SlideShare ist ein Scribd-Unternehmen logo
1 von 157
Downloaden Sie, um offline zu lesen
Generator Tricks
        For Systems Programmers
                                                (Version 2.0)

                                                 David Beazley
                                            http://www.dabeaz.com

                                        Presented at PyCon UK 2008

Copyright (C) 2008, http://www.dabeaz.com                            1- 1
Introduction 2.0
              • At PyCon'2008 (Chicago), I gave this tutorial
                     on generators that you are about to see
              • About 80 people attended
              • Afterwards, there were more than 15000
                     downloads of my presentation slides
              • Clearly there was some interest
              • This is a revised version...
Copyright (C) 2008, http://www.dabeaz.com                       1- 2
An Introduction

              • Generators are cool!
              • But what are they?
              • And what are they good for?
              • That's what this tutorial is about

Copyright (C) 2008, http://www.dabeaz.com              1- 3
About Me

              • I'm a long-time Pythonista
              • First started using Python with version 1.3
              • Author : Python Essential Reference
              • Responsible for a number of open source
                     Python-related packages (Swig, PLY, etc.)



Copyright (C) 2008, http://www.dabeaz.com                        1- 4
My Story
                My addiction to generators started innocently
                      enough. I was just a happy Python
                 programmer working away in my secret lair
                 when I got "the call." A call to sort through
                  1.5 Terabytes of C++ source code (~800
                weekly snapshots of a million line application).
                    That's when I discovered the os.walk()
                   function. This was interesting I thought...



Copyright (C) 2008, http://www.dabeaz.com                          1- 5
Back Story
              • I started using generators on a more day-to-
                      day basis and found them to be wicked cool
              • One of Python's most powerful features!
              • Yet, they still seem rather exotic
              • In my experience, most Python programmers
                      view them as kind of a weird fringe feature
              • This is unfortunate.
Copyright (C) 2008, http://www.dabeaz.com                           1- 6
Python Books Suck!
               • Let's take a look at ... my book


                                            (a motivating example)


             • Whoa! Counting.... that's so useful.
             • Almost as useful as sequences of Fibonacci
                    numbers, random numbers, and squares
Copyright (C) 2008, http://www.dabeaz.com                            1- 7
Our Mission
               • Some more practical uses of generators
               • Focus is "systems programming"
               • Which loosely includes files, file systems,
                      parsing, networking, threads, etc.
               • My goal : To provide some more compelling
                      examples of using generators
               • I'd like to plant some seeds and inspire tool
                      builders

Copyright (C) 2008, http://www.dabeaz.com                        1- 8
Support Files

               • Files used in this tutorial are available here:
                             http://www.dabeaz.com/generators-uk/

               • Go there to follow along with the examples


Copyright (C) 2008, http://www.dabeaz.com                           1- 9
Disclaimer
               • This isn't meant to be an exhaustive tutorial
                      on generators and related theory
               • Will be looking at a series of examples
               • I don't know if the code I've written is the
                      "best" way to solve any of these problems.
               • Let's have a discussion

Copyright (C) 2008, http://www.dabeaz.com                          1- 10
Performance Disclosure
              • There are some later performance numbers
              • Python 2.5.1 on OS X 10.4.11
              • All tests were conducted on the following:
                   • Mac Pro 2x2.66 Ghz Dual-Core Xeon
                   • 3 Gbytes RAM
                   • WDC WD2500JS-41SGB0 Disk (250G)
              • Timings are 3-run average of 'time' command
Copyright (C) 2008, http://www.dabeaz.com                     1- 11
Part I
                           Introduction to Iterators and Generators




Copyright (C) 2008, http://www.dabeaz.com                             1- 12
Iteration
                  • As you know, Python has a "for" statement
                  • You use it to loop over a collection of items
                             >>> for x in [1,4,5,10]:
                             ...      print x,
                             ...
                             1 4 5 10
                             >>>


                   • And, as you have probably noticed, you can
                          iterate over many different kinds of objects
                          (not just lists)


Copyright (C) 2008, http://www.dabeaz.com                                1- 13
Iterating over a Dict
                  • If you loop over a dictionary you get keys
                           >>> prices = { 'GOOG' : 490.10,
                           ...            'AAPL' : 145.23,
                           ...            'YHOO' : 21.71 }
                           ...
                           >>> for key in prices:
                           ...     print key
                           ...
                           YHOO
                           GOOG
                           AAPL
                           >>>




Copyright (C) 2008, http://www.dabeaz.com                        1- 14
Iterating over a String
                  • If you loop over a string, you get characters
                          >>> s = "Yow!"
                          >>> for c in s:
                          ...     print c
                          ...
                          Y
                          o
                          w
                          !
                          >>>




Copyright (C) 2008, http://www.dabeaz.com                           1- 15
Iterating over a File
                   • If you loop over a file you get lines
                        >>> for line in open("real.txt"):
                        ...     print line,
                        ...
                                 Real Programmers write in FORTRAN

                                    Maybe they do now,
                                    in this decadent era of
                                    Lite beer, hand calculators, and "user-friendly" software
                                    but back in the Good Old Days,
                                    when the term "software" sounded funny
                                    and Real Computers were made out of drums and vacuum tubes,
                                    Real Programmers wrote in machine code.
                                    Not FORTRAN. Not RATFOR. Not, even, assembly language.
                                    Machine Code.
                                    Raw, unadorned, inscrutable hexadecimal numbers.
                                    Directly.

Copyright (C) 2008, http://www.dabeaz.com                                             1- 16
Consuming Iterables
                   • Many functions consume an "iterable" object
                   • Reductions:
                               sum(s), min(s), max(s)

                    • Constructors
                               list(s), tuple(s), set(s), dict(s)


                    • in operator
                               item in s


                    • Many others in the library
Copyright (C) 2008, http://www.dabeaz.com                           1- 17
Iteration Protocol
               • The reason why you can iterate over different
                       objects is that there is a specific protocol
                         >>> items = [1, 4, 5]
                         >>> it = iter(items)
                         >>> it.next()
                         1
                         >>> it.next()
                         4
                         >>> it.next()
                         5
                         >>> it.next()
                         Traceback (most recent call last):
                           File "<stdin>", line 1, in <module>
                         StopIteration
                         >>>


Copyright (C) 2008, http://www.dabeaz.com                            1- 18
Iteration Protocol
                   • An inside look at the for statement
                           for x in obj:
                               # statements


                  • Underneath the covers
                          _iter = iter(obj)           # Get iterator object
                          while 1:
                              try:
                                   x = _iter.next()   # Get next item
                              except StopIteration:   # No more items
                                   break
                              # statements
                              ...

                  • Any object that supports iter() and next() is
                          said to be "iterable."

Copyright (C) 2008, http://www.dabeaz.com                                     1-19
Supporting Iteration
                 • User-defined objects can support iteration
                 • Example: Counting down...
                          >>> for x in countdown(10):
                          ...     print x,
                          ...
                          10 9 8 7 6 5 4 3 2 1
                          >>>


                • To do this, you just have to make the object
                        implement __iter__() and next()


Copyright (C) 2008, http://www.dabeaz.com                        1-20
Supporting Iteration
               • Sample implementation
                            class countdown(object):
                                def __init__(self,start):
                                    self.count = start
                                def __iter__(self):
                                    return self
                                def next(self):
                                    if self.count <= 0:
                                        raise StopIteration
                                    r = self.count
                                    self.count -= 1
                                    return r




Copyright (C) 2008, http://www.dabeaz.com                     1-21
Iteration Example

                  • Example use:
                               >>> c =      countdown(5)
                               >>> for      i in c:
                               ...          print i,
                               ...
                               5 4 3 2      1
                               >>>




Copyright (C) 2008, http://www.dabeaz.com                  1-22
Iteration Commentary

                  • There are many subtle details involving the
                         design of iterators for various objects
                  • However, we're not going to cover that
                  • This isn't a tutorial on "iterators"
                  • We're talking about generators...

Copyright (C) 2008, http://www.dabeaz.com                          1-23
Generators
                  • A generator is a function that produces a
                         sequence of results instead of a single value
                            def countdown(n):
                                while n > 0:
                                    yield n
                                    n -= 1
                            >>> for i in countdown(5):
                            ...     print i,
                            ...
                            5 4 3 2 1
                            >>>


                  • Instead of returning a value, you generate a
                         series of values (using the yield statement)

Copyright (C) 2008, http://www.dabeaz.com                                1-24
Generators
                  • Behavior is quite different than normal func
                  • Calling a generator function creates an
                         generator object. However, it does not start
                         running the function.
                         def countdown(n):
                             print "Counting down from", n
                             while n > 0:
                                 yield n
                                 n -= 1              Notice that no
                                                       output was
                         >>> x = countdown(10)          produced
                         >>> x
                         <generator object at 0x58490>
                         >>>


Copyright (C) 2008, http://www.dabeaz.com                               1-25
Generator Functions
                 • The function only executes on next()
                         >>> x = countdown(10)
                         >>> x
                         <generator object at 0x58490>
                         >>> x.next()
                         Counting down from 10           Function starts
                         10                              executing here
                         >>>

                 • yield produces a value, but suspends the function
                 • Function resumes on next call to next()
                         >>> x.next()
                         9
                         >>> x.next()
                         8
                         >>>


Copyright (C) 2008, http://www.dabeaz.com                                  1-26
Generator Functions

                 • When the generator returns, iteration stops
                         >>> x.next()
                         1
                         >>> x.next()
                         Traceback (most recent call last):
                           File "<stdin>", line 1, in ?
                         StopIteration
                         >>>




Copyright (C) 2008, http://www.dabeaz.com                        1-27
Generator Functions

                • A generator function is mainly a more
                       convenient way of writing an iterator
                • You don't have to worry about the iterator
                       protocol (.next, .__iter__, etc.)
                • It just works

Copyright (C) 2008, http://www.dabeaz.com                      1-28
Generators vs. Iterators
               • A generator function is slightly different
                      than an object that supports iteration
               • A generator is a one-time operation. You
                      can iterate over the generated data once,
                      but if you want to do it again, you have to
                      call the generator function again.
               • This is different than a list (which you can
                      iterate over as many times as you want)


Copyright (C) 2008, http://www.dabeaz.com                           1-29
Generator Expressions
                   • A generated version of a list comprehension
                            >>> a = [1,2,3,4]
                            >>> b = (2*x for x in a)
                            >>> b
                            <generator object at 0x58760>
                            >>> for i in b: print b,
                            ...
                            2 4 6 8
                            >>>


                   • This loops over a sequence of items and applies
                          an operation to each item
                   • However, results are produced one at a time
                          using a generator
Copyright (C) 2008, http://www.dabeaz.com                          1-30
Generator Expressions
                  • Important differences from a list comp.
                           •      Does not construct a list.

                           •      Only useful purpose is iteration

                           •      Once consumed, can't be reused

                  • Example:
                          >>> a = [1,2,3,4]
                          >>> b = [2*x for x in a]
                          >>> b
                          [2, 4, 6, 8]
                          >>> c = (2*x for x in a)
                          <generator object at 0x58760>
                          >>>

Copyright (C) 2008, http://www.dabeaz.com                            1-31
Generator Expressions

                  • General syntax
                             (expression for i in s if condition)


                   • What it means
                               for i in s:
                                   if condition:
                                      yield expression




Copyright (C) 2008, http://www.dabeaz.com                           1-32
A Note on Syntax
                   • The parens on a generator expression can
                          dropped if used as a single function argument
                   • Example:
                           sum(x*x for x in s)




                               Generator expression




Copyright (C) 2008, http://www.dabeaz.com                                 1-33
Interlude
                  • We now have two basic building blocks
                  • Generator functions:
                            def countdown(n):
                                while n > 0:
                                     yield n
                                     n -= 1

                  • Generator expressions
                            squares = (x*x for x in s)


                  • In both cases, we get an object that
                         generates values (which are typically
                         consumed in a for loop)
Copyright (C) 2008, http://www.dabeaz.com                        1-34
Part 2
                                            Processing Data Files

                                  (Show me your Web Server Logs)




Copyright (C) 2008, http://www.dabeaz.com                           1- 35
Programming Problem
                    Find out how many bytes of data were
                    transferred by summing up the last column
                    of data in this Apache web server log
            81.107.39.38 -                  ...   "GET   /ply/ HTTP/1.1" 200 7587
            81.107.39.38 -                  ...   "GET   /favicon.ico HTTP/1.1" 404 133
            81.107.39.38 -                  ...   "GET   /ply/bookplug.gif HTTP/1.1" 200 23903
            81.107.39.38 -                  ...   "GET   /ply/ply.html HTTP/1.1" 200 97238
            81.107.39.38 -                  ...   "GET   /ply/example.html HTTP/1.1" 200 2359
            66.249.72.134 -                 ...   "GET   /index.html HTTP/1.1" 200 4447



              Oh yeah, and the log file might be huge (Gbytes)


Copyright (C) 2008, http://www.dabeaz.com                                                    1-36
The Log File
                  • Each line of the log looks like this:
                        81.107.39.38 -        ... "GET /ply/ply.html HTTP/1.1" 200 97238


                  • The number of bytes is the last column
                           bytestr = line.rsplit(None,1)[1]


                  • It's either a number or a missing value (-)
                         81.107.39.38 -       ... "GET /ply/ HTTP/1.1" 304 -


                  • Converting the value
                           if bytestr != '-':
                              bytes = int(bytestr)



Copyright (C) 2008, http://www.dabeaz.com                                             1-37
A Non-Generator Soln
                • Just do a simple for-loop
                       wwwlog = open("access-log")
                       total = 0
                       for line in wwwlog:
                           bytestr = line.rsplit(None,1)[1]
                           if bytestr != '-':
                               total += int(bytestr)

                       print "Total", total


                 • We read line-by-line and just update a sum
                 • However, that's so 90s...
Copyright (C) 2008, http://www.dabeaz.com                       1-38
A Generator Solution
                  • Let's use some generator expressions
                          wwwlog     = open("access-log")
                          bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
                          bytes      = (int(x) for x in bytecolumn if x != '-')

                          print "Total", sum(bytes)



                  • Whoa! That's different!
                     • Less code
                     • A completely different programming style
Copyright (C) 2008, http://www.dabeaz.com                                        1-39
Generators as a Pipeline
                   • To understand the solution, think of it as a data
                          processing pipeline

  access-log                     wwwlog     bytecolumn   bytes   sum()   total




                   • Each step is defined by iteration/generation
                          wwwlog     = open("access-log")
                          bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
                          bytes      = (int(x) for x in bytecolumn if x != '-')

                          print "Total", sum(bytes)



Copyright (C) 2008, http://www.dabeaz.com                                        1-40
Being Declarative
            • At each step of the pipeline, we declare an
                   operation that will be applied to the entire
                   input stream
  access-log                     wwwlog     bytecolumn   bytes   sum()   total




              bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)




                            This operation gets applied to
                               every line of the log file

Copyright (C) 2008, http://www.dabeaz.com                                        1-41
Being Declarative

                • Instead of focusing on the problem at a
                       line-by-line level, you just break it down
                       into big operations that operate on the
                       whole file
                • This is very much a "declarative" style
                • The key : Think big...

Copyright (C) 2008, http://www.dabeaz.com                           1-42
Iteration is the Glue
               • The glue that holds the pipeline together is the
                      iteration that occurs in each step
                       wwwlog               = open("access-log")

                       bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)

                       bytes                = (int(x) for x in bytecolumn if x != '-')

                       print "Total", sum(bytes)


               • The calculation is being driven by the last step
               • The sum() function is consuming values being
                      pulled through the pipeline (via .next() calls)

Copyright (C) 2008, http://www.dabeaz.com                                                1-43
Performance

                       • Surely, this generator approach has all
                              sorts of fancy-dancy magic that is slow.
                       • Let's check it out on a 1.3Gb log file...
                  % ls -l big-access-log
                  -rw-r--r-- beazley 1303238000 Feb 29 08:06 big-access-log




Copyright (C) 2008, http://www.dabeaz.com                                     1-44
Performance Contest
            wwwlog = open("big-access-log")
            total = 0
            for line in wwwlog:                       Time
                bytestr = line.rsplit(None,1)[1]
                if bytestr != '-':
                    total += int(bytestr)               27.20
            print "Total", total



           wwwlog     = open("big-access-log")
           bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
           bytes      = (int(x) for x in bytecolumn if x != '-')

           print "Total", sum(bytes)                   Time
                                                        25.96
Copyright (C) 2008, http://www.dabeaz.com                             1-45
Commentary
                  • Not only was it not slow, it was 5% faster
                  • And it was less code
                  • And it was relatively easy to read
                  • And frankly, I like it a whole better...
               "Back in the old days, we used AWK for this and
                 we liked it. Oh, yeah, and get off my lawn!"


Copyright (C) 2008, http://www.dabeaz.com                        1-46
Performance Contest
            wwwlog     = open("access-log")
            bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
            bytes      = (int(x) for x in bytecolumn if x != '-')

            print "Total", sum(bytes)                   Time
                                                          25.96


             % awk '{ total += $NF } END { print total }' big-access-log


                                                         Time
                  Note:extracting the last
                   column might not be                     37.33
                    awk's strong point

Copyright (C) 2008, http://www.dabeaz.com                                  1-47
Food for Thought

                    • At no point in our generator solution did
                           we ever create large temporary lists
                    • Thus, not only is that solution faster, it can
                           be applied to enormous data files
                    • It's competitive with traditional tools

Copyright (C) 2008, http://www.dabeaz.com                              1-48
More Thoughts
                    • The generator solution was based on the
                           concept of pipelining data between
                           different components
                    • What if you had more advanced kinds of
                           components to work with?
                    • Perhaps you could perform different kinds
                           of processing by just plugging various
                           pipeline components together


Copyright (C) 2008, http://www.dabeaz.com                           1-49
This Sounds Familiar

                    • The Unix philosophy
                    • Have a collection of useful system utils
                    • Can hook these up to files or each other
                    • Perform complex tasks by piping data


Copyright (C) 2008, http://www.dabeaz.com                        1-50
Part 3
                                       Fun with Files and Directories




Copyright (C) 2008, http://www.dabeaz.com                               1- 51
Programming Problem
           You have hundreds of web server logs scattered
           across various directories. In additional, some of
           the logs are compressed. Modify the last program
           so that you can easily read all of these logs

                              foo/
                                        access-log-012007.gz
                                        access-log-022007.gz
                                        access-log-032007.gz
                                        ...
                                        access-log-012008
                              bar/
                                        access-log-092007.bz2
                                        ...
                                        access-log-022008

Copyright (C) 2008, http://www.dabeaz.com                       1-52
os.walk()
              • A very useful function for searching the
                      file system
                       import os

                       for path, dirlist, filelist in os.walk(topdir):
                           # path     : Current directory
                           # dirlist : List of subdirectories
                           # filelist : List of files
                           ...



              • This utilizes generators to recursively walk
                      through the file system


Copyright (C) 2008, http://www.dabeaz.com                                1-53
find
                • Generate all filenames in a directory tree
                       that match a given filename pattern
                       import os
                       import fnmatch

                       def gen_find(filepat,top):
                           for path, dirlist, filelist in os.walk(top):
                               for name in fnmatch.filter(filelist,filepat):
                                   yield os.path.join(path,name)


                • Examples
                       pyfiles = gen_find("*.py","/")
                       logs    = gen_find("access-log*","/usr/www/")




Copyright (C) 2008, http://www.dabeaz.com                                      1-54
Performance Contest
             pyfiles = gen_find("*.py","/")
             for name in pyfiles:
                                              Wall Clock Time
                 print name
                                                    559s


             % find / -name '*.py'
                                              Wall Clock Time
                                                    468s

          Performed on a 750GB file system
           containing about 140000 .py files

Copyright (C) 2008, http://www.dabeaz.com                       1-55
A File Opener
               • Open a sequence of filenames
                         import gzip, bz2
                         def gen_open(filenames):
                             for name in filenames:
                                 if name.endswith(".gz"):
                                       yield gzip.open(name)
                                 elif name.endswith(".bz2"):
                                       yield bz2.BZ2File(name)
                                 else:
                                       yield open(name)


               • This is interesting.... it takes a sequence of
                      filenames as input and yields a sequence of open
                      file objects

Copyright (C) 2008, http://www.dabeaz.com                           1-56
cat
                • Concatenate items from one or more
                       source into a single sequence of items
                       def gen_cat(sources):
                           for s in sources:
                               for item in s:
                                   yield item


                • Example:
                       lognames = gen_find("access-log*", "/usr/www")
                       logfiles = gen_open(lognames)
                       loglines = gen_cat(logfiles)




Copyright (C) 2008, http://www.dabeaz.com                               1-57
grep
              • Generate a sequence of lines that contain
                      a given regular expression
                       import re

                       def gen_grep(pat, lines):
                           patc = re.compile(pat)
                           for line in lines:
                               if patc.search(line): yield line


              • Example:
                       lognames             =   gen_find("access-log*", "/usr/www")
                       logfiles             =   gen_open(lognames)
                       loglines             =   gen_cat(logfiles)
                       patlines             =   gen_grep(pat, loglines)



Copyright (C) 2008, http://www.dabeaz.com                                             1-58
Example
              • Find out how many bytes transferred for a
                     specific pattern in a whole directory of logs
                      pat                   = r"somepattern"
                      logdir                = "/some/dir/"

                      filenames             =   gen_find("access-log*",logdir)
                      logfiles              =   gen_open(filenames)
                      loglines              =   gen_cat(logfiles)
                      patlines              =   gen_grep(pat,loglines)
                      bytecolumn            =   (line.rsplit(None,1)[1] for line in patlines)
                      bytes                 =   (int(x) for x in bytecolumn if x != '-')

                      print "Total", sum(bytes)




Copyright (C) 2008, http://www.dabeaz.com                                                  1-59
Important Concept
                • Generators decouple iteration from the
                       code that uses the results of the iteration
                • In the last example, we're performing a
                       calculation on a sequence of lines
                • It doesn't matter where or how those
                       lines are generated
                • Thus, we can plug any number of
                       components together up front as long as
                       they eventually produce a line sequence

Copyright (C) 2008, http://www.dabeaz.com                            1-60
Part 4
                                            Parsing and Processing Data




Copyright (C) 2008, http://www.dabeaz.com                                 1- 61
Programming Problem
           Web server logs consist of different columns of
           data. Parse each line into a useful data structure
           that allows us to easily inspect the different fields.


      81.107.39.38 - - [24/Feb/2008:00:08:59 -0600] "GET ..." 200 7587




                    host referrer user [datetime] "request" status bytes




Copyright (C) 2008, http://www.dabeaz.com                                  1-62
Parsing with Regex
               • Let's route the lines through a regex parser
                        logpats = r'(S+) (S+) (S+) [(.*?)] '
                                  r'"(S+) (S+) (S+)" (S+) (S+)'

                        logpat = re.compile(logpats)

                        groups              = (logpat.match(line) for line in loglines)
                        tuples              = (g.groups() for g in groups if g)


                 • This generates a sequence of tuples
                       ('71.201.176.194', '-', '-', '26/Feb/2008:10:30:08 -0600',
                       'GET', '/ply/ply.html', 'HTTP/1.1', '200', '97238')




Copyright (C) 2008, http://www.dabeaz.com                                                 1-63
Tuple Commentary
                • I generally don't like data processing on tuples
                       ('71.201.176.194', '-', '-', '26/Feb/2008:10:30:08 -0600',
                       'GET', '/ply/ply.html', 'HTTP/1.1', '200', '97238')


                • First, they are immutable--so you can't modify
                • Second, to extract specific fields, you have to
                       remember the column number--which is
                       annoying if there are a lot of columns
                • Third, existing code breaks if you change the
                       number of fields
Copyright (C) 2008, http://www.dabeaz.com                                      1-64
Tuples to Dictionaries
               • Let's turn tuples into dictionaries
                    colnames                = ('host','referrer','user','datetime',
                                               'method','request','proto','status','bytes')

                    log                     = (dict(zip(colnames,t)) for t in tuples)


               • This generates a sequence of named fields
                          { 'status' :           '200',
                            'proto'   :          'HTTP/1.1',
                            'referrer':          '-',
                            'request' :          '/ply/ply.html',
                            'bytes'   :          '97238',
                            'datetime':          '24/Feb/2008:00:08:59 -0600',
                            'host'    :          '140.180.132.213',
                            'user'    :          '-',
                            'method' :           'GET'}

Copyright (C) 2008, http://www.dabeaz.com                                                 1-65
Field Conversion
              • You might want to map specific dictionary fields
                     through a conversion function (e.g., int(), float())
                       def field_map(dictseq,name,func):
                           for d in dictseq:
                               d[name] = func(d[name])
                               yield d


              • Example: Convert a few field values
                       log = field_map(log,"status", int)
                       log = field_map(log,"bytes",
                                       lambda s: int(s) if s !='-' else 0)




Copyright (C) 2008, http://www.dabeaz.com                                    1-66
Field Conversion
            • Creates dictionaries of converted values
                    { 'status': 200,
                      'proto': 'HTTP/1.1',                Note    conversion
                      'referrer': '-',
                      'request': '/ply/ply.html',
                      'datetime': '24/Feb/2008:00:08:59 -0600',
                      'bytes': 97238,
                      'host': '140.180.132.213',
                      'user': '-',
                      'method': 'GET'}


              • Again, this is just one big processing pipeline

Copyright (C) 2008, http://www.dabeaz.com                                      1-67
The Code So Far
            lognames             =    gen_find("access-log*","www")
            logfiles             =    gen_open(lognames)
            loglines             =    gen_cat(logfiles)
            groups               =    (logpat.match(line) for line in loglines)
            tuples               =    (g.groups() for g in groups if g)

            colnames = ('host','referrer','user','datetime','method',
                          'request','proto','status','bytes')

            log                  = (dict(zip(colnames,t)) for t in tuples)
            log                  = field_map(log,"bytes",
                                             lambda s: int(s) if s != '-' else 0)
            log                  = field_map(log,"status",int)




Copyright (C) 2008, http://www.dabeaz.com                                           1-68
Getting Organized
            • As a processing pipeline grows, certain parts of it
                   may be useful components on their own


                        generate lines      Parse a sequence of lines from
                      from a set of files      Apache server logs into a
                        in a directory         sequence of dictionaries

            • A series of pipeline stages can be easily
                   encapsulated by a normal Python function

Copyright (C) 2008, http://www.dabeaz.com                                    1-69
Packaging
            • Example : multiple pipeline stages inside a function
                      def lines_from_dir(filepat, dirname):
                          names   = gen_find(filepat,dirname)
                          files   = gen_open(names)
                          lines   = gen_cat(files)
                          return lines


            • This is now a general purpose component that can
                   be used as a single element in other pipelines


Copyright (C) 2008, http://www.dabeaz.com                           1-70
Packaging
         • Example : Parse an Apache log into dicts
             def apache_log(lines):
                 groups     = (logpat.match(line) for line in lines)
                 tuples     = (g.groups() for g in groups if g)

                      colnames              = ('host','referrer','user','datetime','method',
                                               'request','proto','status','bytes')

                      log                   = (dict(zip(colnames,t)) for t in tuples)
                      log                   = field_map(log,"bytes",
                                                        lambda s: int(s) if s != '-' else 0)
                      log                   = field_map(log,"status",int)

                      return log




Copyright (C) 2008, http://www.dabeaz.com                                                1-71
Example Use
               • It's easy
                        lines = lines_from_dir("access-log*","www")
                        log   = apache_log(lines)

                        for r in log:
                            print r



               • Different components have been subdivided
                       according to the data that they process


Copyright (C) 2008, http://www.dabeaz.com                             1-72
Food for Thought
               • When creating pipeline components, it's
                       critical to focus on the inputs and outputs
               • You will get the most flexibility when you
                       use a standard set of datatypes
               • Is it simpler to have a bunch of components
                       that all operate on dictionaries or to have
                       components that require inputs/outputs to
                       be different kinds of user-defined instances?

Copyright (C) 2008, http://www.dabeaz.com                             1-73
A Query Language
            • Now that we have our log, let's do some queries
            • Find the set of all documents that 404
                     stat404 = set(r['request'] for r in log
                                         if r['status'] == 404)


            • Print all requests that transfer over a megabyte
                     large = (r for r in log
                                if r['bytes'] > 1000000)

                     for r in large:
                         print r['request'], r['bytes']



Copyright (C) 2008, http://www.dabeaz.com                         1-74
A Query Language
            • Find the largest data transfer
                    print "%d %s" % max((r['bytes'],r['request'])
                                         for r in log)


            • Collect all unique host IP addresses
                    hosts = set(r['host'] for r in log)


            • Find the number of downloads of a file
                     sum(1 for r in log
                              if r['request'] == '/ply/ply-2.3.tar.gz')




Copyright (C) 2008, http://www.dabeaz.com                                 1-75
A Query Language
            • Find out who has been hitting robots.txt
                     addrs = set(r['host'] for r in log
                                   if 'robots.txt' in r['request'])

                     import socket
                     for addr in addrs:
                         try:
                              print socket.gethostbyaddr(addr)[0]
                         except socket.herror:
                              print addr




Copyright (C) 2008, http://www.dabeaz.com                             1-76
Performance Study
            • Sadly, the last example doesn't run so fast on a
                   huge input file (53 minutes on the 1.3GB log)
            • But, the beauty of generators is that you can plug
                   filters in at almost any stage
                    lines         =   lines_from_dir("big-access-log",".")
                    lines         =   (line for line in lines if 'robots.txt' in line)
                    log           =   apache_log(lines)
                    addrs         =   set(r['host'] for r in log)
                    ...


            • That version takes 93 seconds
Copyright (C) 2008, http://www.dabeaz.com                                                1-77
Some Thoughts
           • I like the idea of using generator expressions as a
                   pipeline query language
           • You can write simple filters, extract data, etc.
           • You you pass dictionaries/objects through the
                   pipeline, it becomes quite powerful
           • Feels similar to writing SQL queries

Copyright (C) 2008, http://www.dabeaz.com                          1-78
Part 5
                                            Processing Infinite Data




Copyright (C) 2008, http://www.dabeaz.com                             1- 79
Question
                 • Have you ever used 'tail -f' in Unix?
                          % tail -f logfile
                          ...
                          ... lines of output ...
                          ...


                 • This prints the lines written to the end of a file
                 • The "standard" way to watch a log file
                 • I used this all of the time when working on
                        scientific simulations ten years ago...


Copyright (C) 2008, http://www.dabeaz.com                              1-80
Infinite Sequences

                 • Tailing a log file results in an "infinite" stream
                 • It constantly watches the file and yields lines as
                        soon as new data is written
                 • But you don't know how much data will actually
                        be written (in advance)
                 • And log files can often be enormous

Copyright (C) 2008, http://www.dabeaz.com                              1-81
Tailing a File
               • A Python version of 'tail -f'
                         import time
                         def follow(thefile):
                             thefile.seek(0,2)      # Go to the end of the file
                             while True:
                                  line = thefile.readline()
                                  if not line:
                                      time.sleep(0.1)    # Sleep briefly
                                      continue
                                  yield line


               • Idea : Seek to the end of the file and repeatedly
                       try to read new lines. If new data is written to
                       the file, we'll pick it up.
Copyright (C) 2008, http://www.dabeaz.com                                         1-82
Example
               • Using our follow function
                           logfile = open("access-log")
                           loglines = follow(logfile)

                           for line in loglines:
                               print line,



               • This produces the same output as 'tail -f'


Copyright (C) 2008, http://www.dabeaz.com                     1-83
Example
               • Turn the real-time log file into records
                        logfile = open("access-log")
                        loglines = follow(logfile)
                        log      = apache_log(loglines)



               • Print out all 404 requests as they happen
                        r404 = (r for r in log if r['status'] == 404)
                        for r in r404:
                            print r['host'],r['datetime'],r['request']




Copyright (C) 2008, http://www.dabeaz.com                                1-84
Commentary
              • We just plugged this new input scheme onto
                     the front of our processing pipeline
              • Everything else still works, with one caveat-
                     functions that consume an entire iterable won't
                     terminate (min, max, sum, set, etc.)
              • Nevertheless, we can easily write processing
                     steps that operate on an infinite data stream


Copyright (C) 2008, http://www.dabeaz.com                              1-85
Part 6
                                            Feeding the Pipeline




Copyright (C) 2008, http://www.dabeaz.com                          1- 86
Feeding Generators
                 • In order to feed a generator processing
                        pipeline, you need to have an input source
                 • So far, we have looked at two file-based inputs
                 • Reading a file
                            lines = open(filename)


                 • Tailing a file
                            lines = follow(open(filename))




Copyright (C) 2008, http://www.dabeaz.com                            1-87
A Thought
                 • There is no rule that says you have to
                        generate pipeline data from a file.
                 • Or that the input data has to be a string
                 • Or that it has to be turned into a dictionary
                 • Remember: All Python objects are "first-class"
                 • Which means that all objects are fair-game for
                        use in a generator pipeline


Copyright (C) 2008, http://www.dabeaz.com                           1-88
Generating Connections
                 • Generate a sequence of TCP connections
                           import socket
                           def receive_connections(addr):
                               s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
                               s.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1)
                               s.bind(addr)
                               s.listen(5)
                               while True:
                                    client = s.accept()
                                    yield client


                 • Example:
                          for c,a in receive_connections(("",9000)):
                              c.send("Hello Worldn")
                              c.close()


Copyright (C) 2008, http://www.dabeaz.com                                        1-89
Generating Messages
             • Receive a sequence of UDP messages
                      import socket
                      def receive_messages(addr,maxsize):
                          s = socket.socket(socket.AF_INET,socket.SOCK_DGRAM)
                          s.bind(addr)
                          while True:
                               msg = s.recvfrom(maxsize)
                               yield msg

             • Example:
                     for msg, addr in receive_messages(("",10000),1024):
                         print msg, "from", addr




Copyright (C) 2008, http://www.dabeaz.com                                       1-90
Part 7
                                            Extending the Pipeline




Copyright (C) 2008, http://www.dabeaz.com                            1- 91
Multiple Processes
               • Can you extend a processing pipeline across
                      processes and machines?



                                                     process 2
                                            socket
                                             pipe

                     process 1


Copyright (C) 2008, http://www.dabeaz.com                        1-92
Pickler/Unpickler
               • Turn a generated sequence into pickled objects
                        def gen_pickle(source, protocol=pickle.HIGHEST_PROTOCOL):
                            for item in source:
                                yield pickle.dumps(item, protocol)

                        def gen_unpickle(infile):
                            while True:
                                 try:
                                      item = pickle.load(infile)
                                      yield item
                                 except EOFError:
                                      return


               • Now, attach these to a pipe or socket
Copyright (C) 2008, http://www.dabeaz.com                                       1-93
Sender/Receiver
               • Example: Sender
                        def sendto(source,addr):
                            s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
                            s.connect(addr)
                            for pitem in gen_pickle(source):
                                s.sendall(pitem)
                            s.close()

               • Example: Receiver
                        def receivefrom(addr):
                            s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
                            s.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1)
                            s.bind(addr)
                            s.listen(5)
                            c,a = s.accept()
                            for item in gen_unpickle(c.makefile()):
                                yield item
                            c.close()
Copyright (C) 2008, http://www.dabeaz.com                                          1-94
Example Use
               • Example: Read log lines and parse into records
                        # netprod.py

                        lines = follow(open("access-log"))
                        log   = apache_log(lines)
                        sendto(log,("",15000))


               • Example: Pick up the log on another machine
                        # netcons.py
                        for r in receivefrom(("",15000)):
                            print r




Copyright (C) 2008, http://www.dabeaz.com                         1-95
Generators and Threads
               • Processing pipelines sometimes come up in the
                      context of thread programming
               • Producer/consumer problems
                              Thread 1             Thread 2

                                        Producer        Consumer


               • Question: Can generator pipelines be
                      integrated with thread programming?

Copyright (C) 2008, http://www.dabeaz.com                          1-96
Multiple Threads
               • For example, can a generator pipeline span
                      multiple threads?

                      Thread 1                  Thread 2
                                            ?



               • Yes, if you connect them with a Queue object

Copyright (C) 2008, http://www.dabeaz.com                       1-97
Generators and Queues
             • Feed a generated sequence into a queue
                      # genqueue.py
                      def sendto_queue(source, thequeue):
                          for item in source:
                               thequeue.put(item)
                          thequeue.put(StopIteration)


             • Generate items received on a queue
                      def genfrom_queue(thequeue):
                          while True:
                               item = thequeue.get()
                               if item is StopIteration: break
                               yield item


             • Note: Using StopIteration as a sentinel
Copyright (C) 2008, http://www.dabeaz.com                        1-98
Thread Example
            • Here is a consumer function
                     # A consumer.   Prints out 404 records.
                     def print_r404(log_q):
                         log = genfrom_queue(log_q)
                         r404 = (r for r in log if r['status'] == 404)
                         for r in r404:
                             print r['host'],r['datetime'],r['request']


            • This function will be launched in its own thread
            • Using a Queue object as the input source

Copyright (C) 2008, http://www.dabeaz.com                                 1-99
Thread Example
             • Launching the consumer
                     import threading, Queue
                     log_q = Queue.Queue()
                     r404_thr = threading.Thread(target=print_r404,
                                                 args=(log_q,))
                     r404_thr.start()


              • Code that feeds the consumer
                      lines = follow(open("access-log"))
                      log   = apache_log(lines)
                      sendto_queue(log,log_q)




Copyright (C) 2008, http://www.dabeaz.com                             1-
                                                                       100
Part 8
                                            Advanced Data Routing




Copyright (C) 2008, http://www.dabeaz.com                           1-101
The Story So Far
              • You can use generators to set up pipelines
              • You can extend the pipeline over the network
              • You can extend it between threads
              • However, it's still just a pipeline (there is one
                      input and one output).
              • Can you do more than that?

Copyright (C) 2008, http://www.dabeaz.com                           1-
                                                                     102
Multiple Sources
               • Can a processing pipeline be fed by multiple
                       sources---for example, multiple generators?

                                  source1   source2   source3




                   for item in sources:
                       # Process item




Copyright (C) 2008, http://www.dabeaz.com                            1-
                                                                      103
Concatenation
               • Concatenate one source after another (reprise)
                          def gen_cat(sources):
                              for s in sources:
                                  for item in s:
                                      yield item


               • This generates one big sequence
               • Consumes each generator one at a time
               • But only works if generators terminate
               • So, you wouldn't use this for real-time streams
Copyright (C) 2008, http://www.dabeaz.com                          1-
                                                                    104
Parallel Iteration
                • Zipping multiple generators together
                        import itertools

                        z = itertools.izip(s1,s2,s3)


                • This one is only marginally useful
                • Requires generators to go lock-step
                • Terminates when any input ends

Copyright (C) 2008, http://www.dabeaz.com                 1-
                                                           105
Multiplexing
               • Feed a pipeline from multiple generators in
                       real-time--producing values as they arrive

               • Example use
                       log1 = follow(open("foo/access-log"))
                       log2 = follow(open("bar/access-log"))

                       lines = multiplex([log1,log2])


               • There is no way to poll a generator
               • And only one for-loop executes at a time
Copyright (C) 2008, http://www.dabeaz.com                           1-
                                                                     106
Multiplexing
               • You can multiplex if you use threads and you
                       use the tools we've developed so far

               • Idea :                        source1           source2   source3




                                                                  queue


                                            for item in queue:
                                                # Process item



Copyright (C) 2008, http://www.dabeaz.com                                            1-
                                                                                      107
Multiplexing
                   # genmultiplex.py

                   import threading, Queue
                   from genqueue import *
                   from gencat import *

                   def multiplex(sources):
                       in_q = Queue.Queue()
                       consumers = []
                       for src in sources:
                           thr = threading.Thread(target=sendto_queue,
                                                  args=(src,in_q))
                           thr.start()
                           consumers.append(genfrom_queue(in_q))
                       return gen_cat(consumers)


              • Note: This is the trickiest example so far...
Copyright (C) 2008, http://www.dabeaz.com                                1-
                                                                          108
Multiplexing
                • Each input source is wrapped by a thread which
                       runs the generator and dumps the items into a
                       shared queue

                                   source1          source2     source3

                             sendto_queue       sendto_queue   sendto_queue



                                             in_q    queue



Copyright (C) 2008, http://www.dabeaz.com                                     1-
                                                                               109
Multiplexing
                • For each source, we create a consumer of
                       queue data
                                              in_q   queue

             consumers = [genfrom_queue, genfrom_queue, genfrom_queue ]


                 • Now, just concatenate the consumers together
                          get_cat(consumers)


                 • Each time a producer terminates, we move to
                         the next consumer (until there are no more)

Copyright (C) 2008, http://www.dabeaz.com                                 1-
                                                                           110
Broadcasting

               • Can you broadcast to multiple consumers?
                                                generator



                               consumer1        consumer2   consumer3




Copyright (C) 2008, http://www.dabeaz.com                               1-
                                                                         111
Broadcasting
               • Consume a generator and send to consumers
                           def broadcast(source, consumers):
                               for item in source:
                                   for c in consumers:
                                       c.send(item)



               • It works, but now the control-flow is unusual
               • The broadcast loop is what runs the program
               • Consumers run by having items sent to them

Copyright (C) 2008, http://www.dabeaz.com                       1-
                                                                 112
Consumers
               • To create a consumer, define an object with a
                       send() method on it
                         class Consumer(object):
                             def send(self,item):
                                  print self, "got", item


               • Example:
                         c1 = Consumer()
                         c2 = Consumer()
                         c3 = Consumer()

                         lines = follow(open("access-log"))
                         broadcast(lines,[c1,c2,c3])



Copyright (C) 2008, http://www.dabeaz.com                       1-
                                                                 113
Network Consumer
                  • Example:
                           import socket,pickle
                           class NetConsumer(object):
                               def __init__(self,addr):
                                    self.s = socket.socket(socket.AF_INET,
                                                           socket.SOCK_STREAM)
                                    self.s.connect(addr)
                               def send(self,item):
                                   pitem = pickle.dumps(item)
                                   self.s.sendall(pitem)
                               def close(self):
                                   self.s.close()


                    • This will route items across the network
Copyright (C) 2008, http://www.dabeaz.com                                        1-
                                                                                  114
Network Consumer
                  • Example Usage:
                          class Stat404(NetConsumer):
                              def send(self,item):
                                  if item['status'] == 404:
                                      NetConsumer.send(self,item)

                          lines = follow(open("access-log"))
                          log   = apache_log(lines)

                          stat404 = Stat404(("somehost",15000))

                          broadcast(log, [stat404])


                    • The 404 entries will go elsewhere...
Copyright (C) 2008, http://www.dabeaz.com                           1-
                                                                     115
Commentary

               • Once you start broadcasting, consumers can't
                       follow the same programming model as before
               • Only one for-loop can run the pipeline.
               • However, you can feed an existing pipeline if
                       you're willing to run it in a different thread or in
                       a different process



Copyright (C) 2008, http://www.dabeaz.com                                 1-
                                                                           116
Consumer Thread
                 • Example: Routing items to a separate thread
                             import Queue, threading
                             from genqueue import genfrom_queue

                             class ConsumerThread(threading.Thread):
                                 def __init__(self,target):
                                      threading.Thread.__init__(self)
                                      self.setDaemon(True)
                                      self.in_q   = Queue.Queue()
                                      self.target = target
                                 def send(self,item):
                                      self.in_q.put(item)
                                 def run(self):
                                     self.target(genfrom_queue(self.in_q))




Copyright (C) 2008, http://www.dabeaz.com                                    1-
                                                                              117
Consumer Thread
                 • Sample usage (building on earlier code)
                        def find_404(log):
                            for r in (r for r in log if r['status'] == 404):
                                 print r['status'],r['datetime'],r['request']

                        def bytes_transferred(log):
                            total = 0
                            for r in log:
                                total += r['bytes']
                                print "Total bytes", total

                        c1 = ConsumerThread(find_404)
                        c1.start()
                        c2 = ConsumerThread(bytes_transferred)
                        c2.start()

                        lines = follow(open("access-log")) # Follow a log
                        log   = apache_log(lines)           # Turn into records
                        broadcast(log,[c1,c2])         # Broadcast to consumers
Copyright (C) 2008, http://www.dabeaz.com                                         1-
                                                                                   118
Part 9
                    Various Programming Tricks (And Debugging)




Copyright (C) 2008, http://www.dabeaz.com                        1-119
Putting it all Together

               • This data processing pipeline idea is powerful
               • But, it's also potentially mind-boggling
               • Especially when you have dozens of pipeline
                       stages, broadcasting, multiplexing, etc.
               • Let's look at a few useful tricks

Copyright (C) 2008, http://www.dabeaz.com                         1-
                                                                   120
Creating Generators
               • Any single-argument function is easy to turn
                       into a generator function
                          def generate(func):
                              def gen_func(s):
                                  for item in s:
                                      yield func(item)
                              return gen_func


               • Example:
                        gen_sqrt = generate(math.sqrt)
                        for x in gen_sqrt(xrange(100)):
                            print x




Copyright (C) 2008, http://www.dabeaz.com                       1-
                                                                 121
Debug Tracing
               • A debugging function that will print items going
                       through a generator
                          def trace(source):
                              for item in source:
                                  print item
                                  yield item


               • This can easily be placed around any generator
                          lines = follow(open("access-log"))
                          log   = trace(apache_log(lines))

                          r404          = trace(r for r in log if r['status'] == 404)


               • Note: Might consider logging module for this
Copyright (C) 2008, http://www.dabeaz.com                                               1-
                                                                                         122
Recording the Last Item
               • Store the last item generated in the generator
                          class storelast(object):
                              def __init__(self,source):
                                  self.source = source
                              def next(self):
                                  item = self.source.next()
                                  self.last = item
                                  return item
                              def __iter__(self):
                                  return self

               • This can be easily wrapped around a generator
                          lines = storelast(follow(open("access-log")))
                          log   = apache_log(lines)

                          for r in log:
                              print r
                              print lines.last
Copyright (C) 2008, http://www.dabeaz.com                                 1-
                                                                           123
Shutting Down
               • Generators can be shut down using .close()
                         import time
                         def follow(thefile):
                             thefile.seek(0,2)      # Go to the end of the file
                             while True:
                                  line = thefile.readline()
                                  if not line:
                                      time.sleep(0.1)    # Sleep briefly
                                      continue
                                  yield line


               • Example: lines = follow(open("access-log"))
                          for i,line in enumerate(lines):
                              print line,
                              if i == 10: lines.close()


Copyright (C) 2008, http://www.dabeaz.com                                         1-
                                                                                   124
Shutting Down
                • In the generator, GeneratorExit is raised
                          import time
                          def follow(thefile):
                              thefile.seek(0,2)      # Go to the end of the file
                              try:
                                   while True:
                                       line = thefile.readline()
                                       if not line:
                                           time.sleep(0.1)    # Sleep briefly
                                           continue
                                       yield line
                              except GeneratorExit:
                                   print "Follow: Shutting down"


                • This allows for resource cleanup (if needed)
Copyright (C) 2008, http://www.dabeaz.com                                          1-
                                                                                    125
Ignoring Shutdown
                • Question: Can you ignore GeneratorExit?
                          import time
                          def follow(thefile):
                              thefile.seek(0,2)      # Go to the end of the file
                              while True:
                                  try:
                                       line = thefile.readline()
                                       if not line:
                                           time.sleep(0.1)     # Sleep briefly
                                           continue
                                       yield line
                                  except GeneratorExit:
                                       print "Forget about it"


                • Answer: No. You'll get a RuntimeError
Copyright (C) 2008, http://www.dabeaz.com                                          1-
                                                                                    126
Shutdown and Threads
             • Question : Can a thread shutdown a generator
                    running in a different thread?
                 lines = follow(open("foo/test.log"))

                 def sleep_and_close(s):
                     time.sleep(s)
                     lines.close()

                 threading.Thread(target=sleep_and_close,args=(30,)).start()

                 for line in lines:
                     print line,




Copyright (C) 2008, http://www.dabeaz.com                                      1-
                                                                                127
Shutdown and Threads
                • Separate threads can not call .close()
                • Output:
                      Exception in thread Thread-1:
                      Traceback (most recent call last):
                        File "/Library/Frameworks/Python.framework/Versions/2.5/
                      lib/python2.5/threading.py", line 460, in __bootstrap
                          self.run()
                        File "/Library/Frameworks/Python.framework/Versions/2.5/
                      lib/python2.5/threading.py", line 440, in run
                          self.__target(*self.__args, **self.__kwargs)
                        File "genfollow.py", line 31, in sleep_and_close
                          lines.close()
                      ValueError: generator already executing




Copyright (C) 2008, http://www.dabeaz.com                                     1-
                                                                               128
Shutdown and Signals
               • Can you shutdown a generator with a signal?
                        import signal
                        def sigusr1(signo,frame):
                            print "Closing it down"
                            lines.close()

                        signal.signal(signal.SIGUSR1,sigusr1)

                        lines = follow(open("access-log"))
                        for line in lines:
                            print line,

               • From the command line
                        % kill -USR1 pid



Copyright (C) 2008, http://www.dabeaz.com                       1-
                                                                 129
Shutdown and Signals
                • This also fails:
                         Traceback (most recent call last):
                           File "genfollow.py", line 35, in <module>
                             for line in lines:
                           File "genfollow.py", line 8, in follow
                             time.sleep(0.1)
                           File "genfollow.py", line 30, in sigusr1
                             lines.close()
                         ValueError: generator already executing


                • Sigh.

Copyright (C) 2008, http://www.dabeaz.com                              1-
                                                                        130
Shutdown
                • The only way to externally shutdown a
                       generator would be to instrument with a flag or
                       some kind of check
                        def follow(thefile,shutdown=None):
                            thefile.seek(0,2)
                            while True:
                                if shutdown and shutdown.isSet(): break
                                line = thefile.readline()
                                if not line:
                                   time.sleep(0.1)
                                   continue
                                yield line




Copyright (C) 2008, http://www.dabeaz.com                                 1-
                                                                           131
Shutdown
                • Example:
                         import threading,signal

                         shutdown = threading.Event()
                         def sigusr1(signo,frame):
                             print "Closing it down"
                             shutdown.set()
                         signal.signal(signal.SIGUSR1,sigusr1)

                         lines = follow(open("access-log"),shutdown)
                         for line in lines:
                             print line,




Copyright (C) 2008, http://www.dabeaz.com                              1-
                                                                        132
Part 10
                                            Parsing and Printing




Copyright (C) 2008, http://www.dabeaz.com                          1-133
Incremental Parsing
               • Generators are a useful way to incrementally
                       parse almost any kind of data
                         # genrecord.py
                         import struct

                         def gen_records(record_format, thefile):
                             record_size = struct.calcsize(record_format)
                             while True:
                                   raw_record = thefile.read(record_size)
                                   if not raw_record:
                                       break
                                   yield struct.unpack(record_format, raw_record)


               • This function sweeps through a file and
                       generates a sequence of unpacked records
Copyright (C) 2008, http://www.dabeaz.com                                       1-
                                                                                 134
Incremental Parsing
               • Example:
                        from genrecord import *

                        f = open("stockdata.bin","rb")
                        for name, shares, price in gen_records("<8sif",f):
                             # Process data
                             ...



               • Tip : Look at xml.etree.ElementTree.iterparse
                       for a neat way to incrementally process large
                       XML documents using generators


Copyright (C) 2008, http://www.dabeaz.com                                    1-
                                                                              135
yield as print
               • Generator functions can use yield like a print
                       statement
               • Example:
                            def print_count(n):
                                yield "Hello Worldn"
                                yield "n"
                                yield "Look at me count to %dn" % n
                                for i in xrange(n):
                                    yield "   %dn" % i
                                yield "I'm done!n"


               • This is useful if you're producing I/O output, but
                       you want flexibility in how it gets handled

Copyright (C) 2008, http://www.dabeaz.com                              1-
                                                                        136
yield as print
               • Examples of processing the output stream:
                            # Generate the output
                            out = print_count(10)

                            # Turn it into one big string
                            out_str = "".join(out)

                            # Write it to a file
                            f = open("out.txt","w")
                            for chunk in out:
                                f.write(chunk)

                            # Send it across a network socket
                            for chunk in out:
                                s.sendall(chunk)



Copyright (C) 2008, http://www.dabeaz.com                       1-
                                                                 137
yield as print
           • This technique of producing output leaves the
                  exact output method unspecified
           • So, the code is not hardwired to use files,
                  sockets, or any other specific kind of output
           • There is an interesting code-reuse element
           • One use of this : WSGI applications

Copyright (C) 2008, http://www.dabeaz.com                        1-
                                                                  138
Part 11
                                            Co-routines




Copyright (C) 2008, http://www.dabeaz.com                 1-139
The Final Frontier
                 • In Python 2.5, generators picked up the ability
                        to receive values using .send()
                             def recv_count():
                                 try:
                                      while True:
                                           n = (yield)     # Yield expression
                                           print "T-minus", n
                                 except GeneratorExit:
                                      print "Kaboom!"


                 • Think of this function as receiving values rather
                        than generating them

Copyright (C) 2008, http://www.dabeaz.com                                       1-
                                                                                 140
Example Use
                 • Using a receiver
                           >>> r = recv_count()
                           >>> r.next()                  Note: must call .next() here
                           >>> for i in range(5,0,-1):
                           ...       r.send(i)
                           ...
                           T-minus 5
                           T-minus 4
                           T-minus 3
                           T-minus 2
                           T-minus 1
                           >>> r.close()
                           Kaboom!
                           >>>




Copyright (C) 2008, http://www.dabeaz.com                                           1-
                                                                                     141
Co-routines
               • This form of a generator is a "co-routine"
               • Also sometimes called a "reverse-generator"
               • Python books (mine included) do a pretty poor
                      job of explaining how co-routines are supposed
                      to be used
               • I like to think of them as "receivers" or
                      "consumer". They receive values sent to them.


Copyright (C) 2008, http://www.dabeaz.com                              1-
                                                                        142
Setting up a Coroutine
                 • To get a co-routine to run properly, you have to
                        ping it with a .next() operation first
                             def recv_count():
                                 try:
                                      while True:
                                           n = (yield)     # Yield expression
                                           print "T-minus", n
                                 except GeneratorExit:
                                      print "Kaboom!"


                 • Example:r = recv_count()
                           r.next()


                 • This advances it to the first yield--where it will
                        receive its first value
Copyright (C) 2008, http://www.dabeaz.com                                       1-
                                                                                 143
@consumer decorator
                 • The .next() bit can be handled via decoration
                             def consumer(func):
                                 def start(*args,**kwargs):
                                     c = func(*args,**kwargs)
                                     c.next()
                                     return c
                                 return start

                 • Example:@consumer
                           def recv_count():
                               try:
                                    while True:
                                         n = (yield)     # Yield expression
                                         print "T-minus", n
                               except GeneratorExit:
                                    print "Kaboom!"
Copyright (C) 2008, http://www.dabeaz.com                                     1-
                                                                               144
@consumer decorator
                 • Using the decorated version
                            >>> r = recv_count()
                            >>> for i in range(5,0,-1):
                            ...     r.send(i)
                            ...
                            T-minus 5
                            T-minus 4
                            T-minus 3
                            T-minus 2
                            T-minus 1
                            >>> r.close()
                            Kaboom!
                            >>>

                  • Don't need the extra .next() step here
Copyright (C) 2008, http://www.dabeaz.com                    1-
                                                              145
Coroutine Pipelines
                 • Co-routines also set up a processing pipeline
                 • Instead of being defining by iteration, it's
                        defining by pushing values into the pipeline
                        using .send()


                        .send()             .send()   .send()


                  • We already saw some of this with broadcasting
Copyright (C) 2008, http://www.dabeaz.com                             1-
                                                                       146
Broadcasting (Reprise)
               • Consume a generator and send items to a set
                       of consumers
                           def broadcast(source, consumers):
                               for item in source:
                                   for c in consumers:
                                       c.send(item)



               • Notice that send() operation there
               • The consumers could be co-routines

Copyright (C) 2008, http://www.dabeaz.com                      1-
                                                                147
Example
              @consumer
              def find_404():
                  while True:
                        r = (yield)
                        if r['status'] == 404:
                            print r['status'],r['datetime'],r['request']

              @consumer
              def bytes_transferred():
                  total = 0
                  while True:
                        r = (yield)
                        total += r['bytes']
                        print "Total bytes", total

              lines = follow(open("access-log"))
              log   = apache_log(lines)
              broadcast(log,[find_404(),bytes_transferred()])


Copyright (C) 2008, http://www.dabeaz.com                                  1-
                                                                            148
Discussion

                 • In last example, multiple consumers
                 • However, there were no threads
                 • Further exploration along these lines can take
                        you into co-operative multitasking, concurrent
                        programming without using threads
                 • But that's an entirely different tutorial!

Copyright (C) 2008, http://www.dabeaz.com                                1-
                                                                          149
Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0

Weitere ähnliche Inhalte

Was ist angesagt?

オンプレ、クラウドを組み合わせて作るビックデータ基盤 データ基盤の選び方
オンプレ、クラウドを組み合わせて作るビックデータ基盤  データ基盤の選び方オンプレ、クラウドを組み合わせて作るビックデータ基盤  データ基盤の選び方
オンプレ、クラウドを組み合わせて作るビックデータ基盤 データ基盤の選び方Yu Yamada
 
WebRTCのオーディオ処理の謎、誰か教えて!
WebRTCのオーディオ処理の謎、誰か教えて!WebRTCのオーディオ処理の謎、誰か教えて!
WebRTCのオーディオ処理の謎、誰か教えて!mganeko
 
ロードバランスへの長い道
ロードバランスへの長い道ロードバランスへの長い道
ロードバランスへの長い道Jun Kato
 
OpenvSwitchの落とし穴
OpenvSwitchの落とし穴OpenvSwitchの落とし穴
OpenvSwitchの落とし穴Takashi Naito
 
JIRAを使った簡単リリースプランニング
JIRAを使った簡単リリースプランニングJIRAを使った簡単リリースプランニング
JIRAを使った簡単リリースプランニングNarichika Kajihara
 
これから始める人のための自動化入門 〜Ubuntu Jujuを使って〜
これから始める人のための自動化入門 〜Ubuntu Jujuを使って〜これから始める人のための自動化入門 〜Ubuntu Jujuを使って〜
これから始める人のための自動化入門 〜Ubuntu Jujuを使って〜VirtualTech Japan Inc.
 
UnicodeによるXSSと SQLインジェクションの可能性
UnicodeによるXSSとSQLインジェクションの可能性UnicodeによるXSSとSQLインジェクションの可能性
UnicodeによるXSSと SQLインジェクションの可能性Hiroshi Tokumaru
 
NetApp XCP データ移行ツールインストールと設定
NetApp XCP データ移行ツールインストールと設定NetApp XCP データ移行ツールインストールと設定
NetApp XCP データ移行ツールインストールと設定Kan Itani
 
DockerCon EU 2015: Day 1 General Session
DockerCon EU 2015: Day 1 General SessionDockerCon EU 2015: Day 1 General Session
DockerCon EU 2015: Day 1 General SessionDocker, Inc.
 
Cocos2d-xを用いた "LINE タワーライジング" の開発事例
Cocos2d-xを用いた "LINE タワーライジング" の開発事例Cocos2d-xを用いた "LINE タワーライジング" の開発事例
Cocos2d-xを用いた "LINE タワーライジング" の開発事例gree_tech
 
【CNDT2022】SIerで実践!クラウドネイティブを普及させる取り組み
【CNDT2022】SIerで実践!クラウドネイティブを普及させる取り組み【CNDT2022】SIerで実践!クラウドネイティブを普及させる取り組み
【CNDT2022】SIerで実践!クラウドネイティブを普及させる取り組みYuta Shimada
 
Docker volume基礎/Project Longhorn紹介
Docker volume基礎/Project Longhorn紹介Docker volume基礎/Project Longhorn紹介
Docker volume基礎/Project Longhorn紹介Masahito Zembutsu
 
サーバサイドの並行プログラミング〜かんたんマルチスレッドプログラミング〜
サーバサイドの並行プログラミング〜かんたんマルチスレッドプログラミング〜サーバサイドの並行プログラミング〜かんたんマルチスレッドプログラミング〜
サーバサイドの並行プログラミング〜かんたんマルチスレッドプログラミング〜gree_tech
 
【BS11】毎年訪れる .NET のメジャーバージョンアップに備えるために取り組めること
【BS11】毎年訪れる .NET のメジャーバージョンアップに備えるために取り組めること 【BS11】毎年訪れる .NET のメジャーバージョンアップに備えるために取り組めること
【BS11】毎年訪れる .NET のメジャーバージョンアップに備えるために取り組めること 日本マイクロソフト株式会社
 
ServiceMesh と仲間たち 〜Istio & Conduit & Linkerd〜 @Cloud Native Meetup Tokyo #1
ServiceMesh と仲間たち 〜Istio & Conduit & Linkerd〜 @Cloud Native Meetup Tokyo #1ServiceMesh と仲間たち 〜Istio & Conduit & Linkerd〜 @Cloud Native Meetup Tokyo #1
ServiceMesh と仲間たち 〜Istio & Conduit & Linkerd〜 @Cloud Native Meetup Tokyo #1Masaya Aoyama
 
Kubernetesのワーカーノードを自動修復するために必要だったこと
Kubernetesのワーカーノードを自動修復するために必要だったことKubernetesのワーカーノードを自動修復するために必要だったこと
Kubernetesのワーカーノードを自動修復するために必要だったことh-otter
 
Cloud Foundryは何故動くのか
Cloud Foundryは何故動くのかCloud Foundryは何故動くのか
Cloud Foundryは何故動くのかKazuto Kusama
 
MySQL負荷分散の方法
MySQL負荷分散の方法MySQL負荷分散の方法
MySQL負荷分散の方法佐久本正太
 

Was ist angesagt? (20)

オンプレ、クラウドを組み合わせて作るビックデータ基盤 データ基盤の選び方
オンプレ、クラウドを組み合わせて作るビックデータ基盤  データ基盤の選び方オンプレ、クラウドを組み合わせて作るビックデータ基盤  データ基盤の選び方
オンプレ、クラウドを組み合わせて作るビックデータ基盤 データ基盤の選び方
 
WebRTCのオーディオ処理の謎、誰か教えて!
WebRTCのオーディオ処理の謎、誰か教えて!WebRTCのオーディオ処理の謎、誰か教えて!
WebRTCのオーディオ処理の謎、誰か教えて!
 
ロードバランスへの長い道
ロードバランスへの長い道ロードバランスへの長い道
ロードバランスへの長い道
 
OpenvSwitchの落とし穴
OpenvSwitchの落とし穴OpenvSwitchの落とし穴
OpenvSwitchの落とし穴
 
JIRAを使った簡単リリースプランニング
JIRAを使った簡単リリースプランニングJIRAを使った簡単リリースプランニング
JIRAを使った簡単リリースプランニング
 
これから始める人のための自動化入門 〜Ubuntu Jujuを使って〜
これから始める人のための自動化入門 〜Ubuntu Jujuを使って〜これから始める人のための自動化入門 〜Ubuntu Jujuを使って〜
これから始める人のための自動化入門 〜Ubuntu Jujuを使って〜
 
入門!Jenkins
入門!Jenkins入門!Jenkins
入門!Jenkins
 
DNS, DNSSECの仕組み
DNS, DNSSECの仕組みDNS, DNSSECの仕組み
DNS, DNSSECの仕組み
 
UnicodeによるXSSと SQLインジェクションの可能性
UnicodeによるXSSとSQLインジェクションの可能性UnicodeによるXSSとSQLインジェクションの可能性
UnicodeによるXSSと SQLインジェクションの可能性
 
NetApp XCP データ移行ツールインストールと設定
NetApp XCP データ移行ツールインストールと設定NetApp XCP データ移行ツールインストールと設定
NetApp XCP データ移行ツールインストールと設定
 
DockerCon EU 2015: Day 1 General Session
DockerCon EU 2015: Day 1 General SessionDockerCon EU 2015: Day 1 General Session
DockerCon EU 2015: Day 1 General Session
 
Cocos2d-xを用いた "LINE タワーライジング" の開発事例
Cocos2d-xを用いた "LINE タワーライジング" の開発事例Cocos2d-xを用いた "LINE タワーライジング" の開発事例
Cocos2d-xを用いた "LINE タワーライジング" の開発事例
 
【CNDT2022】SIerで実践!クラウドネイティブを普及させる取り組み
【CNDT2022】SIerで実践!クラウドネイティブを普及させる取り組み【CNDT2022】SIerで実践!クラウドネイティブを普及させる取り組み
【CNDT2022】SIerで実践!クラウドネイティブを普及させる取り組み
 
Docker volume基礎/Project Longhorn紹介
Docker volume基礎/Project Longhorn紹介Docker volume基礎/Project Longhorn紹介
Docker volume基礎/Project Longhorn紹介
 
サーバサイドの並行プログラミング〜かんたんマルチスレッドプログラミング〜
サーバサイドの並行プログラミング〜かんたんマルチスレッドプログラミング〜サーバサイドの並行プログラミング〜かんたんマルチスレッドプログラミング〜
サーバサイドの並行プログラミング〜かんたんマルチスレッドプログラミング〜
 
【BS11】毎年訪れる .NET のメジャーバージョンアップに備えるために取り組めること
【BS11】毎年訪れる .NET のメジャーバージョンアップに備えるために取り組めること 【BS11】毎年訪れる .NET のメジャーバージョンアップに備えるために取り組めること
【BS11】毎年訪れる .NET のメジャーバージョンアップに備えるために取り組めること
 
ServiceMesh と仲間たち 〜Istio & Conduit & Linkerd〜 @Cloud Native Meetup Tokyo #1
ServiceMesh と仲間たち 〜Istio & Conduit & Linkerd〜 @Cloud Native Meetup Tokyo #1ServiceMesh と仲間たち 〜Istio & Conduit & Linkerd〜 @Cloud Native Meetup Tokyo #1
ServiceMesh と仲間たち 〜Istio & Conduit & Linkerd〜 @Cloud Native Meetup Tokyo #1
 
Kubernetesのワーカーノードを自動修復するために必要だったこと
Kubernetesのワーカーノードを自動修復するために必要だったことKubernetesのワーカーノードを自動修復するために必要だったこと
Kubernetesのワーカーノードを自動修復するために必要だったこと
 
Cloud Foundryは何故動くのか
Cloud Foundryは何故動くのかCloud Foundryは何故動くのか
Cloud Foundryは何故動くのか
 
MySQL負荷分散の方法
MySQL負荷分散の方法MySQL負荷分散の方法
MySQL負荷分散の方法
 

Andere mochten auch

An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...David Beazley (Dabeaz LLC)
 
Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...
Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...
Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...David Beazley (Dabeaz LLC)
 
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python ExceptionsWAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python ExceptionsDavid Beazley (Dabeaz LLC)
 
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++David Beazley (Dabeaz LLC)
 
In Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter LockIn Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter LockDavid Beazley (Dabeaz LLC)
 
The New Preparation of Financial Statements Standard (SSARS 21)
The New Preparation of Financial Statements Standard (SSARS 21)The New Preparation of Financial Statements Standard (SSARS 21)
The New Preparation of Financial Statements Standard (SSARS 21)Charles B. Hall, CPA, CFE
 
Ctypes в игровых приложениях на python
Ctypes в игровых приложениях на pythonCtypes в игровых приложениях на python
Ctypes в игровых приложениях на pythonPyNSK
 
Подключение внешних библиотек в python
Подключение внешних библиотек в pythonПодключение внешних библиотек в python
Подключение внешних библиотек в pythonMaxim Shalamov
 

Andere mochten auch (20)

Writing Parsers and Compilers with PLY
Writing Parsers and Compilers with PLYWriting Parsers and Compilers with PLY
Writing Parsers and Compilers with PLY
 
Generators: The Final Frontier
Generators: The Final FrontierGenerators: The Final Frontier
Generators: The Final Frontier
 
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
 
Mastering Python 3 I/O
Mastering Python 3 I/OMastering Python 3 I/O
Mastering Python 3 I/O
 
Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...
Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...
Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...
 
Perl-C/C++ Integration with Swig
Perl-C/C++ Integration with SwigPerl-C/C++ Integration with Swig
Perl-C/C++ Integration with Swig
 
Interfacing C/C++ and Python with SWIG
Interfacing C/C++ and Python with SWIGInterfacing C/C++ and Python with SWIG
Interfacing C/C++ and Python with SWIG
 
Python in Action (Part 1)
Python in Action (Part 1)Python in Action (Part 1)
Python in Action (Part 1)
 
Python Generator Hacking
Python Generator HackingPython Generator Hacking
Python Generator Hacking
 
Generator Tricks for Systems Programmers
Generator Tricks for Systems ProgrammersGenerator Tricks for Systems Programmers
Generator Tricks for Systems Programmers
 
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python ExceptionsWAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
 
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++
 
Python in Action (Part 2)
Python in Action (Part 2)Python in Action (Part 2)
Python in Action (Part 2)
 
In Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter LockIn Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter Lock
 
Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)
 
The New Preparation of Financial Statements Standard (SSARS 21)
The New Preparation of Financial Statements Standard (SSARS 21)The New Preparation of Financial Statements Standard (SSARS 21)
The New Preparation of Financial Statements Standard (SSARS 21)
 
Ctypes в игровых приложениях на python
Ctypes в игровых приложениях на pythonCtypes в игровых приложениях на python
Ctypes в игровых приложениях на python
 
Understanding the Python GIL
Understanding the Python GILUnderstanding the Python GIL
Understanding the Python GIL
 
An Introduction to Python Concurrency
An Introduction to Python ConcurrencyAn Introduction to Python Concurrency
An Introduction to Python Concurrency
 
Подключение внешних библиотек в python
Подключение внешних библиотек в pythonПодключение внешних библиотек в python
Подключение внешних библиотек в python
 

Ähnlich wie Generator Tricks for Systems Programmers, v2.0

Generator Tricks for Systems Programmers
Generator Tricks for Systems ProgrammersGenerator Tricks for Systems Programmers
Generator Tricks for Systems ProgrammersHiroshi Ono
 
Frequently asked questions answered frequently - but now for the last time
Frequently asked questions answered frequently - but now for the last timeFrequently asked questions answered frequently - but now for the last time
Frequently asked questions answered frequently - but now for the last timeAndreas Jung
 
yocto_scale_handout-with-notes
yocto_scale_handout-with-notesyocto_scale_handout-with-notes
yocto_scale_handout-with-notesSteve Arnold
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...Ambassador Labs
 
Design and Evolution of cyber-dojo
Design and Evolution of cyber-dojoDesign and Evolution of cyber-dojo
Design and Evolution of cyber-dojoJon Jagger
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 KeynotePeter Wang
 
What is Python? (Silicon Valley CodeCamp 2014)
What is Python? (Silicon Valley CodeCamp 2014)What is Python? (Silicon Valley CodeCamp 2014)
What is Python? (Silicon Valley CodeCamp 2014)wesley chun
 
Scale11x lxc talk
Scale11x lxc talkScale11x lxc talk
Scale11x lxc talkdotCloud
 
Intro to Python (High School) Unit #1
Intro to Python (High School) Unit #1Intro to Python (High School) Unit #1
Intro to Python (High School) Unit #1Jay Coskey
 
Janus Workshop @ ClueCon 2020
Janus Workshop @ ClueCon 2020Janus Workshop @ ClueCon 2020
Janus Workshop @ ClueCon 2020Lorenzo Miniero
 
Python programming language introduction unit
Python programming language introduction unitPython programming language introduction unit
Python programming language introduction unitmichaelaaron25322
 
Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)Itzik Kotler
 
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기Heejong Ahn
 
VA Smalltalk Update
VA Smalltalk UpdateVA Smalltalk Update
VA Smalltalk UpdateESUG
 

Ähnlich wie Generator Tricks for Systems Programmers, v2.0 (20)

Generator Tricks for Systems Programmers
Generator Tricks for Systems ProgrammersGenerator Tricks for Systems Programmers
Generator Tricks for Systems Programmers
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Stackato v3
Stackato v3Stackato v3
Stackato v3
 
Frequently asked questions answered frequently - but now for the last time
Frequently asked questions answered frequently - but now for the last timeFrequently asked questions answered frequently - but now for the last time
Frequently asked questions answered frequently - but now for the last time
 
yocto_scale_handout-with-notes
yocto_scale_handout-with-notesyocto_scale_handout-with-notes
yocto_scale_handout-with-notes
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
What is Python?
What is Python?What is Python?
What is Python?
 
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
 
Design and Evolution of cyber-dojo
Design and Evolution of cyber-dojoDesign and Evolution of cyber-dojo
Design and Evolution of cyber-dojo
 
Handling I/O in Java
Handling I/O in JavaHandling I/O in Java
Handling I/O in Java
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
What is Python? (Silicon Valley CodeCamp 2014)
What is Python? (Silicon Valley CodeCamp 2014)What is Python? (Silicon Valley CodeCamp 2014)
What is Python? (Silicon Valley CodeCamp 2014)
 
Scale11x lxc talk
Scale11x lxc talkScale11x lxc talk
Scale11x lxc talk
 
Intro to Python (High School) Unit #1
Intro to Python (High School) Unit #1Intro to Python (High School) Unit #1
Intro to Python (High School) Unit #1
 
Janus Workshop @ ClueCon 2020
Janus Workshop @ ClueCon 2020Janus Workshop @ ClueCon 2020
Janus Workshop @ ClueCon 2020
 
Python programming language introduction unit
Python programming language introduction unitPython programming language introduction unit
Python programming language introduction unit
 
Stackato
StackatoStackato
Stackato
 
Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)
 
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
 
VA Smalltalk Update
VA Smalltalk UpdateVA Smalltalk Update
VA Smalltalk Update
 

Kürzlich hochgeladen

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Kürzlich hochgeladen (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Generator Tricks for Systems Programmers, v2.0

  • 1. Generator Tricks For Systems Programmers (Version 2.0) David Beazley http://www.dabeaz.com Presented at PyCon UK 2008 Copyright (C) 2008, http://www.dabeaz.com 1- 1
  • 2. Introduction 2.0 • At PyCon'2008 (Chicago), I gave this tutorial on generators that you are about to see • About 80 people attended • Afterwards, there were more than 15000 downloads of my presentation slides • Clearly there was some interest • This is a revised version... Copyright (C) 2008, http://www.dabeaz.com 1- 2
  • 3. An Introduction • Generators are cool! • But what are they? • And what are they good for? • That's what this tutorial is about Copyright (C) 2008, http://www.dabeaz.com 1- 3
  • 4. About Me • I'm a long-time Pythonista • First started using Python with version 1.3 • Author : Python Essential Reference • Responsible for a number of open source Python-related packages (Swig, PLY, etc.) Copyright (C) 2008, http://www.dabeaz.com 1- 4
  • 5. My Story My addiction to generators started innocently enough. I was just a happy Python programmer working away in my secret lair when I got "the call." A call to sort through 1.5 Terabytes of C++ source code (~800 weekly snapshots of a million line application). That's when I discovered the os.walk() function. This was interesting I thought... Copyright (C) 2008, http://www.dabeaz.com 1- 5
  • 6. Back Story • I started using generators on a more day-to- day basis and found them to be wicked cool • One of Python's most powerful features! • Yet, they still seem rather exotic • In my experience, most Python programmers view them as kind of a weird fringe feature • This is unfortunate. Copyright (C) 2008, http://www.dabeaz.com 1- 6
  • 7. Python Books Suck! • Let's take a look at ... my book (a motivating example) • Whoa! Counting.... that's so useful. • Almost as useful as sequences of Fibonacci numbers, random numbers, and squares Copyright (C) 2008, http://www.dabeaz.com 1- 7
  • 8. Our Mission • Some more practical uses of generators • Focus is "systems programming" • Which loosely includes files, file systems, parsing, networking, threads, etc. • My goal : To provide some more compelling examples of using generators • I'd like to plant some seeds and inspire tool builders Copyright (C) 2008, http://www.dabeaz.com 1- 8
  • 9. Support Files • Files used in this tutorial are available here: http://www.dabeaz.com/generators-uk/ • Go there to follow along with the examples Copyright (C) 2008, http://www.dabeaz.com 1- 9
  • 10. Disclaimer • This isn't meant to be an exhaustive tutorial on generators and related theory • Will be looking at a series of examples • I don't know if the code I've written is the "best" way to solve any of these problems. • Let's have a discussion Copyright (C) 2008, http://www.dabeaz.com 1- 10
  • 11. Performance Disclosure • There are some later performance numbers • Python 2.5.1 on OS X 10.4.11 • All tests were conducted on the following: • Mac Pro 2x2.66 Ghz Dual-Core Xeon • 3 Gbytes RAM • WDC WD2500JS-41SGB0 Disk (250G) • Timings are 3-run average of 'time' command Copyright (C) 2008, http://www.dabeaz.com 1- 11
  • 12. Part I Introduction to Iterators and Generators Copyright (C) 2008, http://www.dabeaz.com 1- 12
  • 13. Iteration • As you know, Python has a "for" statement • You use it to loop over a collection of items >>> for x in [1,4,5,10]: ... print x, ... 1 4 5 10 >>> • And, as you have probably noticed, you can iterate over many different kinds of objects (not just lists) Copyright (C) 2008, http://www.dabeaz.com 1- 13
  • 14. Iterating over a Dict • If you loop over a dictionary you get keys >>> prices = { 'GOOG' : 490.10, ... 'AAPL' : 145.23, ... 'YHOO' : 21.71 } ... >>> for key in prices: ... print key ... YHOO GOOG AAPL >>> Copyright (C) 2008, http://www.dabeaz.com 1- 14
  • 15. Iterating over a String • If you loop over a string, you get characters >>> s = "Yow!" >>> for c in s: ... print c ... Y o w ! >>> Copyright (C) 2008, http://www.dabeaz.com 1- 15
  • 16. Iterating over a File • If you loop over a file you get lines >>> for line in open("real.txt"): ... print line, ... Real Programmers write in FORTRAN Maybe they do now, in this decadent era of Lite beer, hand calculators, and "user-friendly" software but back in the Good Old Days, when the term "software" sounded funny and Real Computers were made out of drums and vacuum tubes, Real Programmers wrote in machine code. Not FORTRAN. Not RATFOR. Not, even, assembly language. Machine Code. Raw, unadorned, inscrutable hexadecimal numbers. Directly. Copyright (C) 2008, http://www.dabeaz.com 1- 16
  • 17. Consuming Iterables • Many functions consume an "iterable" object • Reductions: sum(s), min(s), max(s) • Constructors list(s), tuple(s), set(s), dict(s) • in operator item in s • Many others in the library Copyright (C) 2008, http://www.dabeaz.com 1- 17
  • 18. Iteration Protocol • The reason why you can iterate over different objects is that there is a specific protocol >>> items = [1, 4, 5] >>> it = iter(items) >>> it.next() 1 >>> it.next() 4 >>> it.next() 5 >>> it.next() Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration >>> Copyright (C) 2008, http://www.dabeaz.com 1- 18
  • 19. Iteration Protocol • An inside look at the for statement for x in obj: # statements • Underneath the covers _iter = iter(obj) # Get iterator object while 1: try: x = _iter.next() # Get next item except StopIteration: # No more items break # statements ... • Any object that supports iter() and next() is said to be "iterable." Copyright (C) 2008, http://www.dabeaz.com 1-19
  • 20. Supporting Iteration • User-defined objects can support iteration • Example: Counting down... >>> for x in countdown(10): ... print x, ... 10 9 8 7 6 5 4 3 2 1 >>> • To do this, you just have to make the object implement __iter__() and next() Copyright (C) 2008, http://www.dabeaz.com 1-20
  • 21. Supporting Iteration • Sample implementation class countdown(object): def __init__(self,start): self.count = start def __iter__(self): return self def next(self): if self.count <= 0: raise StopIteration r = self.count self.count -= 1 return r Copyright (C) 2008, http://www.dabeaz.com 1-21
  • 22. Iteration Example • Example use: >>> c = countdown(5) >>> for i in c: ... print i, ... 5 4 3 2 1 >>> Copyright (C) 2008, http://www.dabeaz.com 1-22
  • 23. Iteration Commentary • There are many subtle details involving the design of iterators for various objects • However, we're not going to cover that • This isn't a tutorial on "iterators" • We're talking about generators... Copyright (C) 2008, http://www.dabeaz.com 1-23
  • 24. Generators • A generator is a function that produces a sequence of results instead of a single value def countdown(n): while n > 0: yield n n -= 1 >>> for i in countdown(5): ... print i, ... 5 4 3 2 1 >>> • Instead of returning a value, you generate a series of values (using the yield statement) Copyright (C) 2008, http://www.dabeaz.com 1-24
  • 25. Generators • Behavior is quite different than normal func • Calling a generator function creates an generator object. However, it does not start running the function. def countdown(n): print "Counting down from", n while n > 0: yield n n -= 1 Notice that no output was >>> x = countdown(10) produced >>> x <generator object at 0x58490> >>> Copyright (C) 2008, http://www.dabeaz.com 1-25
  • 26. Generator Functions • The function only executes on next() >>> x = countdown(10) >>> x <generator object at 0x58490> >>> x.next() Counting down from 10 Function starts 10 executing here >>> • yield produces a value, but suspends the function • Function resumes on next call to next() >>> x.next() 9 >>> x.next() 8 >>> Copyright (C) 2008, http://www.dabeaz.com 1-26
  • 27. Generator Functions • When the generator returns, iteration stops >>> x.next() 1 >>> x.next() Traceback (most recent call last): File "<stdin>", line 1, in ? StopIteration >>> Copyright (C) 2008, http://www.dabeaz.com 1-27
  • 28. Generator Functions • A generator function is mainly a more convenient way of writing an iterator • You don't have to worry about the iterator protocol (.next, .__iter__, etc.) • It just works Copyright (C) 2008, http://www.dabeaz.com 1-28
  • 29. Generators vs. Iterators • A generator function is slightly different than an object that supports iteration • A generator is a one-time operation. You can iterate over the generated data once, but if you want to do it again, you have to call the generator function again. • This is different than a list (which you can iterate over as many times as you want) Copyright (C) 2008, http://www.dabeaz.com 1-29
  • 30. Generator Expressions • A generated version of a list comprehension >>> a = [1,2,3,4] >>> b = (2*x for x in a) >>> b <generator object at 0x58760> >>> for i in b: print b, ... 2 4 6 8 >>> • This loops over a sequence of items and applies an operation to each item • However, results are produced one at a time using a generator Copyright (C) 2008, http://www.dabeaz.com 1-30
  • 31. Generator Expressions • Important differences from a list comp. • Does not construct a list. • Only useful purpose is iteration • Once consumed, can't be reused • Example: >>> a = [1,2,3,4] >>> b = [2*x for x in a] >>> b [2, 4, 6, 8] >>> c = (2*x for x in a) <generator object at 0x58760> >>> Copyright (C) 2008, http://www.dabeaz.com 1-31
  • 32. Generator Expressions • General syntax (expression for i in s if condition) • What it means for i in s: if condition: yield expression Copyright (C) 2008, http://www.dabeaz.com 1-32
  • 33. A Note on Syntax • The parens on a generator expression can dropped if used as a single function argument • Example: sum(x*x for x in s) Generator expression Copyright (C) 2008, http://www.dabeaz.com 1-33
  • 34. Interlude • We now have two basic building blocks • Generator functions: def countdown(n): while n > 0: yield n n -= 1 • Generator expressions squares = (x*x for x in s) • In both cases, we get an object that generates values (which are typically consumed in a for loop) Copyright (C) 2008, http://www.dabeaz.com 1-34
  • 35. Part 2 Processing Data Files (Show me your Web Server Logs) Copyright (C) 2008, http://www.dabeaz.com 1- 35
  • 36. Programming Problem Find out how many bytes of data were transferred by summing up the last column of data in this Apache web server log 81.107.39.38 - ... "GET /ply/ HTTP/1.1" 200 7587 81.107.39.38 - ... "GET /favicon.ico HTTP/1.1" 404 133 81.107.39.38 - ... "GET /ply/bookplug.gif HTTP/1.1" 200 23903 81.107.39.38 - ... "GET /ply/ply.html HTTP/1.1" 200 97238 81.107.39.38 - ... "GET /ply/example.html HTTP/1.1" 200 2359 66.249.72.134 - ... "GET /index.html HTTP/1.1" 200 4447 Oh yeah, and the log file might be huge (Gbytes) Copyright (C) 2008, http://www.dabeaz.com 1-36
  • 37. The Log File • Each line of the log looks like this: 81.107.39.38 - ... "GET /ply/ply.html HTTP/1.1" 200 97238 • The number of bytes is the last column bytestr = line.rsplit(None,1)[1] • It's either a number or a missing value (-) 81.107.39.38 - ... "GET /ply/ HTTP/1.1" 304 - • Converting the value if bytestr != '-': bytes = int(bytestr) Copyright (C) 2008, http://www.dabeaz.com 1-37
  • 38. A Non-Generator Soln • Just do a simple for-loop wwwlog = open("access-log") total = 0 for line in wwwlog: bytestr = line.rsplit(None,1)[1] if bytestr != '-': total += int(bytestr) print "Total", total • We read line-by-line and just update a sum • However, that's so 90s... Copyright (C) 2008, http://www.dabeaz.com 1-38
  • 39. A Generator Solution • Let's use some generator expressions wwwlog = open("access-log") bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) bytes = (int(x) for x in bytecolumn if x != '-') print "Total", sum(bytes) • Whoa! That's different! • Less code • A completely different programming style Copyright (C) 2008, http://www.dabeaz.com 1-39
  • 40. Generators as a Pipeline • To understand the solution, think of it as a data processing pipeline access-log wwwlog bytecolumn bytes sum() total • Each step is defined by iteration/generation wwwlog = open("access-log") bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) bytes = (int(x) for x in bytecolumn if x != '-') print "Total", sum(bytes) Copyright (C) 2008, http://www.dabeaz.com 1-40
  • 41. Being Declarative • At each step of the pipeline, we declare an operation that will be applied to the entire input stream access-log wwwlog bytecolumn bytes sum() total bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) This operation gets applied to every line of the log file Copyright (C) 2008, http://www.dabeaz.com 1-41
  • 42. Being Declarative • Instead of focusing on the problem at a line-by-line level, you just break it down into big operations that operate on the whole file • This is very much a "declarative" style • The key : Think big... Copyright (C) 2008, http://www.dabeaz.com 1-42
  • 43. Iteration is the Glue • The glue that holds the pipeline together is the iteration that occurs in each step wwwlog = open("access-log") bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) bytes = (int(x) for x in bytecolumn if x != '-') print "Total", sum(bytes) • The calculation is being driven by the last step • The sum() function is consuming values being pulled through the pipeline (via .next() calls) Copyright (C) 2008, http://www.dabeaz.com 1-43
  • 44. Performance • Surely, this generator approach has all sorts of fancy-dancy magic that is slow. • Let's check it out on a 1.3Gb log file... % ls -l big-access-log -rw-r--r-- beazley 1303238000 Feb 29 08:06 big-access-log Copyright (C) 2008, http://www.dabeaz.com 1-44
  • 45. Performance Contest wwwlog = open("big-access-log") total = 0 for line in wwwlog: Time bytestr = line.rsplit(None,1)[1] if bytestr != '-': total += int(bytestr) 27.20 print "Total", total wwwlog = open("big-access-log") bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) bytes = (int(x) for x in bytecolumn if x != '-') print "Total", sum(bytes) Time 25.96 Copyright (C) 2008, http://www.dabeaz.com 1-45
  • 46. Commentary • Not only was it not slow, it was 5% faster • And it was less code • And it was relatively easy to read • And frankly, I like it a whole better... "Back in the old days, we used AWK for this and we liked it. Oh, yeah, and get off my lawn!" Copyright (C) 2008, http://www.dabeaz.com 1-46
  • 47. Performance Contest wwwlog = open("access-log") bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) bytes = (int(x) for x in bytecolumn if x != '-') print "Total", sum(bytes) Time 25.96 % awk '{ total += $NF } END { print total }' big-access-log Time Note:extracting the last column might not be 37.33 awk's strong point Copyright (C) 2008, http://www.dabeaz.com 1-47
  • 48. Food for Thought • At no point in our generator solution did we ever create large temporary lists • Thus, not only is that solution faster, it can be applied to enormous data files • It's competitive with traditional tools Copyright (C) 2008, http://www.dabeaz.com 1-48
  • 49. More Thoughts • The generator solution was based on the concept of pipelining data between different components • What if you had more advanced kinds of components to work with? • Perhaps you could perform different kinds of processing by just plugging various pipeline components together Copyright (C) 2008, http://www.dabeaz.com 1-49
  • 50. This Sounds Familiar • The Unix philosophy • Have a collection of useful system utils • Can hook these up to files or each other • Perform complex tasks by piping data Copyright (C) 2008, http://www.dabeaz.com 1-50
  • 51. Part 3 Fun with Files and Directories Copyright (C) 2008, http://www.dabeaz.com 1- 51
  • 52. Programming Problem You have hundreds of web server logs scattered across various directories. In additional, some of the logs are compressed. Modify the last program so that you can easily read all of these logs foo/ access-log-012007.gz access-log-022007.gz access-log-032007.gz ... access-log-012008 bar/ access-log-092007.bz2 ... access-log-022008 Copyright (C) 2008, http://www.dabeaz.com 1-52
  • 53. os.walk() • A very useful function for searching the file system import os for path, dirlist, filelist in os.walk(topdir): # path : Current directory # dirlist : List of subdirectories # filelist : List of files ... • This utilizes generators to recursively walk through the file system Copyright (C) 2008, http://www.dabeaz.com 1-53
  • 54. find • Generate all filenames in a directory tree that match a given filename pattern import os import fnmatch def gen_find(filepat,top): for path, dirlist, filelist in os.walk(top): for name in fnmatch.filter(filelist,filepat): yield os.path.join(path,name) • Examples pyfiles = gen_find("*.py","/") logs = gen_find("access-log*","/usr/www/") Copyright (C) 2008, http://www.dabeaz.com 1-54
  • 55. Performance Contest pyfiles = gen_find("*.py","/") for name in pyfiles: Wall Clock Time print name 559s % find / -name '*.py' Wall Clock Time 468s Performed on a 750GB file system containing about 140000 .py files Copyright (C) 2008, http://www.dabeaz.com 1-55
  • 56. A File Opener • Open a sequence of filenames import gzip, bz2 def gen_open(filenames): for name in filenames: if name.endswith(".gz"): yield gzip.open(name) elif name.endswith(".bz2"): yield bz2.BZ2File(name) else: yield open(name) • This is interesting.... it takes a sequence of filenames as input and yields a sequence of open file objects Copyright (C) 2008, http://www.dabeaz.com 1-56
  • 57. cat • Concatenate items from one or more source into a single sequence of items def gen_cat(sources): for s in sources: for item in s: yield item • Example: lognames = gen_find("access-log*", "/usr/www") logfiles = gen_open(lognames) loglines = gen_cat(logfiles) Copyright (C) 2008, http://www.dabeaz.com 1-57
  • 58. grep • Generate a sequence of lines that contain a given regular expression import re def gen_grep(pat, lines): patc = re.compile(pat) for line in lines: if patc.search(line): yield line • Example: lognames = gen_find("access-log*", "/usr/www") logfiles = gen_open(lognames) loglines = gen_cat(logfiles) patlines = gen_grep(pat, loglines) Copyright (C) 2008, http://www.dabeaz.com 1-58
  • 59. Example • Find out how many bytes transferred for a specific pattern in a whole directory of logs pat = r"somepattern" logdir = "/some/dir/" filenames = gen_find("access-log*",logdir) logfiles = gen_open(filenames) loglines = gen_cat(logfiles) patlines = gen_grep(pat,loglines) bytecolumn = (line.rsplit(None,1)[1] for line in patlines) bytes = (int(x) for x in bytecolumn if x != '-') print "Total", sum(bytes) Copyright (C) 2008, http://www.dabeaz.com 1-59
  • 60. Important Concept • Generators decouple iteration from the code that uses the results of the iteration • In the last example, we're performing a calculation on a sequence of lines • It doesn't matter where or how those lines are generated • Thus, we can plug any number of components together up front as long as they eventually produce a line sequence Copyright (C) 2008, http://www.dabeaz.com 1-60
  • 61. Part 4 Parsing and Processing Data Copyright (C) 2008, http://www.dabeaz.com 1- 61
  • 62. Programming Problem Web server logs consist of different columns of data. Parse each line into a useful data structure that allows us to easily inspect the different fields. 81.107.39.38 - - [24/Feb/2008:00:08:59 -0600] "GET ..." 200 7587 host referrer user [datetime] "request" status bytes Copyright (C) 2008, http://www.dabeaz.com 1-62
  • 63. Parsing with Regex • Let's route the lines through a regex parser logpats = r'(S+) (S+) (S+) [(.*?)] ' r'"(S+) (S+) (S+)" (S+) (S+)' logpat = re.compile(logpats) groups = (logpat.match(line) for line in loglines) tuples = (g.groups() for g in groups if g) • This generates a sequence of tuples ('71.201.176.194', '-', '-', '26/Feb/2008:10:30:08 -0600', 'GET', '/ply/ply.html', 'HTTP/1.1', '200', '97238') Copyright (C) 2008, http://www.dabeaz.com 1-63
  • 64. Tuple Commentary • I generally don't like data processing on tuples ('71.201.176.194', '-', '-', '26/Feb/2008:10:30:08 -0600', 'GET', '/ply/ply.html', 'HTTP/1.1', '200', '97238') • First, they are immutable--so you can't modify • Second, to extract specific fields, you have to remember the column number--which is annoying if there are a lot of columns • Third, existing code breaks if you change the number of fields Copyright (C) 2008, http://www.dabeaz.com 1-64
  • 65. Tuples to Dictionaries • Let's turn tuples into dictionaries colnames = ('host','referrer','user','datetime', 'method','request','proto','status','bytes') log = (dict(zip(colnames,t)) for t in tuples) • This generates a sequence of named fields { 'status' : '200', 'proto' : 'HTTP/1.1', 'referrer': '-', 'request' : '/ply/ply.html', 'bytes' : '97238', 'datetime': '24/Feb/2008:00:08:59 -0600', 'host' : '140.180.132.213', 'user' : '-', 'method' : 'GET'} Copyright (C) 2008, http://www.dabeaz.com 1-65
  • 66. Field Conversion • You might want to map specific dictionary fields through a conversion function (e.g., int(), float()) def field_map(dictseq,name,func): for d in dictseq: d[name] = func(d[name]) yield d • Example: Convert a few field values log = field_map(log,"status", int) log = field_map(log,"bytes", lambda s: int(s) if s !='-' else 0) Copyright (C) 2008, http://www.dabeaz.com 1-66
  • 67. Field Conversion • Creates dictionaries of converted values { 'status': 200, 'proto': 'HTTP/1.1', Note conversion 'referrer': '-', 'request': '/ply/ply.html', 'datetime': '24/Feb/2008:00:08:59 -0600', 'bytes': 97238, 'host': '140.180.132.213', 'user': '-', 'method': 'GET'} • Again, this is just one big processing pipeline Copyright (C) 2008, http://www.dabeaz.com 1-67
  • 68. The Code So Far lognames = gen_find("access-log*","www") logfiles = gen_open(lognames) loglines = gen_cat(logfiles) groups = (logpat.match(line) for line in loglines) tuples = (g.groups() for g in groups if g) colnames = ('host','referrer','user','datetime','method', 'request','proto','status','bytes') log = (dict(zip(colnames,t)) for t in tuples) log = field_map(log,"bytes", lambda s: int(s) if s != '-' else 0) log = field_map(log,"status",int) Copyright (C) 2008, http://www.dabeaz.com 1-68
  • 69. Getting Organized • As a processing pipeline grows, certain parts of it may be useful components on their own generate lines Parse a sequence of lines from from a set of files Apache server logs into a in a directory sequence of dictionaries • A series of pipeline stages can be easily encapsulated by a normal Python function Copyright (C) 2008, http://www.dabeaz.com 1-69
  • 70. Packaging • Example : multiple pipeline stages inside a function def lines_from_dir(filepat, dirname): names = gen_find(filepat,dirname) files = gen_open(names) lines = gen_cat(files) return lines • This is now a general purpose component that can be used as a single element in other pipelines Copyright (C) 2008, http://www.dabeaz.com 1-70
  • 71. Packaging • Example : Parse an Apache log into dicts def apache_log(lines): groups = (logpat.match(line) for line in lines) tuples = (g.groups() for g in groups if g) colnames = ('host','referrer','user','datetime','method', 'request','proto','status','bytes') log = (dict(zip(colnames,t)) for t in tuples) log = field_map(log,"bytes", lambda s: int(s) if s != '-' else 0) log = field_map(log,"status",int) return log Copyright (C) 2008, http://www.dabeaz.com 1-71
  • 72. Example Use • It's easy lines = lines_from_dir("access-log*","www") log = apache_log(lines) for r in log: print r • Different components have been subdivided according to the data that they process Copyright (C) 2008, http://www.dabeaz.com 1-72
  • 73. Food for Thought • When creating pipeline components, it's critical to focus on the inputs and outputs • You will get the most flexibility when you use a standard set of datatypes • Is it simpler to have a bunch of components that all operate on dictionaries or to have components that require inputs/outputs to be different kinds of user-defined instances? Copyright (C) 2008, http://www.dabeaz.com 1-73
  • 74. A Query Language • Now that we have our log, let's do some queries • Find the set of all documents that 404 stat404 = set(r['request'] for r in log if r['status'] == 404) • Print all requests that transfer over a megabyte large = (r for r in log if r['bytes'] > 1000000) for r in large: print r['request'], r['bytes'] Copyright (C) 2008, http://www.dabeaz.com 1-74
  • 75. A Query Language • Find the largest data transfer print "%d %s" % max((r['bytes'],r['request']) for r in log) • Collect all unique host IP addresses hosts = set(r['host'] for r in log) • Find the number of downloads of a file sum(1 for r in log if r['request'] == '/ply/ply-2.3.tar.gz') Copyright (C) 2008, http://www.dabeaz.com 1-75
  • 76. A Query Language • Find out who has been hitting robots.txt addrs = set(r['host'] for r in log if 'robots.txt' in r['request']) import socket for addr in addrs: try: print socket.gethostbyaddr(addr)[0] except socket.herror: print addr Copyright (C) 2008, http://www.dabeaz.com 1-76
  • 77. Performance Study • Sadly, the last example doesn't run so fast on a huge input file (53 minutes on the 1.3GB log) • But, the beauty of generators is that you can plug filters in at almost any stage lines = lines_from_dir("big-access-log",".") lines = (line for line in lines if 'robots.txt' in line) log = apache_log(lines) addrs = set(r['host'] for r in log) ... • That version takes 93 seconds Copyright (C) 2008, http://www.dabeaz.com 1-77
  • 78. Some Thoughts • I like the idea of using generator expressions as a pipeline query language • You can write simple filters, extract data, etc. • You you pass dictionaries/objects through the pipeline, it becomes quite powerful • Feels similar to writing SQL queries Copyright (C) 2008, http://www.dabeaz.com 1-78
  • 79. Part 5 Processing Infinite Data Copyright (C) 2008, http://www.dabeaz.com 1- 79
  • 80. Question • Have you ever used 'tail -f' in Unix? % tail -f logfile ... ... lines of output ... ... • This prints the lines written to the end of a file • The "standard" way to watch a log file • I used this all of the time when working on scientific simulations ten years ago... Copyright (C) 2008, http://www.dabeaz.com 1-80
  • 81. Infinite Sequences • Tailing a log file results in an "infinite" stream • It constantly watches the file and yields lines as soon as new data is written • But you don't know how much data will actually be written (in advance) • And log files can often be enormous Copyright (C) 2008, http://www.dabeaz.com 1-81
  • 82. Tailing a File • A Python version of 'tail -f' import time def follow(thefile): thefile.seek(0,2) # Go to the end of the file while True: line = thefile.readline() if not line: time.sleep(0.1) # Sleep briefly continue yield line • Idea : Seek to the end of the file and repeatedly try to read new lines. If new data is written to the file, we'll pick it up. Copyright (C) 2008, http://www.dabeaz.com 1-82
  • 83. Example • Using our follow function logfile = open("access-log") loglines = follow(logfile) for line in loglines: print line, • This produces the same output as 'tail -f' Copyright (C) 2008, http://www.dabeaz.com 1-83
  • 84. Example • Turn the real-time log file into records logfile = open("access-log") loglines = follow(logfile) log = apache_log(loglines) • Print out all 404 requests as they happen r404 = (r for r in log if r['status'] == 404) for r in r404: print r['host'],r['datetime'],r['request'] Copyright (C) 2008, http://www.dabeaz.com 1-84
  • 85. Commentary • We just plugged this new input scheme onto the front of our processing pipeline • Everything else still works, with one caveat- functions that consume an entire iterable won't terminate (min, max, sum, set, etc.) • Nevertheless, we can easily write processing steps that operate on an infinite data stream Copyright (C) 2008, http://www.dabeaz.com 1-85
  • 86. Part 6 Feeding the Pipeline Copyright (C) 2008, http://www.dabeaz.com 1- 86
  • 87. Feeding Generators • In order to feed a generator processing pipeline, you need to have an input source • So far, we have looked at two file-based inputs • Reading a file lines = open(filename) • Tailing a file lines = follow(open(filename)) Copyright (C) 2008, http://www.dabeaz.com 1-87
  • 88. A Thought • There is no rule that says you have to generate pipeline data from a file. • Or that the input data has to be a string • Or that it has to be turned into a dictionary • Remember: All Python objects are "first-class" • Which means that all objects are fair-game for use in a generator pipeline Copyright (C) 2008, http://www.dabeaz.com 1-88
  • 89. Generating Connections • Generate a sequence of TCP connections import socket def receive_connections(addr): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1) s.bind(addr) s.listen(5) while True: client = s.accept() yield client • Example: for c,a in receive_connections(("",9000)): c.send("Hello Worldn") c.close() Copyright (C) 2008, http://www.dabeaz.com 1-89
  • 90. Generating Messages • Receive a sequence of UDP messages import socket def receive_messages(addr,maxsize): s = socket.socket(socket.AF_INET,socket.SOCK_DGRAM) s.bind(addr) while True: msg = s.recvfrom(maxsize) yield msg • Example: for msg, addr in receive_messages(("",10000),1024): print msg, "from", addr Copyright (C) 2008, http://www.dabeaz.com 1-90
  • 91. Part 7 Extending the Pipeline Copyright (C) 2008, http://www.dabeaz.com 1- 91
  • 92. Multiple Processes • Can you extend a processing pipeline across processes and machines? process 2 socket pipe process 1 Copyright (C) 2008, http://www.dabeaz.com 1-92
  • 93. Pickler/Unpickler • Turn a generated sequence into pickled objects def gen_pickle(source, protocol=pickle.HIGHEST_PROTOCOL): for item in source: yield pickle.dumps(item, protocol) def gen_unpickle(infile): while True: try: item = pickle.load(infile) yield item except EOFError: return • Now, attach these to a pipe or socket Copyright (C) 2008, http://www.dabeaz.com 1-93
  • 94. Sender/Receiver • Example: Sender def sendto(source,addr): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(addr) for pitem in gen_pickle(source): s.sendall(pitem) s.close() • Example: Receiver def receivefrom(addr): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1) s.bind(addr) s.listen(5) c,a = s.accept() for item in gen_unpickle(c.makefile()): yield item c.close() Copyright (C) 2008, http://www.dabeaz.com 1-94
  • 95. Example Use • Example: Read log lines and parse into records # netprod.py lines = follow(open("access-log")) log = apache_log(lines) sendto(log,("",15000)) • Example: Pick up the log on another machine # netcons.py for r in receivefrom(("",15000)): print r Copyright (C) 2008, http://www.dabeaz.com 1-95
  • 96. Generators and Threads • Processing pipelines sometimes come up in the context of thread programming • Producer/consumer problems Thread 1 Thread 2 Producer Consumer • Question: Can generator pipelines be integrated with thread programming? Copyright (C) 2008, http://www.dabeaz.com 1-96
  • 97. Multiple Threads • For example, can a generator pipeline span multiple threads? Thread 1 Thread 2 ? • Yes, if you connect them with a Queue object Copyright (C) 2008, http://www.dabeaz.com 1-97
  • 98. Generators and Queues • Feed a generated sequence into a queue # genqueue.py def sendto_queue(source, thequeue): for item in source: thequeue.put(item) thequeue.put(StopIteration) • Generate items received on a queue def genfrom_queue(thequeue): while True: item = thequeue.get() if item is StopIteration: break yield item • Note: Using StopIteration as a sentinel Copyright (C) 2008, http://www.dabeaz.com 1-98
  • 99. Thread Example • Here is a consumer function # A consumer. Prints out 404 records. def print_r404(log_q): log = genfrom_queue(log_q) r404 = (r for r in log if r['status'] == 404) for r in r404: print r['host'],r['datetime'],r['request'] • This function will be launched in its own thread • Using a Queue object as the input source Copyright (C) 2008, http://www.dabeaz.com 1-99
  • 100. Thread Example • Launching the consumer import threading, Queue log_q = Queue.Queue() r404_thr = threading.Thread(target=print_r404, args=(log_q,)) r404_thr.start() • Code that feeds the consumer lines = follow(open("access-log")) log = apache_log(lines) sendto_queue(log,log_q) Copyright (C) 2008, http://www.dabeaz.com 1- 100
  • 101. Part 8 Advanced Data Routing Copyright (C) 2008, http://www.dabeaz.com 1-101
  • 102. The Story So Far • You can use generators to set up pipelines • You can extend the pipeline over the network • You can extend it between threads • However, it's still just a pipeline (there is one input and one output). • Can you do more than that? Copyright (C) 2008, http://www.dabeaz.com 1- 102
  • 103. Multiple Sources • Can a processing pipeline be fed by multiple sources---for example, multiple generators? source1 source2 source3 for item in sources: # Process item Copyright (C) 2008, http://www.dabeaz.com 1- 103
  • 104. Concatenation • Concatenate one source after another (reprise) def gen_cat(sources): for s in sources: for item in s: yield item • This generates one big sequence • Consumes each generator one at a time • But only works if generators terminate • So, you wouldn't use this for real-time streams Copyright (C) 2008, http://www.dabeaz.com 1- 104
  • 105. Parallel Iteration • Zipping multiple generators together import itertools z = itertools.izip(s1,s2,s3) • This one is only marginally useful • Requires generators to go lock-step • Terminates when any input ends Copyright (C) 2008, http://www.dabeaz.com 1- 105
  • 106. Multiplexing • Feed a pipeline from multiple generators in real-time--producing values as they arrive • Example use log1 = follow(open("foo/access-log")) log2 = follow(open("bar/access-log")) lines = multiplex([log1,log2]) • There is no way to poll a generator • And only one for-loop executes at a time Copyright (C) 2008, http://www.dabeaz.com 1- 106
  • 107. Multiplexing • You can multiplex if you use threads and you use the tools we've developed so far • Idea : source1 source2 source3 queue for item in queue: # Process item Copyright (C) 2008, http://www.dabeaz.com 1- 107
  • 108. Multiplexing # genmultiplex.py import threading, Queue from genqueue import * from gencat import * def multiplex(sources): in_q = Queue.Queue() consumers = [] for src in sources: thr = threading.Thread(target=sendto_queue, args=(src,in_q)) thr.start() consumers.append(genfrom_queue(in_q)) return gen_cat(consumers) • Note: This is the trickiest example so far... Copyright (C) 2008, http://www.dabeaz.com 1- 108
  • 109. Multiplexing • Each input source is wrapped by a thread which runs the generator and dumps the items into a shared queue source1 source2 source3 sendto_queue sendto_queue sendto_queue in_q queue Copyright (C) 2008, http://www.dabeaz.com 1- 109
  • 110. Multiplexing • For each source, we create a consumer of queue data in_q queue consumers = [genfrom_queue, genfrom_queue, genfrom_queue ] • Now, just concatenate the consumers together get_cat(consumers) • Each time a producer terminates, we move to the next consumer (until there are no more) Copyright (C) 2008, http://www.dabeaz.com 1- 110
  • 111. Broadcasting • Can you broadcast to multiple consumers? generator consumer1 consumer2 consumer3 Copyright (C) 2008, http://www.dabeaz.com 1- 111
  • 112. Broadcasting • Consume a generator and send to consumers def broadcast(source, consumers): for item in source: for c in consumers: c.send(item) • It works, but now the control-flow is unusual • The broadcast loop is what runs the program • Consumers run by having items sent to them Copyright (C) 2008, http://www.dabeaz.com 1- 112
  • 113. Consumers • To create a consumer, define an object with a send() method on it class Consumer(object): def send(self,item): print self, "got", item • Example: c1 = Consumer() c2 = Consumer() c3 = Consumer() lines = follow(open("access-log")) broadcast(lines,[c1,c2,c3]) Copyright (C) 2008, http://www.dabeaz.com 1- 113
  • 114. Network Consumer • Example: import socket,pickle class NetConsumer(object): def __init__(self,addr): self.s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) self.s.connect(addr) def send(self,item): pitem = pickle.dumps(item) self.s.sendall(pitem) def close(self): self.s.close() • This will route items across the network Copyright (C) 2008, http://www.dabeaz.com 1- 114
  • 115. Network Consumer • Example Usage: class Stat404(NetConsumer): def send(self,item): if item['status'] == 404: NetConsumer.send(self,item) lines = follow(open("access-log")) log = apache_log(lines) stat404 = Stat404(("somehost",15000)) broadcast(log, [stat404]) • The 404 entries will go elsewhere... Copyright (C) 2008, http://www.dabeaz.com 1- 115
  • 116. Commentary • Once you start broadcasting, consumers can't follow the same programming model as before • Only one for-loop can run the pipeline. • However, you can feed an existing pipeline if you're willing to run it in a different thread or in a different process Copyright (C) 2008, http://www.dabeaz.com 1- 116
  • 117. Consumer Thread • Example: Routing items to a separate thread import Queue, threading from genqueue import genfrom_queue class ConsumerThread(threading.Thread): def __init__(self,target): threading.Thread.__init__(self) self.setDaemon(True) self.in_q = Queue.Queue() self.target = target def send(self,item): self.in_q.put(item) def run(self): self.target(genfrom_queue(self.in_q)) Copyright (C) 2008, http://www.dabeaz.com 1- 117
  • 118. Consumer Thread • Sample usage (building on earlier code) def find_404(log): for r in (r for r in log if r['status'] == 404): print r['status'],r['datetime'],r['request'] def bytes_transferred(log): total = 0 for r in log: total += r['bytes'] print "Total bytes", total c1 = ConsumerThread(find_404) c1.start() c2 = ConsumerThread(bytes_transferred) c2.start() lines = follow(open("access-log")) # Follow a log log = apache_log(lines) # Turn into records broadcast(log,[c1,c2]) # Broadcast to consumers Copyright (C) 2008, http://www.dabeaz.com 1- 118
  • 119. Part 9 Various Programming Tricks (And Debugging) Copyright (C) 2008, http://www.dabeaz.com 1-119
  • 120. Putting it all Together • This data processing pipeline idea is powerful • But, it's also potentially mind-boggling • Especially when you have dozens of pipeline stages, broadcasting, multiplexing, etc. • Let's look at a few useful tricks Copyright (C) 2008, http://www.dabeaz.com 1- 120
  • 121. Creating Generators • Any single-argument function is easy to turn into a generator function def generate(func): def gen_func(s): for item in s: yield func(item) return gen_func • Example: gen_sqrt = generate(math.sqrt) for x in gen_sqrt(xrange(100)): print x Copyright (C) 2008, http://www.dabeaz.com 1- 121
  • 122. Debug Tracing • A debugging function that will print items going through a generator def trace(source): for item in source: print item yield item • This can easily be placed around any generator lines = follow(open("access-log")) log = trace(apache_log(lines)) r404 = trace(r for r in log if r['status'] == 404) • Note: Might consider logging module for this Copyright (C) 2008, http://www.dabeaz.com 1- 122
  • 123. Recording the Last Item • Store the last item generated in the generator class storelast(object): def __init__(self,source): self.source = source def next(self): item = self.source.next() self.last = item return item def __iter__(self): return self • This can be easily wrapped around a generator lines = storelast(follow(open("access-log"))) log = apache_log(lines) for r in log: print r print lines.last Copyright (C) 2008, http://www.dabeaz.com 1- 123
  • 124. Shutting Down • Generators can be shut down using .close() import time def follow(thefile): thefile.seek(0,2) # Go to the end of the file while True: line = thefile.readline() if not line: time.sleep(0.1) # Sleep briefly continue yield line • Example: lines = follow(open("access-log")) for i,line in enumerate(lines): print line, if i == 10: lines.close() Copyright (C) 2008, http://www.dabeaz.com 1- 124
  • 125. Shutting Down • In the generator, GeneratorExit is raised import time def follow(thefile): thefile.seek(0,2) # Go to the end of the file try: while True: line = thefile.readline() if not line: time.sleep(0.1) # Sleep briefly continue yield line except GeneratorExit: print "Follow: Shutting down" • This allows for resource cleanup (if needed) Copyright (C) 2008, http://www.dabeaz.com 1- 125
  • 126. Ignoring Shutdown • Question: Can you ignore GeneratorExit? import time def follow(thefile): thefile.seek(0,2) # Go to the end of the file while True: try: line = thefile.readline() if not line: time.sleep(0.1) # Sleep briefly continue yield line except GeneratorExit: print "Forget about it" • Answer: No. You'll get a RuntimeError Copyright (C) 2008, http://www.dabeaz.com 1- 126
  • 127. Shutdown and Threads • Question : Can a thread shutdown a generator running in a different thread? lines = follow(open("foo/test.log")) def sleep_and_close(s): time.sleep(s) lines.close() threading.Thread(target=sleep_and_close,args=(30,)).start() for line in lines: print line, Copyright (C) 2008, http://www.dabeaz.com 1- 127
  • 128. Shutdown and Threads • Separate threads can not call .close() • Output: Exception in thread Thread-1: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/ lib/python2.5/threading.py", line 460, in __bootstrap self.run() File "/Library/Frameworks/Python.framework/Versions/2.5/ lib/python2.5/threading.py", line 440, in run self.__target(*self.__args, **self.__kwargs) File "genfollow.py", line 31, in sleep_and_close lines.close() ValueError: generator already executing Copyright (C) 2008, http://www.dabeaz.com 1- 128
  • 129. Shutdown and Signals • Can you shutdown a generator with a signal? import signal def sigusr1(signo,frame): print "Closing it down" lines.close() signal.signal(signal.SIGUSR1,sigusr1) lines = follow(open("access-log")) for line in lines: print line, • From the command line % kill -USR1 pid Copyright (C) 2008, http://www.dabeaz.com 1- 129
  • 130. Shutdown and Signals • This also fails: Traceback (most recent call last): File "genfollow.py", line 35, in <module> for line in lines: File "genfollow.py", line 8, in follow time.sleep(0.1) File "genfollow.py", line 30, in sigusr1 lines.close() ValueError: generator already executing • Sigh. Copyright (C) 2008, http://www.dabeaz.com 1- 130
  • 131. Shutdown • The only way to externally shutdown a generator would be to instrument with a flag or some kind of check def follow(thefile,shutdown=None): thefile.seek(0,2) while True: if shutdown and shutdown.isSet(): break line = thefile.readline() if not line: time.sleep(0.1) continue yield line Copyright (C) 2008, http://www.dabeaz.com 1- 131
  • 132. Shutdown • Example: import threading,signal shutdown = threading.Event() def sigusr1(signo,frame): print "Closing it down" shutdown.set() signal.signal(signal.SIGUSR1,sigusr1) lines = follow(open("access-log"),shutdown) for line in lines: print line, Copyright (C) 2008, http://www.dabeaz.com 1- 132
  • 133. Part 10 Parsing and Printing Copyright (C) 2008, http://www.dabeaz.com 1-133
  • 134. Incremental Parsing • Generators are a useful way to incrementally parse almost any kind of data # genrecord.py import struct def gen_records(record_format, thefile): record_size = struct.calcsize(record_format) while True: raw_record = thefile.read(record_size) if not raw_record: break yield struct.unpack(record_format, raw_record) • This function sweeps through a file and generates a sequence of unpacked records Copyright (C) 2008, http://www.dabeaz.com 1- 134
  • 135. Incremental Parsing • Example: from genrecord import * f = open("stockdata.bin","rb") for name, shares, price in gen_records("<8sif",f): # Process data ... • Tip : Look at xml.etree.ElementTree.iterparse for a neat way to incrementally process large XML documents using generators Copyright (C) 2008, http://www.dabeaz.com 1- 135
  • 136. yield as print • Generator functions can use yield like a print statement • Example: def print_count(n): yield "Hello Worldn" yield "n" yield "Look at me count to %dn" % n for i in xrange(n): yield " %dn" % i yield "I'm done!n" • This is useful if you're producing I/O output, but you want flexibility in how it gets handled Copyright (C) 2008, http://www.dabeaz.com 1- 136
  • 137. yield as print • Examples of processing the output stream: # Generate the output out = print_count(10) # Turn it into one big string out_str = "".join(out) # Write it to a file f = open("out.txt","w") for chunk in out: f.write(chunk) # Send it across a network socket for chunk in out: s.sendall(chunk) Copyright (C) 2008, http://www.dabeaz.com 1- 137
  • 138. yield as print • This technique of producing output leaves the exact output method unspecified • So, the code is not hardwired to use files, sockets, or any other specific kind of output • There is an interesting code-reuse element • One use of this : WSGI applications Copyright (C) 2008, http://www.dabeaz.com 1- 138
  • 139. Part 11 Co-routines Copyright (C) 2008, http://www.dabeaz.com 1-139
  • 140. The Final Frontier • In Python 2.5, generators picked up the ability to receive values using .send() def recv_count(): try: while True: n = (yield) # Yield expression print "T-minus", n except GeneratorExit: print "Kaboom!" • Think of this function as receiving values rather than generating them Copyright (C) 2008, http://www.dabeaz.com 1- 140
  • 141. Example Use • Using a receiver >>> r = recv_count() >>> r.next() Note: must call .next() here >>> for i in range(5,0,-1): ... r.send(i) ... T-minus 5 T-minus 4 T-minus 3 T-minus 2 T-minus 1 >>> r.close() Kaboom! >>> Copyright (C) 2008, http://www.dabeaz.com 1- 141
  • 142. Co-routines • This form of a generator is a "co-routine" • Also sometimes called a "reverse-generator" • Python books (mine included) do a pretty poor job of explaining how co-routines are supposed to be used • I like to think of them as "receivers" or "consumer". They receive values sent to them. Copyright (C) 2008, http://www.dabeaz.com 1- 142
  • 143. Setting up a Coroutine • To get a co-routine to run properly, you have to ping it with a .next() operation first def recv_count(): try: while True: n = (yield) # Yield expression print "T-minus", n except GeneratorExit: print "Kaboom!" • Example:r = recv_count() r.next() • This advances it to the first yield--where it will receive its first value Copyright (C) 2008, http://www.dabeaz.com 1- 143
  • 144. @consumer decorator • The .next() bit can be handled via decoration def consumer(func): def start(*args,**kwargs): c = func(*args,**kwargs) c.next() return c return start • Example:@consumer def recv_count(): try: while True: n = (yield) # Yield expression print "T-minus", n except GeneratorExit: print "Kaboom!" Copyright (C) 2008, http://www.dabeaz.com 1- 144
  • 145. @consumer decorator • Using the decorated version >>> r = recv_count() >>> for i in range(5,0,-1): ... r.send(i) ... T-minus 5 T-minus 4 T-minus 3 T-minus 2 T-minus 1 >>> r.close() Kaboom! >>> • Don't need the extra .next() step here Copyright (C) 2008, http://www.dabeaz.com 1- 145
  • 146. Coroutine Pipelines • Co-routines also set up a processing pipeline • Instead of being defining by iteration, it's defining by pushing values into the pipeline using .send() .send() .send() .send() • We already saw some of this with broadcasting Copyright (C) 2008, http://www.dabeaz.com 1- 146
  • 147. Broadcasting (Reprise) • Consume a generator and send items to a set of consumers def broadcast(source, consumers): for item in source: for c in consumers: c.send(item) • Notice that send() operation there • The consumers could be co-routines Copyright (C) 2008, http://www.dabeaz.com 1- 147
  • 148. Example @consumer def find_404(): while True: r = (yield) if r['status'] == 404: print r['status'],r['datetime'],r['request'] @consumer def bytes_transferred(): total = 0 while True: r = (yield) total += r['bytes'] print "Total bytes", total lines = follow(open("access-log")) log = apache_log(lines) broadcast(log,[find_404(),bytes_transferred()]) Copyright (C) 2008, http://www.dabeaz.com 1- 148
  • 149. Discussion • In last example, multiple consumers • However, there were no threads • Further exploration along these lines can take you into co-operative multitasking, concurrent programming without using threads • But that's an entirely different tutorial! Copyright (C) 2008, http://www.dabeaz.com 1- 149