The main focus of this chapter is using Python to process data presented in different kinds of common encodings, such as CSV files, JSON, XML, and binary packed records. Unlike the chapter on data structures, this chapter is not focused on specific algorithms, but instead on the problem of getting data in and out of a program.
For most kinds of CSV data, use the csv library. For
example, suppose you have some stock market data
in a file named stocks.csv like this:
Symbol,Price,Date,Time,Change,Volume
"AA",39.48,"6/11/2007","9:36am",-0.18,181800
"AIG",71.38,"6/11/2007","9:36am",-0.15,195500
"AXP",62.58,"6/11/2007","9:36am",-0.46,935000
"BA",98.31,"6/11/2007","9:36am",+0.12,104800
"C",53.08,"6/11/2007","9:36am",-0.25,360900
"CAT",78.29,"6/11/2007","9:36am",-0.23,225400Here’s how you would read the data as a sequence of tuples:
importcsvwithopen('stocks.csv')asf:f_csv=csv.reader(f)headers=next(f_csv)forrowinf_csv:# Process row...
In the preceding code, row will be a tuple. Thus, to access certain
fields, you will need to use indexing, such as row[0] (Symbol) and
row[4] (Change).
Since such indexing can often be confusing, this is one place where you might want to consider the use of named tuples. For example:
fromcollectionsimportnamedtuplewithopen('stock.csv')asf:f_csv=csv.reader(f)headings=next(f_csv)Row=namedtuple('Row',headings)forrinf_csv:row=Row(*r)# Process row...
This would allow you to use the column headers such as row.Symbol and
row.Change instead of indices. It should be noted that this only works if the column headers are valid Python identifiers. If not, you might have
to massage the initial headings (e.g., replacing nonidentifier characters
with underscores or similar).
Another alternative is to read the data as a sequence of dictionaries instead. To do that, use this code:
importcsvwithopen('stocks.csv')asf:f_csv=csv.DictReader(f)forrowinf_csv:# process row...
In this version, you would access the elements of each row using
the row headers. For example, row['Symbol'] or row['Change'].
To write CSV data, you also use the csv module but create a writer
object. For example:
headers=['Symbol','Price','Date','Time','Change','Volume']rows=[('AA',39.48,'6/11/2007','9:36am',-0.18,181800),('AIG',71.38,'6/11/2007','9:36am',-0.15,195500),('AXP',62.58,'6/11/2007','9:36am',-0.46,935000),]withopen('stocks.csv','w')asf:f_csv=csv.writer(f)f_csv.writerow(headers)f_csv.writerows(rows)
If you have the data as a sequence of dictionaries, do this:
headers=['Symbol','Price','Date','Time','Change','Volume']rows=[{'Symbol':'AA','Price':39.48,'Date':'6/11/2007','Time':'9:36am','Change':-0.18,'Volume':181800},{'Symbol':'AIG','Price':71.38,'Date':'6/11/2007','Time':'9:36am','Change':-0.15,'Volume':195500},{'Symbol':'AXP','Price':62.58,'Date':'6/11/2007','Time':'9:36am','Change':-0.46,'Volume':935000},]withopen('stocks.csv','w')asf:f_csv=csv.DictWriter(f,headers)f_csv.writeheader()f_csv.writerows(rows)
You should almost always prefer the use of the csv module over manually
trying to split and parse CSV data yourself. For instance, you might be
inclined to just write some code like this:
withopen('stocks.csv')asf:forlineinf:row=line.split(',')# process row...
The problem with this approach is that you’ll still need to deal with some nasty details. For example, if any of the fields are surrounded by quotes, you’ll have to strip the quotes. In addition, if a quoted field happens to contain a comma, the code will break by producing a row with the wrong size.
By default, the csv library is programmed to understand CSV encoding rules
used by Microsoft Excel. This is probably the most common variant, and will
likely give you the best compatibility. However, if you consult the
documentation for csv, you’ll see a few ways to tweak the encoding to
different formats (e.g., changing the separator character, etc.). For
example, if you want to read tab-delimited data instead, use this:
# Example of reading tab-separated valueswithopen('stock.tsv')asf:f_tsv=csv.reader(f,delimiter='\t')forrowinf_tsv:# Process row...
If you’re reading CSV data and converting it into named tuples, you need to be a little careful with validating column headers. For example, a CSV file could have a header line containing nonvalid identifier characters like this:
Street Address,Num-Premises,Latitude,Longitude 5412 N CLARK,10,41.980262,-87.668452
This will actually cause the creation of a namedtuple to fail with a ValueError
exception. To work around this, you might have to scrub the headers first.
For instance, carrying a regex substitution on nonvalid identifier characters like
this:
importrewithopen('stock.csv')asf:f_csv=csv.reader(f)headers=[re.sub('[^a-zA-Z_]','_',h)forhinnext(f_csv)]Row=namedtuple('Row',headers)forrinf_csv:row=Row(*r)# Process row...
It’s also important to emphasize that csv does not try to interpret
the data or convert it to a type other than a string. If such conversions
are important, that is something you’ll need to do yourself. Here is
one example of performing extra type conversions on CSV data:
col_types=[str,float,str,str,float,int]withopen('stocks.csv')asf:f_csv=csv.reader(f)headers=next(f_csv)forrowinf_csv:# Apply conversions to the row itemsrow=tuple(convert(value)forconvert,valueinzip(col_types,row))...
Alternatively, here is an example of converting selected fields of dictionaries:
('Reading as dicts with type conversion')field_types=[('Price',float),('Change',float),('Volume',int)]withopen('stocks.csv')asf:forrowincsv.DictReader(f):row.update((key,conversion(row[key]))forkey,conversioninfield_types)(row)
In general, you’ll probably want to be a bit careful with such conversions, though. In the real world, it’s common for CSV files to have missing values, corrupted data, and other issues that would break type conversions. So, unless your data is guaranteed to be error free, that’s something you’ll need to consider (you might need to add suitable exception handling).
Finally, if your goal in reading CSV data is to perform data analysis
and statistics, you might want to look at the Pandas package. Pandas includes a convenient
pandas.read_csv() function that will load CSV data into a
DataFrame object. From there, you can generate various summary
statistics, filter the data, and perform other kinds of high-level operations.
An example is given in Recipe 6.13.
The json module provides an easy way to encode and decode data in JSON.
The two main functions are json.dumps() and json.loads(), mirroring the
interface used in other serialization libraries, such as pickle. Here is
how you turn a Python data structure into JSON:
importjsondata={'name':'ACME','shares':100,'price':542.23}json_str=json.dumps(data)
Here is how you turn a JSON-encoded string back into a Python data structure:
data=json.loads(json_str)
If you are working with files instead of strings, you can alternatively use
json.dump() and json.load() to encode and decode JSON data. For example:
# Writing JSON datawithopen('data.json','w')asf:json.dump(data,f)# Reading data backwithopen('data.json','r')asf:data=json.load(f)
JSON encoding supports the basic types of None, bool, int,
float, and str, as well as lists, tuples, and dictionaries
containing those types. For dictionaries, keys are assumed to be
strings (any nonstring keys in a dictionary are converted to strings
when encoding). To be compliant with the JSON specification, you
should only encode Python lists and dictionaries. Moreover, in web
applications, it is standard practice for the top-level object to
be a dictionary.
The format of JSON encoding is almost identical to Python syntax except
for a few minor changes. For instance, True is mapped to true, False is
mapped to false, and None is mapped to null. Here is an example
that shows what the encoding looks like:
>>>json.dumps(False)'false'>>>d={'a':True,...'b':'Hello',...'c':None}>>>json.dumps(d)'{"b": "Hello", "c": null, "a": true}'>>>
If you are trying to examine data you have decoded from JSON, it can
often be hard to ascertain its structure simply by printing it
out—especially if the data contains a deep level of nested structures
or a lot of fields. To assist with this, consider using the
pprint() function in the pprint module. This will alphabetize the
keys and output a dictionary in a more sane way. Here is an example
that illustrates how you would pretty print the results of a search on
Twitter:
>>>fromurllib.requestimporturlopen>>>importjson>>>u=urlopen('http://search.twitter.com/search.json?q=python&rpp=5')>>>resp=json.loads(u.read().decode('utf-8'))>>>frompprintimportpprint>>>pprint(resp){'completed_in': 0.074,'max_id': 264043230692245504,'max_id_str': '264043230692245504','next_page': '?page=2&max_id=264043230692245504&q=python&rpp=5','page': 1,'query': 'python','refresh_url': '?since_id=264043230692245504&q=python','results': [{'created_at': 'Thu, 01 Nov 2012 16:36:26 +0000','from_user': ...},{'created_at': 'Thu, 01 Nov 2012 16:36:14 +0000','from_user': ...},{'created_at': 'Thu, 01 Nov 2012 16:36:13 +0000','from_user': ...},{'created_at': 'Thu, 01 Nov 2012 16:36:07 +0000','from_user': ...},{'created_at': 'Thu, 01 Nov 2012 16:36:04 +0000','from_user': ...}],'results_per_page': 5,'since_id': 0,'since_id_str': '0'}>>>
Normally, JSON decoding will create dicts or lists from the supplied data. If you want
to create different kinds of objects, supply the object_pairs_hook or object_hook
to json.loads(). For example, here is how you would decode JSON data, preserving its
order in an OrderedDict:
>>>s='{"name": "ACME", "shares": 50, "price": 490.1}'>>>fromcollectionsimportOrderedDict>>>data=json.loads(s,object_pairs_hook=OrderedDict)>>>dataOrderedDict([('name', 'ACME'), ('shares', 50), ('price', 490.1)])>>>
Here is how you could turn a JSON dictionary into a Python object:
>>>classJSONObject:...def__init__(self,d):...self.__dict__=d...>>>>>>data=json.loads(s,object_hook=JSONObject)>>>data.name'ACME'>>>data.shares50>>>data.price490.1>>>
In this last example, the dictionary created by decoding the JSON data is passed
as a single argument to __init__(). From there, you are free to use it as you will,
such as using it directly as the instance dictionary of the object.
There are a few options that can be useful for encoding JSON. If you would like the output
to be nicely formatted, you can use the indent argument to json.dumps(). This causes
the output to be pretty printed in a format similar to that with the pprint() function.
For example:
>>>(json.dumps(data)){"price": 542.23, "name": "ACME", "shares": 100}>>>(json.dumps(data,indent=4)){"price": 542.23,"name": "ACME","shares": 100}>>>
If you want the keys to be sorted on output, used the sort_keys argument:
>>>(json.dumps(data,sort_keys=True)){"name": "ACME", "price": 542.23, "shares": 100}>>>
Instances are not normally serializable as JSON. For example:
>>>classPoint:...def__init__(self,x,y):...self.x=x...self.y=y...>>>p=Point(2,3)>>>json.dumps(p)Traceback (most recent call last):File"<stdin>", line1, in<module>File"/usr/local/lib/python3.3/json/__init__.py", line226, indumpsreturn_default_encoder.encode(obj)File"/usr/local/lib/python3.3/json/encoder.py", line187, inencodechunks=self.iterencode(o,_one_shot=True)File"/usr/local/lib/python3.3/json/encoder.py", line245, initerencodereturn_iterencode(o,0)File"/usr/local/lib/python3.3/json/encoder.py", line169, indefaultraiseTypeError(repr(o)+" is not JSON serializable")TypeError:<__main__.Point object at 0x1006f2650> is not JSON serializable>>>
If you want to serialize instances, you can supply a function that takes an instance as input and returns a dictionary that can be serialized. For example:
defserialize_instance(obj):d={'__classname__':type(obj).__name__}d.update(vars(obj))returnd
If you want to get an instance back, you could write code like this:
# Dictionary mapping names to known classesclasses={'Point':Point}defunserialize_object(d):clsname=d.pop('__classname__',None)ifclsname:cls=classes[clsname]obj=cls.__new__(cls)# Make instance without calling __init__forkey,valueind.items():setattr(obj,key,value)returnobjelse:returnd
Here is an example of how these functions are used:
>>>p=Point(2,3)>>>s=json.dumps(p,default=serialize_instance)>>>s'{"__classname__": "Point", "y": 3, "x": 2}'>>>a=json.loads(s,object_hook=unserialize_object)>>>a<__main__.Point object at 0x1017577d0>>>>a.x2>>>a.y3>>>
The json module has a variety of other options for controlling the
low-level interpretation of numbers, special values such as NaN, and
more. Consult the documentation for further details.
The xml.etree.ElementTree module can be used to extract data from
simple XML documents. To illustrate, suppose you want to parse and
make a summary of the RSS feed on Planet Python. Here is a script that will do it:
fromurllib.requestimporturlopenfromxml.etree.ElementTreeimportparse# Download the RSS feed and parse itu=urlopen('http://planet.python.org/rss20.xml')doc=parse(u)# Extract and output tags of interestforitemindoc.iterfind('channel/item'):title=item.findtext('title')date=item.findtext('pubDate')link=item.findtext('link')(title)(date)(link)()
If you run the preceding script, the output looks similar to the following:
Steve Holden: Python for Data Analysis
Mon, 19 Nov 2012 02:13:51 +0000
http://holdenweb.blogspot.com/2012/11/python-for-data-analysis.html
Vasudev Ram: The Python Data model (for v2 and v3)
Sun, 18 Nov 2012 22:06:47 +0000
http://jugad2.blogspot.com/2012/11/the-python-data-model.html
Python Diary: Been playing around with Object Databases
Sun, 18 Nov 2012 20:40:29 +0000
http://www.pythondiary.com/blog/Nov.18,2012/been-...-object-databases.html
Vasudev Ram: Wakari, Scientific Python in the cloud
Sun, 18 Nov 2012 20:19:41 +0000
http://jugad2.blogspot.com/2012/11/wakari-scientific-python-in-cloud.html
Jesse Jiryu Davis: Toro: synchronization primitives for Tornado coroutines
Sun, 18 Nov 2012 20:17:49 +0000
http://feedproxy.google.com/~r/EmptysquarePython/~3/_DOZT2Kd0hQ/Obviously, if you want to do more processing, you need to replace the
print() statements with something more interesting.
Working with data encoded as XML is commonplace in many applications. Not only is XML widely used as a format for exchanging data on the Internet, it is a common format for storing application data (e.g., word processing, music libraries, etc.). The discussion that follows already assumes the reader is familiar with XML basics.
In many cases, when XML is simply being used to store data, the document structure is compact and straightforward. For example, the RSS feed from the example looks similar to the following:
<?xml version="1.0"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>Planet Python</title>
<link>http://planet.python.org/</link>
<language>en</language>
<description>Planet Python - http://planet.python.org/</description>
<item>
<title>Steve Holden: Python for Data Analysis</title>
<guid>http://holdenweb.blogspot.com/...-data-analysis.html</guid>
<link>http://holdenweb.blogspot.com/...-data-analysis.html</link>
<description>...</description>
<pubDate>Mon, 19 Nov 2012 02:13:51 +0000</pubDate>
</item>
<item>
<title>Vasudev Ram: The Python Data model (for v2 and v3)</title>
<guid>http://jugad2.blogspot.com/...-data-model.html</guid>
<link>http://jugad2.blogspot.com/...-data-model.html</link>
<description>...</description>
<pubDate>Sun, 18 Nov 2012 22:06:47 +0000</pubDate>
</item>
<item>
<title>Python Diary: Been playing around with Object Databases</title>
<guid>http://www.pythondiary.com/...-object-databases.html</guid>
<link>http://www.pythondiary.com/...-object-databases.html</link>
<description>...</description>
<pubDate>Sun, 18 Nov 2012 20:40:29 +0000</pubDate>
</item>
...
</channel>
</rss>The xml.etree.ElementTree.parse() function parses the entire XML
document into a document object. From there, you use methods such as
find(), iterfind(), and findtext() to search for specific XML
elements. The arguments to these functions are the names of a specific
tag, such as channel/item or title.
When specifying tags, you need to take the overall document structure
into account. Each find operation takes place relative to a starting
element. Likewise, the tagname that you supply to each operation is also
relative to the start. In the example, the call to
doc.iterfind('channel/item') looks for all “item” elements under a
“channel” element. doc represents the top of the document (the
top-level “rss” element). The later calls to item.findtext() take
place relative to the found “item” elements.
Each element represented by the ElementTree module has a few
essential attributes and methods that are useful when parsing. The
tag attribute contains the name of the tag, the text attribute
contains enclosed text, and the get() method can be used to extract
attributes (if any). For example:
>>>doc<xml.etree.ElementTree.ElementTree object at 0x101339510>>>>e=doc.find('channel/title')>>>e<Element 'title' at 0x10135b310>>>>e.tag'title'>>>e.text'Planet Python'>>>e.get('some_attribute')>>>
It should be noted that xml.etree.ElementTree is not the only option
for XML parsing. For more advanced applications, you might consider lxml. It uses the same
programming interface as ElementTree, so the example shown in this
recipe works in the same manner. You simply need to change the first import to
from lxml.etree import parse. lxml provides the benefit of being
fully compliant with XML standards. It is also extremely fast, and provides support for features
such as validation, XSLT, and XPath.
Any time you are faced with the problem of incremental data processing, you should think of iterators and generators. Here is a simple function that can be used to incrementally process huge XML files using a very small memory footprint:
fromxml.etree.ElementTreeimportiterparsedefparse_and_remove(filename,path):path_parts=path.split('/')doc=iterparse(filename,('start','end'))# Skip the root elementnext(doc)tag_stack=[]elem_stack=[]forevent,elemindoc:ifevent=='start':tag_stack.append(elem.tag)elem_stack.append(elem)elifevent=='end':iftag_stack==path_parts:yieldelemelem_stack[-2].remove(elem)try:tag_stack.pop()elem_stack.pop()exceptIndexError:pass
To test the function, you now need to find a large XML file to work with. You can often find such files on government and open data websites. For example, you can download Chicago’s pothole database as XML. At the time of this writing, the downloaded file consists of more than 100,000 rows of data, which are encoded like this:
<response>
<row>
<row ...>
<creation_date>2012-11-18T00:00:00</creation_date>
<status>Completed</status>
<completion_date>2012-11-18T00:00:00</completion_date>
<service_request_number>12-01906549</service_request_number>
<type_of_service_request>Pot Hole in Street</type_of_service_request>
<current_activity>Final Outcome</current_activity>
<most_recent_action>CDOT Street Cut ... Outcome</most_recent_action>
<street_address>4714 S TALMAN AVE</street_address>
<zip>60632</zip>
<x_coordinate>1159494.68618856</x_coordinate>
<y_coordinate>1873313.83503384</y_coordinate>
<ward>14</ward>
<police_district>9</police_district>
<community_area>58</community_area>
<latitude>41.808090232127896</latitude>
<longitude>-87.69053684711305</longitude>
<location latitude="41.808090232127896"
longitude="-87.69053684711305" />
</row>
<row ...>
<creation_date>2012-11-18T00:00:00</creation_date>
<status>Completed</status>
<completion_date>2012-11-18T00:00:00</completion_date>
<service_request_number>12-01906695</service_request_number>
<type_of_service_request>Pot Hole in Street</type_of_service_request>
<current_activity>Final Outcome</current_activity>
<most_recent_action>CDOT Street Cut ... Outcome</most_recent_action>
<street_address>3510 W NORTH AVE</street_address>
<zip>60647</zip>
<x_coordinate>1152732.14127696</x_coordinate>
<y_coordinate>1910409.38979075</y_coordinate>
<ward>26</ward>
<police_district>14</police_district>
<community_area>23</community_area>
<latitude>41.91002084292946</latitude>
<longitude>-87.71435952353961</longitude>
<location latitude="41.91002084292946"
longitude="-87.71435952353961" />
</row>
</row>
</response>Suppose you want to write a script that ranks ZIP codes by the number of pothole reports. To do it, you could write code like this:
fromxml.etree.ElementTreeimportparsefromcollectionsimportCounterpotholes_by_zip=Counter()doc=parse('potholes.xml')forpotholeindoc.iterfind('row/row'):potholes_by_zip[pothole.findtext('zip')]+=1forzipcode,numinpotholes_by_zip.most_common():(zipcode,num)
The only problem with this script is that it reads and parses the entire XML file into memory. On our machine, it takes about 450 MB of memory to run. Using this recipe’s code, the program changes only slightly:
fromcollectionsimportCounterpotholes_by_zip=Counter()data=parse_and_remove('potholes.xml','row/row')forpotholeindata:potholes_by_zip[pothole.findtext('zip')]+=1forzipcode,numinpotholes_by_zip.most_common():(zipcode,num)
This version of code runs with a memory footprint of only 7 MB—a huge savings!
This recipe relies on two core features of the ElementTree
module. First, the iterparse() method allows incremental processing
of XML documents. To use it, you supply the filename along with an
event list consisting of one or more of the following: start, end,
start-ns, and end-ns. The iterator created by iterparse()
produces tuples of the form (event, elem), where event is one of
the listed events and elem is the resulting XML element. For
example:
>>>data=iterparse('potholes.xml',('start','end'))>>>next(data)('start', <Element 'response' at 0x100771d60>)>>>next(data)('start', <Element 'row' at 0x100771e68>)>>>next(data)('start', <Element 'row' at 0x100771fc8>)>>>next(data)('start', <Element 'creation_date' at 0x100771f18>)>>>next(data)('end', <Element 'creation_date' at 0x100771f18>)>>>next(data)('start', <Element 'status' at 0x1006a7f18>)>>>next(data)('end', <Element 'status' at 0x1006a7f18>)>>>
start events are created when an element is first created but not
yet populated with any other data (e.g., child elements). end
events are created when an element is completed. Although not shown
in this recipe, start-ns and end-ns events are used to handle
XML namespace declarations.
In this recipe, the start and end events are used to manage stacks of
elements and tags. The stacks represent the current hierarchical structure
of the document as it’s being parsed, and are also used to determine if
an element matches the requested path given to the parse_and_remove()
function. If a match is made, yield is used to emit it back to the caller.
The following statement after the yield is the core feature of
ElementTree that makes this recipe save memory:
elem_stack[-2].remove(elem)
This statement causes the previously yielded element to be removed from its parent. Assuming that no references are left to it anywhere else, the element is destroyed and memory reclaimed.
The end effect of the iterative parse and the removal of nodes is a highly efficient incremental sweep over the document. At no point is a complete document tree ever constructed. Yet, it is still possible to write code that processes the XML data in a straightforward manner.
The primary downside to this recipe is its runtime performance. When tested, the version of code that reads the entire document into memory first runs approximately twice as fast as the version that processes it incrementally. However, it requires more than 60 times as much memory. So, if memory use is a greater concern, the incremental version is a big win.
Although the xml.etree.ElementTree library is commonly used for
parsing, it can also be used to create XML documents. For example,
consider this function:
fromxml.etree.ElementTreeimportElementdefdict_to_xml(tag,d):'''Turn a simple dict of key/value pairs into XML'''elem=Element(tag)forkey,valind.items():child=Element(key)child.text=str(val)elem.append(child)returnelem
Here is an example:
>>>s={'name':'GOOG','shares':100,'price':490.1}>>>e=dict_to_xml('stock',s)>>>e<Element 'stock' at 0x1004b64c8>>>>
The result of this conversion is an Element instance. For I/O,
it is easy to convert this to a byte string using the tostring()
function in xml.etree.ElementTree. For example:
>>>fromxml.etree.ElementTreeimporttostring>>>tostring(e)b'<stock><price>490.1</price><shares>100</shares><name>GOOG</name></stock>'>>>
If you want to attach attributes to an element, use its set() method:
>>>e.set('_id','1234')>>>tostring(e)b'<stock _id="1234"><price>490.1</price><shares>100</shares><name>GOOG</name></stock>'>>>
If the order of the elements matters, consider making an OrderedDict instead of
a normal dictionary. See Recipe 1.7.
When creating XML, you might be inclined to just make strings instead. For example:
defdict_to_xml_str(tag,d):'''Turn a simple dict of key/value pairs into XML'''parts=['<{}>'.format(tag)]forkey,valind.items():parts.append('<{0}>{1}</{0}>'.format(key,val))parts.append('</{}>'.format(tag))return''.join(parts)
The problem is that you’re going to make a real mess for yourself if you try to do things manually. For example, what happens if the dictionary values contain special characters like this?
>>>d={'name':'<spam>'}>>># String creation>>>dict_to_xml_str('item',d)'<item><name><spam></name></item>'>>># Proper XML creation>>>e=dict_to_xml('item',d)>>>tostring(e)b'<item><name><spam></name></item>'>>>
Notice how in the latter example, the characters < and > got
replaced with < and >.
Just for reference, if you ever need to manually escape or unescape
such characters, you can use the escape() and unescape() functions
in xml.sax.saxutils. For example:
>>>fromxml.sax.saxutilsimportescape,unescape>>>escape('<spam>')'<spam>'>>>unescape(_)'<spam>'>>>
Aside from creating correct output, the other reason why it’s a good idea to create Element instances instead
of strings is that they can be more easily combined together to make
a larger document. The resulting Element instances can also be
processed in various ways without ever having to worry about parsing
the XML text. Essentially, you can do all of the processing of the
data in a more high-level form and then output it as a string at the very end.
The xml.etree.ElementTree module makes it easy to perform
such tasks. Essentially, you start out by parsing the document
in the usual way. For example, suppose you have a document
named pred.xml that looks like this:
<?xml version="1.0"?><stop><id>14791</id><nm>Clark&Balmoral</nm><sri><rt>22</rt><d>North Bound</d><dd>North Bound</dd></sri><cr>22</cr><pre><pt>5 MIN</pt><fd>Howard</fd><v>1378</v><rn>22</rn></pre><pre><pt>15 MIN</pt><fd>Howard</fd><v>1867</v><rn>22</rn></pre></stop>
Here is an example of using ElementTree to read it and make changes
to the structure:
>>>fromxml.etree.ElementTreeimportparse,Element>>>doc=parse('pred.xml')>>>root=doc.getroot()>>>root<Element 'stop' at 0x100770cb0>>>># Remove a few elements>>>root.remove(root.find('sri'))>>>root.remove(root.find('cr'))>>># Insert a new element after <nm>...</nm>>>>root.getchildren().index(root.find('nm'))1>>>e=Element('spam')>>>e.text='This is a test'>>>root.insert(2,e)>>># Write back to a file>>>doc.write('newpred.xml',xml_declaration=True)>>>
The result of these operations is a new XML file that looks like this:
<?xml version='1.0' encoding='us-ascii'?><stop><id>14791</id><nm>Clark&Balmoral</nm><spam>This is a test</spam><pre><pt>5 MIN</pt><fd>Howard</fd><v>1378</v><rn>22</rn></pre><pre><pt>15 MIN</pt><fd>Howard</fd><v>1867</v><rn>22</rn></pre></stop>
Modifying the structure of an XML document is straightforward, but you
must remember that all modifications are generally made to the parent
element, treating it as if it were a list. For example, if you remove
an element, it is removed from its immediate parent using the parent’s
remove() method. If you insert or append new elements, you also use
insert() and append() methods on the parent. Elements can also be
manipulated using indexing and slicing operations, such as
element[i] or element[i:j].
If you need to make new elements, use the Element class, as shown in this recipe’s solution.
This is described further in Recipe 6.5.
Consider a document that uses namespaces like this:
<?xml version="1.0" encoding="utf-8"?><top><author>David Beazley</author><content><htmlxmlns="http://www.w3.org/1999/xhtml"><head><title>Hello World</title></head><body><h1>Hello World!</h1></body></html></content></top>
If you parse this document and try to perform the usual queries, you’ll find that it doesn’t work so easily because everything becomes incredibly verbose:
>>># Some queries that work>>>doc.findtext('author')'David Beazley'>>>doc.find('content')<Element 'content' at 0x100776ec0>>>># A query involving a namespace (doesn't work)>>>doc.find('content/html')>>># Works if fully qualified>>>doc.find('content/{http://www.w3.org/1999/xhtml}html')<Element '{http://www.w3.org/1999/xhtml}html' at 0x1007767e0>>>># Doesn't work>>>doc.findtext('content/{http://www.w3.org/1999/xhtml}html/head/title')>>># Fully qualified>>>doc.findtext('content/{http://www.w3.org/1999/xhtml}html/'...'{http://www.w3.org/1999/xhtml}head/{http://www.w3.org/1999/xhtml}title')'Hello World'>>>
You can often simplify matters for yourself by wrapping namespace handling up into a utility class.
classXMLNamespaces:def__init__(self,**kwargs):self.namespaces={}forname,uriinkwargs.items():self.register(name,uri)defregister(self,name,uri):self.namespaces[name]='{'+uri+'}'def__call__(self,path):returnpath.format_map(self.namespaces)
To use this class, you do the following:
>>>ns=XMLNamespaces(html='http://www.w3.org/1999/xhtml')>>>doc.find(ns('content/{html}html'))<Element '{http://www.w3.org/1999/xhtml}html' at 0x1007767e0>>>>doc.findtext(ns('content/{html}html/{html}head/{html}title'))'Hello World'>>>
Parsing XML documents that contain namespaces can be messy. The XMLNamespaces class is really just meant to clean it up
slightly by allowing you to use the shortened namespace names
in subsequent operations as opposed to fully qualified URIs.
Unfortunately, there is no mechanism in the basic ElementTree parser
to get further information about namespaces. However, you can get a bit
more information about the scope of namespace processing
if you’re willing to use the iterparse() function instead. For example:
>>>fromxml.etree.ElementTreeimportiterparse>>>forevt,eleminiterparse('ns2.xml',('end','start-ns','end-ns')):...(evt,elem)...end <Element 'author' at 0x10110de10>start-ns ('', 'http://www.w3.org/1999/xhtml')end <Element '{http://www.w3.org/1999/xhtml}title' at 0x1011131b0>end <Element '{http://www.w3.org/1999/xhtml}head' at 0x1011130a8>end <Element '{http://www.w3.org/1999/xhtml}h1' at 0x101113310>end <Element '{http://www.w3.org/1999/xhtml}body' at 0x101113260>end <Element '{http://www.w3.org/1999/xhtml}html' at 0x10110df70>end-ns Noneend <Element 'content' at 0x10110de68>end <Element 'top' at 0x10110dd60>>>>elem# This is the topmost element<Element 'top' at 0x10110dd60>>>>
As a final note, if the text you are parsing makes use of namespaces in
addition to other advanced XML features, you’re really better off using
the lxml library instead of ElementTree. For
instance, lxml provides better support for validating documents against
a DTD, more complete XPath support, and other advanced XML features.
This recipe is really just a simple fix to make parsing a little
easier.
A standard way to represent rows of data in Python is as a sequence of tuples. For example:
stocks=[('GOOG',100,490.1),('AAPL',50,545.75),('FB',150,7.45),('HPQ',75,33.2),]
Given data in this form, it is relatively straightforward to interact with a relational database using Python’s standard database API, as described in PEP 249. The gist of the API is that all operations on the database are carried out by SQL queries. Each row of input or output data is represented by a tuple.
To illustrate, you can use the sqlite3 module that comes with
Python. If you are using a different database (e.g., MySql, Postgres, or ODBC),
you’ll have to install a third-party module to support it.
However, the underlying programming interface will be virtually the same,
if not identical.
The first step is to connect to the database. Typically, you execute a
connect() function, supplying parameters such as the name of the database,
hostname, username, password, and other details as needed. For example:
>>>importsqlite3>>>db=sqlite3.connect('database.db')>>>
To do anything with the data, you next create a cursor. Once you have a cursor, you can start executing SQL queries. For example:
>>>c=db.cursor()>>>c.execute('create table portfolio (symbol text, shares integer, price real)')<sqlite3.Cursor object at 0x10067a730>>>>db.commit()>>>
To insert a sequence of rows into the data, use a statement like this:
>>>c.executemany('insert into portfolio values (?,?,?)',stocks)<sqlite3.Cursor object at 0x10067a730>>>>db.commit()>>>
To perform a query, use a statement such as this:
>>>forrowindb.execute('select * from portfolio'):...(row)...('GOOG', 100, 490.1)('AAPL', 50, 545.75)('FB', 150, 7.45)('HPQ', 75, 33.2)>>>
If you want to perform queries that accept user-supplied
input parameters, make sure you escape the parameters using ?
like this:
>>>min_price=100>>>forrowindb.execute('select * from portfolio where price >= ?',(min_price,)):...(row)...('GOOG', 100, 490.1)('AAPL', 50, 545.75)>>>
At a low level, interacting with a database is an extremely straightforward thing to do. You simply form SQL statements and feed them to the underlying module to either update the database or retrieve data. That said, there are still some tricky details you’ll need to sort out on a case-by-case basis.
One complication is the mapping of data from the database into Python
types. For entries such as dates, it is most common to use datetime
instances from the datetime module, or possibly system timestamps, as
used in the time module. For numerical data, especially financial
data involving decimals, numbers may be represented as Decimal
instances from the decimal module. Unfortunately, the exact mapping
varies by database backend so you’ll have to read the associated
documentation.
Another extremely critical complication concerns the formation of SQL
statement strings. You should never use Python string formatting
operators (e.g., %) or the .format() method to create such
strings. If the values provided to such formatting operators are
derived from user input, this opens up your program to an SQL-injection
attack (see http://xkcd.com/327). The special ? wildcard in
queries instructs the database backend to use its own string
substitution mechanism, which (hopefully) will do it safely.
Sadly, there is some inconsistency across database backends with respect to the
wildcard. Many modules use ? or %s, while others may use a different symbol, such
as :0 or :1, to refer to parameters. Again, you’ll have to consult the
documentation for the database module you’re using. The paramstyle
attribute of a database module also contains information about the quoting style.
For simply pulling data in and out of a database table, using the database API is usually simple enough. If you’re doing something more complicated, it may make sense to use a higher-level interface, such as that provided by an object-relational mapper. Libraries such as SQLAlchemy allow database tables to be described as Python classes and for database operations to be carried out while hiding most of the underlying SQL.
You need to decode a string of hexadecimal digits into a byte string or encode a byte string as hex.
If you simply need to decode or encode a raw string of hex digits, use
the binascii module. For example:
>>># Initial byte string>>>s=b'hello'>>># Encode as hex>>>importbinascii>>>h=binascii.b2a_hex(s)>>>hb'68656c6c6f'>>># Decode back to bytes>>>binascii.a2b_hex(h)b'hello'>>>
Similar functionality can also be found in the base64 module. For example:
>>>importbase64>>>h=base64.b16encode(s)>>>hb'68656C6C6F'>>>base64.b16decode(h)b'hello'>>>
For the most part, converting to and from hex is straightforward using
the functions shown. The main difference between the two techniques
is in case folding. The base64.b16decode() and base64.b16encode()
functions only operate with uppercase hexadecimal letters, whereas the
functions in binascii work with either case.
It’s also important to note that the output produced by the encoding functions is always a byte string. To coerce it to Unicode for output, you may need to add an extra decoding step. For example:
>>>h=base64.b16encode(s)>>>(h)b'68656C6C6F'>>>(h.decode('ascii'))68656C6C6F>>>
When decoding hex digits, the b16decode() and a2b_hex() functions
accept either bytes or unicode strings. However, those strings must
only contain ASCII-encoded hexadecimal digits.
The base64 module has two functions—b64encode() and b64decode()—that do exactly what you want. For example:
>>># Some byte data>>>s=b'hello'>>>importbase64>>># Encode as Base64>>>a=base64.b64encode(s)>>>ab'aGVsbG8='>>># Decode from Base64>>>base64.b64decode(a)b'hello'>>>
Base64 encoding is only meant to be used on byte-oriented data such as byte strings and byte arrays. Moreover, the output of the encoding process is always a byte string. If you are mixing Base64-encoded data with Unicode text, you may have to perform an extra decoding step. For example:
>>>a=base64.b64encode(s).decode('ascii')>>>a'aGVsbG8='>>>
When decoding Base64, both byte strings and Unicode text strings can be supplied. However, Unicode strings can only contain ASCII characters.
You want to read or write data encoded as a binary array of uniform structures into Python tuples.
To work with binary data, use the struct module.
Here is an example of code that writes a list of Python tuples out to
a binary file, encoding each tuple as a structure using struct:
fromstructimportStructdefwrite_records(records,format,f):'''Write a sequence of tuples to a binary file of structures.'''record_struct=Struct(format)forrinrecords:f.write(record_struct.pack(*r))# Exampleif__name__=='__main__':records=[(1,2.3,4.5),(6,7.8,9.0),(12,13.4,56.7)]withopen('data.b','wb')asf:write_records(records,'<idd',f)
There are several approaches for reading this file back into a list of tuples. First, if you’re going to read the file incrementally in chunks, you can write code such as this:
fromstructimportStructdefread_records(format,f):record_struct=Struct(format)chunks=iter(lambda:f.read(record_struct.size),b'')return(record_struct.unpack(chunk)forchunkinchunks)# Exampleif__name__=='__main__':withopen('data.b','rb')asf:forrecinread_records('<idd',f):# Process rec...
If you want to read the file entirely into a byte string with a single read and convert it piece by piece, you can write the following:
fromstructimportStructdefunpack_records(format,data):record_struct=Struct(format)return(record_struct.unpack_from(data,offset)foroffsetinrange(0,len(data),record_struct.size))# Exampleif__name__=='__main__':withopen('data.b','rb')asf:data=f.read()forrecinunpack_records('<idd',data):# Process rec...
In both cases, the result is an iterable that produces the tuples originally stored when the file was created.
For programs that must encode and decode binary data, it is
common to use the struct module. To declare a new
structure, simply create an instance of Struct such as:
# Little endian 32-bit integer, two double precision floatsrecord_struct=Struct('<idd')
Structures are always defined using a set of structure codes
such as i, d, f, and so forth [see
the Python documentation]. These codes
correspond to specific binary data types such as 32-bit integers,
64-bit floats, 32-bit floats, and so forth. The < in the first
character specifies the byte ordering. In this example, it is
indicating “little endian.” Change the character to > for
big endian or ! for network byte order.
The resulting Struct instance has various attributes and methods for
manipulating structures of that type. The size attribute contains
the size of the structure in bytes, which is useful to have in I/O
operations. pack() and unpack() methods are used to pack and
unpack data. For example:
>>>fromstructimportStruct>>>record_struct=Struct('<idd')>>>record_struct.size20>>>record_struct.pack(1,2.0,3.0)b'\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@'>>>record_struct.unpack(_)(1, 2.0, 3.0)>>>
Sometimes you’ll see the pack() and unpack() operations called as module-level
functions, as in the following:
>>>importstruct>>>struct.pack('<idd',1,2.0,3.0)b'\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@'>>>struct.unpack('<idd',_)(1, 2.0, 3.0)>>>
This works, but feels less elegant than creating a single Struct
instance—especially if the same structure appears in multiple places in
your code. By creating a Struct instance, the format code is only
specified once and all of the useful operations are grouped together
nicely. This certainly makes it easier to maintain your code if you
need to fiddle with the structure code (as you only have to change it
in one place).
The code for reading binary structures involves a number of interesting,
yet elegant programming idioms. In the read_records() function,
iter() is being used to make an iterator that returns fixed-sized
chunks. See Recipe 5.8. This iterator repeatedly calls a user-supplied callable (e.g.,
lambda: f.read(record_struct.size)) until it returns a specified
value (e.g., b), at which point iteration stops. For example:
>>>f=open('data.b','rb')>>>chunks=iter(lambda:f.read(20),b'')>>>chunks<callable_iterator object at 0x10069e6d0>>>>forchkinchunks:...(chk)...b'\x01\x00\x00\x00ffffff\x02@\x00\x00\x00\x00\x00\x00\x12@'b'\x06\x00\x00\x00333333\x1f@\x00\x00\x00\x00\x00\x00"@'b'\x0c\x00\x00\x00\xcd\xcc\xcc\xcc\xcc\xcc*@\x9a\x99\x99\x99\x99YL@'>>>
One reason for creating an iterable is that it nicely allows records to be created using a generator comprehension, as shown in the solution. If you didn’t use this approach, the code might look like this:
defread_records(format,f):record_struct=Struct(format)whileTrue:chk=f.read(record_struct.size)ifchk==b'':breakyieldrecord_struct.unpack(chk)
In the unpack_records() function, a different approach using the
unpack_from() method is used. unpack_from() is a useful method
for extracting binary data from a larger binary array, because it does
so without making any temporary objects or memory copies. You just
give it a byte string (or any array) along with a byte offset, and
it will unpack fields directly from that location.
If you used unpack() instead of unpack_from(), you would need to
modify the code to make a lot of small slices and offset calculations. For example:
defunpack_records(format,data):record_struct=Struct(format)return(record_struct.unpack(data[offset:offset+record_struct.size])foroffsetinrange(0,len(data),record_struct.size))
In addition to being more complicated to read, this version also requires a
lot more work, as it performs various offset calculations, copies data,
and makes small slice objects. If you’re going to be unpacking a lot
of structures from a large byte string you’ve already read,
unpack_from() is a more elegant approach.
Unpacking records is one place where you might want to use namedtuple
objects from the collections module. This allows you to
set attribute names on the returned tuples. For example:
fromcollectionsimportnamedtupleRecord=namedtuple('Record',['kind','x','y'])withopen('data.p','rb')asf:records=(Record(*r)forrinread_records('<idd',f))forrinrecords:(r.kind,r.x,r.y)
If you’re writing a program that needs to work with a large amount of
binary data, you may be better off using a library such as numpy. For
example, instead of reading a binary into a list of tuples, you could
read it into a structured array, like this:
>>>importnumpyasnp>>>f=open('data.b','rb')>>>records=np.fromfile(f,dtype='<i,<d,<d')>>>recordsarray([(1, 2.3, 4.5), (6, 7.8, 9.0), (12, 13.4, 56.7)],dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')])>>>records[0](1, 2.3, 4.5)>>>records[1](6, 7.8, 9.0)>>>
Last, but not least, if you’re faced with the task of reading binary data in some known file format (i.e., image formats, shape files, HDF5, etc.), check to see if a Python module already exists for it. There’s no reason to reinvent the wheel if you don’t have to.
You need to read complicated binary-encoded data that contains a collection of nested and/or variable-sized records. Such data might include images, video, shapefiles, and so on.
The struct module can be used to decode and encode almost any kind
of binary data structure. To illustrate the kind of data in
question here, suppose you have this Python data structure
representing a collection of points that make up a series of polygons:
polys=[[(1.0,2.5),(3.5,4.0),(2.5,1.5)],[(7.0,1.2),(5.1,3.0),(0.5,7.5),(0.8,9.0)],[(3.4,6.3),(1.2,0.5),(4.6,9.2)],]
Now suppose this data was to be encoded into a binary file where the file started with the following header:
| Byte | Type | Description |
0 | int | File code (0x1234, little endian) |
4 | double | Minimum x (little endian) |
12 | double | Minimum y (little endian) |
20 | double | Maximum x (little endian) |
28 | double | Maximum y (little endian) |
36 | int | Number of polygons (little endian) |
Following the header, a series of polygon records follow, each encoded as follows:
| Byte | Type | Description |
0 | int | Record length including length (N bytes) |
4-N | Points | Pairs of (X,Y) coords as doubles |
To write this file, you can use Python code like this:
importstructimportitertoolsdefwrite_polys(filename,polys):# Determine bounding boxflattened=list(itertools.chain(*polys))min_x=min(xforx,yinflattened)max_x=max(xforx,yinflattened)min_y=min(yforx,yinflattened)max_y=max(yforx,yinflattened)withopen(filename,'wb')asf:f.write(struct.pack('<iddddi',0x1234,min_x,min_y,max_x,max_y,len(polys)))forpolyinpolys:size=len(poly)*struct.calcsize('<dd')f.write(struct.pack('<i',size+4))forptinpoly:f.write(struct.pack('<dd',*pt))# Call it with our polygon datawrite_polys('polys.bin',polys)
To read the resulting data back, you can write very similar looking
code using the struct.unpack() function, reversing the operations
performed during writing. For example:
importstructdefread_polys(filename):withopen(filename,'rb')asf:# Read the headerheader=f.read(40)file_code,min_x,min_y,max_x,max_y,num_polys=\struct.unpack('<iddddi',header)polys=[]forninrange(num_polys):pbytes,=struct.unpack('<i',f.read(4))poly=[]forminrange(pbytes//16):pt=struct.unpack('<dd',f.read(16))poly.append(pt)polys.append(poly)returnpolys
Although this code works, it’s also a rather messy mix of small reads, struct unpacking, and other details. If code like this is used to process a real datafile, it can quickly become even messier. Thus, it’s an obvious candidate for an alternative solution that might simplify some of the steps and free the programmer to focus on more important matters.
In the remainder of this recipe, a rather advanced solution for interpreting binary data will be built up in pieces. The goal will be to allow a programmer to provide a high-level specification of the file format, and to simply have the details of reading and unpacking all of the data worked out under the covers. As a forewarning, the code that follows may be the most advanced example in this entire book, utilizing various object-oriented programming and metaprogramming techniques. Be sure to carefully read the discussion section as well as cross-references to other recipes.
First, when reading binary data, it is common for the file to contain
headers and other data structures. Although the struct module can
unpack this data into a tuple, another way to represent such
information is through the use of a class. Here’s some code
that allows just that:
importstructclassStructField:'''Descriptor representing a simple structure field'''def__init__(self,format,offset):self.format=formatself.offset=offsetdef__get__(self,instance,cls):ifinstanceisNone:returnselfelse:r=struct.unpack_from(self.format,instance._buffer,self.offset)returnr[0]iflen(r)==1elserclassStructure:def__init__(self,bytedata):self._buffer=memoryview(bytedata)
This code uses a descriptor to represent each structure field. Each
descriptor contains a struct-compatible format code along
with a byte offset into an underlying memory buffer. In the __get__() method,
the struct.unpack_from() function is used to unpack a value from the
buffer without having to make extra slices or copies.
The Structure class just serves as a base class that accepts some
byte data and stores it as the underlying memory buffer used by the
StructField descriptor. The use of a memoryview() in this class
serves a purpose that will become clear later.
Using this code, you can now define a structure as a high-level class that mirrors the information found in the tables that described the expected file format. For example:
classPolyHeader(Structure):file_code=StructField('<i',0)min_x=StructField('<d',4)min_y=StructField('<d',12)max_x=StructField('<d',20)max_y=StructField('<d',28)num_polys=StructField('<i',36)
Here is an example of using this class to read the header from the polygon data written earlier:
>>>f=open('polys.bin','rb')>>>phead=PolyHeader(f.read(40))>>>phead.file_code==0x1234True>>>phead.min_x0.5>>>phead.min_y0.5>>>phead.max_x7.0>>>phead.max_y9.2>>>phead.num_polys3>>>
This is interesting, but there are a number of annoyances with this approach. For one, even though you get the convenience of a class-like
interface, the code is rather verbose and requires the user to specify
a lot of low-level detail (e.g., repeated uses of StructField,
specification of offsets, etc.). The resulting class is also missing
common conveniences such as providing a way to compute the total size
of the structure.
Any time you are faced with class definitions that are overly verbose
like this, you might consider the use of a class decorator or metaclass. One of the
features of a metaclass is that it can be used to fill in a lot of
low-level implementation details, taking that burden off of the
user. As an example, consider this metaclass and slight reformulation
of the Structure class:
classStructureMeta(type):'''Metaclass that automatically creates StructField descriptors'''def__init__(self,clsname,bases,clsdict):fields=getattr(self,'_fields_',[])byte_order=''offset=0forformat,fieldnameinfields:ifformat.startswith(('<','>','!','@')):byte_order=format[0]format=format[1:]format=byte_order+formatsetattr(self,fieldname,StructField(format,offset))offset+=struct.calcsize(format)setattr(self,'struct_size',offset)classStructure(metaclass=StructureMeta):def__init__(self,bytedata):self._buffer=bytedata@classmethoddeffrom_file(cls,f):returncls(f.read(cls.struct_size))
Using this new Structure class, you can now write a structure
definition like this:
classPolyHeader(Structure):_fields_=[('<i','file_code'),('d','min_x'),('d','min_y'),('d','max_x'),('d','max_y'),('i','num_polys')]
As you can see, the specification is a lot less verbose. The added
from_file() class method also makes it easier to read the
data from a file without knowing any details about the size or structure
of the data. For example:
>>>f=open('polys.bin','rb')>>>phead=PolyHeader.from_file(f)>>>phead.file_code==0x1234True>>>phead.min_x0.5>>>phead.min_y0.5>>>phead.max_x7.0>>>phead.max_y9.2>>>phead.num_polys3>>>
Once you introduce a metaclass into the mix, you can build more intelligence into it. For example, suppose you want to support nested binary structures. Here’s a reformulation of the metaclass along with a new supporting descriptor that allows it:
classNestedStruct:'''Descriptor representing a nested structure'''def__init__(self,name,struct_type,offset):self.name=nameself.struct_type=struct_typeself.offset=offsetdef__get__(self,instance,cls):ifinstanceisNone:returnselfelse:data=instance._buffer[self.offset:self.offset+self.struct_type.struct_size]result=self.struct_type(data)# Save resulting structure back on instance to avoid# further recomputation of this stepsetattr(instance,self.name,result)returnresultclassStructureMeta(type):'''Metaclass that automatically creates StructField descriptors'''def__init__(self,clsname,bases,clsdict):fields=getattr(self,'_fields_',[])byte_order=''offset=0forformat,fieldnameinfields:ifisinstance(format,StructureMeta):setattr(self,fieldname,NestedStruct(fieldname,format,offset))offset+=format.struct_sizeelse:ifformat.startswith(('<','>','!','@')):byte_order=format[0]format=format[1:]format=byte_order+formatsetattr(self,fieldname,StructField(format,offset))offset+=struct.calcsize(format)setattr(self,'struct_size',offset)
In this code, the NestedStruct descriptor is used to overlay another
structure definition over a region of memory. It does this by taking a
slice of the original memory buffer and using it to instantiate the
given structure type. Since the underlying memory buffer was
initialized as a memoryview, this slicing does not incur any extra
memory copies. Instead, it’s just an overlay on the original memory.
Moreover, to avoid repeated instantiations, the descriptor then stores the
resulting inner structure object on the instance using the same
technique described in Recipe 8.10.
Using this new formulation, you can start to write code like this:
classPoint(Structure):_fields_=[('<d','x'),('d','y')]classPolyHeader(Structure):_fields_=[('<i','file_code'),(Point,'min'),# nested struct(Point,'max'),# nested struct('i','num_polys')]
Amazingly, it will all still work as you expect. For example:
>>>f=open('polys.bin','rb')>>>phead=PolyHeader.from_file(f)>>>phead.file_code==0x1234True>>>phead.min# Nested structure<__main__.Point object at 0x1006a48d0>>>>phead.min.x0.5>>>phead.min.y0.5>>>phead.max.x7.0>>>phead.max.y9.2>>>phead.num_polys3>>>
At this point, a framework for dealing with fixed-sized records has been developed, but what about the variable-sized components? For example, the remainder of the polygon files contain sections of variable size.
One way to handle this is to write a class that simply represents a chunk of binary data along with a utility function for interpreting the contents in different ways. This is closely related to the code in Recipe 6.11:
classSizedRecord:def__init__(self,bytedata):self._buffer=memoryview(bytedata)@classmethoddeffrom_file(cls,f,size_fmt,includes_size=True):sz_nbytes=struct.calcsize(size_fmt)sz_bytes=f.read(sz_nbytes)sz,=struct.unpack(size_fmt,sz_bytes)buf=f.read(sz-includes_size*sz_nbytes)returncls(buf)defiter_as(self,code):ifisinstance(code,str):s=struct.Struct(code)foroffinrange(0,len(self._buffer),s.size):yields.unpack_from(self._buffer,off)elifisinstance(code,StructureMeta):size=code.struct_sizeforoffinrange(0,len(self._buffer),size):data=self._buffer[off:off+size]yieldcode(data)
The SizedRecord.from_file() class method is a utility for reading a
size-prefixed chunk of data from a file, which is common in many file
formats. As input, it accepts a structure format code containing the
encoding of the size, which is expected to be in bytes. The optional
includes_size argument specifies whether the number of bytes
includes the size header or not. Here’s an example of how you would
use this code to read the individual polygons in the polygon file:
>>>f=open('polys.bin','rb')>>>phead=PolyHeader.from_file(f)>>>phead.num_polys3>>>polydata=[SizedRecord.from_file(f,'<i')...forninrange(phead.num_polys)]>>>polydata[<__main__.SizedRecord object at 0x1006a4d50>,<__main__.SizedRecord object at 0x1006a4f50>,<__main__.SizedRecord object at 0x10070da90>]>>>
As shown, the contents of the SizedRecord instances have not yet been
interpreted. To do that, use the iter_as() method, which accepts a structure format code or Structure class as input.
This gives you a lot of flexibility in how to interpret the data.
For example:
>>>forn,polyinenumerate(polydata):...('Polygon',n)...forpinpoly.iter_as('<dd'):...(p)...Polygon 0(1.0, 2.5)(3.5, 4.0)(2.5, 1.5)Polygon 1(7.0, 1.2)(5.1, 3.0)(0.5, 7.5)(0.8, 9.0)Polygon 2(3.4, 6.3)(1.2, 0.5)(4.6, 9.2)>>>>>>forn,polyinenumerate(polydata):...('Polygon',n)...forpinpoly.iter_as(Point):...(p.x,p.y)...Polygon 01.0 2.53.5 4.02.5 1.5Polygon 17.0 1.25.1 3.00.5 7.50.8 9.0Polygon 23.4 6.31.2 0.54.6 9.2>>>
Putting all of this together, here’s an alternative formulation
of the read_polys() function:
classPoint(Structure):_fields_=[('<d','x'),('d','y')]classPolyHeader(Structure):_fields_=[('<i','file_code'),(Point,'min'),(Point,'max'),('i','num_polys')]defread_polys(filename):polys=[]withopen(filename,'rb')asf:phead=PolyHeader.from_file(f)forninrange(phead.num_polys):rec=SizedRecord.from_file(f,'<i')poly=[(p.x,p.y)forpinrec.iter_as(Point)]polys.append(poly)returnpolys
This recipe provides a practical application of various advanced programming techniques, including descriptors, lazy evaluation, metaclasses, class variables, and memoryviews. However, they all serve a very specific purpose.
A major feature of the implementation is that it is strongly based
on the idea of lazy-unpacking. When an instance of Structure
is created, the __init__() merely creates a memoryview of the
supplied byte data and does nothing else. Specifically, no unpacking
or other structure-related operations take place at this time.
One motivation for taking this approach is that you might only be
interested in a few specific parts of a binary record. Rather
than unpacking the whole file, only the parts that are actually accessed
will be unpacked.
To implement the lazy unpacking and packing of values, the StructField
descriptor class is used. Each attribute the user lists in
_fields_ gets converted to a StructField descriptor that stores
the associated structure format code and byte offset into the stored
buffer. The StructureMeta metaclass is what creates these
descriptors automatically when various structure classes are defined.
The main reason for using a metaclass is to make it extremely easy for
a user to specify a structure format with a high-level description
without worrying about low-level details.
One subtle aspect of the StructureMeta metaclass is that it makes
byte order sticky. That is, if any attribute specifies a byte order
(< for little endian or > for big endian), that ordering is
applied to all fields that follow. This helps avoid extra typing,
but also makes it possible to switch in the middle of a definition.
For example, you might have something more complicated, such as this:
classShapeFile(Structure):_fields_=[('>i','file_code'),# Big endian('20s','unused'),('i','file_length'),('<i','version'),# Little endian('i','shape_type'),('d','min_x'),('d','min_y'),('d','max_x'),('d','max_y'),('d','min_z'),('d','max_z'),('d','min_m'),('d','max_m')]
As noted, the use of a memoryview() in the solution serves
a useful role in avoiding memory copies. When structures
start to nest, memoryviews can be used to
overlay different parts of the structure definition
on the same region of memory. This aspect of the solution
is subtle, but it concerns the slicing behavior of a memoryview
versus a normal byte array. If you slice a byte string or byte array,
you usually get a copy of the data. Not so with a memoryview—slices
simply overlay the existing memory. Thus, this approach is
more efficient.
A number of related recipes will help expand upon the topics used in
the solution. See Recipe 8.13 for a closely related recipe that
uses descriptors to build a type system. Recipe 8.10 has
information about lazily computed properties and is related to the
implementation of the NestedStruct descriptor. Recipe 9.19
has an example of using a metaclass to initialize class members, much
in the same manner as the StructureMeta class. The source
code for Python’s ctypes library may also be of interest, due to
its similar support for defining data structures, nesting of data
structures, and similar functionality.
You need to crunch through large datasets and generate summaries or other kinds of statistics.
For any kind of data analysis involving statistics, time series, and other related techniques, you should look at the Pandas library.
To give you a taste, here’s an example of using Pandas to analyze the City of Chicago rat and rodent database. At the time of this writing, it’s a CSV file with about 74,000 entries:
>>>importpandas>>># Read a CSV file, skipping last line>>>rats=pandas.read_csv('rats.csv',skip_footer=1)>>>rats<class 'pandas.core.frame.DataFrame'>Int64Index: 74055 entries, 0 to 74054Data columns:Creation Date 74055 non-null valuesStatus 74055 non-null valuesCompletion Date 72154 non-null valuesService Request Number 74055 non-null valuesType of Service Request 74055 non-null valuesNumber of Premises Baited 65804 non-null valuesNumber of Premises with Garbage 65600 non-null valuesNumber of Premises with Rats 65752 non-null valuesCurrent Activity 66041 non-null valuesMost Recent Action 66023 non-null valuesStreet Address 74055 non-null valuesZIP Code 73584 non-null valuesX Coordinate 74043 non-null valuesY Coordinate 74043 non-null valuesWard 74044 non-null valuesPolice District 74044 non-null valuesCommunity Area 74044 non-null valuesLatitude 74043 non-null valuesLongitude 74043 non-null valuesLocation 74043 non-null valuesdtypes: float64(11), object(9)>>># Investigate range of values for a certain field>>>rats['Current Activity'].unique()array([nan, Dispatch Crew, Request Sanitation Inspector], dtype=object)>>># Filter the data>>>crew_dispatched=rats[rats['Current Activity']=='Dispatch Crew']>>>len(crew_dispatched)65676>>>>>># Find 10 most rat-infested ZIP codes in Chicago>>>crew_dispatched['ZIP Code'].value_counts()[:10]60647 383760618 353060614 328460629 325160636 280160657 246560641 223860609 220660651 215260632 2071>>>>>># Group by completion date>>>dates=crew_dispatched.groupby('Completion Date')<pandas.core.groupby.DataFrameGroupBy object at 0x10d0a2a10>>>>len(dates)472>>>>>># Determine counts on each day>>>date_counts=dates.size()>>>date_counts[0:10]Completion Date01/03/2011 401/03/2012 12501/04/2011 5401/04/2012 3801/05/2011 7801/05/2012 10001/06/2011 10001/06/2012 5801/07/2011 101/09/2012 12>>>>>># Sort the counts>>>date_counts.sort()>>>date_counts[-10:]Completion Date10/12/2012 31310/21/2011 31409/20/2011 31610/26/2011 31902/22/2011 32510/26/2012 33303/17/2011 33610/13/2011 37810/14/2011 39110/07/2011 457>>>
Yes, October 7, 2011, was indeed a very busy day for rats.
Pandas is a large library that has more features than can be described here. However, if you need to analyze large datasets, group data, perform statistics, or other similar tasks, it’s definitely worth a look.
Python for Data Analysis by Wes McKinney (O’Reilly) also contains much more information.