Introducing templatemaker | Holovaty.com
I've just released templatemaker, which is something I've been hacking on and off (mostly off) the past couple of months. It's a Python library for extracting data from similarly formatted text strings.
What the heck does that mean?
Well, say you want to get the raw data from a bunch of Web pages that use the same template -- like restaurant reviews on Yelp.com, for instance. You can give templatemaker an arbitrary number of HTML files, and it will create the "template" that was used to create those files. ("Template," in this case, means a string with a number of "holes" in it, where the holes represent the parts of the page that change.) Once you've got the template, you can then give it any HTML file that uses that same template, and it will give you the raw data: "The value for hole 1 is 'July 6, 2007', the value for hole 2 is 'blue'," etc.

No comment on this link yet.