XML::Twig is a Perl module used to process efficently XML documents
Twig offers a tree-oriented interface to a document while still allowing the processing of documents of any size. I think the current buzzword for it would be something like "push-pull" processing ;--)
When I was younger I wanted to grow up and write a tool that would allow people to process text the way they wanted, offering tons of feature, various ways to achieve the same result, not forcing them into any processing model but allowing them to use the one they felt the most comfortable with. Eventually I grew up and I realized a guy named Larry Wall had already written a language named Perl... Darn! So as I was quite involved in dealing with SGML, then XML documents, I decided to settle for the next best thing: writing a module that would allow people to process XML the way they wanted, offering them tons of feature, various ways... you get the point.
So I wrote XML::Twig. XML::Twig gives you a tree interface to XML documents... if you want. It also lets you dump parts of the tree, set callbacks during processing, both on tags and on subtrees, process only part of the tree... you name it. The only thing XML::Twig does not do is follow standards (except XML of course). Consider yourself warned!
This talk is aimed at programmers who want to process XML data with the XML::Twig module.
It will go from the basic functionnalities of the module to its most adanced use, offering numerous examples of code, from HTML conversion to database integration.
XML::Twig is a Perl module offering a push-pull processing model of XML data. In other words it lets you build a tree from an XML documents, while letting you output the results of your processing as its built. But more on that later...
This tutorial is available in XML (yapc_xmltwig.xml), converted to html using the talk2html script (which uses XML::Twig).
The latest version of the XML::Twig tutorial can be found on the XML::Twig page
Prior knowledge of Perl, especially its object-oriented aspects and regular expressions will probably help the reader. Familiarity with the DBI module wouldn't hurt either, but the examples are simple and detailed enough to offer a first introduction to data base processing using Perl.
Very little prior knowledge of XML is assumed, although a selection of related links is offered and would be of interest to the complete beginner.
Of course other ways of processing XML documents exist, both using Perl and other languages, especially Java and Python.
You can find information on Perl modules on the Perl-XML FAQ, for a list of Python XML resources see Python and XML Processing and for a list of Java XML resources see Java (TM) Technology and XML.
XML could be described as "HTML on steroids". Or conversely as "SGML on Prozac".
XML is a markup language, just like HTML, using the same basic syntax: pointy brackets, attributes... just slightly more dictatrial than HTML: tags MUST be closed, attributes MUST be enclosed in quotes, either single or double.
In fact it is just a little more than comma separated files, apart from the fact that fields are somewhat documented (by the element name and by attributes) and that they can be nested, thus defining a tree structure instead of a table.
What XML brings is syntaxic coherence, allowing the same tools to be used to process all XML files, and a host of associated standards to do formatting, transformation, linking...
XML complexity stems from 2 main facts:
A simple example would be:
The best resource on XML, and SGML by the way, is certainly Robin Cover's SGML/XML Web Page, which links to everything else anyway. XML.com and xmlhack are 2 good sites respectively for detailed articles on XML and for the latest news on the topic.
Just a word on the XML I use in this tutorial.
XML is usually used for 2 purposes these days: either purely to store data, to be exchanged between 2 pieces of software, or to store documents, possibly including data, that are destined to be printed or displayed on the web.
Data-oriented XML should be tagged according to a DTD that represents faithfully the data, we will see examples of that in the section about data base integration.
For document-oriented XML, after using SGML then XML for nearly 8 years, in all sorts of flavors and according to all sorts of DTD's I have become a firm believer in what I'd call "HTML++". By this I mean that as much as possible of the HTML DTD should be used for text. There is really no need to redefine paragraphs, lists, code, headers etc... Structuring elements can be added, such as sections, possibly typed ones, that's one +. Specific inline elements, for domain relevant data, such as part numbers and prices in a catalog, standard references in a standard, etc... constitue the second +. Links can either use the familiar <a> tag or use different tags, possibly typed.
XMLnews is a good example of such a DTD.
Starting from the XHTML DTD and adding the extra elements is definitely the easiest way to create that kind of DTD.
Although I did not use a DTD for this tutorial it would look like:
html_stuff is just the usual html content, plus a couple of elements: ]]>
XML::Parser, first developped by Larry Wall and now supported by Clark Cooper, is the basis of most other XML modules. It includes a non-validating parser, Expat, written by James Clark, who amongst other feats also wrote the nsgmls parser for SGML.
XML::Parser allows calling software to set handlers on parsing events. Those events include start tags (and XML::Parser gives the name of the tag and the attributes), end tags, text, processing instructions etc...
XML::Twig is a sub-class of XML::Parser that allows higher level processing of XML. XML::Twig offers a tree interface to a document, both once the document has been completely parsed and during the parsing by allowing handlers to be defined on elements. Additional methods help managing the resources needed by XML::Twig.
A whole bunch of methods can be used on elements in the twig, to navigate it, transform it, create new elements...
XML:Twig is only one of the dozen or so Perl modules that process XML. Other popular ones are XML::DOM, XML::Simple, XML::PYX, XML::Grove or just plain vanilla XML::Parser.
So why would you use XML::Twig?
XML::Twig uses a tree-based processing model, you can control how much of the tree you want to load at once in memory and it is very perlish, up to TIMTOWTDI and DWIM.
Now let's see our first code example. The purpose of this one is to reorder a list of elements on the value of an attribute.
The DTD is quite simple:
And the data is:
|
The complete xml data.
The script is
Note how we get the root of the twig using the
The
The
Another example, in which we will create new elements: our statistics include the total number of blocks for each player, but in order to find out the best blocker in our selection we want the number of blocks per game, and we want to store it in the document (conveniently the DTD allows for an optionnal blg element).
Here is the
The
You can ommit first_child and just write $elt->paste( $ref). What you can't do is paste an element that already belongs to a document, that will cause a fatal error.
An important feature of the paste method is that it is called on the element being pasted: $child->paste( $parent) and not the other way around.
Note that the output is now generated by the
Another way to accomplish the same task, a more "twig-ish" way, would be to set a handler on the player element. A handler is attached to an element name through the twig_handlers option when the twig is created. The subroutine that will be called everytime an element with that name has been completely parsed. It is then called with 2 parameters: the twig itself and the element.
Note that the handler is called as soon as the element is completely parsed. That means that the handler will be called when the end tag for that element is parsed. A somewhat surprising consequence of that is that if you set twig handlers on nested elements, the handlers on the inner elements will be called before the handlers on the outer elements.
Here is the
This is basically similar to the previous example, except the interesting code is in the handler instead of being in the loop. It gets more interesting in the next section though...
Now in the previous examples the whole document was being loaded, then printed. This is not very memory efficient, especially as once a player has been updated it is never used again.
Hence the use of the
Here is the
Still very similar to the previous example, except that instead of printing
the whole twig at the end of the processing the calls to
Note: as of XML::Twig 3.23, there is no longer any need to call flush one last time after the document is completely parsed. If the document was flushed, then it will be "auto-flushed" (to the same filehandle used for the first flush) after the parse.
The flush method is usefull if you want to output the modified standard. But you might not always want that. Suppose you just want to output the leader in a category:
Here is the
Very simple, yet very memory efficient. You still get the advantage of local tree-processing, having access to the whole player sub-tree, while not having to pay the price of loading the whole document in memory.
But wait! There's more...
Actually in the previous example we build the complete twig for each player element, even though we are really only interested in the name and one of the sub-elements. It's OK as the xml file we are working on is not to big, but it can be a problem, both in terms of speed and memory for bigger file. Hopefully XML::Twig offer a way to build the twig only for those elements we are interested in.
The twig_roots option, set when the twig is created, gives a list (well, actually a hash) of elements for which the twig will be built. Other elements will be ignored. The result is a twig that includes the root of the document (we need a root for the tree in any case) and the twig_roots elements as children of that root. For each element in the twig_roots list the whole sub-tree is built.
Here is the
The virtual twig build (looking for the leader in ppg) is <stats><name>Houston, Allan</name><ppg>20.1</ppg><name>Sprewell, Latrell</name><ppg>19.2</ppg>...</stats>. The script doesn't spend memory storing useless information on other stats, nor time building the twig for those stats.
Now suppose all we want to do is remove a statistical category from the document. Ideally we would like to build as little of the twig as possible, using the twig_roots option, but we also want want most of the document to be output as-is. twig_print_outside_roots to the rescue! By setting that option when we create the twig anything outside of the twig_roots elements will simply be print.
Here is the
Note the use of the
And of course, as There's More Than One Way To Do It, here is a real short script that does the same thing, just in a more lazy way (and actually a slightly faster but more memory intensive one).
The
Figuring out how it works is left as an exercise for the reader (hint: twig_print_outside_roots does just what it's name suggests, no more).
Now with what we've learned so far we are just a couple of additional tricks away from building a simple "HTML+" converter. The + here means that we can include additional inline elements to an HTML document. Provided of course that HTML document is a valid XML instance (and I admit this can be hard to achieve).
So here is the
We use 3 new methods here:
Also note the neat trick (thanks to Clark Cooper for this one) that consist in setting the handler as a sub that just adds an extra parameter to the usual ones: sub { make(@_, 'tt') }.
An additional option is to set handlers not for elements but for elements in a given context: instead of giving just the gi of the element you can use an XPath-like expression in the twig_handlers (as well as in the twig_roots) argument.
Valid path can be of the form /root/elt1/elt2 for a complete path to the element, or elt1/elt2 for a partial path.
Note that this path is given in the original document, not in the current twig.
So if we want to convert the simple document we saw in the XML examples we would write the conversion as in
When we process the doc element the title has already been processed, so we have to look for a h1 child.
We also use two new methods here:
We now have all the tools we need to build documents that include data straight out of relationnal data bases. The only decision we have to make is how to design our documents, and our DTD's. Are we going to include entire tables or single values, and how.
Here are some simple examples of what can be done:
For this first example we will include a whole table in the document.
The document we use is
The code in
This code can also be used to process slightly trickier queries, as in
Depending on how generic, and how convenient to write we want the queries to be, several options are possible. Here are a couple:
The first document is
It can be processed using the
A shorter but less generic way would be a document like
It can be processed using the
We are now going to fill a relationnal table from an XML file, which could come from another, incompatible, data base for example.
The XML file looks like this:
The script to load the table:
Now let see some other features of XML::Twig, beyond the basic examples.
Sometimes all we need is to extract or update part of the document. In this case there is no reason to bother with building the twig for the rest of the
document. We just want to be done with it and exit or go through the rest of the document and just output it.That's what the
So here is
Probably more interesting is
For some applications, especially when the whole document is loaded in memory, it can be very convenient to get direct access to elements through an
ID attribute. XML::Twig provides such a feature. By default if an element has
an attribute named id then a hash id => element is created. This
hash can be accessed through the
The name of the ID attribute can be changed when the twig is created by using the
The id attribute can still be accessed through the
XML::Twig also offers methods to compare the order of 2 elements in the document.
Although the next_sibling and first_child methods are
often the most convenient way to navigate there are some cases where another
method is easier to use: the
The next_elt of an element is the first element opened after the open tag of the element. This is either the first child of the element, or its next sibling, or the next sibling of one of its ancestors. Note that as usual PCDATA is considered an element.
This method has 2 forms:
By popular demand I have included a number of pretty printing options, both for documents and for data.
The usefull options to pretty print a document are:
The NOT SAFE options can produce invalid XML (that would not conform to the original DTD) in some cases. I have included them anyway because it rarely happens with simple DTDs and they look good!
The
The output is
To pretty print tables 2 options can be used (besides the faithful none):
The
The output is
These options can be set either when creating the twig, using the PrettyPrint
option, by using the PrettyPrint option in the print method (on a twig or on an
element) or by using the
or "I hope you don't need those"
Sometimes you might want to just change a tag name, or store some attributes,
BEFORE the whole tree for the element is built. This is often the case when you need to
In that case you can use the StartTagHandlers option when you create the twig, which will call a handler when the start tag of the element is found. The arguments passed to the handler will be the twig and the element. The element will be empty at that point but the attributes will be there.
The other new feature used in this script is the _all_ keyword in the twig_handlers option. This calls the handler (which in this case just flushes the twig) for every single element in the document. Another keyword, _default_ calls a handler for each element that does not have a handler. _all_ and _default_ can be used both with StartTagHandlers and with the twig_handlers option.
Sometimes, especially when converting an XML file to several HTML ones it
is convenient to purge the twig only up to the next-to-last sibling, not
up to the current one. Hence the
Here is an example of how to use them to list the difference in a given
stat between 2 consecutive players.
I just thought I'd mention, because I think it's cool, that you can overload
the comparison operators to use the
So just insert these lines in your script:
\&cmp, 'lt' => \<, 'le' => \&le, 'gt' => \>, 'ge' => \&ge, '+=' => \&suffix, '-=' => \&prefix, '>>' => \&suffix, '<<' => \&prefix, fallback => 1, ;]]>
Then you will be able to write if( $elt1 le $elt2) { print "$elt1 is
before $elt2\n"; }
. As an added bonus you get 2 new ways to prefix or
suffix an element:
$elt += "suffix"; $elt -= "prefix"; $elt << "prefix"; $elt >> "suffix";
This is just syntactic sugar, and IMHO pretty useless (hence it is not included in the module), plus it slows the module down by a good 30%. It's cute though, and if you don't care about speed and need to do a lot of comparisons of elements it can be handy.
Now let's have a look under the hood at some of the things that go on in XML::Twig from a developer stand point.
I think one of the most interesting feature of XML::Twig is the optimization step that takes place when the module is installed.
The module is written in pure OO style, whith accessors for every fields of objects, even inside the module. But as we all know method calls are expensive. So an optimization pass replaces method calls by hash accesses if possible.
For example $elt->parent is replaced by $elt->{parent} and $elt->set_parent( $parent) is replaced by $elt->{parent}= $parent.
The
The result is an improvement of about 30% of the speed of the module.
Speedup could also be used to... speedup a production script, with the caveat that as XML::Twig implementation changes it might be necessary to re-run the tool with new versions of the module.
A minor optimization in XML::Twig is that element names, which are stored as hash values are replaced by an index in an array holding all names.
Not all attempts at optimizing XML::Twig succeded, so I think it might be useful for me to share at least my biggest failure in this area...
Twig elements are stored in hashes, one element per hash. In order to reduce the potential overhead of all too much memory being allocated for each one of them I tried to store elements in global arrays, each array storing one field for all the elements: instead of the parent of an element being stored in $elt->{parent} it was stored in $parent[$elt], $elt being a blessed scalar.
It did not work.
The twig was just as big and slower to access than the original version.
Oh well... there goes 2 days of work...
(c) 2000 Michel Rodriguez
This tutorial is free
documentation. It can be redistributed and modified under the same terms as
perl itself