Processing XML efficiently with Perl and XML::Twig

Michel Rodriguez <mirod@xmltwig.com> 2003-03-31

Introduction

XML::Twig is a Perl module used to process efficently XML documents

Twig offers a tree-oriented interface to a document while still allowing the processing of documents of any size. I think the current buzzword for it would be something like "push-pull" processing ;--)

When I was younger I wanted to grow up and write a tool that would allow people to process text the way they wanted, offering tons of feature, various ways to achieve the same result, not forcing them into any processing model but allowing them to use the one they felt the most comfortable with. Eventually I grew up and I realized a guy named Larry Wall had already written a language named Perl... Darn! So as I was quite involved in dealing with SGML, then XML documents, I decided to settle for the next best thing: writing a module that would allow people to process XML the way they wanted, offering them tons of feature, various ways... you get the point.

So I wrote XML::Twig. XML::Twig gives you a tree interface to XML documents... if you want. It also lets you dump parts of the tree, set callbacks during processing, both on tags and on subtrees, process only part of the tree... you name it. The only thing XML::Twig does not do is follow standards (except XML of course). Consider yourself warned!

This talk is aimed at programmers who want to process XML data with the XML::Twig module.

It will go from the basic functionnalities of the module to its most adanced use, offering numerous examples of code, from HTML conversion to database integration.

XML::Twig is a Perl module offering a push-pull processing model of XML data. In other words it lets you build a tree from an XML documents, while letting you output the results of your processing as its built. But more on that later...

This tutorial is available in XML (yapc_xmltwig.xml), converted to html using the talk2html script (which uses XML::Twig).

The latest version of the XML::Twig tutorial can be found on the XML::Twig page

Knowledge

Prior knowledge of Perl, especially its object-oriented aspects and regular expressions will probably help the reader. Familiarity with the DBI module wouldn't hurt either, but the examples are simple and detailed enough to offer a first introduction to data base processing using Perl.

Very little prior knowledge of XML is assumed, although a selection of related links is offered and would be of interest to the complete beginner.

Alternatives to XML::Twig

Of course other ways of processing XML documents exist, both using Perl and other languages, especially Java and Python.

You can find information on Perl modules on the Perl-XML FAQ, for a list of Python XML resources see Python and XML Processing and for a list of Java XML resources see Java (TM) Technology and XML.

Introduction to XML

What is XML

XML could be described as "HTML on steroids". Or conversely as "SGML on Prozac".

XML is a markup language, just like HTML, using the same basic syntax: pointy brackets, attributes... just slightly more dictatrial than HTML: tags MUST be closed, attributes MUST be enclosed in quotes, either single or double.

In fact it is just a little more than comma separated files, apart from the fact that fields are somewhat documented (by the element name and by attributes) and that they can be nested, thus defining a tree structure instead of a table.

What XML brings is syntaxic coherence, allowing the same tools to be used to process all XML files, and a host of associated standards to do formatting, transformation, linking...

XML complexity stems from 2 main facts:

in order to "unleash the power of XML" you have to design the "right" XML for your system, through DTD's (and soon schemas),
the associated standards, such as CSS, XSL, DOM, XSLT, XPath, XLink, XInclude: you often need them to do anything useful with XML, but their mere number is quite overwhelming.

XML example

A simple example would be: simple_doc.xml.

Resources

The best resource on XML, and SGML by the way, is certainly Robin Cover's SGML/XML Web Page, which links to everything else anyway. XML.com and xmlhack are 2 good sites respectively for detailed articles on XML and for the latest news on the topic.

XML used in this tutorial

Just a word on the XML I use in this tutorial.

XML is usually used for 2 purposes these days: either purely to store data, to be exchanged between 2 pieces of software, or to store documents, possibly including data, that are destined to be printed or displayed on the web.

Data oriented XML

Data-oriented XML should be tagged according to a DTD that represents faithfully the data, we will see examples of that in the section about data base integration.

Document oriented XML

For document-oriented XML, after using SGML then XML for nearly 8 years, in all sorts of flavors and according to all sorts of DTD's I have become a firm believer in what I'd call "HTML++". By this I mean that as much as possible of the HTML DTD should be used for text. There is really no need to redefine paragraphs, lists, code, headers etc... Structuring elements can be added, such as sections, possibly typed ones, that's one +. Specific inline elements, for domain relevant data, such as part numbers and prices in a catalog, standard references in a standard, etc... constitue the second +. Links can either use the familiar <a> tag or use different tags, possibly typed.

XMLnews is a good example of such a DTD.

Starting from the XHTML DTD and adding the extra elements is definitely the easiest way to create that kind of DTD.

Although I did not use a DTD for this tutorial it would look like:


  
  html_stuff is just the usual html content, plus a couple of elements:
  
  
  
  
  
  
  
  
  
  ]]>

Introduction to XML::Twig

XML::Parser

XML::Parser, first developped by Larry Wall and now supported by Clark Cooper, is the basis of most other XML modules. It includes a non-validating parser, Expat, written by James Clark, who amongst other feats also wrote the nsgmls parser for SGML.

XML::Parser allows calling software to set handlers on parsing events. Those events include start tags (and XML::Parser gives the name of the tag and the attributes), end tags, text, processing instructions etc...

XML::Twig

XML::Twig is a sub-class of XML::Parser that allows higher level processing of XML. XML::Twig offers a tree interface to a document, both once the document has been completely parsed and during the parsing by allowing handlers to be defined on elements. Additional methods help managing the resources needed by XML::Twig.

A whole bunch of methods can be used on elements in the twig, to navigate it, transform it, create new elements...

Why use XML::Twig

XML:Twig is only one of the dozen or so Perl modules that process XML. Other popular ones are XML::DOM, XML::Simple, XML::PYX, XML::Grove or just plain vanilla XML::Parser.

So why would you use XML::Twig?

you need to process huge documents efficiently,
PYX is not quite powerful enough,
the XML data is too complex for XML::Simple to handle,
the processing is hard to write in XML::Parser,
the document is too big to load conveniently in XML::DOM,
XSLT is a pain to write.

XML::Twig uses a tree-based processing model, you can control how much of the tree you want to load at once in memory and it is very perlish, up to TIMTOWTDI and DWIM.

First Examples

Full-tree mode

Creating and navigating the twig

Now let's see our first code example. The purpose of this one is to reorder a list of elements on the value of an attribute.

The DTD is quite simple: stats.dtd

And the data is:



Houston, Allan6920.13.42.814
Sprewell, Latrell6919.24.54.015
Ewing, Patrick4914.610.01.068

]]>

The complete xml data.

The script is ex1_1.pl.

Note how we get the root of the twig using the root method, then use the children method to get the list of players.

The first_child method is used to navigate the twig, it accepts an optionnal parameter which is the gi we are interested in, if the parameter is ommited the first child, whatever it's gi, is returned. Other navigation methods are last_child, prev_sibling, next_sibling and parent. They all return undef if no element is found.

The text returns the... text of the element, including all elements included in it, without any tags. Other methods used to retrieve the content of an element include print, which prints the element content, from its start tag to its end tag, included, and including the content (and tags) of all included elements, and sprint, which returns the string that print prints, and accepts an optionnal parameter which excludes the element tags when true.

Modifying the twig

Another example, in which we will create new elements: our statistics include the total number of blocks for each player, but in order to find out the best blocker in our selection we want the number of blocks per game, and we want to store it in the document (conveniently the DTD allows for an optionnal blg element).

Here is the ex1_2.pl.

The paste method accepts 4 different position arguments:

first_child: pastes the element as the first child of the third argument
last_child: pastes the element as the last child of the third argument
before: pastes the element before the third argument
after: pastes the element after the third argument

You can ommit first_child and just write $elt->paste( $ref). What you can't do is paste an element that already belongs to a document, that will cause a fatal error.

An important feature of the paste method is that it is called on the element being pasted: $child->paste( $parent) and not the other way around.

Note that the output is now generated by the print method, instead of regular print statements, and that the extra line returns that we had inserted in the file have disapeared. We will see a little later how to keep them around.

Twig handlers

Another way to accomplish the same task, a more "twig-ish" way, would be to set a handler on the player element. A handler is attached to an element name through the twig_handlers option when the twig is created. The subroutine that will be called everytime an element with that name has been completely parsed. It is then called with 2 parameters: the twig itself and the element.

Note that the handler is called as soon as the element is completely parsed. That means that the handler will be called when the end tag for that element is parsed. A somewhat surprising consequence of that is that if you set twig handlers on nested elements, the handlers on the inner elements will be called before the handlers on the outer elements.

Here is the ex1_3.pl.

This is basically similar to the previous example, except the interesting code is in the handler instead of being in the loop. It gets more interesting in the next section though...

The flush and purge methods

The flush method

Now in the previous examples the whole document was being loaded, then printed. This is not very memory efficient, especially as once a player has been updated it is never used again.

Hence the use of the flush method. The flush method just dumps the twig that has been parsed so far. It takes care of printing the proper closing tags when needed and deleting the printed elements, thus allowing the memory to be reused for the rest of the processing. It does not delete the parents of the current element (but might delete most of their children), so they are still available when navigating the twig.

Here is the ex1_4.pl.

Still very similar to the previous example, except that instead of printing the whole twig at the end of the processing the calls to flush at the end of player ensure that each player element stays in memory for just as long as it is needed.

Note: as of XML::Twig 3.23, there is no longer any need to call flush one last time after the document is completely parsed. If the document was flushed, then it will be "auto-flushed" (to the same filehandle used for the first flush) after the parse.

The purge method

The flush method is usefull if you want to output the modified standard. But you might not always want that. Suppose you just want to output the leader in a category:

Here is the ex1_5.pl.

Very simple, yet very memory efficient. You still get the advantage of local tree-processing, having access to the whole player sub-tree, while not having to pay the price of loading the whole document in memory.

But wait! There's more...

The twig_roots option

Actually in the previous example we build the complete twig for each player element, even though we are really only interested in the name and one of the sub-elements. It's OK as the xml file we are working on is not to big, but it can be a problem, both in terms of speed and memory for bigger file. Hopefully XML::Twig offer a way to build the twig only for those elements we are interested in.

The twig_roots option, set when the twig is created, gives a list (well, actually a hash) of elements for which the twig will be built. Other elements will be ignored. The result is a twig that includes the root of the document (we need a root for the tree in any case) and the twig_roots elements as children of that root. For each element in the twig_roots list the whole sub-tree is built.

Here is the ex1_6.pl.

The virtual twig build (looking for the leader in ppg) is <stats><name>Houston, Allan</name><ppg>20.1</ppg><name>Sprewell, Latrell</name><ppg>19.2</ppg>...</stats>. The script doesn't spend memory storing useless information on other stats, nor time building the twig for those stats.

The twig_print_outside_roots option

Now suppose all we want to do is remove a statistical category from the document. Ideally we would like to build as little of the twig as possible, using the twig_roots option, but we also want want most of the document to be output as-is. twig_print_outside_roots to the rescue! By setting that option when we create the twig anything outside of the twig_roots elements will simply be print.

Here is the ex1_7.pl.

Note the use of the cut method, which just removes the element from the twig. It is also possible to use delete instead of cut. The difference is that cut keeps the element around (so it can be for example pasted somewhere else), while delete destroys it (and frees up the memory it used).

And of course, as There's More Than One Way To Do It, here is a real short script that does the same thing, just in a more lazy way (and actually a slightly faster but more memory intensive one).

The ex1_8.pl.

Figuring out how it works is left as an exercise for the reader (hint: twig_print_outside_roots does just what it's name suggests, no more).

A simple HTML+ converter

Now with what we've learned so far we are just a couple of additional tricks away from building a simple "HTML+" converter. The + here means that we can include additional inline elements to an HTML document. Provided of course that HTML document is a valid XML instance (and I admit this can be hard to achieve).

So here is the xml2html1.pl. It runs on the html_plus.xml file and includes itself in the output:html_plus.html

We use 3 new methods here:

set_gi, predictably sets the gi (the name, gi means generic identifier, it comes from sgml) of the element, the gi method returns the gi of an element
set_att sets (and creates if it does not exist already) an attribute to a value, the att method retrieves the value of an attribute,
insert creates an element which is inserted within an other element, the new element is the only child of the initial element and all children of the initial element become children of the new element. The method returns the new element.

Also note the neat trick (thanks to Clark Cooper for this one) that consist in setting the handler as a sub that just adds an extra parameter to the usual ones: sub { make(@_, 'tt') }.

Setting handlers for elements in context

An additional option is to set handlers not for elements but for elements in a given context: instead of giving just the gi of the element you can use an XPath-like expression in the twig_handlers (as well as in the twig_roots) argument.

Valid path can be of the form /root/elt1/elt2 for a complete path to the element, or elt1/elt2 for a partial path.

Note that this path is given in the original document, not in the current twig.

So if we want to convert the simple document we saw in the XML examples we would write the conversion as in ex1_9.pl.

When we process the doc element the title has already been processed, so we have to look for a h1 child.

We also use two new methods here: erase removes the element and pastes all of its children as children of the element parent. The effect on the output is that the tag has been erased from the document. set_text sets the textual content of the element.

Data base integration

We now have all the tools we need to build documents that include data straight out of relationnal data bases. The only decision we have to make is how to design our documents, and our DTD's. Are we going to include entire tables or single values, and how.

Here are some simple examples of what can be done:

Including a table

For this first example we will include a whole table in the document.

The document we use is books1.xml, where the table is generated by the <rel_table query="SELECT code, name, price FROM books"/> tag.

The code in ex2_1.pl mixes DBI and XML::Twig to build the table.

This code can also be used to process slightly trickier queries, as in books2.xml.

Including values from a table

Depending on how generic, and how convenient to write we want the queries to be, several options are possible. Here are a couple:

The first document is books3.xml, which includes very generic queries.

It can be processed using the ex2_2.pl script.

A shorter but less generic way would be a document like books3.xml.

It can be processed using the ex2_3.pl script.

Dumping an XML table into a data base table

We are now going to fill a relationnal table from an XML file, which could come from another, incompatible, data base for example.

The XML file looks like this: teams_extract.xml (the whole file is in teams.xml).

The script to load the table: ex2_4.pl is pretty simple, the only notable features being the fact that we prepare the SQL statement once and then bind parameters to it, and that the purge does not delete the parent element of a name.

Other features

Now let see some other features of XML::Twig, beyond the basic examples.

Using the finish and finish_print methods

Sometimes all we need is to extract or update part of the document. In this case there is no reason to bother with building the twig for the rest of the document. We just want to be done with it and exit or go through the rest of the document and just output it.That's what the finish and finish_print methods provide.

finish calls Expat finish method. It unsets all handlers (including internal ones that set context), but expat continues parsing to the end of the document or until it finds an error. It should finish up a lot faster than with the handlers set.

finish_print stops the twig processing, flushes the twig and proceed to finish printing the document as fast as possible.

So here is ex3_1.pl, which just displays a stat for a player then finishes parsing. Note that the document is still checked for well-formedness, the script will exit with an error if the document is not well-formed XML.

Probably more interesting is ex3_2.pl which updates the stats for a player.

Using set_id and elt_id methods

For some applications, especially when the whole document is loaded in memory, it can be very convenient to get direct access to elements through an ID attribute. XML::Twig provides such a feature. By default if an element has an attribute named id then a hash id => element is created. This hash can be accessed through the id, set_id and del_id methods on an element, and an element can be retrived from a twig using the elt_id method on the twig.

The name of the ID attribute can be changed when the twig is created by using the Id option.

The id attribute can still be accessed through the att, set_att and del_att methods on the element but in this case the id hash will not be updated.

ex3_3.pl is an example of the set_id method.

ex3_4.pl uses elt_id on the updated XML document to display the name of a player with a given id. perl ex3_3.pl | perl ex3_4.pl 050 will then display player050: Stojakovic, Predrag.

Comparing the order of 2 elements

XML::Twig also offers methods to compare the order of 2 elements in the document. before and after are based on the cmp method. An element is before an other one if its opening tag is before the opening tag of the other element. Otherwise it is after. The 2 elements are equal if they are... equal!

ex3_5.pl shows how to use those methods. You can run it on an ordered and "id'ed" document this way: perl ex1_1.pl blk | perl ex3_3.pl | perl ex3_5.pl 001 015.

The next_elt method

Although the next_sibling and first_child methods are often the most convenient way to navigate there are some cases where another method is easier to use: the next_elt method makes it easier to go through all the elements in a sub-tree.

The next_elt of an element is the first element opened after the open tag of the element. This is either the first child of the element, or its next sibling, or the next sibling of one of its ancestors. Note that as usual PCDATA is considered an element.

This method has 2 forms:

$elt->next_elt returns simply the next element,
$elt->next_elt( $subtree_root returns the next element, or undef if the next element would be outside of the $subtree_root element.

ex3_6.pl shows how to use next_elt to list all the methods in the html_plus.xml document.

Pretty printing

By popular demand I have included a number of pretty printing options, both for documents and for data.

The usefull options to pretty print a document are:

none: the default, no \n is used,
nsgmls: nsgmls style, with \n added within tags,
nice: adds \n wherever possible (NOT SAFE),
indented: same as nice plus indents elements (NOT SAFE).

The NOT SAFE options can produce invalid XML (that would not conform to the original DTD) in some cases. I have included them anyway because it rarely happens with simple DTDs and they look good!

The ex3_7.pl example shows the pretty printer.

The output is ex3_7.res.

To pretty print tables 2 options can be used (besides the faithful none):

record_c: compact, one record per line,
record: one field per line.
indented: same as nice plus indents elements (NOT SAFE).

The ex3_8.pl example shows the pretty printer.

The output is ex3_8.res.

These options can be set either when creating the twig, using the PrettyPrint option, by using the PrettyPrint option in the print method (on a twig or on an element) or by using the set_pretty_print method either on a twig or on an element. Note that the setting is actually global at the moment.

Advanced features

or "I hope you don't need those"

Using StartTagHandlers

Sometimes you might want to just change a tag name, or store some attributes, BEFORE the whole tree for the element is built. This is often the case when you need to flush the twig while in the element. Then changing the element name for example will only change the end tag, as the start tag will have been output by the time you try to change it.

In that case you can use the StartTagHandlers option when you create the twig, which will call a handler when the start tag of the element is found. The arguments passed to the handler will be the twig and the element. The element will be empty at that point but the attributes will be there.

ex4_0.pl demonstrates the use of StartTagHandlers to change the tags in an XML document.

The other new feature used in this script is the _all_ keyword in the twig_handlers option. This calls the handler (which in this case just flushes the twig) for every single element in the document. Another keyword, _default_ calls a handler for each element that does not have a handler. _all_ and _default_ can be used both with StartTagHandlers and with the twig_handlers option.

Purging part of the tree

Sometimes, especially when converting an XML file to several HTML ones it is convenient to purge the twig only up to the next-to-last sibling, not up to the current one. Hence the purge_up_to and flush_up_to methods.

Here is an example of how to use them to list the difference in a given stat between 2 consecutive players. ex4_1.pl can receive the output from ex1_1.pl.

Fun with overloading

I just thought I'd mention, because I think it's cool, that you can overload the comparison operators to use the cmp method to compare elements in a twig.

So just insert these lines in your script:

 \&cmp,
             'lt'  => \<,
             'le'  => \&le,
             'gt'  => \>,
             'ge'  => \&ge,
             '+='  => \&suffix,
             '-='  => \&prefix,
             '>>'  => \&suffix,
             '<<'  => \&prefix,
             fallback => 1,
;]]>

Then you will be able to write if( $elt1 le $elt2) { print "$elt1 is before $elt2\n"; }. As an added bonus you get 2 new ways to prefix or suffix an element:

$elt += "suffix";
$elt -= "prefix";
$elt << "prefix";
$elt >> "suffix";

This is just syntactic sugar, and IMHO pretty useless (hence it is not included in the module), plus it slows the module down by a good 30%. It's cute though, and if you don't care about speed and need to do a lot of comparisons of elements it can be handy.

Under the hood

Now let's have a look under the hood at some of the things that go on in XML::Twig from a developer stand point.

Speedup

I think one of the most interesting feature of XML::Twig is the optimization step that takes place when the module is installed.

The module is written in pure OO style, whith accessors for every fields of objects, even inside the module. But as we all know method calls are expensive. So an optimization pass replaces method calls by hash accesses if possible.

For example $elt->parent is replaced by $elt->{parent} and $elt->set_parent( $parent) is replaced by $elt->{parent}= $parent.

The speedup is pretty simple, just a bunch of substitutions, and certainly not foolproof (it would crash miserably if I were to use brackets in the argument list of a method). It works pretty well though, and if it fails then the non-regression tests will catch the problem. It could be improved by using 5.6 new regexp to fix this.

The result is an improvement of about 30% of the speed of the module.

Speedup could also be used to... speedup a production script, with the caveat that as XML::Twig implementation changes it might be necessary to re-run the tool with new versions of the module.

Element names "compression"

A minor optimization in XML::Twig is that element names, which are stored as hash values are replaced by an index in an array holding all names.

Failed optimizations

Not all attempts at optimizing XML::Twig succeded, so I think it might be useful for me to share at least my biggest failure in this area...

Twig elements are stored in hashes, one element per hash. In order to reduce the potential overhead of all too much memory being allocated for each one of them I tried to store elements in global arrays, each array storing one field for all the elements: instead of the parent of an element being stored in $elt->{parent} it was stored in $parent[$elt], $elt being a blessed scalar.

It did not work.

The twig was just as big and slower to access than the original version.

Oh well... there goes 2 days of work...

Reference

The XML::Twig documentation.

(c) 2000 Michel Rodriguez
This tutorial is free documentation. It can be redistributed and modified under the same terms as perl itself