XML, the Perl Way


NAME

XML::Twig - A perl module for processing huge XML documents in tree mode.

SYNOPSIS

Note that this documentation is intended as a reference to the module.

Complete docs, including a tutorial, examples, an easier to use HTML version, a quick reference card and a FAQ are available at http://www.xmltwig.org/xmltwig

Small documents (loaded in memory as a tree):

  my $twig=XML::Twig->new();    # create the twig
  $twig->parsefile( 'doc.xml'); # build it
  my_process( $twig);           # use twig methods to process it 
  $twig->print;                 # output the twig

Huge documents (processed in combined stream/tree mode):

  # at most one div will be loaded in memory
  my $twig=XML::Twig->new(   
    twig_handlers => 
      { title   => sub { $_->set_tag( 'h2') }, # change title tags to h2 
                                               # $_ is the current element
        para    => sub { $_->set_tag( 'p')  }, # change para to p
        hidden  => sub { $_->delete;       },  # remove hidden elements
        list    => \&my_list_process,          # process list elements
        div     => sub { $_[0]->flush;     },  # output and free memory
      },
    pretty_print => 'indented',                # output will be nicely formatted
    empty_tags   => 'html',                    # outputs <empty_tag />
                         );
  $twig->parsefile( 'my_big.xml');

  sub my_list_process
    { my( $twig, $list)= @_;
      # ...
    }

See XML::Twig 101 for other ways to use the module, as a filter for example.

utf8

DESCRIPTION

This module provides a way to process XML documents. It is build on top of XML::Parser.

The module offers a tree interface to the document, while allowing you to output the parts of it that have been completely processed.

It allows minimal resource (CPU and memory) usage by building the tree only for the parts of the documents that need actual processing, through the use of the twig_roots and twig_print_outside_roots options. The finish and finish_print methods also help to increase performances.

XML::Twig tries to make simple things easy so it tries its best to takes care of a lot of the (usually) annoying (but sometimes necessary) features that come with XML and XML::Parser.

TOOLS

XML::Twig comes with a few command-line utilities:

xml_pp - xml pretty-printer

XML pretty printer using XML::Twig

xml_grep - grep XML files looking for specific elements

xml_grep does a grep on XML files. Instead of using regular expressions it uses XPath expressions (in fact the subset of XPath supported by XML::Twig).

xml_split - cut a big XML file into smaller chunks

xml_split takes a (presumably big) XML file and split it in several smaller files, based on various criteria (level in the tree, size or an XPath expression)

xml_merge - merge back XML files split with xml_split

xml_merge takes several xml files that have been split using xml_split and recreates a single file.

xml_spellcheck - spellcheck XML files

xml_spellcheck lets you spell check the content of an XML file. It extracts the text (the content of elements and optionally of attributes), call a spell checker on it and then recreates the XML document.

XML::Twig 101

XML::Twig can be used either on "small" XML documents (that fit in memory) or on huge ones, by processing parts of the document and outputting or discarding them once they are processed.

Loading an XML document and processing it

  my $t= XML::Twig->new();
  $t->parse( '<d><title>title</title><para>p 1</para><para>p 2</para></d>');
  my $root= $t->root;
  $root->set_tag( 'html');              # change doc to html
  $title= $root->first_child( 'title'); # get the title
  $title->set_tag( 'h1');               # turn it into h1
  my @para= $root->children( 'para');   # get the para children
  foreach my $para (@para)
    { $para->set_tag( 'p'); }           # turn them into p
  $t->print;                            # output the document

Other useful methods include:

att: $elt->{'att'}->{'foo'} return the foo attribute for an element,

set_att : $elt->set_att( foo => "bar") sets the foo attribute to the bar value,

next_sibling: $elt->{next_sibling} return the next sibling in the document (in the example $title->{next_sibling} is the first para, you can also (and actually should) use $elt->next_sibling( 'para') to get it

The document can also be transformed through the use of the cut, copy, paste and move methods: $title->cut; $title->paste( after => $p); for example

And much, much more, see XML::Twig::Elt.

Processing an XML document chunk by chunk

One of the strengths of XML::Twig is that it let you work with files that do not fit in memory (BTW storing an XML document in memory as a tree is quite memory-expensive, the expansion factor being often around 10).

To do this you can define handlers, that will be called once a specific element has been completely parsed. In these handlers you can access the element and process it as you see fit, using the navigation and the cut-n-paste methods, plus lots of convenient ones like prefix . Once the element is completely processed you can then flush it, which will output it and free the memory. You can also purge it if you don't need to output it (if you are just extracting some data from the document for example). The handler will be called again once the next relevant element has been parsed.

  my $t= XML::Twig->new( twig_handlers => 
                          { section => \&section,
                            para   => sub { $_->set_tag( 'p'); }
                          },
                       );
  $t->parsefile( 'doc.xml');

  # the handler is called once a section is completely parsed, ie when 
  # the end tag for section is found, it receives the twig itself and
  # the element (including all its sub-elements) as arguments
  sub section 
    { my( $t, $section)= @_;      # arguments for all twig_handlers
      $section->set_tag( 'div');  # change the tag name
      # let's use the attribute nb as a prefix to the title
      my $title= $section->first_child( 'title'); # find the title
      my $nb= $title->{'att'}->{'nb'}; # get the attribute
      $title->prefix( "$nb - ");  # easy isn't it?
      $section->flush;            # outputs the section and frees memory
    }

There is of course more to it: you can trigger handlers on more elaborate conditions than just the name of the element, section/title for example.

  my $t= XML::Twig->new( twig_handlers => 
                           { 'section/title' => sub { $_->print } }
                       )
                  ->parsefile( 'doc.xml');

Here sub { $_->print } simply prints the current element ($_ is aliased to the element in the handler).

You can also trigger a handler on a test on an attribute:

  my $t= XML::Twig->new( twig_handlers => 
                      { 'section[@level="1"]' => sub { $_->print } }
                       );
                  ->parsefile( 'doc.xml');

You can also use start_tag_handlers to process an element as soon as the start tag is found. Besides prefix you can also use suffix ,

Processing just parts of an XML document

The twig_roots mode builds only the required sub-trees from the document Anything outside of the twig roots will just be ignored:

  my $t= XML::Twig->new( 
       # the twig will include just the root and selected titles 
           twig_roots   => { 'section/title' => \&print_n_purge,
                             'annex/title'   => \&print_n_purge
           }
                      );
  $t->parsefile( 'doc.xml');

  sub print_n_purge 
    { my( $t, $elt)= @_;
      print $elt->text;    # print the text (including sub-element texts)
      $t->purge;           # frees the memory
    }

You can use that mode when you want to process parts of a documents but are not interested in the rest and you don't want to pay the price, either in time or memory, to build the tree for the it.

Building an XML filter

You can combine the twig_roots and the twig_print_outside_roots options to build filters, which let you modify selected elements and will output the rest of the document as is.

This would convert prices in $ to prices in Euro in a document:

  my $t= XML::Twig->new( 
           twig_roots   => { 'price' => \&convert, },   # process prices 
           twig_print_outside_roots => 1,               # print the rest
                      );
  $t->parsefile( 'doc.xml');

  sub convert 
    { my( $t, $price)= @_;
      my $currency=  $price->{'att'}->{'currency'};          # get the currency
      if( $currency eq 'USD')
        { $usd_price= $price->text;                     # get the price
          # %rate is just a conversion table 
          my $euro_price= $usd_price * $rate{usd2euro};
          $price->set_text( $euro_price);               # set the new price
          $price->set_att( currency => 'EUR');          # don't forget this!
        }
      $price->print;                                    # output the price
    }

XML::Twig and various versions of Perl, XML::Parser and expat:

XML::Twig is a lot more sensitive to variations in versions of perl, XML::Parser and expat than to the OS, so this should cover some reasonable configurations.

The "recommended configuration" is perl 5.8.3+ (for good Unicode support), XML::Parser 2.31+ and expat 1.95.5+

See http://testers.cpan.org/search?request=dist&dist=XML-Twig for the CPAN testers reports on XML::Twig, which list all tested configurations.

An Atom feed of the CPAN Testers results is available at http://xmltwig.org/rss/twig_testers.rss

Finally:

When in doubt, upgrade expat, XML::Parser and Scalar::Util

Finally, for some optional features, XML::Twig depends on some additional modules. The complete list, which depends somewhat on the version of Perl that you are running, is given by running t/zz_dump_config.t

Simplifying XML processing

CLASSES

XML::Twig uses a very limited number of classes. The ones you are most likely to use are XML::Twig of course, which represents a complete XML document, including the document itself (the root of the document itself is root), its handlers, its input or output filters... The other main class is XML::Twig::Elt, which models an XML element. Element here has a very wide definition: it can be a regular element, or but also text, with an element tag of #PCDATA (or #CDATA), an entity (tag is #ENT), a Processing Instruction (#PI), a comment (#COMMENT).

Those are the 2 commonly used classes.

You might want to look the elt_class option if you want to subclass XML::Twig::Elt.

Attributes are just attached to their parent element, they are not objects per se. (Please use the provided methods att and set_att to access them, if you access them as a hash, then your code becomes implementation dependent and might break in the future).

Other classes that are seldom used are XML::Twig::Entity_list and XML::Twig::Entity.

If you use XML::Twig::XPath instead of XML::Twig, elements are then created as XML::Twig::XPath::Elt

METHODS

XML::Twig

A twig is a subclass of XML::Parser, so all XML::Parser methods can be called on a twig object, including parse and parsefile. setHandlers on the other hand cannot be used, see BUGS

XML::Twig::Elt

cond

Most of the navigation functions accept a condition as an optional argument The first element (or all elements for children or ancestors ) that passes the condition is returned.

The condition is a single step of an XPath expression using the XPath subset defined by get_xpath. Additional conditions are:

The condition can be

XML::Twig::XPath

XML::Twig implements a subset of XPath through the get_xpath method.

If you want to use the whole XPath power, then you can use XML::Twig::XPath instead. In this case XML::Twig uses XML::XPath to execute XPath queries. You will of course need XML::XPath installed to be able to use XML::Twig::XPath.

See XML::XPath for more information.

The methods you can use are:

In order for XML::XPath to be used as the XPath engine the following methods are included in XML::Twig:

in XML::Twig

in XML::Twig::Elt

XML::Twig::XPath::Elt

The methods you can use are the same as on XML::Twig::XPath elements:

XML::Twig::Entity_list

XML::Twig::Entity

EXAMPLES

Additional examples (and a complete tutorial) can be found on the XML::Twig Page

To figure out what flush does call the following script with an XML file and an element name as arguments

  use XML::Twig;

  my ($file, $elt)= @ARGV;
  my $t= XML::Twig->new( twig_handlers => 
      { $elt => sub {$_[0]->flush; print "\n[flushed here]\n";} });
  $t->parsefile( $file, ErrorContext => 2);
  $t->flush;
  print "\n";

NOTES

Subclassing XML::Twig

Useful methods:

DTD Handling

There are 3 possibilities here. They are:

Flush

Remember that element handlers are called when the element is CLOSED, so if you have handlers for nested elements the inner handlers will be called first. It makes it for example trickier than it would seem to number nested sections (or clauses, or divs), as the titles in the inner sections are handled before the outer sections.

BUGS

Globals

These are the things that can mess up calling code, especially if threaded. They might also cause problem under mod_perl.

If you need to manipulate all those values, you can use the following methods on the XML::Twig object:

TODO

AUTHOR

Michel Rodriguez <mirod@cpan.org>

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Bug reports should be sent using: RT

Comments can be sent to mirod@cpan.org

The XML::Twig page is at http://www.xmltwig.org/xmltwig/ It includes the development version of the module, a slightly better version of the documentation, examples, a tutorial and a: Processing XML efficiently with Perl and XML::Twig:

SEE ALSO

Complete docs, including a tutorial, examples, an easier to use HTML version of the docs, a quick reference card and a FAQ are available at http://www.xmltwig.org/xmltwig/

git repository at http://github.com/mirod/xmltwig

XML::Parser, XML::Parser::Expat, XML::XPath, Encode, Text::Iconv, Scalar::Utils

Alternative Modules

XML::Twig is not the only XML::Processing module available on CPAN (far from it!).

The main alternative I would recommend is XML::LibXML.

Here is a quick comparison of the 2 modules:

XML::LibXML, actually libxml2 on which it is based, sticks to the standards, and implements a good number of them in a rather strict way: XML, XPath, DOM, RelaxNG, I must be forgetting a couple (XInclude?). It is fast and rather frugal memory-wise.

XML::Twig is older: when I started writing it XML::Parser/expat was the only game in town. It implements XML and that's about it (plus a subset of XPath, and you can use XML::Twig::XPath if you have XML::XPathEngine installed for full support). It is slower and requires more memory for a full tree than XML::LibXML. On the plus side (yes, there is a plus side!) it lets you process a big document in chunks, and thus let you tackle documents that couldn't be loaded in memory by XML::LibXML, and it offers a lot (and I mean a LOT!) of higher-level methods, for everything, from adding structure to "low-level" XML, to shortcuts for XHTML conversions and more. It also DWIMs quite a bit, getting comments and non-significant whitespaces out of the way but preserving them in the output for example. As it does not stick to the DOM, is also usually leads to shorter code than in XML::LibXML.

Beyond the pure features of the 2 modules, XML::LibXML seems to be preferred by "XML-purists", while XML::Twig seems to be more used by Perl Hackers who have to deal with XML. As you have noted, XML::Twig also comes with quite a lot of docs, but I am sure if you ask for help about XML::LibXML here or on Perlmonks you will get answers.

Note that it is actually quite hard for me to compare the 2 modules: on one hand I know XML::Twig inside-out and I can get it to do pretty much anything I need to (or I improve it ;--), while I have a very basic knowledge of XML::LibXML. So feature-wise, I'd rather use XML::Twig ;--). On the other hand, I am painfully aware of some of the deficiencies, potential bugs and plain ugly code that lurk in XML::Twig, even though you are unlikely to be affected by them (unless for example you need to change the DTD of a document programmatically), while I haven't looked much into XML::LibXML so it still looks shinny and clean to me.

That said, if you need to process a document that is too big to fit memory and XML::Twig is too slow for you, my reluctant advice would be to use "bare" XML::Parser. It won't be as easy to use as XML::Twig: basically with XML::Twig you trade some speed (depending on what you do from a factor 3 to... none) for ease-of-use, but it will be easier IMHO than using SAX (albeit not standard), and at this point a LOT faster (see the last test in http://www.xmltwig.org/article/simple_benchmark/).