XML, the Perl Way


NAME

XML::Twig - A perl module for processing huge XML documents in tree mode.

SYNOPSIS

Small documents

  my $twig=XML::Twig->new();    # create the twig
  $twig->parsefile( 'doc.xml'); # build it
  my_process( $twig);           # use twig methods to process it 
  $twig->print;                 # output the twig

Huge documents

  my $twig=XML::Twig->new(   
    twig_handlers => 
      { title   => sub { $_->set_gi( 'h2') }, # change title tags to h2
        para    => sub { $_->set_gi( 'p')  }, # change para to p
    hidden  => sub { $_->delete;       }, # remove hidden elements
    list    => \&my_list_process,         # process list elements
    div     => sub { $_[0]->flush;     }, # output and free memory
      },
    pretty_print => 'indented',               # output will be nicely formatted
    empty_tags   => 'html',                   # outputs <empty_tag />
                    );
    $twig->flush;                             # flush the end of the document

See XML::Twig 101 for other ways to use the module, as a filter for example

DESCRIPTION

This module provides a way to process XML documents. It is build on top of XML::Parser.

The module offers a tree interface to the document, while allowing you to output the parts of it that have been completely processed.

It allows minimal resource (CPU and memory) usage by building the tree only for the parts of the documents that need actual processing, through the use of the twig_roots and twig_print_outside_roots options. The finish and finish_print methods also help to increase performances.

XML::Twig tries to make simple things easy so it tries its best to takes care of a lot of the (usually) annoying (but sometimes necessary) features that come with XML and XML::Parser.

XML::Twig 101

XML::Twig can be used either on "small" XML documents (that fit in memory) or on huge ones, by processing parts of the document and outputting or discarding them once they are processed.

Loading an XML document and processing it

        my $t= XML::Twig->new();
        $t->parse( '<d><tit>title</tit><para>para1</para><para>p2</para></d>');
        my $root= $t->root;
    $root->set_gi( 'html');               # change doc to html
    $title= $root->first_child( 'tit');   # get the title
    $title->set_gi( 'h1');                # turn it into h1
    my @para= $root->children( 'para');   # get the para children
    foreach my $para (@para)
      { $para->set_gi( 'p'); }            # turn them into p
    $t->print;                            # output the document

Other useful methods include:

att: $elt->{'att'}->{'type'} return the type attribute for an element,

set_att : $elt->set_att( type => "important") sets the type attribute to the important value,

next_sibling: $elt->{next_sibling} return the next sibling in the document (in the example $title->{next_sibling} is the first para while $elt->next_sibling( 'table') is the next table sibling

The document can also be transformed through the use of the cut, copy, paste and move methods: $title->cut; $title->paste( 'after', $p); for example

And much, much more, see Elt.

Processing an XML document chunk by chunk

One of the strengths of XML::Twig is that it let you work with files that do not fit in memory (BTW storing an XML document in memory as a tree is quite memory-expensive, the expansion factor being often around 10).

To do this you can define handlers, that will be called once a specific element has been completely parsed. In these handlers you can access the element and process it as you see fit, using the navigation and the cut-n-paste methods, plus lots of convenient ones like prefix. Once the element is completely processed you can then flush it, which will output it and free the memory. You can also purge it if you don't need to output it (if you are just extracting some data from the document for example). The handler will be called again once the next relevant element has been parsed.

        my $t= XML::Twig->new( twig_handlers => 
                                { section => \&section,
                              para   => sub { $_->set_gi( 'p');
                    },
                    );
        $t->parsefile( 'doc.xml');
        $t->flush; # don't forget to flush one last time in the end or anything
               # after the last </section> tag will not be output 
    
    # the handler is called once a section is completely parsed, ie when 
    # the end tag for section is found, it receives the twig itself and
    # the element (including all its sub-elements) as arguments
        sub section 
      { my( $t, $section)= @_;      # arguments for all twig_handlers
        $section->set_gi( 'div');   # change the gi, my favourite method...
        # let's use the attribute nb as a prefix to the title
        my $title= $section->first_child( 'title'); # find the title
        my $nb= $title->{'att'}->{'nb'}; # get the attribute
        $title->prefix( "$nb - ");  # easy isn't it?
        $section->flush;            # outputs the section and frees memory
      }

        my $t= XML::Twig->new( twig_handlers => 
                            { 'section/title' => \&print_elt_text} );
        $t->parsefile( 'doc.xml');
        sub print_elt_text 
          { my( $t, $elt)= @_;
            print $elt->text; 
          }

        my $t= XML::Twig->new( twig_handlers => 
                            { 'section[@level="1"]' => \&print_elt_text }
                );
        $t->parsefile( 'doc.xml');

There is of course more to it: you can trigger handlers on more elaborate conditions than just the name of the element, section/title for example. You can also use twig_start_handlers to process an element as soon as the start tag is found. Besides prefix you can also use suffix,

Processing just parts of an XML document

The twig_roots mode builds only the required sub-trees from the document Anything outside of the twig roots will just be ignored:

        my $t= XML::Twig->new( 
             # the twig will include just the root and selected titles 
                 twig_roots   => { 'section/title' => \&print_elt_text,
                                   'annex/title'   => \&print_elt_text
                 }
                            );
        $t->parsefile( 'doc.xml');
    
        sub print_elt_text 
          { my( $t, $elt)= @_;
            print $elt->text;    # print the text (including sub-element texts)
        $t->purge;           # frees the memory
          }

You can use that mode when you want to process parts of a documents but are not interested in the rest and you don't want to pay the price, either in time or memory, to build the tree for the it.

Building an XML filter

You can combine the twig_roots and the twig_print_outside_roots options to build filters, which let you modify selected elements and will output the rest of the document as is.

This would convert prices in $ to prices in Euro in a document:

        my $t= XML::Twig->new( 
                 twig_roots   => { 'price' => \&convert, },    # process prices 
         twig_print_outside_roots => 1,                # print the rest
                            );
        $t->parsefile( 'doc.xml');
    
        sub convert 
          { my( $t, $price)= @_;
        my $currency=  $price->{'att'}->{'currency'};        # get the currency
        if( $currency eq 'USD')
          { $usd_price= $price->text;                   # get the price
            # %rate is just a conversion table 
            my $euro_price= $usd_price * $rate{usd2euro};
        $price->set_text( $euro_price);             # set the new price
        $price->set_att( currency => 'EUR');        # don't forget this!
          }
            $price->print;                                  # output the price
      }

Simplifying XML processing

METHODS

XML::Twig

A twig is a subclass of XML::Parser, so all XML::Parser methods can be called on a twig object, including parse and parsefile. setHandlers on the other hand cannot be used, see BUGS

XML::Twig::Elt

cond

Most of the navigation functions accept a condition as an optional argument The first element (or all elements for children or ancestors) that passes the condition is returned.

The condition can be

XML::Twig::Entity_list

XML::Twig::Entity

EXAMPLES

See the test file in t/test[1-n].t Additional examples (and a complete tutorial) can be found on the XML::Twig Page

To figure out what flush does call the following script with an XML file and an element name as arguments

  use XML::Twig;

  my ($file, $elt)= @ARGV;
  my $t= XML::Twig->new( twig_handlers => 
      { $elt => sub {$_[0]->flush; print "\n[flushed here]\n";} });
  $t->parsefile( $file, ErrorContext => 2);
  $t->flush;
  print "\n";

NOTES

XML::Twig and various versions of Perl, XML::Parser and expat:

    XML::Twig is tested under the following environments:

Linux, perl 5.004_005, expat 1.95.2 and 1.95.5, XML::Parser 2.27 and 2.31

You cannot use the output_encoding option with perl 5.004_005

Linux, perl 5.005_03, expat 1.95.2 and 1.95.5, XML::Parser 2.27 and 2.31

Linux, perl 5.6.1, expat 1.95.2 and 1.95.5, XML::Parser 2.27 and 2.31

Linux, perl 5.8.0, expat 1.95.2 and 1.95.5, XML::Parser 2.31

You cannot use the output_encoding option with perl 5.004_005 Parsing utf-8 Asian characters with perl 5.8.0 seems not to work (this is under investigation, and probably due to XML::Parser)

Windows NT 4.0 ActivePerl 5.6.1 build 631

You need nmake to make the module on Windows (or you can just copy Twig.pm to the appropriate directory)

Windows 2000 ActivePerl 5.6.1 build 633

XML::Twig does NOT work with expat 1.95.4 (upgrade to 1.95.5) XML::Parser 2.27 does NOT work under perl 5.8.0, nor does XML::Twig

DTD Handling

There are 3 possibilities here. They are:

Flush

If you set handlers and use flush, do not forget to flush the twig one last time AFTER the parsing, or you might be missing the end of the document.

Remember that element handlers are called when the element is CLOSED, so if you have handlers for nested elements the inner handlers will be called first. It makes it for example trickier than it would seem to number nested clauses.

BUGS

Globals

These are the things that can mess up calling code, especially if threaded. They might also cause problem under mod_perl.

TODO

BENCHMARKS

You can use the benchmark_twig file to do additional benchmarks. Please send me benchmark information for additional systems.

AUTHOR

Michel Rodriguez <mirod@xmltwig.com>

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Bug reports should be sent using: RT

Comments can be sent to mirod@xmltwig.com

The XML::Twig page is at http://www.xmltwig.com/xmltwig/ It includes the development version of the module, a slightly better version of the documentation, examples, a tutorial and a: Processing XML efficiently with Perl and XML::Twig:

SEE ALSO

XML::Parser,XML::Parser::Expat, Encode, Text::Iconv, Scalar::Utils