XML::Twig FAQ

XML::Twig FAQ 1.72003-02-06 Michel Rodriguez

FAQ created by Michel Rodriguez

Thanks to the numerous users of XML::Twig for their questions and suggestions, and to Walter Pienciak for letting me mirror this FAQ on the IEEE website

This FAQ contains information on XML::Twig, a perl module used to process XML documents. Please direct all corrections and additions to mirod@xmltwig.com.

This FAQ can be found on the Web at www.xmltwig.com/xmltwig/faq.html.

Information in this FAQ is based mainly on question to the Perl XML email list. To join, send an email to Lyris@ActiveState.com with the message: SUBSCRIBE Perl-XML.

This FAQ was generated using a Perl script (using XML::Twig ;--) and an XML file. The script is at http://www.xmltwig.com/xmltwig/twig_faq. The XML source is at http://www.xmltwig.com/xmltwig/faq.xml. To generate the XML::Twig FAQ, run twig_faq faq.xml which prints the HTML to STDOUT.

I know what a twig is but what is that XML thing anyway? OK, time for a quick list of XML links:

Where can I get the latest version of XML::Twig? The latest stable version:

The latest development version:

Where is the documentation?

Development version: html / text

Stable version: html / text

You can also type perldoc XML::Twig once you have installed the module or look at the XML::Twig Quick Reference, or goto xmltwig.com for more information, including a tutorial.

How is XML::Twig supported?

Twig is supported through email mirod@xmltwig.com and through the Perl-XML mailing list.

You are encouraged to report bugs using RT at rt.cpan.org.

Please send the following configuration information when you describe a bug:

OS
version of perl (perl -v),
version of expat (see below),
version of XML::Parser (perl -MXML::Parser -le'print $XML::Parser::VERSION'),
version of XML::Twig (perl -MXML::Twig -le'print $XML::Twig::VERSION').

Finding the version of expat that you are running can be a bit tricky, but it is an important information. Here is how you can get it:

First, if you are using a version of XML::Parser lower than 2.30, then you don't need to mention expat's version: XML::Parser comes with its own version of expat (it is old though, you might want to upgrade, first grab expat and install it, then install a recent version of XML::Parser).

If you are using XML::Parser 2.30 or above, run xmlwf -v. If you are lucky this will give you the version of expat. If xmlwf exists but does not like the -v option, then you are most likely running expat 1.95.2. If xmlwf is not installed on your system (which can be the case if you did not install expat yourself but use the one provided with your OS) then (on *nix) you can look for libexpat.so in your library path (using for example slocate libexpat.so). libexpat.so.1.0 is expat 1.95.2, libexpat.so.3.0 is expat 1.95.4 (in which case you should upgrade, expat 1.95.4 is not compatible with XML::Twig, libexpat.so.4.0 is expat 1.95.5 or 1.95.6.

This information will help me a lot in figuring out what causes the problem.

What is XML::Twig used for anyway?

I use XML::Twig for all sorts of XML processing: I use it to extract data from XML documents, to update documents from one DTD to another, to convert them to HTML and to extract/store/process data to and from a various databases.

Why should I use XML::Twig?

The main purpose of XML::Twig is to allow you to process XML documents that might be too big to fit in memory (with XML::DOM for example). If you are in that case but don't really like stream oriented processing, then XML::Twig allows you to use a mixed stream/tree model, where you can process sub-documents as trees and then flush them to free the memory.

In addition it is designed to be easy to use, masking some of the most annoying quirks of XML and XML::Parser, such as whitespace management and encodings (see below)

The main drawback of XML::Twig is that it is not XML::DOM! It is does not have a standard interface (feel free to add one ;--) nor does it interface with XML::SAX, although as of verion 3.05 it does export SAX streams

Using the twig_roots option also lets you process (using the tree interface) only the parts of the documents you are interested in, something that can speed up tremendously your scripts

My XML documents/data are produced by tools that do not grok Unicode, will XML::Twig help me there?

Yes, if you use the KeepEncoding option when you create a twig all PCDATA (character data) will be returned as-is, dont forget to use an encoding declaration in the XML declaration or in the twig creation though or the parser will die on you. You can also process your document as UTF-8 internally and use the output_encoding option (XML::Twig version 3.05 and above) to convert the output to your favourite encoding.

What's that whitespace management thing?

XML parsers are required by the standard to pass ALL data outside the markup to the calling application. Most of the time this is not desirable. By default XML::Twig discards those pesky \n (in fact XML::Twig discards all element contents that contain only whitespaces. This can be changed at twig level

What's the expansion factor from an XML document to a twig?

If you load the entire document in a twig the expansion factor is about 13 (the 900K file used for the benchmark takes about 11M). Of course if you flush the document as you're parsing then it will be much less!

I have that huge XML document, but I only want to extract information from a couple of elements, can XML-Twig help me there?

Oddly enough yes! Create the twig using the TwigRoots option and the tree will be built only for those elements.
Example:my $twig= XML::Twig->( twig_roots => { info => \&process_info });

I process lots of XML documents in batch and there seems to be a memory leak in XML::Twig, any fix for that?

Yes, since version 3.00, XML::Twig has a dispose method that releases completely a twig. With earlier versions you can release it yourself by doing: undef $t->{twig}; undef $t->{twig_root}->{twig}; undef $t->{twig_parser};

The easiest method though, if you are using perl 5.6.0 and above, is to install the WeakRef module, which fixes the memory leak

How can I install XML::Twig on Windows?

XML::Twig might be available as a ppm either from Activestate or from another repository (see Using PPM to install modules for more information about ppm and for a list of repositories.

If it is not available, or if you want to use the development version, you can just uncompress the distribution file (XML::Twig-x.xx.tar.gz) and copy the Twig.pm in the C:\Perl\site\lib\xml directory, alongside Parser.pm. Of course if you use Cygwin you can install the module with the usualperl Makefile.PL; make; make test; make install incantation. You might need to download nmake.

For logging purposes I would like XML::Twig to report line/column number in the original file

Use start_tag_handlers to grab the line and column number through the parser object and store them in private attributes (attributes whose name starts with a # are not output by XML::Twig):

#!/usr/bin/perl -w
use strict;
use XML::Twig;

my $t=XML::Twig->new( start_tag_handlers => 
                       { # called when the start tag for elt is parsed
                         # use '#ELT' or _all_ to call the handler for all elements
                         elt => sub { my( $t, $elt)= @_;
                                      $elt->set_att( '#line' => $t->current_line);
                                    },
                       },
		      twig_handlers =>
                       { # called when elt is completely parsed
                         elt => sub { my( $t, $elt)= @_; 
			              print "error in elt starting line ",
                                            $elt->att( '#line'), "\n"
                                        if( $elt->has_child( 'subelt[@error]'));
                                    },
                       },
                     );
$t->parsefile( "test_track_line_number.xml");

will parse test_track_line_number.xml that looks like:

text 1 text 2 text 3 text 1 text 2 text 3 ]]>

and will output: error in elt starting line 7

How do I include bits of (possibly not well-formed) HTML in an XML document and use them to generate HTML?

You can wrap the HTML in a CDATA section, which will prevent the parser to look into the data. Then use a twig_handler on CDATA to process those sections. Use the set_asis method to get those sections to be output without being "XML escaped" (XML::Twig 3.05 and above)

new( twig_handlers => { '#CDATA' => sub { $_->set_asis; } });
  $t->parse( \*DATA);
  $t->print;

  __DATA__
  
    text
    	  
    world]]]]>
    ]]>

will output (comment stripped for conciseness):

texthello
world]]>

Note that the CDATA section will not protect you from encoding problems, so if the included text is likely to be in a different encoding than the main document you will have to do some encoding conversion before including it.

In which order are handlers called?

I have this simple Perl script that parse an XML document. The XML document use the following DTD:

I've noticed the following: although the element 'doc' is the root, XML::Twig calls its handle last. All the elements 'title' and 'elt' are processed in correct sequence. Why? The element 'doc' handler should be called the first and not the last.

Is the element's handler called on the opening tag OR on the closing tag?

Element handlers are called on the closing tag, as it is the only time when the entire element has been parsed. The handler is called as soon as the element has been completely parsed, which is when its end tag has been parsed.

This indeed leads to handlers for the inner elements to be called before the ones to the outer elements: here the handler on 'doc' will be called after the handlers on 'title' and 'elt'.

This example will show you in which order the handlers are called:

new( twig_handlers => { '_all_' => sub { print "handler for ", $_->att( 'id'); } },
                       error_context => 1,
                     );
$t->parse( \*DATA);


__DATA__

  title
  
    subelt
    subelt
  
  element 2
]]>

Any neat trick to increase the performance of XML::Twig?

Tom Anderson from tomacorp released an interesting article: Performance Comparison Between SAX XML::Filter::Dispatcher and XML::Twig. He notes:

I learned an interesting performance optimization when writing the anonymous subs for XML::Twig. These subs should not uselessly return a long string. Processing this string can increase processing time by 50% in this example. This is why the start_tag_handlers return the value 1

Using this trick lead to a 4x speedup on my first attempt at speeding up Tom's example!

Thanks Tom!

I seem to be having a spot of trouble getting XML::Twig 3.08 to compile and install on a SuSE 8.1/RedHat 8.0 system.

Here is the result of make test:



  toto &ent1;
============^
  tata &ent2;
  tutu &ent3;
 at /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm line 185
t/test_entities...........dubious                                            
       Test returned status 255 (wstat 65280, 0xff00)
DIED. FAILED tests 1-6
        Failed 6/6 tests, 0.00% okay
[...]
t/test_spaces.............dubious                                            
        Test returned status 255 (wstat 65280, 0xff00)
DIED. FAILED tests 1-3
        Failed 3/3 tests, 0.00% okay
t/test_twig_roots.........ok                                         t/test_xpath_cond.........ok                                                 
Failed Test       Stat Wstat Total Fail  Failed  List of Failed
-------------------------------------------------------------------------------
t/test_entities.t  255 65280     6    6 100.00%  1-6
t/test_spaces.t    255 65280     3    3 100.00%  1-3
Failed 2/18 test scripts, 88.89% okay. 9/400 subtests failed, 97.75% okay.
make: *** [test_dynamic] Error 29]]>

The problem is (probably, I don't use those distributions) an incompatibility between XML::Twig and the version of the libexpat library that comes with RH 8.0 / Suse 8.1. If you upgrade to XML::Twig 3.08 and to libexpat 1.95.5 you should not get the problem anymore.

You can get the latest version of libexpat on sourceforge: http://expat.sourceforge.net/

I need to process XML documents. The problem is that they are several of them, so the parser dies after the first one, with a message telling me that there is junk after the end of the document. Is there any way I could trick the parser into believing they are all part of a single document?

You can open the input file as a pipe, first echo-ing an open tag, then getting the input from wherever you get it, then echo-ing a close tag:

doc1doc2'}; 
my $wrap= 'docs';

# this is where it all happens:
# the pipe at the end of the "file name" means that the name is a
# shell command, that will be executed then piped to the filehandle
open( IN, qq{echo '<$wrap>'; $xml_generator; echo '' |})
  or die "error opening xml_generator: $!";

my $i=1;
my $t= XML::Twig->new( twig_handlers => {
                         doc => sub { print "document $i: ", $_->sprint, "\n"; 
                                      $_[0]->purge; # to get he memory back
                                      $i++;
                                    }
                                        },
                     );
$t->parse( \*IN);
close IN or die "error during the execution of xml_generator: $!";					 
]]>