XML, the Perl Way

XML::Twig FAQ

Version 1.12 - 2006-02-07

by Michel Rodriguez

Credits

FAQ created by Michel Rodriguez

Thanks to the numerous users of XML::Twig for their questions and suggestions, and to Walter Pienciak for letting me mirror this FAQ on the IEEE website


Overview

This FAQ contains information on XML::Twig, a perl module used to process XML documents. Please direct all corrections and additions to mirod@cpan.org.

This FAQ can be found on the Web at www.xmltwig.org/xmltwig/faq.html.

Information in this FAQ is based mainly on question to the Perl XML email list. To join, send an email to Lyris@ActiveState.com with the message: SUBSCRIBE Perl-XML.

This FAQ was generated using a Perl script (using XML::Twig ;--) and an XML file. The script is at xmltwig.org/xmltwig/twig_faq. The XML source is at http://www.xmltwig.org/xmltwig/faq.xml. To generate the XML::Twig FAQ, run twig_faq faq.xml which prints the HTML to STDOUT.


Content


Q1: I know what a twig is but what is that XML thing anyway?

Answer: OK, time for a quick list of XML links:


Q2: Where can I get the latest version of XML::Twig?

Answer: The latest stable version:

The latest development version:


Q3: Where is the documentation?

Answer: Development version: html / text

Stable version: html / text

You can also type perldoc XML::Twig once you have installed the module or look at the XML::Twig Quick Reference, or goto xmltwig.org for more information, including a tutorial.


Q4: How is XML::Twig supported?

Answer: Twig is supported through email mirod@cpan.org and through the Perl-XML mailing list.

You are encouraged to report bugs using RT at rt.cpan.org.

Please send the following configuration information when you describe a bug:

Finding the version of expat that you are running can be a bit tricky, but it is an important information. Here is how you can get it:

First, if you are using a version of XML::Parser lower than 2.30, then you don't need to mention expat's version: XML::Parser comes with its own version of expat (it is old though, you might want to upgrade, first grab expat and install it, then install a recent version of XML::Parser).

If you are using XML::Parser 2.30 or above, run xmlwf -v. If you are lucky this will give you the version of expat. If xmlwf exists but does not like the -v option, then you are most likely running expat 1.95.2. If xmlwf is not installed on your system (which can be the case if you did not install expat yourself but use the one provided with your OS) then (on *nix) you can look for libexpat.so in your library path (using for example slocate libexpat.so). libexpat.so.1.0 is expat 1.95.2, libexpat.so.3.0 is expat 1.95.4 (in which case you should upgrade, expat 1.95.4 is not compatible with XML::Twig, libexpat.so.4.0 is expat 1.95.5 or 1.95.6.

This information will help me a lot in figuring out what causes the problem.


Q5: What is XML::Twig used for anyway?

Answer: I use XML::Twig for all sorts of XML processing: I use it to extract data from XML documents, to update documents from one DTD to another, to convert them to HTML and to extract/store/process data to and from a various databases.


Q6: Why should I use XML::Twig?

Answer: The main purpose of XML::Twig is to allow you to process XML documents that might be too big to fit in memory (with XML::DOM for example). If you are in that case but don't really like stream oriented processing, then XML::Twig allows you to use a mixed stream/tree model, where you can process sub-documents as trees and then flush them to free the memory.

In addition it is designed to be easy to use, masking some of the most annoying quirks of XML and XML::Parser, such as whitespace management and encodings (see below)

The main drawback of XML::Twig is that it is not XML::DOM! It is does not have a standard interface (feel free to add one ;--) nor does it interface with XML::SAX, although as of verion 3.05 it does export SAX streams

Using the twig_roots option also lets you process (using the tree interface) only the parts of the documents you are interested in, something that can speed up tremendously your scripts


Q7: What are the alternatives to XML::Twig?

Answer: The Perl-XML FAQ lists quite a few other modules that can be used to process XML.

When deciding which module to choose for any slightly complex processing of XML, I would advise you to also have a look at XML::LibXML. Here is a quick comparison of the 2 modules.

XML::LibXML, actually libxml2 on which it is based, sticks to the standards, and implements a good number of them in a rather strict way: XML, XPath, DOM, RelaxNG, I must be forgetting a couple (XInclude?). It is fast and rather frugal memory-wise.

XML::Twig is older: when I started writing it XML::Parser/expat was the only game in town. It implements XML and that's about it (plus a subset of XPath, and you can use XML::Twig::XPath if you have XML::XPath installed for full support). It is slower and requires more memory for a full tree than XML::LibXML. On the plus side (yes, there is a plus side!) it lets you process a big document in chunks, and thus let you tackle documents that couldn't be loaded in memory by XML::LibXML, and it offers a lot (and I mean a LOT!) of higher-level methods, for everything, from adding structure to "low-level" XML, to shortcuts for XHTML conversions and more. It also DWIMs quite a bit, getting comments and non-significant whitespaces out of the way but preserving them in the output for example. As it does not stick to the DOM, is also usually leads to shorter code than in XML::LibXML.

Beyond the pure features of the 2 modules, XML::LibXML seems to be prefered by "XML-purists", while XML::Twig seems to be more used by Perl Hackers who have to deal with XML. As you have noted, XML::Twig also comes with quite a lot of docs, but I am sure if you ask for help about XML::LibXML here or on Perlmonks you will get answers.

Note that it is actually quite hard for me to compare the 2 modules: on one hand I know XML::Twig inside-out and I can get it to do pretty much anything I need to (or I improve it ;--), while I have a very basic knowledge of XML::LibXML. So feature-wise, I'd rather use XML::Twig ;--). On the other hand, I am painfully aware of some of the deficiencies, potential bugs and plain ugly code that lurk in XML::Twig, even though you are unlikely to be affected by them (unless for example you need to change the DTD of a document programatically), while I haven't looked much into XML::LibXML so it still looks shinny and clean to me.

That said, ifyou need to process a document that is too big to fit memory and XML::Twig is too slow for you, my reluctant advice would be to use "bare" XML::Parser. It won't be as easy to use as XML::Twig: basically with XML::Twig you trade some speed (depending on what you do from a factor 3 to... none) for ease-of-use, but it will be easier IMHO than using SAX (albeit not standard), and at this point a LOT faster (see the last test in simple benchmark).


Q8: My XML documents/data are produced by tools that do not grok Unicode, will XML::Twig help me there?

Answer: Yes, if you use the KeepEncoding option when you create a twig all PCDATA (character data) will be returned as-is, dont forget to use an encoding declaration in the XML declaration or in the twig creation though or the parser will die on you. You can also process your document as UTF-8 internally and use the output_encoding option (XML::Twig version 3.05 and above) to convert the output to your favourite encoding.


Q9: What's that whitespace management thing?

Answer: XML parsers are required by the standard to pass ALL data outside the markup to the calling application. Most of the time this is not desirable. By default XML::Twig discards those pesky \n (in fact XML::Twig discards all element contents that contain only whitespaces. This can be changed at twig level


Q10: What's the expansion factor from an XML document to a twig?

Answer: If you load the entire document in a twig the expansion factor is about 13 (the 900K file used for the benchmark takes about 11M). Of course if you flush the document as you're parsing then it will be much less!


Q11: I have that huge XML document, but I only want to extract information from a couple of elements, can XML-Twig help me there?

Answer: Oddly enough yes! Create the twig using the TwigRoots option and the tree will be built only for those elements.
Example:

			      my $twig= XML::Twig->( twig_roots    => { info =>  \&process_info });
     


Q12: I process lots of XML documents in batch and there seems to be a memory leak in XML::Twig, any fix for that?

Answer: Yes, since version 3.00, XML::Twig has a dispose method that releases completely a twig. With earlier versions you can release it yourself by doing:

    undef $t->{twig};
    undef $t->{twig_root}->{twig};
    undef $t->{twig_parser};
    

The easiest method though, if you are using perl 5.6.0 and above, is to install the WeakRef module, which fixes the memory leak


Q13: How can I install XML::Twig on Windows?

Answer: XML::Twig might be available as a ppm either from Activestate or from another repository (see Using PPM to install modules for more information about ppm and for a list of repositories.

If it is not available, or if you want to use the development version, you can just uncompress the distribution file (XML::Twig-x.xx.tar.gz) and copy the Twig.pm in the C:\Perl\site\lib\xml directory, alongside Parser.pm. Of course if you use Cygwin you can install the module with the usualperl Makefile.PL; make; make test; make install incantation. You might need to download nmake.

Alternatively KobeSearch lists PPMs for the module


Q14: I am having a problem installing MythTV on RedHat 9.0:

When I attempt to do an install XML::Twig in CPAN It goes through its install, but then states: Weak references are not implemented

Answer: You need to upgrade the Scalar::Util module, from CPAN. Then re-run the install from scratch (doing the perl Makefile.PL; make; make test; make install dance, or cleaning up the CPAN/CPANPLUS cache, I suspect you have to exit the shell and launch it again for this to work).


Q15: I seem to be having a spot of trouble getting XML::Twig 3.08 to compile and install on a SuSE 8.1/RedHat 8.0 system.

Here is the result of make test:



  toto &ent1;
============^
  tata &ent2;
  tutu &ent3;
 at /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm line 185
t/test_entities...........dubious                                            
       Test returned status 255 (wstat 65280, 0xff00)
DIED. FAILED tests 1-6
        Failed 6/6 tests, 0.00% okay
[...]
t/test_spaces.............dubious                                            
        Test returned status 255 (wstat 65280, 0xff00)
DIED. FAILED tests 1-3
        Failed 3/3 tests, 0.00% okay
t/test_twig_roots.........ok                                         t/test_xpath_cond.........ok                                                 
Failed Test       Stat Wstat Total Fail  Failed  List of Failed
-------------------------------------------------------------------------------
t/test_entities.t  255 65280     6    6 100.00%  1-6
t/test_spaces.t    255 65280     3    3 100.00%  1-3
Failed 2/18 test scripts, 88.89% okay. 9/400 subtests failed, 97.75% okay.
make: *** [test_dynamic] Error 29]]>

Answer: The problem is an incompatibility between XML::Twig and the version of the libexpat library that comes with RH 8.0 / Suse 8.1. (1.95.4) If you upgrade to XML::Twig 3.08 or later and to the latest version of libexpat you should not get the problem anymore.

You can get the latest version of libexpat on sourceforge: http://expat.sourceforge.net/


Q16: Setting $SIG{__DIE__} breaks parse()

The problem can be narrowed down to:

parse('');]]>

Answer: This is a bug in XML::Parser. Upgrading to XML::Parser 2.34 or above solves the problem. See the bug report on RT.


Q17: It looks like I can only print a twig (or an element) to STDIN, how do I redirect the output to a file?

Answer: You can pass a filehandle to print:

            open( FH, ">output.xml") or die "cannot open output.xml: $!";
  $twig->print( \*FH);
        

Q18: For logging purposes I would like XML::Twig to report line/column number in the original file

Answer: Use start_tag_handlers to grab the line and column number through the parser object and store them in private attributes (attributes whose name starts with a # are not output by XML::Twig):

#!/usr/bin/perl -w
use strict;
use XML::Twig;

my $t=XML::Twig->new( start_tag_handlers => 
                       { # called when the start tag for elt is parsed
                         # use '#ELT' or _all_ to call the handler for all elements
                         elt => sub { my( $t, $elt)= @_;
                                      $elt->set_att( '#line' => $t->current_line);
                                    },
                       },
		      twig_handlers =>
                       { # called when elt is completely parsed
                         elt => sub { my( $t, $elt)= @_; 
			              print "error in elt starting line ",
                                            $elt->att( '#line'), "\n"
                                        if( $elt->has_child( 'subelt[@error]'));
                                    },
                       },
                     );
$t->parsefile( "test_track_line_number.xml");

will parse test_track_line_number.xml that looks like:


  
    text 1
    text 2
    text 3
  
  
    text 1
    text 2
    text 3
  
]]>

and will output: error in elt starting line 7


Q19: How do I include bits of (possibly not well-formed) HTML in an XML document and use them to generate HTML?

Answer: You can wrap the HTML in a CDATA section, which will prevent the parser to look into the data. Then use a twig_handler on CDATA to process those sections. Use the set_asis method to get those sections to be output without being "XML escaped" (XML::Twig 3.05 and above)

new( twig_handlers => { '#CDATA' => sub { $_->set_asis; } });
  $t->parse( \*DATA);
  $t->print;

  __DATA__
  
    text
    	  
    world]]]]>
    ]]>

will output (comment stripped for conciseness):

texthello
world
]]>

Note that the CDATA section will not protect you from encoding problems, so if the included text is likely to be in a different encoding than the main document you will have to do some encoding conversion before including it.


Q20: In which order are handlers called?

I have this simple Perl script that parse an XML document. The XML document use the following DTD:




I've noticed the following: although the element 'doc' is the root, XML::Twig calls its handle last. All the elements 'title' and 'elt' are processed in correct sequence. Why? The element 'doc' handler should be called the first and not the last.

Is the element's handler called on the opening tag OR on the closing tag?

Answer: Element handlers are called on the closing tag, as it is the only time when the entire element has been parsed. The handler is called as soon as the element has been completely parsed, which is when its end tag has been parsed.

This indeed leads to handlers for the inner elements to be called before the ones to the outer elements: here the handler on 'doc' will be called after the handlers on 'title' and 'elt'.

This example will show you in which order the handlers are called:

new( twig_handlers => { '_all_' => sub { print "handler for ", $_->att( 'id'); } },
                       error_context => 1,
                     );
$t->parse( \*DATA);


__DATA__

  title
  
    subelt
    subelt
  
  element 2
]]>

Q21: Any neat trick to increase the performance of XML::Twig?

Answer: Tom Anderson from tomacorp released an interesting article: Performance Comparison Between SAX XML::Filter::Dispatcher and XML::Twig. He notes:

I learned an interesting performance optimization when writing the anonymous subs for XML::Twig. These subs should not uselessly return a long string. Processing this string can increase processing time by 50% in this example. This is why the start_tag_handlers return the value 1

Using this trick lead to a 4x speedup on my first attempt at speeding up Tom's example!

Thanks Tom!


Q22: I need to process XML documents. The problem is that they are several of them, so the parser dies after the first one, with a message telling me that there is junk after the end of the document. Is there any way I could trick the parser into believing they are all part of a single document?

Answer: You can open the input file as a pipe, first echo-ing an open tag, then getting the input from wherever you get it, then echo-ing a close tag:

doc1doc2'}; 
my $wrap= 'docs';

# this is where it all happens:
# the pipe at the end of the "file name" means that the name is a
# shell command, that will be executed then piped to the filehandle
open( IN, qq{echo '<$wrap>'; $xml_generator; echo '' |})
  or die "error opening xml_generator: $!";

my $i=1;
my $t= XML::Twig->new( twig_handlers => {
                         doc => sub { print "document $i: ", $_->sprint, "\n"; 
                                      $_[0]->purge; # to get he memory back
                                      $i++;
                                    }
                                        },
                     );
$t->parse( \*IN);
close IN or die "error during the execution of xml_generator: $!";					 
]]>

Q23: How to stop processing the document when a certain condition is met?

Answer: There are 2 ways to do this:

update: is now a third method: $twig->finish_now method is, as you might have guessed, a little more imperative than finish: while finish still finishes to parse the XML, and dies if it isn't well-formed, finish_now just aborts the parsing and returns right away.


Q24: When I re-use a twig to parse an other document within a handler, I get a mysterious calling depth after parsing is finished... error. What does it mean?

My code:

new( twig_handlers => { include => \&include })
                 ->parsefile( "main_file.xml");
  sub include
    { my( $t, $include);
      $t->parsefile( $include->att( 'src');
      };]]>

Answer: Indeed you cannot re-use the twig object to parse an other document. Contrary to most other modules (XML::Parser, XML::LibXML...), the twig is both the parser _and_ the parsed document. You can re-use the object if you parse several documents sequentially, but you cannot re-use it within a parse. So in your case you have to create a new XML::Twig object.

The reason for this is simple: incompetence. Mine. I wasn't very familiar with OO when I started writing the module, back in 1998, and I completely missed the object factory construct. Sorry.

Note that in version 3.22 and up the error message that is hopefully more explicit: cannot reuse a twig that is already parsing.


Q25: I want to output the XML with the same format (indentation and line returns) as the input file. I have tried pretty_print but I cannot get what I want.

Answer: You can get the same formating as in the original file by using the keep_spaces => 1 option when you create the twig. Note that this will create #PCDATA (text) elements that contain the whitespaces in your tree.


Q26: What does the error message *** glibc detected *** double free or corruption (!prev): mean, and how do I get rid of it?

Answer: You are using the UTF8 perlIO layer on your input stream, usually because the environment variable PERL_UNICODE or the -C option include D. This causes problems when reading from a pipe, due to a flaw in IO::Handle, used in XML::Parser in this case.

The workaround is to remove the D option, by setting PERL_UNICODE or using -C with a value that does not include -d.

More info at http://rt.cpan.org/Ticket/Display.html?id=17500.


Q27: I want to pass additional arguments to XML::Twig handlers, not just the twig and the element, and I'd rather not use global variables. Can I do this?

Answer: Sure, use a closure:

new( twig_handlers => { foo => sub { bar( @_, @additional_args) } });
  sub bar
    { my( $t, $foo, @more_args)= @_;
      ...
    };]]>

A good explanation of what closures are can be found in Achieving Closure.


Copyright (c)2000-2008 Michel Rodriguez. All rights reserved. Permission is hereby granted to freely distribute this document provided that all credits and copyright notices are retained.