<?xml version="1.0"?>
<faq>
   <header>
      <title>XML::Twig FAQ</title>
      <version>1.7</version><date>2003-02-06</date>
      <author>Michel Rodriguez</author>
   </header>

   <credits>
      <p>FAQ created by Michel Rodriguez</p>
      <p>Thanks to the numerous users of XML::Twig for their questions and suggestions, and to Walter Pienciak for letting me
         mirror this FAQ on the IEEE website</p>
   </credits>

   <overview><p>This FAQ contains information on XML::Twig, a perl module used to process XML documents.
Please direct all corrections and additions to <a href="mailto:mirod@xmltwig.com">mirod@xmltwig.com</a>. </p>
<p>This FAQ can be found on the Web at <a href="http://www.xmltwig.com/xmltwig/faq.html">
		www.xmltwig.com/xmltwig/faq.html</a>.</p>
<p><a name="mlist"></a>Information in this FAQ is based mainly on question to the Perl XML email list. To join, send an email to <a href="mailto:Lyris@ActiveState.com">Lyris@ActiveState.com</a> with the message: 
<b>SUBSCRIBE Perl-XML</b>.</p>

      <p>This FAQ was generated using a Perl script (using XML::Twig ;--) and an XML file. The script is at <a href="http://www.xmltwig.com/xmltwig/twig_faq">http://www.xmltwig.com/xmltwig/twig_faq</a>. The XML source is at 
	      <a href="http://www.xmltwig.com/xmltwig/faq.xml">http://www.xmltwig.com/xmltwig/faq.xml</a>. To generate the XML::Twig FAQ, run <B>twig_faq faq.xml</B> which prints the HTML to STDOUT.
</p>
   </overview>

   <q id="1">
      <question>I know what a twig is but what is that XML thing anyway?</question>
      <answer>OK, time for a quick list of XML links:
	<ul><li><a href="http://www.w3.org/XML">The W3C XML page</a></li>
		<li><a href="http://xml.coverpages.org/sgml-xml.html">The XML Cover Pages</a></li> 
		<li><a href="http://www.perlxml.net/perl-xml-faq.dkb">The Perl XML FAQ</a></li>
		<li><a href="http://www.xml.com/pub/au/83">Kip Hampton's Perl and XML column</a></li>
        </ul>
      </answer>
   </q>

   <q id="2">
      <question>Where can I get the latest version of XML::Twig?</question>
      <answer>The latest stable version:
        <ul><li><a href="http://www.cpan.org/modules/by-module/XML/MIROD/">CPAN</a></li>
		<li><a href="http://www.xmltwig.com/xmltwig/">The Twig Homepage</a></li>
		<li><a href="http://standards.ieee.org/resources/spasystem/twig/index.html">The Twig Homepage (mirror hosted by the IEEE)</a></li>
        </ul>
        The latest development version:
	<ul><li><a href="http://www.xmltwig.com/xmltwig/">The Twig Homepage</a></li>
		<li><a href="http://standards.ieee.org/resources/spasystem/twig/index.html">The Twig Homepage (mirror hosted by the IEEE)</a></li>
        </ul>
      </answer>
   </q>


   <q id="3">
      <question>Where is the documentation?</question>
      <answer><p>Development version:
		      <a href="http://www.xmltwig.com/xmltwig/twig_dev.html">html</a> / 
		      <a href="http://www.xmltwig.com/xmltwig/twig_dev.txt">text</a></p>
             <p>Stable version:
		     <a href="http://www.xmltwig.com/xmltwig/twig_stable.html">html</a> / 
	             <a href="http://www.xmltwig.com/xmltwig/twig_stable.txt">text</a>
             </p>
             <p>You can also type <tt>perldoc XML::Twig</tt> once you have installed the module
		     or look at the <a href="http://www.xmltwig.com/xmltwig/quick_ref.html">XML::Twig Quick Reference</a>,
		    or goto <a href="http://www.xmltwig.com">xmltwig.com</a> for more information, including a
		    <a href="http://www.xmltwig.com/xmltwig/tutorial/index.html">tutorial</a>.</p>
      </answer>
   </q>

   <q id="9">
      <question>How is XML::Twig supported?</question>
      <answer><p>Twig is supported through email <a href="mailto:mirod@xmltwig.com">mirod@xmltwig.com</a>
		      and through the <a href="#mlist">Perl-XML mailing list</a>.</p>
	      <p>You are encouraged to report bugs using RT at <a href="http://rt.cpan.org">rt.cpan.org</a>.</p>
	      <p>Please send the following configuration information when you describe a bug:</p>
	      <ul><li>OS</li>
		      <li>version of perl (<tt>perl -v</tt>),</li>
		      <li>version of <tt>expat</tt> (see below),</li>
		      <li>version of XML::Parser (<tt>perl -MXML::Parser -le'print $XML::Parser::VERSION'</tt>),</li>
		      <li>version of XML::Twig (<tt>perl -MXML::Twig -le'print $XML::Twig::VERSION'</tt>).</li>
              </ul>
	      <p>Finding the version of <tt>expat</tt> that you are running can be a bit tricky, but it is an 
		      important information. Here is how you can get it:</p>
	      <p>First, if you are using a version of XML::Parser lower than 2.30, then you don't need to mention
		      <tt>expat</tt>'s version: XML::Parser comes with 
		      its own version of <tt>expat</tt> (it is old though, you might want to upgrade, first grab
		      <tt><a href="http://expat.sourceforge.net">expat</a></tt> and install it, then install
		      a recent version of XML::Parser).</p>
	      <p>If you are using XML::Parser 2.30 or above, run <tt>xmlwf -v</tt>. If you are lucky this will
		     give you the version of expat. If <tt>xmlwf</tt> exists but
		     does not like the <tt>-v</tt> option, then you are most likely running expat 1.95.2. If  
		     <tt>xmlwf</tt> is not installed on your system (which can be the case if you did not install
		     <tt>expat</tt> yourself but use the one provided with your OS) then (on *nix) you can look for
		     libexpat.so in your library path (using for example <tt>slocate libexpat.so</tt>). 
		     libexpat.so.1.0 is expat 1.95.2, libexpat.so.3.0 is  expat 1.95.4 (in which case you should 
		     upgrade, expat 1.95.4 is not compatible with XML::Twig, libexpat.so.4.0 is expat 1.95.5 or 
		     1.95.6.</p> 
	     <p>This information will help me a lot in figuring out what causes the problem.</p>
      </answer>
   </q>

   <q id="4">
      <question>What is XML::Twig used for anyway?</question>
      <answer><p>I use XML::Twig for all sorts of XML processing: I use it to extract data from XML documents, to update documents from one DTD to another, to convert them to HTML and  to extract/store/process data to and from a various databases.</p></answer>
   </q>
   <q id="5">
      <question>Why should I use XML::Twig?</question>
      <answer><p>The main purpose of XML::Twig is to allow you to process XML documents that might be too big to fit in memory (with XML::DOM for example). If you are in that case but don't really like stream oriented processing, then XML::Twig allows you to use a mixed stream/tree model, where you can process sub-documents as trees and then flush them to free the memory.</p><p>In addition it is designed to be easy to use, masking some of the most annoying quirks of XML and XML::Parser, such as whitespace management and encodings (see below)</p><p>The main drawback of XML::Twig is that it is not XML::DOM! It is does not have a standard interface (feel free to add one ;--) nor does it interface with XML::SAX, although as of verion 3.05 it does export SAX streams</p><p>Using the twig_roots option also lets you process (using the tree interface) only the parts of the documents you are interested in, something that can speed up tremendously your scripts</p>
      </answer>
   </q>
   <q id="6">
      <question>My XML documents/data are produced by tools that do not grok Unicode, will XML::Twig help me there?</question>
      <answer><p>Yes, if you use the KeepEncoding option when you create a twig all PCDATA (character data) will be returned as-is, dont forget to use an encoding declaration in the XML declaration or in the twig creation though or the parser will die on you. You can also process your document as UTF-8 internally and use the <tt>output_encoding</tt> option (XML::Twig version 3.05 and above) to convert the output to your favourite encoding.</p></answer>
   </q>
   <q id="7">
      <question>What's that whitespace management thing?</question>
      <answer><p>XML parsers are required by the standard to pass ALL data outside the markup to the calling application. Most of the time this is not desirable. By default XML::Twig discards those pesky \n (in fact XML::Twig discards all element contents that contain only whitespaces. This can be changed at twig level</p></answer>
   </q>
   <q id="8">
      <question>What's the expansion factor from an XML document to a twig?</question>
      <answer><p>If you load the entire document in a twig the expansion factor is about 13 (the 900K file used for the benchmark takes about 11M). Of course if you flush the document as you're parsing then it will be <b>much</b> less!</p></answer>
   </q>
   <q id="10">
      <question>I have that huge XML document, but I only want to extract information from a couple of elements, can XML-Twig help me there?</question>
      <answer><p>Oddly enough yes! Create the twig using the TwigRoots option and the tree will be built only for those elements. <br/>Example:<code>
			      my $twig= XML::Twig->( twig_roots    =&gt; { info =&gt;  \&amp;process_info });
     </code>
     </p></answer>
   </q>
     <q id="11">
	<question>I process lots of XML documents in batch and there seems to 
	          be a memory leak in XML::Twig, any fix for that?</question>
	  <answer><p>Yes, since version 3.00, XML::Twig has a <tt>dispose</tt> method that releases completely a twig.
			With earlier versions you can release it yourself by doing:
    <code>
    undef $t->{twig};
    undef $t->{twig_root}->{twig};
    undef $t->{twig_parser};
    </code>
     </p>
     <p>The easiest method though, if you are using perl 5.6.0 and above, is to install the 
        <a href="http://search.cpan.org/search?dist=WeakRef">WeakRef</a> module, which fixes the memory leak</p>
     </answer>
   </q>

   <q id="12"><question>How can I install XML::Twig on Windows?</question>
	   <answer><p>XML::Twig might be available as a ppm either from <a href="http://www.activestate.com">Activestate</a>
		   or from another repository (see <a href="http://aspn.activestate.com//ASPN/Reference/Products/ActivePerl/faq/ActivePerl-faq2.html">Using PPM to install modules</a> for more information about ppm and for a list of repositories.</p>
	   <p>If it is not available, or if you want to use the development version, you can just uncompress the distribution file (<tt>XML::Twig-x.xx.tar.gz</tt>) and copy the <tt>Twig.pm</tt> in the <tt>C:\Perl\site\lib\xml</tt> directory, alongside <tt>Parser.pm</tt>. Of course if you use <a href="http://cygwin.com">Cygwin</a> you can install the module with the usual<tt>perl Makefile.PL; make; make test; make install</tt> incantation. You might need to download <a href="http://download.microsoft.com/download/vc15/Patch/1.52/W95/EN-US/Nmake15.exe">nmake</a>.</p></answer> 
   </q>
   <q id="13"><question>For logging purposes I would like XML::Twig to report line/column number in the
		   original file</question>
	   <answer><p>Use <tt>start_tag_handlers</tt> to grab the line and column number through the parser object and
		   store them in private attributes (attributes whose name starts with a # are not output by XML::Twig):</p>
	   <code>#!/usr/bin/perl -w
use strict;
use XML::Twig;

my $t=XML::Twig->new( start_tag_handlers => 
                       { # called when the start tag for elt is parsed
                         # use '#ELT' or _all_ to call the handler for all elements
                         elt => sub { my( $t, $elt)= @_;
                                      $elt->set_att( '#line' => $t->current_line);
                                    },
                       },
		      twig_handlers =>
                       { # called when elt is completely parsed
                         elt => sub { my( $t, $elt)= @_; 
			              print "error in elt starting line ",
                                            $elt->att( '#line'), "\n"
                                        if( $elt->has_child( 'subelt[@error]'));
                                    },
                       },
                     );
$t->parsefile( "test_track_line_number.xml");
</code>
<p>will parse <tt>test_track_line_number.xml</tt> that looks like:</p>
<markup><![CDATA[<doc>
  <elt>
    <subelt>text 1</subelt>
    <subelt>text 2</subelt>
    <subelt>text 3</subelt>
  </elt>
  <elt>
    <subelt>text 1</subelt>
    <subelt error="yes">text 2</subelt>
    <subelt>text 3</subelt>
  </elt>
</doc>]]></markup>
<p>and will output: <tt>error in elt starting line 7</tt></p></answer>
</q>

<q id="14"><question>How do I include bits of (possibly not well-formed) HTML in an XML document and
		     use them to generate HTML?</question>
	     <answer><p>You can wrap the HTML in a CDATA section, which will prevent the parser to
			     look into the data. Then use a twig_handler on CDATA to process those sections.
			     Use the <tt>set_asis</tt> method to get those sections to be output without
			     being "XML escaped" (XML::Twig 3.05 and above)</p>
		     <code><![CDATA[
  #!/usr/bin/perl -w
  use strict;

  use XML::Twig;

  my $t= XML::Twig->new( twig_handlers => { '#CDATA' => sub { $_->set_asis; } });
  $t->parse( \*DATA);
  $t->print;

  __DATA__
  <doc>
    <elt>text</elt>
    <!-- embedded HTML, note the un-closed <br> tag -->	  
    <ehtml><![CDATA[hello<br>world]]]]><![CDATA[></ehtml>
    </doc>]]>
</code>
<p>will output (comment stripped for conciseness):</p>
<markup><![CDATA[<doc><elt>text</elt><ehtml>hello<br>world</ehtml></doc>]]></markup>

	   <p>Note that the CDATA section will not protect you from encoding problems, so if the included text is likely to
	      be in a different encoding than the main document you will have to do some encoding conversion before including it.</p></answer>
</q>  

<q id="15">
	<question><p>In which order are handlers called?</p>
		<p>I have this simple Perl script that parse an XML document. The XML document use the following DTD:</p>
		<markup><![CDATA[<!ELEMENT doc  (title, elt+)>
<!ELEMENT title  (#PCDATA)>
<!ELEMENT elt    (#PCDATA|subelt)+>
<!ELEMENT subelt (#PCDATA)]]></markup>

<p>I've noticed the following: although the element 'doc' is the root,
	XML::Twig calls its handle last. All the elements 'title' and 'elt'
	are processed in correct sequence. Why? The element 'doc' handler should be
called the first and not the last.</p>

<p>Is the element's handler called on the opening tag OR on the closing tag?</p>
</question>
<answer><p>Element handlers are called on the closing tag, as it is the only time
when the entire element has been parsed. The handler is
called as soon as the element has been completely parsed, which is when
its end tag has been parsed.</p>
<p>This indeed leads to handlers for the inner elements to be called before
	the ones to the outer elements: here the handler on 'doc' will be
	called after the handlers on 'title' and 'elt'.</p>
<p>This example will show you in which order the handlers are called:</p>
<code><![CDATA[#!/usr/bin/perl -w -l
use strict;

use XML::Twig;

my $t= XML::Twig->new( twig_handlers => { '_all_' => sub { print "handler for ", $_->att( 'id'); } },
                       error_context => 1,
                     );
$t->parse( \*DATA);


__DATA__
<doc id="doc">
  <title id="title">title</title>
  <elt id="elt_1">
    <subelt id="subelt_1">subelt</subelt>
    <subelt id="subelt_2">subelt</subelt>
  </elt>
  <elt id="elt_2">element 2</elt>
</doc>]]></code>
</answer>
</q>

<q id="17">
<question>Any neat trick to increase the performance of XML::Twig?</question>
<answer><p>Tom Anderson from tomacorp released an interesting article: 
	<a href="http://tomacorp.com/perl/xml/saxvstwig.html">Performance Comparison
		Between SAX XML::Filter::Dispatcher and XML::Twig</a>. He notes:</p>
<blocquote><i>I learned an interesting performance optimization when writing 
		the anonymous subs for XML::Twig. These subs should not uselessly return
		a long string. Processing this string can increase processing time by 50%
		in this example. This is why the start_tag_handlers return the value 1</i>
</blocquote>
<p>Using this trick lead to a 4x speedup on my first attempt at speeding up Tom's example!</p>
        <p>Thanks Tom!</p>
</answer>
</q>

<q id="16">
<question><p>I seem to be having a spot of trouble getting XML::Twig 3.08 to compile
and install on a SuSE 8.1/RedHat 8.0 system.</p>
<p>Here is the result of <tt>make test</tt>:</p>
<code><![CDATA[make test
[...]
t/test_entities...........
undefined entity at line 4, column 13, byte 77:
<!DOCTYPE doc SYSTEM "t/dummy.dtd">
<doc>
  <elt1>toto &ent1;</elt1>
============^
  <elt2>tata &ent2;</elt2>
  <elt3>tutu &ent3;</elt3>
 at /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm line 185
t/test_entities...........dubious                                            
       Test returned status 255 (wstat 65280, 0xff00)
DIED. FAILED tests 1-6
        Failed 6/6 tests, 0.00% okay
[...]
t/test_spaces.............dubious                                            
        Test returned status 255 (wstat 65280, 0xff00)
DIED. FAILED tests 1-3
        Failed 3/3 tests, 0.00% okay
t/test_twig_roots.........ok                                         t/test_xpath_cond.........ok                                                 
Failed Test       Stat Wstat Total Fail  Failed  List of Failed
-------------------------------------------------------------------------------
t/test_entities.t  255 65280     6    6 100.00%  1-6
t/test_spaces.t    255 65280     3    3 100.00%  1-3
Failed 2/18 test scripts, 88.89% okay. 9/400 subtests failed, 97.75% okay.
make: *** [test_dynamic] Error 29]]></code>
</question>
<answer><p>The problem is (probably, I don't use those distributions) an incompatibility between XML::Twig and the
		version of the libexpat library that comes with RH 8.0 / Suse 8.1.
If you upgrade to XML::Twig 3.08 and to libexpat 1.95.5 you should not
get the problem anymore.</p>
<p>You can get the latest version of libexpat on sourceforge: <a href="http://expat.sourceforge.net/">http://expat.sourceforge.net/</a></p></answer>
</q>

<q id="19">
		<question>I need to process XML documents. The problem is that they are several of them, so the
				parser dies after the first one, with a message telling me that there is junk after the
				end of the document. Is there any way I could trick the parser into believing they are
				all part of a single document?</question>
		<answer><p>You can open the input file as a pipe, first <tt>echo</tt>-ing an open tag, then getting
				the input from wherever you get it, then <tt>echo</tt>-ing a close tag:</p>
<code><![CDATA[#!/usr/bin/perl -w
use strict;
 
use XML::Twig;
 
# here we have a very simple generator, but it could be any process that 
# generates a stream of XML documents 
my $xml_generator= q{echo '<doc>doc1</doc><doc>doc2</doc>'}; 
my $wrap= 'docs';

# this is where it all happens:
# the pipe at the end of the "file name" means that the name is a
# shell command, that will be executed then piped to the filehandle
open( IN, qq{echo '<$wrap>'; $xml_generator; echo '</$wrap>' |})
  or die "error opening xml_generator: $!";

my $i=1;
my $t= XML::Twig->new( twig_handlers => {
                         doc => sub { print "document $i: ", $_->sprint, "\n"; 
                                      $_[0]->purge; # to get he memory back
                                      $i++;
                                    }
                                        },
                     );
$t->parse( \*IN);
close IN or die "error during the execution of xml_generator: $!";					 
]]></code>
</answer>
</q>
		

<copyright>Copyright (c)2000-2002 Michel Rodriguez. All rights reserved. Permission is hereby granted to freely distribute this document provided that all credits and copyright notices are retained.</copyright>
</faq>





