XML, the Perl Way

Previous
3. Introduction to XML::Twig
Table of Content
Table of Content
Next
5. Data base integration

4. First Examples

4.1 Full-tree mode

4.1.1 Creating and navigating the twig

Now let's see our first code example. The purpose of this one is to reorder a list of elements on the value of an attribute.

The DTD is quite simple: stats.dtd

<!ELEMENT stats  (player+)>
<!ELEMENT player (name, ppg, rpg, apg, blk)>
<!ELEMENT name   (#PCDATA)>
<!ELEMENT ppg    (#PCDATA)>
<!ELEMENT rpg    (#PCDATA)>
<!ELEMENT apg    (#PCDATA)>
<!ELEMENT blk     (#PCDATA)>

And the data is:

<?xml version="1.0"?>
<!DOCTYPE stats SYSTEM "stats.dtd">
<stats><player><name>Houston, Allan</name><g>69</g><ppg>20.1</ppg><rpg>3.4</rpg><apg>2.8</apg><blk>14</blk></player>
<player><name>Sprewell, Latrell</name><g>69</g><ppg>19.2</ppg><rpg>4.5</rpg><apg>4.0</apg><blk>15</blk></player>
<player><name>Ewing, Patrick</name><g>49</g><ppg>14.6</ppg><rpg>10.0</rpg><apg>1.0</apg><blk>68</blk></player>
</stats>

The complete xml data.

The script is ex1_1.pl.

#!/bin/perl -w

#########################################################################
#                                                                       #
#  This first example shows how to create a twig, parse a file into it  #
#  get the root of the document, its children, access a specific child  #
#  and get the text of an element                                       #
#                                                                       #
#########################################################################

use strict;
use XML::Twig;

my $field= $ARGV[0] || 'ppg';
my $twig= new XML::Twig;

$twig->parsefile( "nba.xml");    # build the twig
my $root= $twig->root;           # get the root of the twig (stats)
my @players= $root->children;    # get the player list

                                 # sort it on the text of the field
my @sorted= sort {    $b->first_child( $field)->text 
                  <=> $a->first_child( $field)->text }
            @players;
                                 
print '<?xml version="1.0"?>';   # print the XML declaration
print '<!DOCTYPE stats SYSTEM "stats.dtd" []>';
print '<stats>';                 # then the root element start tag

foreach my $player (@sorted)     # the sorted list 
 { $player->print;               # print the xml content of the element 
   print "\n"; 
 }
print "</stats>\n";              # close the document

Note how we get the root of the twig using the root method, then use the children method to get the list of players.

The first_child method is used to navigate the twig, it accepts an optionnal parameter which is the gi we are interested in, if the parameter is ommited the first child, whatever it's gi, is returned. Other navigation methods are last_child, prev_sibling, next_sibling and parent. They all return undef if no element is found.

The text returns the... text of the element, including all elements included in it, without any tags. Other methods used to retrieve the content of an element include print, which prints the element content, from its start tag to its end tag, included, and including the content (and tags) of all included elements, and sprint, which returns the string that print prints, and accepts an optionnal parameter which excludes the element tags when true.

4.1.2 Modifying the twig

Another example, in which we will create new elements: our statistics include the total number of blocks for each player, but in order to find out the best blocker in our selection we want the number of blocks per game, and we want to store it in the document (conveniently the DTD allows for an optionnal blg element).

Here is the ex1_2.pl.

#!/bin/perl -w

#########################################################################
#                                                                       #
#  This example shows how to create, and paste elements                 #
#  It creates a new element named blg, for each player                  #
#                                                                       #
#########################################################################

use strict;
use XML::Twig;

my $twig= new XML::Twig;

$twig->parsefile( "nba.xml");    # build the twig
my $root= $twig->root;           # get the root of the twig (stats)
my @players= $root->children;    # get the player list

                                 
foreach my $player (@players)     
 { my $g  = $player->first_child( 'g')->text;    # get the text of g            
   my $blk= $player->first_child( 'blk')->text;  # get the text of blk
   my $blg= sprintf( "%2.3f", $blk/$g);          # compute blg
   my $eblg= new XML::Twig::Elt( 'blg', $blg);   # create the element
   $eblg->paste( 'last_child', $player);         # paste it in the document   
 }

$twig->print;                    # note that we lose the extra returns

The paste method accepts 4 different position arguments:

You can ommit first_child and just write $elt->paste( $ref). What you can't do is paste an element that already belongs to a document, that will cause a fatal error.

An important feature of the paste method is that it is called on the element being pasted: $child->paste( $parent) and not the other way around.

Note that the output is now generated by the print method, instead of regular print statements, and that the extra line returns that we had inserted in the file have disapeared. We will see a little later how to keep them around.

4.2 Twig handlers

Another way to accomplish the same task, a more "twig-ish" way, would be to set a handler on the player element. A handler is attached to an element name through the twig_handlers option when the twig is created. The subroutine that will be called everytime an element with that name has been completely parsed. It is then called with 2 parameters: the twig itself and the element.

Note that the handler is called as soon as the element is completely parsed. That means that the handler will be called when the end tag for that element is parsed. A somewhat surprising consequence of that is that if you set twig handlers on nested elements, the handlers on the inner elements will be called before the handlers on the outer elements.

Here is the ex1_3.pl.

#!/bin/perl -w

#########################################################################
#                                                                       #
#  This example shows how to use the twig_handlers option               #
#  It creates a new element named blg, for each player                  #
#                                                                       #
#########################################################################

use strict;
use XML::Twig;

my $twig= new XML::Twig( 
                twig_handlers =>                 # player will be called
                  { player => \&player }         # when each player element
                       );                        # has been parsed

$twig->parsefile( "nba.xml");    # build the twig
$twig->print;                    # print it

sub player
 { my( $twig, $player)= @_;                      # handlers params are always
                                                 # the twig and the element

   my $g  = $player->first_child( 'g')->text;    # get the text of g            
   my $blk= $player->first_child( 'blk')->text;  # get the text of blk
   my $blg= sprintf( "%2.3f", $blk/$g);          # compute blg
   my $eblg= new XML::Twig::Elt( 'blg', $blg);   # create the element
   $eblg->paste( 'last_child', $player);         # paste it in the document   
 }

This is basically similar to the previous example, except the interesting code is in the handler instead of being in the loop. It gets more interesting in the next section though...

4.3 The flush and purge methods

4.3.1 The flush method

Now in the previous examples the whole document was being loaded, then printed. This is not very memory efficient, especially as once a player has been updated it is never used again.

Hence the use of the flush method. The flush method just dumps the twig that has been parsed so far. It takes care of printing the proper closing tags when needed and deleting the printed elements, thus allowing the memory to be reused for the rest of the processing. It does not delete the parents of the current element (but might delete most of their children), so they are still available when navigating the twig.

Here is the ex1_4.pl.

#!/bin/perl -w

#########################################################################
#                                                                       #
#  This example shows how to use the flush method                       #
#  It creates a new element named blg, for each player                  #
#                                                                       #
#########################################################################

use strict;
use XML::Twig;

my $twig= new XML::Twig( 
                twig_handlers =>                  # player will be called
                  { player => \&player }          # when each player element
                       );                         # has been parsed

$twig->parsefile( "nba.xml");                     # build the twig
$twig->flush;                                     # flush the end of the twig  

sub player
  { my( $twig, $player)= @_;                      # handlers params are always
                                                  # the twig and the element

    my $g  = $player->first_child( 'g')->text;    # get the text of g
    my $blk= $player->first_child( 'blk')->text;  # get the text of blk
    my $blg= sprintf( "%2.3f", $blk/$g);          # compute blg
    my $eblg= new XML::Twig::Elt( 'blg', $blg);   # create the element
    $eblg->paste( 'last_child', $player);         # paste it in the document

    $twig->flush;                                 # flush the twig so far   
 }

Still very similar to the previous example, except that instead of printing the whole twig at the end of the processing the calls to flush at the end of player ensure that each player element stays in memory for just as long as it is needed.

Note: as of XML::Twig 3.23, there is no longer any need to call flush one last time after the document is completely parsed. If the document was flushed, then it will be "auto-flushed" (to the same filehandle used for the first flush) after the parse.

4.3.2 The purge method

The flush method is usefull if you want to output the modified standard. But you might not always want that. Suppose you just want to output the leader in a category:

Here is the ex1_5.pl.

#!/bin/perl -w

#########################################################################
#                                                                       #
#  This example shows how to use the purge method                       #
#  It outputs the name of the leader in a statistical category          #
#                                                                       #
#########################################################################

use strict;
use XML::Twig;

my $leader_name;
my $leader_score=0;


my $field= $ARGV[0] || 'ppg';
                                              # create the twig
my $twig= new XML::Twig( twig_handlers => { player => \&player } ); 

$twig->parsefile( "nba.xml");                 # parse the twig
                                              # print the result
print "Leader in $field: $leader_name ($leader_score $field)\n";

sub player
  { my( $twig, $player)= @_;                      
                                              # get the score
    my $score= $player->first_child( $field)->text;    
    if( $score > $leader_score)               # if it's the highest
      { $leader_score= $score;                # store the information
        $leader_name= $player->first_child( 'name')->text;
      }
    $twig->purge;                             # delete the twig so far   
 }

Very simple, yet very memory efficient. You still get the advantage of local tree-processing, having access to the whole player sub-tree, while not having to pay the price of loading the whole document in memory.

But wait! There's more...

4.4 The twig_roots option

Actually in the previous example we build the complete twig for each player element, even though we are really only interested in the name and one of the sub-elements. It's OK as the xml file we are working on is not to big, but it can be a problem, both in terms of speed and memory for bigger file. Hopefully XML::Twig offer a way to build the twig only for those elements we are interested in.

The twig_roots option, set when the twig is created, gives a list (well, actually a hash) of elements for which the twig will be built. Other elements will be ignored. The result is a twig that includes the root of the document (we need a root for the tree in any case) and the twig_roots elements as children of that root. For each element in the twig_roots list the whole sub-tree is built.

Here is the ex1_6.pl.

#!/bin/perl -w

#########################################################################
#                                                                       #
#  This example shows how to use the twig_roots option                  #
#  It outputs the name of the leader in a statistical category          #
#                                                                       #
#########################################################################

use strict;
use XML::Twig;

my $leader_name;
my $leader_score=0;


my $field= $ARGV[0] || 'ppg';
                                              # twig will be created only
                                              # for name and $field elements
my $twig= new XML::Twig( twig_roots    => { 'name' => 1, $field => 1 },
                                              # handler will be called for
                                              # $field elements
                         twig_handlers => { $field => \&field } ); 

$twig->parsefile( "nba.xml");                 # parse the twig
                                              # print the result
print "Leader in $field: $leader_name ($leader_score $field)\n";

sub field
  { my( $twig, $field)= @_;                      
                                              # get the score
    my $score= $field->text;    
    if( $score > $leader_score)               # if it's the highest
      { $leader_score= $score;                # store the information
        $leader_name= $field->prev_sibling( 'name')->text;
      }
    $twig->purge;                             # delete the twig so far   
 }

The virtual twig build (looking for the leader in ppg) is <stats><name>Houston, Allan</name><ppg>20.1</ppg><name>Sprewell, Latrell</name><ppg>19.2</ppg>...</stats>. The script doesn't spend memory storing useless information on other stats, nor time building the twig for those stats.

4.5 The twig_print_outside_roots option

Now suppose all we want to do is remove a statistical category from the document. Ideally we would like to build as little of the twig as possible, using the twig_roots option, but we also want want most of the document to be output as-is. twig_print_outside_roots to the rescue! By setting that option when we create the twig anything outside of the twig_roots elements will simply be print.

Here is the ex1_7.pl.

#!/bin/perl -w

#########################################################################
#                                                                       #
#  This example shows how to use the twig_print_outside_roots option    #
#  It deletes a statistical category from the document                  #
#                                                                       #
#########################################################################

use strict;
use XML::Twig;

my $field= $ARGV[0] || 'ppg';
                                              # twig will be created only
                                              # for $field elements
my $twig= new XML::Twig( twig_roots    => { $field => 1 },
                                              # print all other elements asis
                         twig_print_outside_roots => 1, 
                         twig_handlers => { $field => \&field } ); 

$twig->parsefile( "nba.xml");                 # parse the twig

sub field
  { my( $twig, $field)= @_;                      
    $field->cut;    
 }

Note the use of the cut method, which just removes the element from the twig. It is also possible to use delete instead of cut. The difference is that cut keeps the element around (so it can be for example pasted somewhere else), while delete destroys it (and frees up the memory it used).

And of course, as There's More Than One Way To Do It, here is a real short script that does the same thing, just in a more lazy way (and actually a slightly faster but more memory intensive one).

The ex1_8.pl.

#!/bin/perl -w

use strict;
use XML::Twig;

my $field= $ARGV[0] || 'ppg';
my $twig= new XML::Twig( twig_roots => { $field => 1 },
                         twig_print_outside_roots => 1); 

$twig->parsefile( "nba.xml");                 # parse the twig

Figuring out how it works is left as an exercise for the reader (hint: twig_print_outside_roots does just what it's name suggests, no more).

4.6 A simple HTML+ converter

Now with what we've learned so far we are just a couple of additional tricks away from building a simple "HTML+" converter. The + here means that we can include additional inline elements to an HTML document. Provided of course that HTML document is a valid XML instance (and I admit this can be hard to achieve).

So here is the xml2html1.pl. It runs on the html_plus.xml file and includes itself in the output:html_plus.html

#!/bin/perl -w

#########################################################################
#                                                                       #
#  This example can be used as a framework to create others HTML+       #
#  converters                                                           #
#  It maps various specific elements to common html ones, takes care of #
#  empty elements that would not be displayed in an old browser, and    #
#  finally allows simple inclusion of outside files in the document     #
#                                                                       #
#########################################################################


use strict;
use XML::Twig;


my $t= new XML::Twig
         ( TwigRoots =>
             { example     => \&example,    # include a file 
                                            # convert to an html tag
               method      => \&method,     # convert to tt and create
                                            # link to doc
               tag         => sub { make(@_, 'tt') },
               code        => sub { make(@_, 'tt') },
               package     => sub { make(@_, 'bold') },
               option      => sub { make(@_, 'bold') },
               br          => \&empty,      # we need those for the html
               hr          => \&empty,      # to work in old browsers
             },
            TwigPrintOutsideRoots => 1,     # just print the rest
          );

if( $ARGV[0]) { $t->parsefile( $ARGV[0]); } # process the twig
else          { $t->parse( \*STDIN);      }

exit;

sub empty                                   
  { my( $t, $empty)= @_;                    
    print "<" . $empty->gi . ">";           # just print the tag html style
  }

sub make                                          
  { my( $t, $elt, $new_gi)= @_;
    $elt->set_gi( $new_gi);                 # change the tag gi
    $elt->print;                            # don't forget to print it
  }

sub method
  { my( $t, $method)= @_;
    $method->set_gi( 'tt');
    my $a= $method->insert( 'a');
    my $class= $method->att( 'class');
    my $item= lc $method->text;
    $method->del_att( 'class');
    $a->set_att( href => "$class\_$item");
  }

sub example                                 # generate a link and include the file
  { my( $t, $example)= @_;

    my $file= $example->text;               # first get the included file

    $example->set_gi( 'p');                 # replace the example by a paragraph
    my $a= $example->insert( 'a');          # insert an link in the paragraph
    $a->set_att( href => $file);            # set the href attribute

    $example->print;                        # print the paragraph

    open( EXAMPLE, "<$file")                # open the file
      or die "cannot open file $file: $!"; 
    local undef $/;                         # slurp all of it
    my $text= <EXAMPLE>;
    close EXAMPLE;

    $text=~ s/&/&amp;/g;                    # replace special characters (& first)
    $text=~ s/</&lt;/g ;                    
    $text=~ s/"/&quot;/g;

    print "<pre>$text</pre>";               # print the example


    
  }


We use 3 new methods here:

Also note the neat trick (thanks to Clark Cooper for this one) that consist in setting the handler as a sub that just adds an extra parameter to the usual ones: sub { make(@_, 'tt') }.

4.7 Setting handlers for elements in context

An additional option is to set handlers not for elements but for elements in a given context: instead of giving just the gi of the element you can use an XPath-like expression in the twig_handlers (as well as in the twig_roots) argument.

Valid path can be of the form /root/elt1/elt2 for a complete path to the element, or elt1/elt2 for a partial path.

Note that this path is given in the original document, not in the current twig.

So if we want to convert the simple document we saw in the XML examples we would write the conversion as in ex1_9.pl.

#!/bin/perl -w

use strict;
use XML::Twig;

my $twig= new XML::Twig( twig_handlers => 
                { doc            => \&doc,
                 '/doc/title'    => \&doc_title,     # full path
                 'section/title' => \&section_title, # partial path
                  section        => \&section,
                }
                       ); 

$twig->parsefile( "simple_doc.xml");                  # parse the twig
$twig->print;                                         # print the modified twig


sub doc_title
  { my( $t, $doc_title)= @_;
    $doc_title->set_gi( 'h1');                        # just change the tag to h1
  }

sub section_title
  { my( $t, $section_title)= @_;
    $section_title->set_gi( 'h2');                    # just change the tag to h2
  }

sub section
  { my( $t, $section)= @_;
    $section->erase;                                  # erase just erases the tags
  }

sub doc
  { my( $t, $doc)= @_;
    $doc->set_gi( 'html');                            # set the gi to html
    my $doc_title= $doc->first_child( 'h1')->text;    # the title is now a h1 element
    $doc->insert( 'body');                            # create the body
    my $header= new XML::Twig::Elt( 'header');        # create the header
    $header->paste( $doc);                            # paste it 
    my $title= $header->insert( 'title');             # insert the title
    $title->set_text( $doc_title);                    # with the appropriate content
  }



When we process the doc element the title has already been processed, so we have to look for a h1 child.

We also use two new methods here: erase removes the element and pastes all of its children as children of the element parent. The effect on the output is that the tag has been erased from the document. set_text sets the textual content of the element.


Previous
3. Introduction to XML::Twig
Table of Content
Table of Content
Next
5. Data base integration