Simple XML Transformation with Perl

by Michel Rodriguez
Boardwatch Magazine

After looking at ways to create XML using Perl in a previous column, this month I will look at two easy ways to process existing XML files.

Being the most popular CGI language, it should come as no surprise that Perl offers many ways to process XML. Actually, it offers lots of ways, no less than 14 different ways, implemented by 14 different modules, are available for XML transformation. From XML::DOM to XML::XSLT and from XML::Parser to XML::Twig it is a challenge to figure out which one(s) to use. This column will not cover all of them, nor give you ways to choose between them, instead it will focus on two of them, which offer simple interfaces to XML files, and which will definitely appeal to most beginning users. They might not be the most powerful tools around, but they are definitely the simplest ones.

Why two, then? Because they cover the two main ways to process XML: One is event-oriented, processing the file as it is being parsed, while the other is more tree-oriented, first loading the file in memory and then processing it.

XML::PYX

My absolute favorite tool for extracting information from an XML document is XML::PYX. PYX is really simple, fits very well with the "Perl Way" and does not require the user to know much about XML.

Most of the time XML::PYX is not even used as a module per se. It comes with four tools, pyx (the original), pyxv (which validates the XML file against a DTD), pyxhtml (which reads html files) and pyxw (which writes a PYX flow as an XML file, thus allowing transformation of an XML file through a pyx original.xml | my_script | pyxw > target.xml pipeline.

There is even no need for the pyx tool to be the Perl one, the Python and Java-based PYX implementations can be used. XML::PYX parses an XML file and outputs a very simple, line-oriented format with just the bare, essential information. A simple Perl script, often a one-liner, can then take this output and filter it.

Here is the PYX output format (slightly simplified):

The PYX format

First char	Event	Line Format
(	Start Tag	`(<tag>`
)	End Tag	`)<tag>`
A	Attribute	`A<attribute> <value>`
_	Text	`_<text>`

So a (simple) XML file such as <doc class="simple" >

My document</doc> would be output as:

(doc
Aclass simple
(p
_My document
)p
)doc

A typical use of XML::PYX would be to count all the tags in an XML document:

pyx doc.xml | perl _n _e '$nb{$1}++ if( m/^\((.*)$/); \
  END { map { print "$_ used $nb{$_} time(s)\n";} keys %nb;}'

It goes through the file using Perl's _n option, grabs the tag name from every line that start with a '(' and increments a value in a hash {tag => nb_tag}. At the end of the file the hash content is printed. Easy, isn't it?

Despite its apparent simplicity, PYX can be used for some quite powerful processing. Doing basic tag translation, such as using a conversion hash to go from one XML vocabulary (a nice way to say a set of tags) to another.

For example, an XML document like this:

<html>
  <head><title>Users</title></head>
  <body>
    <h1>Users</h1>
    <users>
      <user><login>jsmith</login><fullname>John Smith</fullname></user>
      <user><login>jdoe</login><fullname>John Doe</fullname></user>
    </users>
  </body>
</html>

could easily be converted into a proper HTML document by the following command:

pyxw users.xml | perl _pe 'BEGIN { %html= (users=>"table", user=>"tr", login=>"td", name=>"td"); } \
s/^([()])(.*)$/$1.($html{$2}||$2)/e;? | pyxw

This script just gets the PYX flow from the file, initializes a conversion hash (%html, {source_tag => target_tag}), replaces the tags in lines generated by open or close tags, starting either with '(' or ')', and outputs the modified PYX flow, which will then be written back as XML by the pyxw tool.

It is, of course, possible to enhance it, for instance, by getting the tag conversion table from a parameter file, to output several tags for a single input tag, to process attributes and so forth. This example shows just how easy it can be to write a simple XML transformation script.

XML::PYX is event-oriented, which means it processes the file as it is being read and parsed. This processing model is both memory and speed-efficient, but can be a little bit more difficult to use than the tree model described below.

XML::SIMPLE

XML::Simple uses a tree model. It parses the entire XML file and loads it into a tree structure in memory. The drawbacks of this approach are, of course, that loading the XML file can take up a lot of memory, and there will be a delay before the program can start outputting results, which can be annoying for CGI scripts. But this is the price to pay to gain the power of being able to access the entire document at once, extract information from anywhere, change it, update it and even output the updated XML.

Loading the document is done using the XMLin function, and outputting an updated version is done with the XMLout function.

XML::Simple is a very popular module for simple XML. It does not work properly for complex files, typically for document-oriented XML, as it does not cope well with mixed content (This ismixed content) text and tags are mixed) but makes it really easy to deal with data-oriented XML.

Here is a simple XML document, which could be a configuration file for a tool:

<config dir="/usr/local/etc" log="/usr/local/log">
  <user id="user1"><group>root</group><group>webadmin</group></user>
  <user id="user2"><group>staff</group><group>webadmin</group></user>
</config>

Reading this file using XML::Simple creates the following structure in memory:

{ 'dir'   => '/usr/local/etc',
  'log'   => '/usr/local/log',
  'user'  => {'user1' => {'group' => ['root', 'webadmin']},
  'user2' => { 'group' => ['staff', 'webadmin']}
}

Here is a more complex example, using a file that describes mailing lists data:

<mldata type="internal">
  <title>Mailing Lists Data</title>
  <list name="tech"><title>Technical</title>
    <member id="jdeere" admin="1"/>
    <member id="jsmith"/>
    <member id="jbrown"/>
  </list>
  <list name="biz"><title>Business and Managers</title>
    <member id="sadams" admin="1"/>
    <member id="jsmith"/>
    <member id="jdeere"/>
  </list>
  <user id="jdeere"><fullname>John Deere</fullname><email>jdeere@isp.com</email></user>
  <user id="jsmith"><fullname>John Smith</fullname><email>jsmith@isp.com</email></user>
  <user id="jbrown"><fullname>John Brown</fullname><email>jbrown678@aol.com</email></user>
  <user id="sadams"><fullname>Sam Adams</fullname><email>sam.ad@msn.com</email></user>
</mldata>

Can then be processed by a simple script like:

#!/bin/perl -w
use strict;

use XML::Simple;

# load the lists document
my $mldata= XMLin( './lists.xml');

# title and type are just fields of mldata
print "$mldata->{title} ($mldata->{type})\n";

# list and user info are references to the list and user information
my $lists=$mldata->{list};
my $users=$mldata->{user};

# each value of %$lists is a reference to the list data
foreach my $list (values %$lists)
  { # the list title is a field of $list
    print "$list->{title} List:\n";
    # members is a hash containing members id
    my $members= $list->{member};

    foreach my $id (keys %$members)

      { # print a leading star for admins
        if( exists $members->{$id}->{admin})
          { print "* "; }
        else
          { print " "; }
   
        print " $id ";
        # print user info from the users data
        print "$users->{$id}->{fullname} $users->{$id}->{email}\n";
      }

  }

which will generate a clean list of mailing list members:

Technical List:
jbrown John Brown jbrown678@aol.com
jsmith John Smith jsmith@isp.com
* jdeere John Deere jdeere@isp.com

Business and Managers List:
* sadams Sam Adams sam.ad@msn.com
jsmith John Smith jsmith@isp.com
jdeere John Deere jdeere@isp.com

It is really easy, especially to retrieve member information from their ids.

The key to using XML::Simple is to either dump the structure built from the XML file using Data::Dumper or just to look at it under the Perl debugger. Once you understand what happens to your XML and how it is loaded in memory, it becomes very easy to manipulate.

XML::Simple offers many (not so simple) options to really tweak how to load the XML, so reading the documentation that comes with the module is, of course, recommended.

Conclusion

There are many tools available in Perl, as well as other languages. The simplicity and the way they "hide" the complexity of XML makes PYX and XML::Simple good candidates for Perl programmers who want to use XML.

Have fun with them.

Resources

Pyxie, the original article on XML.com: www.xml.com/pub/2000/03/15/feature/index.html

Pyxie Perfect, describes the Perl and Java implementations: www.xml.com/pub/2000/03/22/pyxie/index.html

XML::Simple, the author site: web.co.nz/~grantm/cpan/

Perl XML modules documentations: http://theoryx5.uwinnipeg.ca/mod_perl/cpan_search?request=cat;catinfo=1120

Note: this article was published in 2000 in Boardwatch magazine. More recent articles about XML and especially Perl & XML can be found on www.xmltwig.com