XML, the Perl Way

Exploring XML Examples

by Michel Rodriguez
Boardwatch Magazine

Let's start with a buzzword: XML is a meta-language , a language used to define languages. HTML, or, more accurately, XHTML is one of those languages. One amongst an unlimited number of potential languages. Each of those languages corresponds to a type of document.

An important part of designing an XML system consists of designing the types of documents stored in it or exchanged with other systems. This column lists some of the most common types an implementor is likely to use and gives examples and advice on how to design them, from configuration files to tables of a data base and documents.

DTDs

The common way to define those languages is through a document type definition (DTD). Other ways include Microsoft's XML Data set Reduced (XDR) and the soon-to-be released XML Schema from the W3C.

DTDs are inherited from SGML. They use a specific (non-XML) syntax to express which elements can be included in a document, their attributes and how they can be combined.

W3C Schemas, as well as XDR, are expressed in a somewhat verbose XML syntax and add to basic DTD features the option to define precisely the type of content of an element or attribute. Microsoft announced it will support W3C Schema as soon as it is available and stop developing (but not supporting) XDR schemas.

As DTDs are the most common syntax to define document types at the moment, they will be used in the examples in this column. Note that tools exist to convert DTDs to XDR and will exist to convert them to W3C Schemas, and GUI-based tools are likely to appear as XML usage becomes more common.

A DTD is not technically necessary. It is always possible to use just well-formed (DTD-less) XML, especially in simple cases when it is only used as a storage format but will not be exchanged with others. A DTD is also not the only documentation that should be written for a project. A description of the content of elements and attributes will certainly help recipients of the XML documents or maintainers of the system make sense of it.

Configuration files

A common use for XML is to store configuration files for various tools. The advantages of using XML instead of a "home-brewed" format are that the format is highly portable between platforms, auto-documented (provided sensible element names are used), easily extensible and easy to process with standard tools.

A (very) simple configuration file, for a tool that would just transfer files from a source directory to a target directory and then send a report to a specified e-mail address, could look like:

<?xml version="1.0"?>
<config>
  <source-dir>/home/webadmin</source-dir>
  <target-dir>/opt/web/infotree</target-dir>
  <email>webadmin@mycomp.com</email>
</config>

For which the DTD would be:

<!ELEMENT config (source-dir, target-dir, email)>
<!ELEMENT source-dir (#PCDATA)>
<!ELEMENT target-dir (#PCDATA)>
<!ELEMENT email      (#PCDATA)>

This DTD defines the four elements config, source-dir, target-dir and email . A config element must contain one of each of the other three. #PCDATA means Parsed Character Data, or, in layman's terms, just text, so source-dir, target-dir and email just contain text.

Note that, as usual, XML is case-sensitive, so the ELEMENT keyword must be uppercase.

Such a format is nicely human-readable. It is easy to extend it with new parameters, such as a header for the report, which will be ignored by older versions of the tool but used by newer ones. And tools like the XML::Simple Perl module will load it seamlessly into a data structure in memory.

Relational tables

Another common practice is to use XML to exchange tables between incompatible database management systems, or just to make it easier to display them.

Here, the natural way to encode a table in XML would be just to set each row as an element, and each field as a sub-element of the row element.

A typical table would then look like this:

<?xml version="1.0"?>
<stats>
  <player><id>GRICE</id><name>Rice, Glen</name><ppg>16.1</ppg></player>
  <player><id>TKUKOC</id><name>Kukoc, Toni</name><ppg>15.1</ppg></player>
  <player><id>JROSE</id><name>Rose, Jalen</name><ppg>17.8</ppg></player>
</stats>

And for which the DTD would look like:

<!ELEMENT stats (player+)> 
<!ELEMENT player (id, name, ppg)> 
<!ELEMENT id (#PCDATA)> 
<!ELEMENT name (#PCDATA)> 
<!ELEMENT ppg (#PCDATA)>

The + sign means that the stats element includes one or more player elements, other modifiers include ? (optional) and * (zero or more), just as in regular expressions.

An alternative would be to set the id element as an attribute of the player, which would make it slightly easier to process using standard XML tools. A player element would then look like:

<player id="GRIC"><name&gtRice, Glen</name><ppg>16.1</ppg></player>

And the DTD would become:

<!ELEMENT stats (player+)>
<!ELEMENT player (name, ppg)>
<!ATTRIBUTE player id      ID #REQUIRED>
<!ELEMENT name   (#PCDATA)>
<!ELEMENT ppg     (#PCDATA)>

Here the attribute declaration indicates that the id attribute is an ID, an attribute whose value must be unique in the document, and that this attribute must be present in the document. Other types of attributes include CDATA (just plain text), NAME, NUMBER, IDREF (a reference to the ID of another element) and choices, for example (guard | forward | center).

A document of that type could contain the entire table or just a part of it, typically the result of a query. It can then be easily edited, fed into a different database system or processed into an HTML table, which will then be displayed in a browser using XML tools.

Text documents

Of course, XML is also used as a more powerful alternative to HTML for generic documents.

An easy way to add new elements to HTML is to use the XHTML DTD, an XML version of the HTML 4.0 DTD. A W3C working group is currently working on how to formally modularize and extend the DTD, but in the meantime it is already possible to do it by just copying and modifying the DTD, which can be found at www.w3.org/TR/xhtml1/DTD/xhtml1transitional.dtd.

Adding new elements to the XHTML DTD can be a little tricky, as it is a heavily modular DTD, written using entities (entities are the XML equivalent of macros), which make it look quite different from the simple DTDs described in the first two sections. It makes it easy, though, to add the element everywhere it needs to be in one swell move:

The most important content models in XHTML are:

A new type of heading should go in the heading entity:

<!ENTITY % heading (h1|h2|h3|h4|h5|h6|my-heading)>
<!ELEMENT my-heading   (%text;)*>
<!ATTLIST my-heading align  (left|center|right) #IMPLIED>

A price element that can be inserted in the flow of text would be defined by adding it to the inline entity:

<!ENTITY % inline "a | %special; | %fontstyle; | %phrase; | %inline.forms;| price">>
<!ELEMENT price    (#PCDATA)*>

Creating XML documents based on the XHTML DTD avoids re-inventing the same old paragraph, list and table structures and lets designers focus on the important elements that are specific to the system they are building.

Elements or attributes?

This question has to be treated here, as it is the most frequently asked by XML document designers. Which one should be used to model the various components of a document, elements or attributes?

There is, of course, no hard rule, or easy answer. As shown in the database example, a piece of information can be stored either in an element or an attribute. Attributes are less verbose than elements. They are also unique (an element cannot have two attributes with the same name; unordered) the order in which they are written is meaningless; and unstructured, an attribute value is just a string.

So data represented by an attribute should satisfy those requirements. And that's it! Choosing between attributes and elements is then mostly a question of personal preference.

More information on the subject can be found on Robin Cover's SGML/ XML: Using Elements and Attributes at www.oasis-open.org/cover/elementsAndAttrs.html.

Conclusion

Other types of document types can, of course, be needed. Documents including much richer semantics (yippee, another buzzword!) than basic HTML, such as instructions to perform a specific task, include conditions to be fulfilled before displaying the next step. Or XML documents too large to be displayed as a single HTML page also fall into this category. There are plenty of cases where the design of the document involves more than the simple cases described above.

Designing document types for a complex system is a complex task! It should not be taken lightly and should be dealt with using the same kind of methodology that applies to designing any complex information system.

An excellent book, written for SGML but still very much relevant to XML document-type design is Developing SGML DTDs: From Text to Model to Markup, by Eve Maler and Jeanne El Andaloussi (Prentice Hall, 1996, ISBN: 0-13-309881-8).

Also note that a good number of DTDs have been created by various standard committees, trade organizations, companies and individuals: docbook is used for technical manuals, XMLnews for news articles, XML-RPC to let processes communicate using the HTTP protocol. They can usually be re-used, if you can find them, which will be the theme of the next column!

And a last bit of advice: One of the biggest strengths of XML is its flexibility and the ease of adding new element types to documents. This makes it really easy to build a system incrementally, starting from a simple DTD, testing it, then adding new features as they are required.

Note: this article was published in 2000 in Boardwatch magazine. More recent articles about XML and especially Perl & XML can be found on www.xmltwig.com