XML, the Perl Way

XML::Twig Quick Reference

This is a quick list of the most useful features of XML::Twig. Some seldom used methods and options have been ommited so if you are using the module and cannot find a way to do something you should refer to the complete doc by doing perldoc XML::Twig or going to xmltwig.com.

Conventions used in this document: arguments which name start with opt_ are optional, (3.00) denotes methods added for version 3.00 of XML::Twig.

Twig Options

Options set when creating the twig can be written either using a Java-like style UglyOptionName or with a more Perl-ish style a_cool_option_name, they are normalized before being used.

twig_handlers$handlers$handlers is a ref to a hash expression => sub_ref, where expression is a XPath-like expression, which triggers a call to the subroutine referenced by code_ref, the subroutine receives 2 arguments: the twig itself and the element, $_ is also the element, the subroutine is called once the element is completely parsed
twig_roots$handlers$handler is a ref to a hash expression => sub_ref or 1, a twig is buit only for the elements for wich the expression is true, if the value is a sub_ref then it is called with the twig and the element as arguments, elements outside the twig roots are ignored (or printed if twig_print_outside_roots is set), elements inside the twig roots are included in the twig and can trigger twig_handlers
twig_print_outside_rootstrue or false valuecan only be used if twig_roots is also used, if set to a true value will print all parts of a document that are not inside the twig roots
start_tag_handlers$handlers$handlers is a ref to a hash expression => sub_ref, where expression is a XPath-like expression, which triggers a call to the subroutine referenced by code_ref, the subroutine receives 2 arguments: the twig itself and the element, the subroutine is called as soon as the start tag for the element has been parsed, so the element will only contain its attributes, but not its sub-elements
keep_encodingtrue or false valuekeeps the original encoding of the document
pretty_print'nsgmls', 'nice','indented' 'record'or 'record_c' 'nsgmls' is kind of ugly but safe, 'nice' and 'indented' look better but can produce invalid, but well-formed XML (the document is no longer conformant to its DTD but it is still XML), 'record' and 'record_c' are nice for record-oriented documents
empty_tags'html'by default empty tags are displayed as <empty/>, setting this option makes them display as <empty /> (an extra space is added before the /) so HTML browsers display them properly

Twig Methods

Note that some of the methods only make sense when used in a handler: purge, finish, finish_print for example, while others should only be used once the document is completely parsed: print for example.

parse$string or \*OPEN_FILEHANDLEparse a document from a string or from an open filehandle
parsefile$filenameparse a document from a file
print$opt_filehandleprint the entire document (use only after the parse!), optionally to a filehandle
sprint return the entire document as a string
safe_parse$string or \*OPEN_FILEHANDLEthis method is similar to parse except that it wraps the parsing in an eval block. It returns the twig on success and 0 on failure (the twig object also contains the parsed twig). $@ contains the error message on failure.
safe_parsefile$stringsame as safe_parse for a file name
flush$opt_filehandleprint the document so far and release the memory for all output elements. Don't forget to flush one last time after the parsing is done to output the end of the document
purge same as flush except that the twig is not printed, releases as much memory as possible by purging all closed elements
root return the root element for the twig
first_elt$opt_gireturn the first ($opt_gi) element in the twig
get_xpath$xpath, $opt_offsetreturn the list of element filtered by the $xpath expression, if $opt_offset return just one element, the one with that offset in the list
finish unset all handlers and finish parsing the document as fast as possible
finish_print unset all handlers and finish parsing the document as fast as possible, printing the rest of the document as-is as is
dispose releases the memory used by the twig, use if you use a lot of twigs in your script (3.00)

Element Methods

MethodArgumentsDescription
Elements
print$opt_filehandle, $opt_pretty_print_styleprint the element
sprint$opt_no_enclosing_tagreturn the element string, with the tags (if $opt_no_enclosing_tag is true then the outside tags are ommited (equivalent to xml_string in 3.00), XML base entities are escaped
gi return the gi (the tag) for the element. Equivalent to the tag method (3.00)
set_gi$giset the gi (the tag) for the element to $gi. Equivalent to the set_tag method (3.00)
text the text of the element (without any tags, the text is not XML-escaped)
trimmed_text the trimmed text of the element (without any tags, the text is not XML-escaped): leading and trailing whitespace is trimmed and all consecutive spaces are collapsed to a single one
new$opt_gi, $opt_atts, @opt_contentcreate a new element, $opt_atts is a ref to a hash of attributes (a-la CGI.pm), @opt_content is a list of strings and elements used as the children of the element.
parse$string, %argscreate a new element from $string, %args is a hash with the arguments used to create the twig contraining the element
set_text$textset the text of the element
set_content$opt_atts, @content
or
$opt_atts, '#EMPTY'
set the content of the element, $opt_atts is a ref to a hash of attributes (a-la CGI.pm), @content is a list of elements and strings, '#EMPTY' creates an empty element
Attributes
att$attget the $att attribute or undef
atts return a reference to a hash containing the attribute of the element
set_att$att, $valueset the value of attribute $att
set_atts$atts_refset the attribute of the element using the hash referenced by $atts_ref
del_att$attdelete the $att attribute
del_atts delete all of the attributes of the element
Cut'n Paste
cut cut the element from the tree
paste$opt_position, $ref_eltpaste the element before, after, as first_child (default) or last_child of $ref_elt
move$opt_position, $ref_eltsame as paste but cut the element before pasting it
replace$refreplace $ref by the element in the tree
copy return a "deep" copy of the element
delete cut the element from the tree and delete it
cut_children cut all children of the element, returns the list of children
Navigation
first_child$opt_gireturn the first ($opt_gi) child of the element
last_child$opt_gireturn the last ($opt_gi) child of the element
prev_sibling$opt_gireturn the ($opt_gi) previous sibling of the element
next_sibling$opt_gireturn the ($opt_gi) next sibling of the element
parent$opt_gireturn the ($opt_gi) parent of the element
children$opt_gireturn the list of ($opt_gi) children of the element
descendants$opt_gireturn the list of ($opt_gi) descendants of the element
ancestors$opt_gireturn the list of ($opt_gi) ancestors of the element
get_xpath$xpath, $opt_offsetreturn the list of element filtered by the $xpath expression, if $opt_offset return just one element, the one with that offset in the list

Note: starting at XML::Twig 3.00.10 $opt_gi can be either a gi, #ELT (in wich case any "real" element is returned), #TEXT (in which case any "text", PCDATA or CDATA element is returned), a regexp, applied to the gi of elements, or a code reference, applied to the element.

Twig Specials
field$opt_gireturn the text of the first child ($opt_gi) of the element
prefix$stringprefix the element with $string
suffix$stringsuffix the element with $string
insert@gi For each $gi in @gi insert an element $gi as the only child of the element, all original children of the element are set as children of the new element, return the inner most element: $table->insert( 'tr', 'td', 'p'); creates a single tr, a nested single td, a p nested in the td and returns the p element
wrap_in@gi Wrap the element in elements from @gi, return the outer element: $p->wrap_in( 'td', 'tr', 'table'); puts $p in a table with a single tr and a single td and returns the table element.
erase cut the element and paste its children in its place, as if the tag had been erased from the document
in$parentreturn true if the element is in the element $parent
in_context$gi, $opt_levelreturn true if the element is included in an element whose gi is $gi, optionally within $opt_level levels, the returned value is the innermost including element $gi
inherit_att$att, @opt_gireturn the value of an attribute inherited from parent tags. The first value found by looking at the element then in turn at each of its ancestors (in @opt_gi) is returned.
level$opt_giReturns the depth of the element in the twig (root is 0). If the $opt gi is given then only ancestors of the given type are counted.
next_elt$opt_root, $opt_gireturn the next ($opt_gi) element (next element found in the document after the opening tag of $elt), if $opt_root is used then undef is returned if the next element is not under the element $opt_root, so you can use
my $elt= $subtree_root;
while( $elt= $elt->next_elt( $subtree_root) { my_process( $elt); }

to loop through all elements in $subtree_root
path return a string showing the path to the element XPath style: /doc/section/title
remove_cdata remove all CDATA markers in the element. Useful when you have HTML-is-a-CDATA-section in a document that you want to ignore during processing, but that you might want to output as markup when converting to HTML
simplifysame arguments as XMLin in XML::Simple(experimental in 3.10): generate a data structure similar to the one generated by XML::Simple's XMLin for an element

XPath-like Syntax

XPath-like syntax is used in 2 places: to trigger handlers and in the get_xpath method.

handler triggers

This table describes the various types of pseudo-xpath expressions than can trigger the various handlers. Expression types are listed from highest to lowest priority. If several expressions match then they will be stacked and the various handlers will be called until one of them returns a false value.

Getting only one handler to be triggered for each element is generally regarded as a good way to keep one's sanity...

Convention: litteral parts are in bold, variable parts are in normal font.

syntaxDescription
_all_always triggers the handler (even if a previous handler returns a false value)
*[@att]triggers if the attribute att exists for the element
*[@att='val']triggers if the attribute att exists for the element and is equal to val (a string comparison is performed, not a numeric one)
gi[string()="foo"]triggers the handler if the gi of the element is gi and its text isfoo, the text is the result of the element text method, cannot be used for twig_roots and start_tag_handlers
gi[string(child_gi)="foo"]triggers the handler if the gi of the element is gi and the text of one of it's direct child_gi child is foo, cannot be used for twig_roots and start_tag_handlers
gi[string()=~ /foo/]triggers the handler if the gi of the element is gi and its text matches foo, the i, m, s and o modifiers can be used to modify the regexp, cannot be used for twig_roots and start_tag_handlers
gi[string(child_gi)="foo"]triggers the handler if the gi of the element is gi and the text of one of it's direct child_gi child matchesfoo, cannot be used for twig_roots and start_tag_handlers
gi[@att]triggers the handler for gi elements with an attribute att
gi[@att="val"]triggers the handler for gi elements with an attribute att which value is val
/root/elt/subelttriggers the handler for elements matching this exact path path, starting from the root
elt/subelttriggers the handler for element matching this path
elttriggers the handler for all gi elements
_default_triggers the handler if no other handler has been trigger

get_xpath method

Summarry of the syntax:

giselects gi elements
gi[1]selects the first gi element, any integer, positive or negative can be used, negative integers start from the last element
gi[last()]selects the last gi element
gi[@att]selects the gi elements which have an attribute att
gi[@att="val"]selects the gi elements with an att attribute equals to val
gi[att1="val1" and att2="val2"] 
gi[att1="val1" or att2="val2"] 
gi[string()="toto"]selects gi elements which text (as per the text method) is toto
gi[string()=~/regexp/]selects gi elements which text matches regexp

In addition:

Examples

para selects the para element children of the current element
* selects all element children of the current element
para[1] selects the first para child of the current element
para[last()] selects the last para child of the current element
*/para selects all para grandchildren of the current element
/doc/chapter[5]/section[2] selects the second section of the fifth chapter of the doc
chapter//para selects the para element descendants of the chapter element children of the current element
//para selects all the para descendants of the document root and thus selects all para elements in the same document as the current element
//olist/item selects all the item elements in the same document as the current element that have an olist parent
.//para selects the para element descendants of the current element
.. selects the parent of the current element
para[@type="warning"] selects all para children of the current element that have a type attribute with value warning
employee[@secretary and @assistant] selects all the employee children of the current element that have both a secretary attribute and an assistant attribute