Ways to Rome 2 - Kourallinen Dollareita
last updated Thu Aug 21 18:16:07 2003
Introduction
This article follows the original Ways to Rome article. It shows how to process XML data using some of the numerous XML modules on CPAN.
The task for this article is to process data-oriented XML. Data-oriented, as opposed to text-oriented XML, is XML that essentially contains data, often extracted from a data base, or resulting from the serialization of an object, or just configuration data for a piece of software. A lot of XMl modules are specifically oriented towards processing data-oriented XML, which is typicaly easier to process than text-oriented XML (it is usually more regular, doesn't include mixed-content...).
Specification of the problem
Finvoice
The data to process will be XML invoices. Those invoices follow the Finvoice specification. Finvoice is the electronic invoice of the Finnish Bankers' Association. It is one of the few publically available standard way of exchanging invoices between a seller and a buyer. Visa also has published the Visa XML Invoice Specification which is quite a bit more complex than the Finvoice one, so I stuck to the simpler one. SAP also has a specification for invoices, but it does not seem to be public. Note that none of my emails asking for information and support to Finvoice, Visa and SAP was ever answered, so don't expect much support in the XML invoice World ;--(
The Finvoice DTD is quite comprehensive and look quite usable for general invoicing.
I found that it includes a couple of content model that in my experience are not optimal for processing, where elements that belong together are not wrapped in a containing element:
<! ELEMENT Finvoice (SellerPartyDetails, SellerOrganisationUnitNumber, SellerContactPersonName?,
SellerCommunicationDetails?, SellerInformationDetails?, InvoiceRecipientPartyDetails?,
BuyerPartyDetails, BuyerOrganisationUnitNumber?, BuyerContactPersonName?,
BuyerCommunicationDetails?, DeliveryPartyDetails?, DeliveryDetails?, InvoiceDetails,
PaymentStatusDetails?, PartialPaymentDetails*, VirtualBankBarcode?, InvoiceUrlText?,
InvoiceRow+, SpecificationDetails?, EpiDetails)>
|
I would much rather have this as:
<! ELEMENT Finvoice (SellerInfo, BuyerInfo, DeliveryInfo?, InvoiceDetails, PaymentDetails,
InvoiceRows, SpecificationDetails?, EpiDetails)>
|
with the sub-elements containing related data. I especially do not like the fact that InvoiceRow is repeated in the "main" Finvoice element, which I don't think is good practice.
Overall the Finvoice DTD is quite simple, most elements have non-contextual names: names that identify them uniquely without needing to know the context. For example there is a SellerStreetName and a BuyerStreetName. While this makes the DTD harder to maintain (if you change the address model you need to change it in several places), it also makes processing quite easier. The only, pretty annoying, exceptions to this rule are the CountryCode and CountryName, which are used in the SellerPostalAddressDetails and BuyerPostalAddressDetails. This is quite a pain because it forces the code to deal with the context. Using non-contextual element names is a valid design option, but not being consistent about it gives you the worst of both worlds: the DTD is hard to maintain and the code has to deal with the context.
In the end Finvoice is quite a convenient DTD and understanding it and using it was not too difficult.
It should be noted that, as for much "standard" DTDs in my experience, it is quite flexible. It allows for exemple invoice rows to be either highly structured (with a lot of optional elements) or... just plain text. This of course increases the complexity for the developer (but makes the DTD usable in the Real World).
A few words about the invoices
These invoice are an interesting class of document: they are clearly data-oriented: thhey have no mixed content and they are very structured. But on the other hand the order of elements is quite significant, at least for display purposes, which is a feature usually found in document-oriented XML. So as long as the task consists solely in checking them and extracting information from them they can be considered as strictly data-oriented, but if we need to modify them, they they should be dealt with as documents.
Additional characteristics of the invoices: they are quite small, and I would hope a single company wouldn't receive several thousands of then every minute. This allows us to load each invoice in memory, and not to have to worry too much about performance (especially as the cost of processing the XML is likely to be small compared to the time spend talking to the data base). This explains the lack of benchmark in this article (at least the initial version).
The task
The task that the various examples will perform is a typical application of what should be done with such data: process invoices, coming in separate files, check them (check that they reference an existing purchase order, that they are addressed to the right company...) and if they are ok store some of the data in a data base (created with DBI and DBD::SQLite for easy set-up). For extra credit the examples can output documents that do not pass the checks, in the original form plus the error messages as new XML elements at the beginning of the document. Note that this is quite arbitrary, and is there mostly to show how to create new elements with each module. A "real" application would likely use a different method to report errors.
Potential problems not dealt with in these examples are:
- Encoding problems: no matter what the encoding of the original documents is (it is usually ISO-8859-1, at least in the examples provided by Finvoice) they are stored as UTF-8 in the DB. This is actually a sane decision to make if your DB supports it, but it might be hard to do if you are dealing with a legacy system. In this case use of Text::Iconv or Encode should solve the problem
The Examples
XML::Simple
XML::Simple is extremely convenient for loading XML data into a Perl data structure, using XMLin, and then forgetting that it was ever in XML.
The code was very easy to write. I used YAML during the debugging phase to dump the content of the $xml structure, so it was really easy to see where was the data I needed.
Note the 2 options for XMLin:
-
forcearray => [ qw(InvoiceRow)]
- this will cause
XMLinto load the InvoiceRow's in an array, even if there is only one in the document (without that option a single InvoiceRow would be turned into a hash value instead of an array, which would have forced me to test it). -
forcecontent => 1
- despite the documentation for XML::Simple stating that this option is rarely used I found it made it easier and safer to write the code: all text content is stored in the
contentfield of a hash. This makes accessing the values a little uglier (you have to add an extra->{content}in the expression), but in fact it saved me treating differently elements that had attributes and elements which didn't, plus it's an extra security if new attributes are added to some elements of the DTD, or for optional attributes, that would cause the generated data structure to be different depending on them being present or not (for example inOrderedQuantitytheQuantityUnitCodeattribute is optional). If you use XML::Simple for data where the DTD is simpler and where attributes are always set then you might want not to use this option. In this case I found it convenient.
Finally note that XML::Simple cannot output a modified document, as it looses the order of elements in the document (once elements are assigne to a hash their order is lost), so $CAN_OUTPUT is set to 0 and the error messages are output to the console, rather than creating a new document.
This might or might not be a problem, depending on your specific case.
The complete example is in wtr2_simple
1 #!/usr/bin/perl -w
2 use strict;
3
4 use XML::Simple;
5
6 use FindBin qw($Bin);
7 use lib $Bin;
8 use wtr2_base;
9
10 my $DEBUG=0;
11
12 init_db();
13
14 # XML::Simple cannot output properly the document
15 # the order of the elements will be lost
16 my $CAN_OUTPUT= 0;
17
18 my @files= @ARGV || (<$dir{invoices}/*.xml>);
19
20 foreach my $file (@files)
21 { my $xml= XMLin( $file, forcearray => [ qw(InvoiceRow)], forcecontent => 1);
22
23 my $errors= check_invoice( $xml);
24
25 if( !@$errors)
26 { store_invoice( $xml); }
27 else
28 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n";
29 if( $CAN_OUTPUT)
30 { my $rejected_file= rejected( $file);
31 print "adding errors in $rejected_file\n" if( $DEBUG);
32 add_errors( $xml, $errors);
33 output_doc_to_check( $rejected_file, $xml);
34 }
35 }
36 }
37
38 exit;
39
40 sub check_invoice
41 { my( $xml)= @_;
42 my $errors=[]; # array ref, holds the error messages
43
44 check_buyer( $xml->{BuyerPartyDetails}->{BuyerPartyIdentifier}->{content},
45 $xml->{BuyerPartyDetails}->{BuyerOrganisationName}->{content},
46 $errors
47 );
48 check_po( $xml->{InvoiceDetails}->{OrderIdentifier}->{content}, $errors);
49
50 my @rows= @{$xml->{InvoiceRow}};
51
52 reset_default_row_id();
53
54 foreach my $row( @rows)
55 { # this does not cope well with broken row numbers
56
57 my $row_id= $row->{RowIdentifier}->{content} || default_row_id();
58
59 print "checking row $row_id\n" if $DEBUG;
60
61 check_qtty( $row_id,
62 $row->{DeliveredQuantity}->{content},
63 $row->{DeliveredQuantity}->{QuantityUnitCode},
64 $row->{OrderedQuantity}->{content},
65 $row->{OrderedQuantity}->{QuantityUnitCode},
66 $errors
67 );
68 }
69
70 return $errors;
71 }
72
73 sub store_invoice
74 { my( $xml)= @_;
75 print "storing invoice $xml->{InvoiceDetails}->{InvoiceNumber}->{content}\n";
76
77 # build the various data structures
78 my $data;
79
80 my $invoice = $xml->{InvoiceDetails};
81 $data->{invoice} = { number => $invoice->{InvoiceNumber}->{content},
82 date => $invoice->{InvoiceDate}->{content},
83 po => $invoice->{OrderIdentifier}->{content},
84 amount_no_tax => $invoice->{InvoiceTotalVatExcludedAmount}->{content},
85 tax => $invoice->{InvoiceTotalVatAmount}->{content},
86 amount => $invoice->{InvoiceTotalVatIncludedAmount}->{content},
87 payment_status => $xml->{PaymentStatusDetails}->{PaymentStatusCode}->{content},
88 };
89
90 my $seller = $xml->{SellerPartyDetails};
91 $data->{seller} = { identifier => $seller->{SellerPartyIdentifier}->{content},
92 name => $seller->{SellerOrganisationName}->{content},
93 tax_code => $seller->{SellerOrganisationTaxCode}->{content},
94 };
95
96 my $address = $xml->{SellerPartyDetails}->{SellerPostalAddressDetails};
97 $data->{address} = { street => $address->{SellerStreetName}->{content},
98 town => $address->{SellerTownName}->{content},
99 zip => $address->{SellerPostCodeIdentifier}->{content},
100 country_code => $address->{CountryCode}->{content},
101 po_box => $address->{SellerPostOfficeBoxIdentifier}->{content},
102 };
103
104 $data->{contact} = { name => $xml->{SellerContactPersonName}->{content},
105 phone => $xml->{SellerCommunicationDetails}->{SellerPhoneNumberIdentifier}->{content},
106 email => $xml->{SellerCommunicationDetails}->{SellerEmailaddressIdentifier}->{content},
107 };
108
109 $data->{invoicerow} ||= [];
110 reset_default_row_id();
111
112 foreach my $invoicerow (@{$xml->{InvoiceRow}})
113 { push @{$data->{invoicerow}},
114 { row_id => $invoicerow->{RowIdentifier}->{content} || default_row_id(),
115 sku => $invoicerow->{ArticleIdentifier}->{content},
116 name => $invoicerow->{ArticleName}->{content},
117 qty => $invoicerow->{DeliveredQuantity}->{content},
118 qty_unit => $invoicerow->{DeliveredQuantity}->{QuantityUnitCode},
119 unit_price => $invoicerow->{UnitPriceAmount}->{content},
120 amount_no_tax => $invoicerow->{RowVatExcludedAmount}->{content},
121 tax => $invoicerow->{RowVatAmount}->{content},
122 amount => $invoicerow->{RowAmount}->{content},
123 }
124 }
125
126
127 store_all( $data);
128 }
|
XML::Smart
I wrote this XML::SMART example after having written the XML::Simple one and I must say I was impressed by how compatible it was: it took me a good 10 minutes to modufy the wtr2_simple code to get this one: basically understanding that I needed to work with <$xml-{Finvoice} >> instead of directly the original $xml object, and removing all the extra ->{content} calls as XML::Smart lets me happily get the content of an element using $row->{DeliveredQuantity} and the value of an attribute with $row->{DeliveredQuantity}->{QuantityUnitCode}. Very neat.
In 10 more minutes I could add the errors, by just unshifting the newly created container in the document object.
One big annoyance with this module: I had to remove the -w switch as I kept getting warnings. This should be in the TODO list of the author (along with the unnecessary attacks against XML::Simple in the docs, the module can stand on its own and certainly does not need them).
The complete example is in wtr2_smart
1 #!/usr/bin/perl
2 use strict;
3
4 use XML::Smart;
5
6 use FindBin qw($Bin);
7 use lib $Bin;
8 use wtr2_base;
9
10 init_db();
11
12 my $DEBUG=0;
13
14 my $CAN_OUTPUT= 1;
15
16 my @files= @ARGV || (<$dir{invoices}/*.xml>);
17
18 foreach my $file (@files)
19 { my $doc= XML::Smart->new( $file);
20 my $xml= $doc->{Finvoice};
21
22 my $errors= check_invoice( $xml);
23
24 if( !@$errors)
25 { store_invoice( $xml); }
26 else
27 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n";
28 if( $CAN_OUTPUT)
29 { my $rejected_file= rejected( $file);
30 print "adding errors in $rejected_file\n" if( $DEBUG);
31 add_errors( $doc, $errors);
32 output_doc_to_check( $rejected_file, $doc);
33 }
34 }
35 }
36
37 exit;
38
39 sub check_invoice
40 { my( $xml)= @_;
41 my $errors=[]; # array ref, holds the error messages
42
43 check_buyer( $xml->{BuyerPartyDetails}->{BuyerPartyIdentifier},
44 $xml->{BuyerPartyDetails}->{BuyerOrganisationName},
45 $errors
46 );
47 check_po( $xml->{InvoiceDetails}->{OrderIdentifier}, $errors);
48
49 my @rows= @{$xml->{InvoiceRow}};
50
51 reset_default_row_id();
52
53 foreach my $row( @rows)
54 { # this does not cope well with broken row numbers
55
56 my $row_id= $row->{RowIdentifier} || default_row_id();
57
58 print "checking row $row_id\n" if $DEBUG;
59
60 check_qtty( $row_id,
61 $row->{DeliveredQuantity},
62 $row->{DeliveredQuantity}->{QuantityUnitCode},
63 $row->{OrderedQuantity},
64 $row->{OrderedQuantity}->{QuantityUnitCode},
65 $errors
66 );
67 }
68
69 return $errors;
70 }
71
72 sub store_invoice
73 { my( $xml)= @_;
74 print "storing invoice $xml->{InvoiceDetails}->{InvoiceNumber}\n";
75
76 # build the various data structures
77 my $data;
78
79 my $invoice = $xml->{InvoiceDetails};
80 $data->{invoice} = { number => $invoice->{InvoiceNumber},
81 date => $invoice->{InvoiceDate},
82 po => $invoice->{OrderIdentifier},
83 amount_no_tax => $invoice->{InvoiceTotalVatExcludedAmount},
84 tax => $invoice->{InvoiceTotalVatAmount},
85 amount => $invoice->{InvoiceTotalVatIncludedAmount},
86 payment_status => $xml->{PaymentStatusDetails}->{PaymentStatusCode},
87 };
88
89 my $seller = $xml->{SellerPartyDetails};
90 $data->{seller} = { identifier => $seller->{SellerPartyIdentifier},
91 name => $seller->{SellerOrganisationName},
92 tax_code => $seller->{SellerOrganisationTaxCode},
93 };
94
95 my $address = $xml->{SellerPartyDetails}->{SellerPostalAddressDetails};
96 $data->{address} = { street => $address->{SellerStreetName},
97 town => $address->{SellerTownName},
98 zip => $address->{SellerPostCodeIdentifier},
99 country_code => $address->{CountryCode},
100 po_box => $address->{SellerPostOfficeBoxIdentifier},
101 };
102
103 $data->{contact} = { name => $xml->{SellerContactPersonName},
104 phone => $xml->{SellerCommunicationDetails}->{SellerPhoneNumberIdentifier},
105 email => $xml->{SellerCommunicationDetails}->{SellerEmailaddressIdentifier},
106 };
107
108 $data->{invoicerow} ||= [];
109 reset_default_row_id();
110
111 foreach my $invoicerow (@{$xml->{InvoiceRow}})
112 { push @{$data->{invoicerow}},
113 { row_id => $invoicerow->{RowIdentifier} || default_row_id(),
114 sku => $invoicerow->{ArticleIdentifier},
115 name => $invoicerow->{ArticleName},
116 qty => $invoicerow->{DeliveredQuantity},
117 qty_unit => $invoicerow->{DeliveredQuantity}->{QuantityUnitCode},
118 unit_price => $invoicerow->{UnitPriceAmount},
119 amount_no_tax => $invoicerow->{RowVatExcludedAmount},
120 tax => $invoicerow->{RowVatAmount},
121 amount => $invoicerow->{RowAmount},
122 }
123 }
124
125
126 store_all( $data);
127 }
128
129
130 sub add_errors
131 { my( $doc, $error_messages)= @_;
132 my $errors= { error => [@$error_messages]};
133 unshift @$doc, $errors;
134 return $doc;
135 }
136
137 sub output_doc_to_check
138 { my( $file, $doc)= @_;
139 open( FILE, ">$file") or die "cannot create file to check $file: $!";
140 print FILE $doc->data;
141 close FILE;
142 }
|
XML::Twig
An XML::Twig version based on the XML::Simple code
This code uses XML::Twig to process the invoices
As this code was written after the wtr2_simple example, it was very, very easy to write. I blatantly cheated ;--): it uses XML::Twig simplify method, which generates the same data structure as XML::Simple for a document (or an element) (available only in XML::Twig 3.10). So once the XML document is parsed in memory a call to that method gives me a data structure that can be processed with exactly the same code as in wtr2_simple.
The original document is still in memory though, which makes it possible to update it with the error messages and output it.
Note that in this example the only modification of the original document is the addition of the error messages. If I had needed to change data within the document before outputing it (for example to link the error messages to the place where th error occurs), then I would have had to access it using XML::Twig methods, see wtr2_twig.
The complete example is in wtr2_twig_simple
Here are the parts that differ from wtr2_simple:
...
20 { my $doc= XML::Twig->new( pretty_print => 'indented')->parsefile( $file);
21 my $xml= $doc->simplify( forcearray => [ qw(InvoiceRow)], forcecontent => 1);
...
|
A "native" XML::Twig version (in full-tree mode)
This code uses XML::Twig to process the invoices. It uses the full-tree mode: load the entire XML document through the parsefile method and then process it.
It uses mostly navigation to access the information, the first_child and field methods. New elements are created using the insert_new_elt method.
This was easy to write, but you would expect so, as I wrote XML::Twig and I am obviously quite familiar with it ;--)
The complete example is in wtr2_twig
1 #!/usr/bin/perl -w
2 use strict;
3
4 use XML::Twig;
5
6 use FindBin qw($Bin);
7 use lib $Bin;
8 use wtr2_base;
9
10 init_db();
11
12 my $DEBUG=0;
13
14 # XML::Twig can output the updated document, whith the error messages
15 my $CAN_OUTPUT= 1;
16
17 my @files= @ARGV || (<$dir{invoices}/*.xml>);
18
19 foreach my $file (@files)
20 { my $doc= XML::Twig->new( pretty_print => 'indented')->parsefile( $file);
21
22 my $errors= check_invoice( $doc);
23
24 if( !@$errors)
25 { store_invoice( $doc); }
26 else
27 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n";
28 if( $CAN_OUTPUT)
29 {my $rejected_file= rejected( $file);
30 print "adding errors in $rejected_file\n" if( $DEBUG);
31 add_errors( $doc, $errors);
32 output_doc_to_check( $rejected_file, $doc);
33 }
34 };
35 }
36
37 exit;
38
39 sub check_invoice
40 { my( $doc)= @_;
41 my $root= $doc->root;
42 my $errors=[]; # array ref, holds the error messages
43
44 check_buyer( $root->first_child( 'BuyerPartyDetails')->field( 'BuyerPartyIdentifier'),
45 $root->first_child( 'BuyerPartyDetails')->field( 'BuyerOrganisationName'),
46 $errors
47 );
48 check_po( $root->first_child( 'InvoiceDetails')->field( 'OrderIdentifier'), $errors);
49
50 my @rows= $root->children( 'InvoiceRow');
51
52 reset_default_row_id();
53
54 foreach my $row( @rows)
55 { # this does not cope well with broken row numbers
56
57 my $row_id= $row->field( 'RowIdentifier') || default_row_id();
58
59 print "checking row $row_id\n" if $DEBUG;
60
61 my $DeliveredQuantity = $row->first_child( 'DeliveredQuantity');
62 my $OrderedQuantity = $row->first_child( 'OrderedQuantity');
63 my $delivered_qty = $DeliveredQuantity ? $DeliveredQuantity->text : 0;
64 my $delivered_unit = $DeliveredQuantity ? $DeliveredQuantity->att( 'QuantityUnitCode') : '';
65 my $ordered_qty = $OrderedQuantity ? $OrderedQuantity->text : 0;
66 my $ordered_unit = $OrderedQuantity ? $OrderedQuantity->att( 'QuantityUnitCode') : '';
67
68 check_qtty( $row_id, $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit, $errors);
69 }
70
71 return $errors;
72 }
73
74 sub store_invoice
75 { my( $doc)= @_;
76 my $root= $doc->root;
77 print "storing invoice " . $root->first_child( 'InvoiceDetails')->field('InvoiceNumber') . "\n";
78
79 # build the various data structures
80 my $data;
81
82 my $invoice = $root->first_child( 'InvoiceDetails');
83 $data->{invoice} = { number => $invoice->field( 'InvoiceNumber'),
84 date => $invoice->field( 'InvoiceDate'),
85 po => $invoice->field( 'OrderIdentifier'),
86 amount_no_tax => $invoice->field( 'InvoiceTotalVatExcludedAmount'),
87 tax => $invoice->field( 'InvoiceTotalVatAmount'),
88 amount => $invoice->field( 'InvoiceTotalVatIncludedAmount'),
89 payment_status => $root->first_child( 'PaymentStatusDetails')
90 ->field( 'PaymentStatusCode'),
91 };
92
93 my $seller = $root->first_child( 'SellerPartyDetails');
94 $data->{seller} = { identifier => $seller->field( 'SellerPartyIdentifier'),
95 name => $seller->field( 'SellerOrganisationName'),
96 tax_code => $seller->field( 'SellerOrganisationTaxCode'),
97 };
98
99 my $address = $root->first_child( 'SellerPartyDetails')
100 ->first_child( 'SellerPostalAddressDetails');
101 $data->{address} = { street => $address->field( 'SellerStreetName'),
102 town => $address->field( 'SellerTownName'),
103 zip => $address->field( 'SellerPostCodeIdentifier'),
104 country_code => $address->field( 'CountryCode'),
105 po_box => $address->field( 'SellerPostOfficeBoxIdentifier'),
106 };
107
108 $data->{contact} = { name => $root->field( 'SellerContactPersonName'),
109 phone => $root->first_child( 'SellerCommunicationDetails')
110 ->field( 'SellerPhoneNumberIdentifier'),
111 email => $root->first_child( 'SellerCommunicationDetails')
112 ->field( 'SellerEmailaddressIdentifier'),
113 };
114
115 $data->{invoicerow} ||= [];
116 reset_default_row_id();
117
118 foreach my $invoicerow ($root->children( 'InvoiceRow'))
119 { my $DeliveredQuantity= $invoicerow->first_child( 'DeliveredQuantity');
120 my $qty_unit= $DeliveredQuantity ? $DeliveredQuantity->att( 'QuantityUnitCode') : '';
121
122 push @{$data->{invoicerow}},
123 { row_id => $invoicerow->field( 'RowIdentifier') || default_row_id(),
124 sku => $invoicerow->field( 'ArticleIdentifier'),
125 name => $invoicerow->field( 'ArticleName'),
126 qty => $invoicerow->field( 'DeliveredQuantity'),
127 qty_unit => $qty_unit,
128 unit_price => $invoicerow->field( 'UnitPriceAmount'),
129 amount_no_tax => $invoicerow->field( 'RowVatExcludedAmount'),
130 tax => $invoicerow->field( 'RowVatAmount'),
131 amount => $invoicerow->field( 'RowAmount'),
132 }
133 }
134
135
136 store_all( $data);
137 }
138
139 sub add_errors
140 { my( $doc, $error_messages)= @_;
141 my $errors= $doc->root->insert_new_elt( first_child => 'errors');
142 foreach my $message (@$error_messages)
143 { $errors->insert_new_elt( last_child => error => $message); }
144 return $doc;
145 }
146
147 sub output_doc_to_check
148 { my( $file, $doc)= @_;
149 open( FILE, ">$file") or die "cannot create file to check $file: $!";
150 $doc->print( \*FILE);
151 close FILE;
152 }
|
XML::DOM
This code uses XML::DOM to process the invoices
I have never liked the DOM. The Object Model is good, very complete and solid, but the API is very Java oriented (camels are good as Perl mascottes, not as veryLongMethodNames), and at least at level 1 (XML::DOM is a level 1 DOM implementation) quite weak and indeed dangerous.
Having gotten this out of my system...
Writing the DOM example wasn't particulary hard. It was long and quite painful, but more boring than difficult.
A minor annoyance, that could actually be a blessing in a different context: every time I needed to access an attribute for an optional element (DeliveredQuantity for example) I had to check the existence of the element, or calling getAttribute on undef (the non existent element) would cause the script to die. XML::Simple and the likes, by contrast, would let me access non existent hash values in the Perl data structure without complaining. Having to check is a pain in a short script like this one, but could be very useful in a bigger project, as it is always better to have the code die with a bang than fail silently (and cause errors later).
I wrote a little layer on top of it, with the first_child, children and text functions, that basically ensure that when I get a child I get the proper one, and not an extra whitespace, comment or other, that would happen to be in the XML.
The complete example is in wtr2_dom
1 #!/usr/bin/perl -w
2 use strict;
3
4 use XML::DOM;
5
6 use FindBin qw($Bin);
7 use lib $Bin;
8 use wtr2_base;
9
10 init_db();
11
12 my $DEBUG=0;
13
14 my $CAN_OUTPUT= 1;
15
16 my @files= @ARGV || (<$dir{invoices}/*.xml>);
17
18 foreach my $file (@files)
19 { my $parser = new XML::DOM::Parser;
20 my $doc = $parser->parsefile ( $file);
21
22 my $errors= check_invoice( $doc);
23
24 if( !@$errors)
25 { store_invoice( $doc); }
26 else
27 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n";
28 if( $CAN_OUTPUT)
29 { my $rejected_file= rejected( $file);
30 print "adding errors in $rejected_file\n" if( $DEBUG);
31 add_errors( $doc, $errors);
32 output_doc_to_check( $rejected_file, $doc);
33 }
34 };
35 }
36
37 exit;
38
39 sub check_invoice
40 { my( $doc)= @_;
41 my $root= $doc->getDocumentElement; # so we really work with the root element
42 my $errors=[]; # array ref, holds the error messages
43
44 my $BuyerPartyDetails = first_child( $root, 'BuyerPartyDetails');
45 my $BuyerPartyIdentifier = first_child( $BuyerPartyDetails, 'BuyerPartyIdentifier');
46 my $BuyerOrganisationName = first_child( $BuyerPartyDetails, 'BuyerOrganisationName');
47 check_buyer( text( $BuyerPartyIdentifier),
48 text( $BuyerOrganisationName),
49 $errors
50 );
51 my $InvoiceDetails = first_child( $root, 'InvoiceDetails');
52 my $OrderIdentifier = first_child( $InvoiceDetails, 'OrderIdentifier');
53 check_po( text( $OrderIdentifier), $errors);
54
55 my @rows= children( $root, 'InvoiceRow');
56
57 reset_default_row_id();
58
59 foreach my $row ( @rows)
60 { # this does not cope well with broken row numbers
61
62 my $row_id= text( first_child( $row, 'RowIdentifier')) || default_row_id();
63
64 print "checking row $row_id\n" if $DEBUG;
65
66 my $DeliveredQuantity= first_child( $row, 'DeliveredQuantity');
67 my $OrderedQuantity = first_child( $row, 'OrderedQuantity');
68
69 my $delivered_qty = text( $DeliveredQuantity);
70 my $delivered_unit = $DeliveredQuantity ? $DeliveredQuantity->getAttribute( 'QuantityUnitCode') : '';
71 my $ordered_qty = text( $OrderedQuantity);
72 my $ordered_unit = $OrderedQuantity ? $OrderedQuantity->getAttribute( 'QuantityUnitCode') : '';
73
74 check_qtty( $row_id, $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit, $errors);
75 }
76
77 return $errors;
78 }
79
80 sub store_invoice
81 { my( $doc)= @_;
82 my $root= $doc->getDocumentElement; # so we really work with the root element
83
84 print "storing invoice ", text( first_child( first_child( $root, 'InvoiceDetails'), 'InvoiceNumber')), "\n";
85
86 # build the various data structures
87 my $data;
88
89 my $invoice = first_child( $root, 'InvoiceDetails');
90 $data->{invoice} = { number => text( first_child( $invoice, 'InvoiceNumber')),
91 date => text( first_child( $invoice, 'InvoiceDate')),
92 po => text( first_child( $invoice, 'OrderIdentifier')),
93 amount_no_tax => text( first_child( $invoice, 'InvoiceTotalVatExcludedAmount')),
94 tax => text( first_child( $invoice, 'InvoiceTotalVatAmount')),
95 amount => text( first_child( $invoice, 'InvoiceTotalVatIncludedAmount')),
96 payment_status => text( first_child( first_child( $root, 'PaymentStatusDetails'), 'PaymentStatusCode')),
97 };
98
99 my $seller = first_child( $root, 'SellerPartyDetails');
100 $data->{seller} = { identifier => text( first_child( $seller, 'SellerPartyIdentifier')),
101 name => text( first_child( $seller, 'SellerOrganisationName')),
102 tax_code => text( first_child( $seller, 'SellerOrganisationTaxCode')),
103 };
104
105 my $SellerPartyDetails = first_child( $root, 'SellerPartyDetails');
106 my $address = first_child( $SellerPartyDetails, 'SellerPostalAddressDetails');
107 $data->{address} = { street => text( first_child( $address, 'SellerStreetName')),
108 town => text( first_child( $address, 'SellerTownName')),
109 zip => text( first_child( $address, 'SellerPostCodeIdentifier')),
110 country_code => text( first_child( $address, 'CountryCode')),
111 po_box => text( first_child( $address, 'SellerPostOfficeBoxIdentifier')),
112 };
113
114 my $contact = first_child( $root, 'SellerCommunicationDetails');
115 $data->{contact} = { name => text( first_child( $root, 'SellerContactPersonName')),
116 phone => text( first_child( $contact, 'SellerPhoneNumberIdentifier')),
117 email => text( first_child( $contact, 'SellerEmailaddressIdentifier')),
118 };
119
120 $data->{invoicerow} ||= [];
121 reset_default_row_id();
122
123 foreach my $invoicerow ( children($root, 'InvoiceRow'))
124 { # need to check that the DeliveredQuantity element is present before getting its attribute
125 my $DeliveredQuantity= first_child( $invoicerow, 'DeliveredQuantity');
126 my $qty_unit= $DeliveredQuantity ? $DeliveredQuantity->getAttribute( 'QuantityUnitCode') : '';
127
128 push @{$data->{invoicerow}},
129 { row_id => text( first_child( $invoicerow, 'RowIdentifier')) || default_row_id(),
130 sku => text( first_child( $invoicerow, 'ArticleIdentifier')),
131 name => text( first_child( $invoicerow, 'ArticleName')),
132 qty => text( $DeliveredQuantity),
133 qty_unit => $qty_unit,
134 unit_price => text( first_child( $invoicerow, 'UnitPriceAmount')),
135 amount_no_tax => text( first_child( $invoicerow, 'RowVatExcludedAmount')),
136 tax => text( first_child( $invoicerow, 'RowVatAmount')),
137 amount => text( first_child( $invoicerow, 'RowAmount')),
138 }
139 }
140
141
142 store_all( $data);
143 }
144
145 sub add_errors
146 { my( $doc, $error_messages)= @_;
147 my $root= $doc->getDocumentElement;
148 my $errors= $doc->createElement( 'errors');
149 $root->insertBefore ( $errors, $root->getFirstChild);
150 foreach my $message (@$error_messages)
151 { my $error= $doc->createElement( 'error');
152 $errors->appendChild( $error);
153 # those 2 lines could be replaced by $error->addText( $message)
154 # which is not in the DOM spec
155 my $text= $doc->createTextNode( $message);
156 $error->appendChild( $text);
157 }
158 return $doc;
159 }
160
161 sub output_doc_to_check
162 { my( $file, $doc)= @_;
163 open( FILE, ">$file") or die "cannot create file to check $file: $!";
164 print FILE $doc->toString;
165 close FILE;
166 }
167
168 # this is very important,this ensures that we get the proper child
169 # never use XML::DOM's getFirstChild method directly!
170 sub first_child
171 { my( $node, $tag)= @_;
172 my $child= $node->getFirstChild or return undef;
173 while( $child && ($child->getNodeName ne $tag) )
174 { $child= $child->getNextSibling; }
175 return $child;
176 }
177
178 sub children
179 { my( $node, $tag)= @_;
180 my @children;
181 my $child= first_child( $node, $tag) or return undef;
182 push @children, $child;
183 while( $child= $child->getNextSibling)
184 { push @children, $child if( $child->getNodeName eq $tag); }
185 return @children;
186 }
187
188 # node must include only text (and comments)
189 sub text
190 { my( $node)= @_;
191 unless( $node) { return undef; }
192 my $text='';
193 foreach my $child ($node->getChildNodes)
194 { if( $child->getNodeName eq '#text')
195 { $text.= $child->getData; }
196 }
197 return $text;
198 }
|
XML::EasyOBJ
This code uses XML::EasyOBJ to process the invoices
XML::EasyOBJ is built on top of XML::DOM, and mostly allows for easy navigation by letting you write $invoice= $doc->InvoiceDetails->InvoiceNumber to get the invoice number element, and $invoice->getString to get its value.
The fact that tag names (from the document) are used as method names in the code feels a little weird, and it can make name collision possible (XML::EasyOBJ lets you rename methods though, so you can deal with this problem). It makes it really easy to work with these invoices though, the code was easy to write and
The bad news here is that XML::EasyOBJ mostly helps you accessing the data. If you want to create new elements you have to use the DOM methods.
Overall XML::EasyOBJ feels like a very convenient layer on top of the DOM, which lets you easily navigate and access the data in the DOM (making easy things easy), and lets you use native DOM methods for advanced (making harder things possible).
The complete example is in wtr2_easyobj
1 #!/usr/bin/perl -w
2 use strict;
3
4 use XML::EasyOBJ;
5
6 use FindBin qw($Bin);
7 use lib $Bin;
8 use wtr2_base;
9
10 init_db();
11
12 my $DEBUG=0;
13
14 my $CAN_OUTPUT= 0;
15
16 my @files= @ARGV || (<$dir{invoices}/*.xml>);
17
18 foreach my $file (@files)
19 { my $doc = XML::EasyOBJ->new( $file);
20
21 my $errors= check_invoice( $doc);
22
23 if( !@$errors)
24 { store_invoice( $doc); }
25 else
26 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n";
27 if( $CAN_OUTPUT)
28 { my $rejected_file= rejected( $file);
29 print "adding errors in $rejected_file\n" if( $DEBUG);
30 my $dom= $doc->getDomObj->getOwnerDocument; # we need to use the DOM for this
31 add_errors( $dom, $errors);
32 output_doc_to_check( $rejected_file, $dom);
33 }
34 };
35 }
36
37 exit;
38
39 sub check_invoice
40 { my( $doc)= @_;
41 my $errors=[]; # array ref, holds the error messages
42
43 my $BuyerPartyDetails = $doc->BuyerPartyDetails;
44 my $BuyerPartyIdentifier = $BuyerPartyDetails->BuyerPartyIdentifier;
45 my $BuyerOrganisationName = $BuyerPartyDetails->BuyerOrganisationName;
46 check_buyer( $BuyerPartyIdentifier->getString,
47 $BuyerOrganisationName->getString,
48 $errors
49 );
50 my $InvoiceDetails = $doc->InvoiceDetails;
51 my $OrderIdentifier = $InvoiceDetails->OrderIdentifier;
52 check_po( $OrderIdentifier->getString, $errors);
53
54 my @rows= $doc->InvoiceRow;
55
56 reset_default_row_id();
57
58 foreach my $row ( @rows)
59 { # this does not cope well with broken row numbers
60
61 my $row_id= $row->RowIdentifier->getString || default_row_id();
62
63 print "checking row $row_id\n" if $DEBUG;
64
65 my $DeliveredQuantity= $row->DeliveredQuantity;
66 my $OrderedQuantity = $row->OrderedQuantity;
67
68 my( $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit)= ('','','','');
69 if( $DeliveredQuantity)
70 { $delivered_qty = $DeliveredQuantity->getString;
71 $delivered_unit = $DeliveredQuantity->getAttr( 'QuantityUnitCode');
72 }
73 if( $OrderedQuantity)
74 { $ordered_qty = $OrderedQuantity->getString;
75 $ordered_unit = $OrderedQuantity->getAttr( 'QuantityUnitCode');
76 }
77
78 check_qtty( $row_id, $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit, $errors);
79 }
80
81 return $errors;
82 }
83
84 sub store_invoice
85 { my( $doc)= @_;
86
87 print "storing invoice ", $doc->InvoiceDetails->InvoiceNumber->getString, "\n";
88
89 # build the various data structures
90 my $data;
91
92 my $invoice = $doc->InvoiceDetails;
93 $data->{invoice} = { number => $invoice->InvoiceNumber->getString,
94 date => $invoice->InvoiceDate->getString,
95 po => $invoice->OrderIdentifier->getString,
96 amount_no_tax => $invoice->InvoiceTotalVatExcludedAmount->getString,
97 tax => $invoice->InvoiceTotalVatAmount->getString,
98 amount => $invoice->InvoiceTotalVatIncludedAmount->getString,
99 payment_status => $doc->PaymentStatusDetails->PaymentStatusCode->getString,
100 };
101
102 my $seller = $doc->SellerPartyDetails;
103 $data->{seller} = { identifier => $seller->SellerPartyIdentifier->getString,
104 name => $seller->SellerOrganisationName->getString,
105 tax_code => $seller->SellerOrganisationTaxCode->getString,
106 };
107
108 my $address = $doc->SellerPartyDetails->SellerPostalAddressDetails;
109 $data->{address} = { street => $address->SellerStreetName->getString,
110 town => $address->SellerTownName->getString,
111 zip => $address->SellerPostCodeIdentifier->getString,
112 country_code => $address->CountryCode->getString,
113 po_box => $address->SellerPostOfficeBoxIdentifier->getString,
114 };
115
116 my $contact = $doc->SellerCommunicationDetails;
117 $data->{contact} = { name => $doc->SellerContactPersonName->getString,
118 phone => $contact->SellerPhoneNumberIdentifier->getString,
119 email => $contact->SellerEmailaddressIdentifier->getString,
120 };
121
122 $data->{invoicerow} ||= [];
123 reset_default_row_id();
124
125 foreach my $invoicerow ( $doc->InvoiceRow)
126 { # need to check that the DeliveredQuantity element is present before getting its attribute
127 my $DeliveredQuantity= $invoicerow->DeliveredQuantity;
128 my( $qty, $qty_unit)= ('','');
129 if( $DeliveredQuantity)
130 { $qty = $DeliveredQuantity->getString;
131 $qty_unit = $DeliveredQuantity->getAttr( 'QuantityUnitCode');
132 }
133
134 push @{$data->{invoicerow}},
135 { row_id => $invoicerow->RowIdentifier->getString || default_row_id(),
136 sku => $invoicerow->ArticleIdentifier->getString,
137 name => $invoicerow->ArticleName->getString,
138 qty => $qty,
139 qty_unit => $qty_unit,
140 unit_price => $invoicerow->UnitPriceAmount->getString,
141 amount_no_tax => $invoicerow->RowVatExcludedAmount->getString,
142 tax => $invoicerow->RowVatAmount->getString,
143 amount => $invoicerow->RowAmount->getString,
144 }
145 }
146
147
148 store_all( $data);
149 }
150
151 sub add_errors
152 { my( $doc, $error_messages)= @_;
153 my $root= $doc->getDocumentElement;
154 my $errors= $doc->createElement( 'errors');
155 $root->insertBefore ( $errors, $root->getFirstChild);
156 foreach my $message (@$error_messages)
157 { my $error= $doc->createElement( 'error');
158 $errors->appendChild( $error);
159 # those 2 lines could be replaced by $error->addText( $message)
160 # which is not in the DOM spec
161 my $text= $doc->createTextNode( $message);
162 $error->appendChild( $text);
163 }
164 return $doc;
165 }
166
167 sub output_doc_to_check
168 { my( $file, $doc)= @_;
169 open( FILE, ">$file") or die "cannot create file to check $file: $!";
170 print FILE $doc->toString;
171 close FILE;
172 }
|
XML::LibXML
An XML::LibXML version based on the XML::DOM code
This code uses XML::LibXML to process the invoices
It is directly derived from the XML::DOM code in wtr2_dom. It only differs where XML::DOM and XML::LibXML DOM methods or constants have different names:
XML::DOM XML::LibXML
getNodeName nodeName method #text text constant returned by nodeName/getNodeName
This was very easy to write, but does not take advantage of XML::LibXML's best feature: its support for XPath. See wtr2_libxml for a slightly different version, made a lot safer by using XPath queries instead of navigation methods (like getFirstChild) to access the data.
This exemple is here just to show how easy it is to port code from XML::DOM (which IMHO should be deprecated) to XML::LibXML, and then take advantage of XML::LibXML more powerful features.
The complete example is in wtr2_libxml_dom
Here are the parts that differ from wtr2_dom:
...
19 { my $parser = new XML::LibXML;
20 my $doc = $parser->parse_file ( $file);
...
173 while( $child && ($child->nodeName ne $tag) )
...
184 { push @children, $child if( $child->nodeName eq $tag); }
...
194 { if( $child->nodeName eq 'text')
...
|
A "native" XML::LibXML version
This code uses XML::LibXML to process the invoices. It uses a lot of XML::LibXML specific methods instead of using the DOM. This makes for nicer and safer code. XML::LibXML is a Perl wrapper on top of the libxml2 library, an XML/XPath/DOM/RelaxNG/... library written by Daniel Veillard for the Gnome project.
Instead of using navigation methods (getFirstChild) it relies mostly on findnodes and findvalue, which use XPath to select nodes to access.
The error message is built as text and then parsed using parse_xml_chunk. While this might not be the best method I found it really nice to use.
Overall the code was quite easy to write, all the extra goodies provided by XML::LibXML compared to the DOM are really useful and make it much easier to write compact and safe code.
My main gripe with XML::LibXML is that it is often unstable, as the Perl module tries to keep up with the development of the librarie. The problem is that libxml2 is so widely used that I find that it is often upgraded by unrelated software, which can then cause trouble to code that uses XML::LibXML.
The complete example is in wtr2_libxml
1 #!/usr/bin/perl -w
2 use strict;
3
4 use XML::LibXML;
5
6 use FindBin qw($Bin);
7 use lib $Bin;
8 use wtr2_base;
9
10 init_db();
11
12 my $DEBUG=0;
13
14 my $CAN_OUTPUT= 1;
15
16 my @files= @ARGV || (<$dir{invoices}/*.xml>);
17
18 foreach my $file (@files)
19 { my $parser = new XML::LibXML;
20 my $doc = $parser->parse_file ( $file);
21
22 my $errors= check_invoice( $doc);
23
24 if( !@$errors)
25 { store_invoice( $doc); }
26 else
27 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n";
28 if( $CAN_OUTPUT)
29 { my $rejected_file= rejected( $file);
30 print "adding errors in $rejected_file\n" if( $DEBUG);
31 add_errors( $parser, $doc, $errors); # the parser is passed so it can be re-used
32 output_doc_to_check( $rejected_file, $doc);
33 }
34 };
35 }
36
37 exit;
38
39 sub check_invoice
40 { my( $doc)= @_;
41 my $errors=[]; # array ref, holds the error messages
42
43 check_buyer( $doc->findvalue( '/Finvoice/BuyerPartyDetails/BuyerPartyIdentifier'),
44 $doc->findvalue( '/Finvoice/BuyerPartyDetails/BuyerOrganisationName'),
45 $errors
46 );
47 check_po( $doc->findvalue( '/Finvoice/InvoiceDetails/OrderIdentifier'), $errors);
48
49 my @rows= $doc->findnodes( '/Finvoice/InvoiceRow');
50
51 reset_default_row_id();
52
53 foreach my $row ( @rows)
54 { # this does not cope well with broken row numbers
55
56 my $row_id= $row->findvalue( 'RowIdentifier') || default_row_id();
57
58 print "checking row $row_id\n" if $DEBUG;
59
60 my( $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit)= ( '','','','');
61
62 if( my $DeliveredQuantity= $row->findnodes( 'DeliveredQuantity')->[0])
63 { $delivered_qty = $DeliveredQuantity->textContent;
64 $delivered_unit = $DeliveredQuantity->getAttribute( 'QuantityUnitCode');
65 }
66 if( my $OrderedQuantity= $row->findnodes( 'OrderedQuantity')->[0])
67 { $ordered_qty = $OrderedQuantity->textContent;
68 $ordered_unit = $OrderedQuantity->getAttribute( 'QuantityUnitCode');
69 }
70
71 check_qtty( $row_id, $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit, $errors);
72 }
73
74 return $errors;
75 }
76
77 sub store_invoice
78 { my( $doc)= @_;
79
80 my $invoice_number= $doc->findvalue( '/Finvoice/InvoiceDetails/InvoiceNumber');
81 print "storing invoice $invoice_number\n";
82
83 # build the various data structures
84 my $data;
85
86 my $invoice = $doc->findnodes( '/Finvoice/InvoiceDetails')->[0];
87 $data->{invoice} = { number => $invoice->findvalue( 'InvoiceNumber'),
88 date => $invoice->findvalue( 'InvoiceDate'),
89 po => $invoice->findvalue( 'OrderIdentifier'),
90 amount_no_tax => $invoice->findvalue( 'InvoiceTotalVatExcludedAmount'),
91 tax => $invoice->findvalue( 'InvoiceTotalVatAmount'),
92 amount => $invoice->findvalue( 'InvoiceTotalVatIncludedAmount'),
93 payment_status => $doc->findvalue( '/Finvoice/PaymentStatusDetails/PaymentStatusCode'),
94 };
95
96 my $seller = $doc->findnodes( '/Finvoice/SellerPartyDetails')->[0];
97 $data->{seller} = { identifier => $seller->findvalue( 'SellerPartyIdentifier'),
98 name => $seller->findvalue( 'SellerOrganisationName'),
99 tax_code => $seller->findvalue( 'SellerOrganisationTaxCode'),
100 };
101
102 my $address = $doc->findnodes( '/Finvoice/SellerPartyDetails/SellerPostalAddressDetails')->[0];
103 $data->{address} = { street => $address->findvalue( 'SellerStreetName'),
104 town => $address->findvalue( 'SellerTownName'),
105 zip => $address->findvalue( 'SellerPostCodeIdentifier'),
106 country_code => $address->findvalue( 'CountryCode'),
107 po_box => $address->findvalue( 'SellerPostOfficeBoxIdentifier'),
108 };
109
110 my $contact = $doc->findnodes( '/Finvoice/SellerCommunicationDetails')->[0];
111 $data->{contact} = { name => $doc->findvalue( '/Finvoice/SellerContactPersonName'),
112 phone => $contact->findvalue( 'SellerPhoneNumberIdentifier'),
113 email => $contact->findvalue( 'SellerEmailaddressIdentifier'),
114 };
115
116 $data->{invoicerow} ||= [];
117 reset_default_row_id();
118
119 foreach my $invoicerow ( $doc->findnodes( '/Finvoice/InvoiceRow'))
120 { # need to check that the DeliveredQuantity element is present before getting its attribute
121 my $DeliveredQuantity= $invoicerow->getChildrenByTagName( 'DeliveredQuantity')->[0];
122 my $qty = $DeliveredQuantity ? $DeliveredQuantity->textContent : '';
123 my $qty_unit= $DeliveredQuantity ? $DeliveredQuantity->getAttribute( 'QuantityUnitCode') : '';
124
125 push @{$data->{invoicerow}},
126 { row_id => $invoicerow->findvalue( 'RowIdentifier') || default_row_id(),
127 sku => $invoicerow->findvalue( 'ArticleIdentifier'),
128 name => $invoicerow->findvalue( 'ArticleName'),
129 qty => $qty,
130 qty_unit => $qty_unit,
131 unit_price => $invoicerow->findvalue( 'UnitPriceAmount'),
132 amount_no_tax => $invoicerow->findvalue( 'RowVatExcludedAmount'),
133 tax => $invoicerow->findvalue( 'RowVatAmount'),
134 amount => $invoicerow->findvalue( 'RowAmount'),
135 }
136 }
137
138 store_all( $data);
139 }
140
141 sub add_errors
142 { my( $parser, $doc, $error_messages)= @_;
143 my $root= $doc->documentElement();
144 # here I chose to build the error messages as text and then parse them
145 my $chunk= "\n <errors>\n " . join( "\n ", map { "<error>$_</error>" } @$error_messages) . "\n </errors>";
146 my $errors= $parser->parse_xml_chunk( $chunk );
147 $root->insertBefore ( $errors, $root->getFirstChild);
148 return $doc;
149 }
150
151 sub output_doc_to_check
152 { my( $file, $doc)= @_;
153 open( FILE, ">$file") or die "cannot create file to check $file: $!";
154 print FILE $doc->toString;
155 close FILE;
156 }
|
XML::XPath version
This code uses XML::XPath. Accessing the data, the check_invoice and store_invoice functions, is very similar to the code using XML::LibXML (not surprisingly as the 2 modules were both written by Matt Sergeant ;--) The only difference are the way to create the document object and the name of the method used to get the text of an element (string_value instead of textContent).
Creating the errors element proved a little more challenging.
The complete example is in wtr2_xpath
1 #!/usr/bin/perl -w
2 use strict;
3
4 use XML::XPath;
5 use XML::XPath::XMLParser;
6
7
8 use FindBin qw($Bin);
9 use lib $Bin;
10 use wtr2_base;
11
12 init_db();
13
14 my $DEBUG=0;
15
16 my $CAN_OUTPUT= 1;
17
18 my @files= @ARGV || (<$dir{invoices}/*.xml>);
19
20 my $doc; # note that putting the my _in_ the loop causes an Out of Memory
21 # error after just a few documents
22
23 foreach my $file (@files)
24 { $doc= XML::XPath->new( filename => $file);
25
26 my $errors= check_invoice( $doc);
27
28 if( !@$errors)
29 { store_invoice( $doc); }
30 else
31 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n";
32 if( $CAN_OUTPUT)
33 { my $rejected_file= rejected( $file);
34 print "adding errors in $rejected_file\n" if( $DEBUG);
35 add_errors( $doc, $errors);
36 output_doc_to_check( $rejected_file, $doc);
37 }
38 };
39 }
40
41 exit;
42
43 sub check_invoice
44 { my( $doc)= @_;
45 my $errors=[]; # array ref, holds the error messages
46
47 check_buyer( $doc->findvalue( '/Finvoice/BuyerPartyDetails/BuyerPartyIdentifier'),
48 $doc->findvalue( '/Finvoice/BuyerPartyDetails/BuyerOrganisationName'),
49 $errors
50 );
51 check_po( $doc->findvalue( '/Finvoice/InvoiceDetails/OrderIdentifier'), $errors);
52
53 my @rows= $doc->findnodes( '/Finvoice/InvoiceRow');
54
55 reset_default_row_id();
56
57 foreach my $row ( @rows)
58 { # this does not cope well with broken row numbers
59
60 my $row_id= $row->findvalue( 'RowIdentifier') || default_row_id();
61
62 print "checking row $row_id\n" if $DEBUG;
63
64 my( $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit)= ( '','','','');
65
66 if( my $DeliveredQuantity= $row->findnodes( 'DeliveredQuantity')->[0])
67 { $delivered_qty = $DeliveredQuantity->string_value;
68 $delivered_unit = $DeliveredQuantity->getAttribute( 'QuantityUnitCode');
69 }
70 if( my $OrderedQuantity= $row->findnodes( 'OrderedQuantity')->[0])
71 { $ordered_qty = $OrderedQuantity->string_value;
72 $ordered_unit = $OrderedQuantity->getAttribute( 'QuantityUnitCode');
73 }
74
75 check_qtty( $row_id, $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit, $errors);
76 }
77
78 return $errors;
79 }
80
81 sub store_invoice
82 { my( $doc)= @_;
83
84 my $invoice_number= $doc->findvalue( '/Finvoice/InvoiceDetails/InvoiceNumber');
85 print "storing invoice $invoice_number\n";
86
87 # build the various data structures
88 my $data;
89
90 my $invoice = $doc->findnodes( '/Finvoice/InvoiceDetails')->[0];
91 $data->{invoice} = { number => $invoice->findvalue( 'InvoiceNumber'),
92 date => $invoice->findvalue( 'InvoiceDate'),
93 po => $invoice->findvalue( 'OrderIdentifier'),
94 amount_no_tax => $invoice->findvalue( 'InvoiceTotalVatExcludedAmount'),
95 tax => $invoice->findvalue( 'InvoiceTotalVatAmount'),
96 amount => $invoice->findvalue( 'InvoiceTotalVatIncludedAmount'),
97 payment_status => $doc->findvalue( '/Finvoice/PaymentStatusDetails/PaymentStatusCode'),
98 };
99
100 my $seller = $doc->findnodes( '/Finvoice/SellerPartyDetails')->[0];
101 $data->{seller} = { identifier => $seller->findvalue( 'SellerPartyIdentifier'),
102 name => $seller->findvalue( 'SellerOrganisationName'),
103 tax_code => $seller->findvalue( 'SellerOrganisationTaxCode'),
104 };
105
106 my $address = $doc->findnodes( '/Finvoice/SellerPartyDetails/SellerPostalAddressDetails')->[0];
107 $data->{address} = { street => $address->findvalue( 'SellerStreetName'),
108 town => $address->findvalue( 'SellerTownName'),
109 zip => $address->findvalue( 'SellerPostCodeIdentifier'),
110 country_code => $address->findvalue( 'CountryCode'),
111 po_box => $address->findvalue( 'SellerPostOfficeBoxIdentifier'),
112 };
113
114 my $contact = $doc->findnodes( '/Finvoice/SellerCommunicationDetails')->[0];
115 $data->{contact} = { name => $doc->findvalue( '/Finvoice/SellerContactPersonName'),
116 phone => $contact->findvalue( 'SellerPhoneNumberIdentifier'),
117 email => $contact->findvalue( 'SellerEmailaddressIdentifier'),
118 };
119
120 $data->{invoicerow} ||= [];
121 reset_default_row_id();
122
123 foreach my $invoicerow ( $doc->findnodes( '/Finvoice/InvoiceRow'))
124 { # need to check that the DeliveredQuantity element is present before getting its attribute
125 my $DeliveredQuantity= $invoicerow->findnodes( 'DeliveredQuantity')->[0];
126 my $qty = $DeliveredQuantity ? $DeliveredQuantity->string_value : '';
127 my $qty_unit= $DeliveredQuantity ? $DeliveredQuantity->getAttribute( 'QuantityUnitCode') : '';
128
129 push @{$data->{invoicerow}},
130 { row_id => $invoicerow->findvalue( 'RowIdentifier') || default_row_id(),
131 sku => $invoicerow->findvalue( 'ArticleIdentifier'),
132 name => $invoicerow->findvalue( 'ArticleName'),
133 qty => $qty,
134 qty_unit => $qty_unit,
135 unit_price => $invoicerow->findvalue( 'UnitPriceAmount'),
136 amount_no_tax => $invoicerow->findvalue( 'RowVatExcludedAmount'),
137 tax => $invoicerow->findvalue( 'RowVatAmount'),
138 amount => $invoicerow->findvalue( 'RowAmount'),
139 }
140 }
141
142 store_all( $data);
143 }
144
145 sub add_errors
146 { my( $doc, $error_messages)= @_;
147 my $errors= $doc->createNode( '/Finvoice/errors');
148 foreach my $message (@$error_messages)
149 { my $error= XML::XPath::Node::Element->new( 'error');
150 $errors->appendChild( $error);
151 $doc->setNodeText( '/Finvoice/errors/error[last()]' => $message, );
152 }
153 return $doc;
154 }
155
156 sub output_doc_to_check
157 { my( $file, $doc)= @_;
158 open( FILE, ">$file") or die "cannot create file to check $file: $!";
159 print FILE $doc->findnodes( '/')->[0]->toString;
160 close FILE;
161 }
|
XML::PYX
An XML::PYX version based on the XML::Simple code
This example uses XML::PYX in a kinda devious way: it reimplements a simplified version of XML::Simple's XMLin that will work for the class of documents we are processing. The code is then exactly the same as the one for wtr2_simple.
Note that this version of XMLin does not offer any option besides a limited version of forcearray. It will not for exemple use keyattr.
I guess this was pretty easy to write as it worked the first time I run it, much to my surprise I should say ;--)
Generally It illustrates the fact that it often pays to write a layer on top of generic modules to adapt them to your specific needs (please, just don't release it on CPAN! Chances are that it is really specific to your problem or to the way you like to code, and it will only add to the already considerable confusion in the XML namespace).
The complete example is in wtr2_pyx_simple
Here are the call to XMLin and the XMLin sub
...
17 { # XMLin just re-implements a simplified version of XML::Simple's XMLin
18 # that will work for this specific class of document
19 my $xml= XMLin( $file, forcearray => [ qw(InvoiceRow)]);
...
126 sub XMLin
127 { my( $file, %options)=@_;
128
129 # get the 'forcearray' elements in a convenient hash
130 my %forcearray= map { $_ => 1} @{$options{forcearray}};
131
132 my @current; # the stack of hashes or array in the data structure
133 # $current[0] is the overall structure (the root of the document),
134 # $current[1] is the first_level child currently open
135 # ...
136 # $current[-1] is the current element
137 open( IN, "pyx $file | ") or die "cannot open pyx $file: $!";
138 while( <IN>)
139 { if( m{^\((.*)$}) # open element
140 { my $tag= $1;
141 my $current={}; # will be filled later;
142 if( @current)
143 { # normal (non-root) element
144 if( $forcearray{$1}) { $current[-1]->{$1} ||= []; push @{$current[-1]->{$1}}, $current; }
145 else { $current[-1]->{$1} = $current; }
146 }
147 push @current, $current;
148 }
149 elsif( m{^\)(.*)$}) # close element (except when only the root is left)
150 { pop @current unless( @current == 1); }
151 elsif( m{^-\\n$}) # empty line, skip
152 { }
153 elsif( m{^-(.*)$}) # content, assign to the current element content
154 { $current[-1]->{content}= $1; }
155 elsif( m{^A(\w*) (.*)$}) # attribute, assign to a field in the current element
156 { $current[-1]->{$1}= $2; }
157 }
158
159 # note that in case there is an error during the parsing it will show when
160 # closing the file, so you need to check the result of close
161 close IN or die "error processing pyx $file: $!";
162
163 return $current[0];
164 }
...
|
A more advanced XML::PYX version, still based on the XML::Simple code
The previous code might seem naive (and useless), but it can easily be improved and thus get extra credits:
This example uses XML::PYX and the XMLin layer on top of it, but stores the initial PYX flow so it can later update it with the error messages.
The code is very similar to wtr2_pyx_simple, but the PYX flow if first stored in $pyx, before being passed to XMLin. If errors are found, $pyx is used to add the errors (add_errors looks for the end of the root (Finvoice) start tag, builds the PYX flow for the errors element and insert it after the root. The pyx flow is then output using pyxw which writes it back as XML.
Creating the PYX for the error element is not really elegant, but it is not that much of a pain either. A simple layer a-la-XML::Writer could make it even easier, but wasn't deemed necessary here.
The complete example is in wtr2_pyx_simple_plus
Here are the interesing bits:
... 19 my $pyx= pyx_in( $file); 20 my $xml= XMLin( $pyx, forcearray => [ qw(InvoiceRow)]); ... 30 add_errors( $pyx, $errors); 31 output_doc_to_check( $rejected_file, $pyx); ... 127 sub add_errors 128 { my( $pyx, $error_messages)= @_; 129 # first get the root element 130 my $root_index=0; 131 while($_=$pyx->[$root_index]) 132 { if( m{^\(Finvoice$}) 133 { # found the root element 134 # now skip attributes 135 while( $pyx->[$root_index+1]=~ m{^A}) { $root_index++; } 136 last; 137 } 138 $root_index++; 139 } 140 141 my $lf= "-\\n\n"; # a line feed in pyx 142 # build the error messages 143 my $messages= $lf 144 . "- \n(errors\n" 145 . join( "", map { "$lf- \n(error\n-$_\n)error\n" } @$error_messages) 146 . $lf 147 . "- \n)errors\n"; 148 149 # now insert the messages 150 splice( @$pyx, $root_index+1, 0, $messages); 151 152 return $pyx; 153 } 154 155 sub output_doc_to_check 156 { my( $file, $pyx)= @_; 157 open( FILE, "| pyxw > $file") or die "cannot create file to check $file: $!"; 158 print FILE @$pyx; 159 close FILE; 160 } 161 162 163 sub pyx_in 164 { my( $file)= @_; 165 open( IN, "pyx $file | ") or die "cannot open pyx $file: $!"; 166 my @pyx= <IN>; 167 # note that in case there is an error during the parsing it will show when 168 # closing the file, so you need to check the result of close 169 close IN or die "error processing pyx $file: $!"; 170 return \@pyx; 171 } 172 ... |
XML::SAX::Base
This code uses SAX to extract the data from the invoices. It parses the invoice and extract the relevant data into a Perl data structure that is then used to check the invoice and update the data base.
The first problem to solve when using SAX is that the content of elements can be broken in different calls to the characters handler. So I needed to buffer the content. Luckily enough, Robin Berjon's XML::Filter::BufferText does just that!
So I used a SAX machine (using SAX::Machines) to pipe the 2 handlers, first XML::Filter::BufferText, then my own handler: wtr2_handler. Note that SAX::Machines takes care of
wtr2_handler extracts all the information needed to check the invoice, then store it in the data base. The resulting data (returned by the end_document handler) is then used by check_invoice and store_invoice.
As this is something that is likely to be quite common and as there are few SAX modules that do this, I decided to go generic: I created a small language to describe how to extract the data and store it in my custom data structure.
The idea is to give an element name (no namespaces are used in this DTD, so there is no need to get fancy) and associate an action to it. Actions can be associated with the start of an element or with its content. At the start of an element it is possible to store attributes or to create new sub-records, for repeatable data in the document, such as InvoiceRow
The content of an element can be stored, either as top-level data, for non-repeatable data, or in a sub-record, for repeatable data.
The easiest way I found to parse these actions was to use Getopt::Long Overall this is slightly overkill for this problem, but could be re-used in other cases, so I thought it would be worth it to show it here.
In order to know in which element the parser is from the characters handler I used a stack of element names: the start_element handler pushes the current element name on the stack and the end_element handler pops it. This is the only way to get access to the parent name, needed for the --parent option.
Overall the code was quite a pain to write, especially as the default parser, XML::LibXML::SAX::Parser had a problem during my tests, as once again I had upgraded libxml2 but not the Perl module. The hardest part was designing a way to express what I wanted to extract from the XML document and how to store it, without resorting with one of those long lists of ifs that I find make code such a pain to maintain.
The complete example is in wtr2_sax_base
1 #!/usr/bin/perl -w
2 use strict;
3
4 #use diagnostics;
5
6 use XML::Filter::BufferText; # to buffer all character events
7 use XML::SAX::Machines qw(:all); # to pipe the 2 SAX handlers
8
9 use FindBin qw($Bin);
10 use lib $Bin;
11 use wtr2_base;
12
13 my $DEBUG=0;
14
15 init_db();
16
17 my $CAN_OUTPUT= 0;
18
19 my @files= @ARGV || (<$dir{invoices}/*.xml>);
20
21 foreach my $file (@files)
22 { my $handler= wtr2_handler->new();
23
24 my $pipeline = Pipeline( XML::Filter::BufferText->new(),
25 $handler,
26 );
27 my $data = $pipeline->parse_uri ( $file);
28
29 my $errors= check_invoice( $data);
30
31 if( !@$errors)
32 { store_invoice( $data); }
33 else
34 { print "ERROR in $file\n ", join( "\n ", @$errors), "\n";
35 if( $CAN_OUTPUT)
36 { my $rejected_file= rejected( $file);
37 print "adding errors in $rejected_file\n" if( $DEBUG);
38 add_errors( $pipeline, $data, $errors); # the parser is passed so it can be re-used
39 output_doc_to_check( $rejected_file, $data);
40 }
41 };
42 }
43
44 sub check_invoice
45 { my( $data)= @_;
46 my $errors=[]; # array ref, holds the error messages
47
48 check_buyer( $data->{BuyerPartyIdentifier},
49 $data->{BuyerOrganisationName},
50 $errors
51 );
52 check_po( $data->{OrderIdentifier}, $errors);
53
54 my @rows= @{$data->{InvoiceRow}};
55
56 reset_default_row_id();
57
58 foreach my $row( @rows)
59 { # this does not cope well with broken row numbers
60
61 my $row_id= $row->{RowIdentifier} || default_row_id();
62
63 print "checking row $row_id\n" if $DEBUG;
64
65 check_qtty( $row_id,
66 $row->{DeliveredQuantity},
67 $row->{DeliveredQuantityUC},
68 $row->{OrderedQuantity},
69 $row->{OrderedQuantityUC},
70 $errors
71 );
72 }
73
74 return $errors;
75 }
76
77
78 sub store_invoice
79 { my( $xml_data)= @_;
80 print "storing invoice $xml_data->{InvoiceNumber}\n";
81
82 # build the various data structures
83 my $data;
84
85 $data->{invoice} = { number => $xml_data->{InvoiceNumber},
86 date => $xml_data->{InvoiceDate},
87 po => $xml_data->{OrderIdentifier},
88 amount_no_tax => $xml_data->{InvoiceTotalVatExcludedAmount},
89 tax => $xml_data->{InvoiceTotalVatAmount},
90 amount => $xml_data->{InvoiceTotalVatIncludedAmount},
91 payment_status => $xml_data->{PaymentStatusCode},
92 };
93
94 $data->{seller} = { identifier => $xml_data->{SellerPartyIdentifier},
95 name => $xml_data->{SellerOrganisationName},
96 tax_code => $xml_data->{SellerOrganisationTaxCode},
97 };
98
99 $data->{address} = { street => $xml_data->{SellerStreetName},
100 town => $xml_data->{SellerTownName},
101 zip => $xml_data->{SellerPostCodeIdentifier},
102 country_code => $xml_data->{CountryCode},
103 po_box => $xml_data->{SellerPostOfficeBoxIdentifier},
104 };
105
106 $data->{contact} = { name => $xml_data->{SellerContactPersonName},
107 phone => $xml_data->{SellerPhoneNumberIdentifier},
108 email => $xml_data->{SellerEmailaddressIdentifier},
109 };
110
111 $data->{invoicerow} ||= [];
112 reset_default_row_id();
113
114 foreach my $invoicerow (@{$xml_data->{InvoiceRow}})
115 { push @{$data->{invoicerow}},
116 { row_id => $invoicerow->{RowIdentifier} || default_row_id(),
117 sku => $invoicerow->{ArticleIdentifier},
118 name => $invoicerow->{ArticleName},
119 qty => $invoicerow->{DeliveredQuantity},
120 qty_unit => $invoicerow->{DeliveredQuantityUC},
121 unit_price => $invoicerow->{UnitPriceAmount},
122 amount_no_tax => $invoicerow->{RowVatExcludedAmount},
123 tax => $invoicerow->{RowVatAmount},
124 amount => $invoicerow->{RowAmount},
125 }
126 }
127
128 store_all( $data);
129 }
130
131
132
133 package wtr2_handler;
134
135 use base qw(XML::SAX::Base);
136
137 use Getopt::Long; # to process actions associated with SAX events
138
139 # all those could be stored as part of the parser object
140 # but IMHO this implies potential name collision
141
142 my( $content, $start);
143
144 BEGIN {
145
146 # declare which element content and attributes we want to store
147 # note that this works fine because elements are not re-used, the element
148 # name always gives enough information to figure out what to do with it
149
150 # note that the little languages used in the action part depends on the fact
151 # that - is not a valid characters at the start of an XML identifier
152
153 # actions for start_element handler
154 #
155 # actions format is:
156 # actions : action (';' action)* # only one used here
157 # action : command options+
158 # command : ('--create' | '--store_att' <attribute name>)
159 # options : option+
160 # option : --in <sub_record> # sub_record must be have been defined
161 # --as <field_name> # name of the field in the hash
162 # # (defaults to the attribute name)
163 # --parent <name> # trigger only when in proper parent
164 # --create only supports the -as option (not used)
165
166 $start = { # elements that create a new sub_record for repeated content
167 InvoiceRow => '--create',
168 # attributes to be stored
169 # format is #store_att
170 DeliveredQuantity => '--store_att QuantityUnitCode --in InvoiceRow --as DeliveredQuantityUC',
171 OrderedQuantity => '--store_att QuantityUnitCode --in InvoiceRow --as OrderedQuantityUC',
172 };
173
174
175 # actions for characters handler
176 # actions format is:
177 # actions : action (';' action)* # only one used here
178 # action : command options+
179 # command : ('--store')
180 # options : option+
181 # option : --in <sub_record> # sub_record must be have been defined
182 # --as <field_name> # not used here
183 # --parent <name> # trigger only when in proper parent
184
185 $content= { # content that needs to be stored once per invoice
186 BuyerPartyIdentifier => '--store',
187 BuyerOrganisationName => '--store',
188 InvoiceNumber => '--store',
189 InvoiceDate => '--store',
190 OrderIdentifier => '--store',
191 InvoiceTotalVatExcludedAmount => '--store',
192 InvoiceTotalVatAmount => '--store',
193 InvoiceTotalVatIncludedAmount => '--store',
194 PaymentStatusCode => '--store',
195 SellerPartyIdentifier => '--store',
196 SellerOrganisationName => '--store',
197 SellerOrganisationTaxCode => '--store',
198 SellerStreetName => '--store',
199 SellerTownName => '--store',
200 SellerPostCodeIdentifier => '--store',
201 CountryCode => '--store --parent SellerPostalAddressDetails',
202 SellerPostOfficeBoxIdentifier => '--store',
203 SellerContactPersonName => '--store',
204 SellerPhoneNumberIdentifier => '--store',
205 SellerEmailaddressIdentifier => '--store',
206 # repeated content
207 RowIdentifier => '--store --in InvoiceRow',
208 ArticleIdentifier => '--store --in InvoiceRow',
209 ArticleName => '--store --in InvoiceRow',
210 DeliveredQuantity => '--store --in InvoiceRow',
211 OrderedQuantity => '--store --in InvoiceRow',
212 UnitPriceAmount => '--store --in InvoiceRow',
213 RowVatExcludedAmount => '--store --in InvoiceRow',
214 RowVatAmount => '--store --in InvoiceRow',
215 RowAmount => '--store --in InvoiceRow',
216 };
217
218 }
219
220 # very complex new! it's a hash so we can add the state data needed
221 sub new
222 { my $class = shift;
223 my $self= bless {}, $class;
224 }
225
226 # reset data for each invoice
227 sub start_document
228 { $_[0]->{data} = {}; # stored data
229 $_[0]->{context} = []; # element name stack
230 }
231
232 #
233 sub start_element
234 { my( $p, $elt)= @_;
235
236 my $name= $elt->{Name};
237
238 # store the context
239 push @{$p->{context}}, $name;
240
241 # process actions for the relevant elements
242 if( my $actions= $start->{$name})
243 { my @actions= split /;/, $actions; # actions are ; separated (not used here)
244 foreach my $action (@actions)
245 { # use Getopt::Long to parse the action
246 local @ARGV= split /\s+/, $action;
247 my %options;
248 GetOptions( \%options, "create", "store_att=s", "in=s", "as=s", "parent=s");
249
250 if( $options{parent} and ($p->{context}->[-2] ne $options{parent})) { next; }
251
252 # process each type of action
253 if( $options{create})
254 { # create a new sub_record
255 if( $options{store_att}) { die "can't use --store_att and --create in $name => '$action'\n"; }
256 if( $options{in}) { die "can't use --in with --create in $name => '$action'\n"; }
257 $options{as} ||= $name; # the sub_record name defaults to the element name
258 $p->{data}->{$name} ||= [];
259 push @{$p->{data}->{$name}}, {};
260 }
261 elsif( my $att= $options{store_att})
262 { # store an attribute
263 my $att_clarkian= "{}".$att; # attributes are indexed using the clarkian notation
264 my $value= $elt->{Attributes}->{$att_clarkian}->{Value};
265 store( $p->{data}, $options{in}, $options{as} || $name, $value);
266 }
267 else
268 { die "no valid start action found in $name => '$action'\n"; }
269 }
270 }
271 }
272
273
274 sub characters
275 { my( $p, $characters)= @_;
276 my $name= $p->{context}->[-1];
277 if( my $actions= $content->{$name})
278 { my @actions= split /;/, $actions;
279 foreach my $action (@actions)
280 { local @ARGV= split /\s+/, $action;
281 my %options;
282 GetOptions( \%options, "store", "in=s", "as=s", "parent=s");
283 if( $options{parent} and ($p->{context}->[-2] ne $options{parent})) { next; }
284 if( $options{store})
285 { store( $p->{data}, $options{in}, $options{as} || $name, $characters->{Data}); }
286 else
287 { die "no valid contentt action found in $name => '$action'\n"; }
288 }
289 }
290 }
291
292 sub end_element { pop @{$_[0]->{context}}; }
293
294 sub end_document { return $_[0]->{data}; }
295
296 sub store
297 { my( $data, $in, $as, $value)= @_;
298 if( my $sub_record= $in)
299 { # create it in a sub_record
300 $data->{$sub_record}->[-1]->{$as}= $value;
301 }
302 else
303 { # create at top-level
304 $data->{$as}= $value;
305 }
306 }
307
308 1;
|
Conclusions
Overall the code was quite easy to write with most modules.
I was surprised by how much common code I could re-use from one example to the other, due to modules using standard API (DOM/SAX) and to a lot of modules being written by the same author (as with XML::XPath / XML::LibXML) (not to mention the author of the article blatantly cheating and stealing method names, if not code, from other modules to add them to XML::Twig ;--).
Modules that simply slurp the XML into a Perl data structure were the easiest to use, sometimes at the cost of making it harder to output a modified document. XML::Smart looked good here, but it is a very recent module and will need to be tested some more before I can recommend it. Tree-based modules were a little harder to use (why can't we call methods on undef! that would save us quite a few test), but allowed for more control over modified XML output. Streaming is not really adapted to this problem, and requires a the bigger layer on top of the basic modules, so I would not really advise using it here.
Benchmark
All modules handled the load pretty well, except for XML::XPath which gives seemingly random Out of memory! errors.
Tested on 206 invoices
| Module | Version | Timing (benchmark) | Factor |
|---|---|---|---|
| XML::Simple | 2.08 | 7.43 | 100 |
| XML::Smart | 1.3.1 | 9.71 | 130 |
| XML::Twig | 3.10 | 8.74 | 117 |
| XML::Twig (alt) | � | 10.96 | 147 |
| XML::DOM | 1.42 | 8.97 | 120 |
| XML::EasyOBJ | 1.12 | 8.35 | 112 |
| XML::LibXML | 1.55 | 3.72 | 50 |
| XML::LibXML (alt) | � | 4.53 | 60 |
| XML::XPath | 1.13 | 24.86 | 334 |
| XML::PYX | 0.07 | 21.41 | 288 |
| XML::PYX (alt) | � | 22.53 | 303 |
| XML::SAX::Base | 1.04 | 11.66 | 156 |
benchmark environment: perl 5.008 on linux - Thu Aug 21 17:11:39 2003
Running the examples
Requirements
You will need to install the following modules: DBI, DBD::SQLite, Memoize, plus of course the various XML modules required by each example.
Files:
A tar file with everything you need
- all files (.tar.gz): everything you need to run wtr2, stored in the proper directory. run_tests init will run all the tests, provided you have all the modules installed.
Finvoice files
See Finvoice: Technical files to download the complete package.
- Finvoice.dtd: the Finvoice DTD,
- Finvoice.xsl: the Finvoice XSLT stylesheet,
- example invoice 1: named basic_invoice.xml in Finvoice technical package,
- example invoice 2: named all_details.xml in Finvoice technical package
- example invoice 3
The data base definition and creation file
- invoice_db.def: used by the various examples to reset the DB
A common package
- wtr2_base.pm: contains common functions used by the various examples, see the html doc or the text doc
The various examples
- wtr2_simple: XML::Simple version (doc),
- wtr2_smart: XML::SMART version, very similar to the XML::Simple one, but more powerful (it outputs the document with the errors) (doc)
- wtr2_twig_simple: XML::Twig version derived from the XML::Simple one (doc)
- wtr2_twig: XML::Twig full-tree version (doc),
- wtr2_dom: XML::DOM version, (doc),
- wtr2_easyobj: XML::EasyOBJ version, (doc),
- wtr2_libxml_dom: XML::LibXML version using strictly the same code as the XML::DOM example. (doc),
- wtr2_libxml: XML::LibXML version using XML::LibXML specific methods (doc),
- wtr2_xpath: XML::XPath version (doc),
- wtr2_pyx_simple: XML::PYX version built using the code for XML::Simple, and re-implementing XMLin on top of PYX (doc, run with perl wtr2_pyx_simple invoice_?.xml)
- wtr2_sax_base: a basic SAX example (doc),
Templates for writing examples
- wtr2_template_tree: a template convenient for tree-oriented examples (doc),