Perl XML FAQ 1.1

Perl XML FAQ

Version 1.1

by Jonathan Eisenzopf

Credits

Thanks to Clark Cooper, Matthew Sergeant, Enno Derksen, Ken MacLeod, Rob Cameron, and Asakura Hiroshi for their contributions to this FAQ.

Overview

This FAQ contains information related to using and manipulating XML with Perl. Please direct all corrections and additions to [email protected]. This FAQ can be found on the Web at http://www.perlxml.com/faq/perl-xml-faq.html. Asakura Hiroshi has created a Japanese translation of this FAQ which is available at http://db-www.aist-nara.ac.jp/xml/perl-xml-faq-j.html. Information in this FAQ is primarily based on discussions and information transmitted to the Perl XML email list. To join, send an email to [email protected] with the message: SUBSCRIBE Perl-XML.

This FAQ was generated using a small Perl script and an XML file. The script can be slurped from http://www.perlxml.com/faq/xmlfaq.pl. The XML source is located at http://www.perlxml.com/faq/perl-xml-faq.xml. To generate the Perl XML FAQ, run perl xmlfaq.pl perl-xml-faq.xml which prints the HTML to STDOUT.

Q1: What is XML anyway?

The eXtensible Markup Language, or XML, is a simplified version of SGML developed by the World Wide Web Consortium. Unlike the limited tag-set offered by HTML, XML allows authors to define tags based upon the logical structure of their documents. A good introduction to XML is available at http://www.xml.com/xml/pub/98/10/guide0.html. Additional information about XML is available from the World Wide Web Consortium at http://www.w3.org/XML.

Q2: Is there an XML parser for Perl?

Yes, there are several, but the most popular one is the XML::Parser module. Originally developed by Larry Wall, Clark Cooper now maintains the XML::Parser module. The module is a Perl wrapper around Expat, a non-validating parser written in C by James Clark. The module can be found on any CPAN server or on Clark's home page at http://www.netheaven.com/~coopercc/xmlparser/intro.html. The distribution includes Expat, so you don't have to worry about installing it separately. More information on Expat is available at http://www.jclark.com/xml/expat.html. Clark Cooper has also written a nice intro to XML::Parser which is available at http://www.xml.com/xml/pub/98/09/xml-perl.html.

In some cases, you may want to utilize regular expressions to manipulate XML. REX, written by Rob Cameron, is a fairly complete shallow parser written in Perl. Information on Rex can be found at http://www.cs.sfu.ca/~cameron/REX.html.

Q3: What version of Perl do I need to use XML::Parser?

You should have Perl 5.004 or greater installed. If you require UTF8 encoding support, you will need version 5.005_52 or greater. A good resource for Perl in general is http://www.perl.com. To get the latest Perl distribution, visit http://www.perl.com/CPAN and select the closest CPAN mirror site.

Q4: Can I use XML::Parser to validate XML against a DTD?

No. XML::Parser is built on-top-of Expat, a non-validating parser, so a DOCTYPE declaration will be checked for syntax but will not be used to validate the XML. However, Earl Hood has written perlSGML, which is a collection of Perl scripts and libraries for parsing and manipulating SGML. It includes the ability to process DTDs. Check http://www.oac.uci.edu/indiv/ehood/perlSGML.html for more information. You may also try the SGMLspm module from David Megginson, available on CPAN at http://www.perl.com/CPAN/authors/David_Megginson, which is a library for parsing the output from James Clark's NSGMLS parser. You can find the SP toolkit which includes NSGMLS at http://www.jclark.com/sp/index.htm.

Q5: How do I install XML::Parser for Win32?

XML::Parser is available in ActiveState's package repository. To install it type: 'ppm install XML-Parser' on the command line. Matt Sergeant usually offers the newest version of the XML::Parser module if ActiveState's version is out-of-date. To install it type: 'ppm install --location=http://www.fastnetltd.ndirect.co.uk/Perl/packages XML-Parser'.

Q6: Is XML::Parser object oriented?

In short, yes. XML::Parser is a factory object that creates instances of XML::Parser::Expat as needed.

Q7: Is XML::Parser based on the SAX API?

Nope. XML::Parser is based on Expat, a non-validating parser written in C by James Clark. However, Eric Prud'hommeaux has developed a preliminary implementation, W3C::SAX::XmlParser, which can be found at http://www.w3.org/1999/02/26-modules/. Note that this implementation is not final since Ken MacLeod is currently working on a standardized Perl-SAX interface.

Q8: Is there a DOM module for Perl?

Yes. Enno Derksen in collaboration with Clark Cooper wrote a DOM module for Perl that's implemented as a sub-class of XML::Parser. It complies with DOM level 1 syntax. You will need XML::Parser version 2.16 or greater installed for it to work. The module is available at your local CPAN archive, or at http://www.erols.com/enno/dom/. There is also a DOM module available at http://www2.ann.ne.jp/~kojun/TmpL1/ that uses its own XML parser.

Q9: What about XSL?

Because XSL is still in flux, there hasn't been a huge amount of interest in creating an XSL engine in Perl. So, if you want to see one, make your voice heard on the Perl-XML mailing list.

Q10: Where are the XML modules on CPAN?

If you're having problems finding XML::Parser or other modules on a CPAN archive, try the following URL: http://www.perl.com/CPAN-local/modules/by-category/11_String_Lang_Text_Proc/XML/. A list of all Perl/XML modules with descriptions is available at http://www.perlxml.com/modules/perl-xml-modules.html.

Q11: What is the difference between XML::DOM and XML::Grove? Which one should I use?

XML::DOM is an implementation of World Wide Web Consortium's (W3C) Document Object Model (DOM) Level 1. DOM defines an interface for working with an XML tree and XML::DOM is an implementation of DOM that works with an in-memory tree of XML nodes. All DOM implementations (in Perl or other languages) all use a similar interface and code written using one DOM implementattion should work with other DOM implementations. This portability allows you to pick a DOM implementation that has the features you need (memory usage, implementation language, database, etc.).

XML::Grove uses Perl hashes and arrays to store XML objects allowing you to use regular Perl hash and array functions to work with the tree of XML nodes. XML::Grove is based on ``property sets'' as described in the International Organization for Standardization (ISO) HyTime and DSSSL standards.

Using XML::DOM will allow you to more easily port your code from or to other languages or use other DOM modules [XML::DOM is the only implementation currently available to be used from Perl]. XML::Grove has a simpler Perlish interface. Briefly reading the XML::DOM and XML::Grove pod documentation may help you choose which module to use. Many modules work with both DOM and groves[*], but you should check the module documentation for compatibility issues.

Q12: Why does XML::Parser always croak when I use an XML declaration at the top of an XML file?

The XML declaration in your XML file is incorrect. Despite those who seem to think that everything in XML is case insensitive, this is in fact not the case.

The declaration must be lowercase and contain the version number (also lower case). It should look like this:
<?xml version='1.0'?>

You may alternatively specify the language encoding and declare whether the document is standalone:
<?xml version='1.0' encoding='UTF-8' standalone='yes'?>

NOTE: You can use single or double quotes for attribute values in the XML declaration.

Q13: When using XML::Parser I get: Can't find method "read" in module FileHandle

This error usually occurs when using XML::Parser in conjunction with DBI. This is not a bug in XML::Parser or DBI, but a bug in Perl itself. You should upgrade to DBI version 1.05 or greater or simply load the FileHandle module like: use FileHandle;

Q14: How can I catch errors returned from XML::Parser without exiting the code?

Normally, the XML::Parser module will immediately terminate when it finds mal-formed XML. This is, in fact, the way XML parsers should behave. There are cases however, where you may want to handle the error without exiting the program. In these cases, you can enclose the code that calls the parse() or parsefile() methods in an eval block like:
eval { $p->parse($xml) };
or like:
eval { $p->parsefile($filename) };

If an error occurs, it puts the error message into the $@ variable. Below is a short script that parses an XML file. It encloses the parsefile() method in an eval block and then prints the error message if an error occured.

use strict;
use XML::Parser;
my $p = new XML::Parser();
die "catch_error.pl \n" unless $ARGV[0] && -e $ARGV[0];
eval { $p->parsefile($ARGV[0]) };
print "Caught error: $@\n" if $@;
print "Done.\n";

Q15: XML::Parser seems to be converting my text to UTF8. Is there a way to maintain the original encoding?

Yes, the original_string method, which is available in version 2.19 or later, returns strings in their original encoding. The only drawback is that it will disable entity expansion. Also, you cannot use this method if you are using the XML::Parser::ExpatNB object, which was added in version 2.22.

Q16: Is it possible to read in several documents from a stream?

You can read multiple documents from a stream by using the parse_start method in place of of parse or parse_file, which creates a new instance of XML::Parser::ExpatNB. Multiple documents are parsed by making successive calls to the parse_more method. Calling the parse_done method signifies that you have are done processing the document.

Q17: How can I filter out extraneous whitespace whilst processing text?

You can filter out the whitespace in your text handler:

sub text {
   my ($xp, $data) = @_;

   return if ($ignorable_whitespace{$xp->current_element}
              and $data =~ /^\s*$/m);
   # Rest of processing
   ...
}

Q18: I get a 'duplicate attribute' error message from XML::Parser 2.20 when the parser sees an element with multiple attributes.

This is a bug in version 2.20 of the XML::Parser module. Try upgrading to a newer version.

Q19: I'm getting strange errors from XML::Parser even though my XML is well-formed.

If you're using the Perl distribution that came with Linux RedHat-5.2, you will want to upgrade to a newer version of Perl. Redhat accidentally included a buggy version in their 5.2 Linux distribution.

Q20: Is there a module available for parsing RDF?

Yes, Eric Prud'hommeaux has developed the W3C::Rdf::RdfParser which relies on the Perl implementation of SAX mentioned in question #7. It's available at http://www.w3.org/1999/02/26-modules/.

Q21: How can I speed up the performance of the XML::Parser module when processing documents on a Web server?

For starters, if you are using Apache, you should probably install mod_perl, available at http://perl.apache.org. This will eliminate the time it normally takes to load the Perl interpreter and any modules you are using. If you still require speed, you might consider using Data::Dumper or Storable to dump the XML::Parser object to disk. This eliminates the time required to re-parse an XML document.