Perl for XML Processing
This page will cover how to properly process XML using Perl and various recommended modules from CPAN (= Comprehensive Perl Archive Network).
Table of Contents
- Technologies of Interest
- Web Pages about Perl and XML
- What to Avoid
- Modules for Dealing with Specific Grammars
Technologies of Interest
XML-LibXML
XML-LibXML is the de-facto standard for XML processing in Perl. It's a comprehensive CPAN module based on the libxml2 library, that provides DOM (Document Object Module), SAX (a stream parser), a pull parser, XPath, and XSLT support. XML-LibXML has good reference documentation and is actively maintained. The Perl XML::LibXML by Example site provides a tutorial suitable for beginners.
One note is that you should be aware of XML namespaces and how they interact with the DOM and the XML-LibXML API before using this library.
XPath
XPath is an XML-related technology (but not notated in XML) that allows one to locate nodes in XML files using a compact syntax. One can use it using XML::LibXML, and should avoid using the old, slow, and largely unmaintained XML::XPath
CPAN distribution.
Another useful module is HTML-Selector-XPath which allows one to convert CSS-style selectors to XPath and provides functionality similar to that offered by such JavaScript libraries such as jQuery. So you can, for example, write selector_to_xpath('ul.myclass a')
to find all a
elements inside a ul
element with a CSS class of myclass
.
To learn about XPath, consult the following resources:
Custom XPath Functions
XML::LibXML allows the programmer to register custom XPath functions, coded in Perl, in order to help working with XPath. For more information, see XML::LibXML::XPathContext .
XSLT
XSLT stands for Extensible Stylesheet Language Transformations and is a language for transforming XML documents into other XML documents or other formats such as HTML or plain text. Perl has good support for version 1.0 of XSLT by using the XML-LibXSLT distribution.
(Please avoid using XML-XSLT which is old, and largely unmaintained. Use XML-LibXSLT instead.)
For more about XSLT, see the following links, but note that XSLT makes extensive use of XPath, so you should learn it first.
The Zvon XSLT Tutorial - with many examples.
The Zvon XSLT Reference - provides a useful reference.
Interactive XSLT tester - it seems that the best results are achieved with the Mozilla Firefox browser, because Google Chromium/Google Chrome and Opera do not handle XML namespaces with XSLT well.
Web Pages about Perl and XML
The Perl XML Project Home Page
Their Frequently Asked Questions List (FAQ)
What to Avoid
XML-Simple
XML-Simple is not so simple when done properly and takes the wrong approach to dealing with XML. Please avoid using it. Look at XML-LibXML for an easy and fast alternative.
Parsing XML Using Regular Expressions
You should also avoid parsing XML using regular expressions, because it is difficult to handle the non-regular grammar of XML using them. Instead, use a parser. For more information see:
Modules for Dealing with Specific Grammars
In addition to generic XML parsers and manipulators, there are many specialised modules on the CPAN for dealing with specific XML grammars. Many of them reside under XML:: namespace. Some prominent examples include:
- XML-RSS - manipulate RSS (Really Simple Syndication) 0.9, 0.91, 1.0 and 2.0.
- XML-Atom - manipulate Atom feeds. (Atom is an alternative syndication format)
- XML-Feed - generate, parse, mix and match web feeds (Atom or RSS).
- OpenOffice-OODoc - manipulate OpenOffice.org-like ODF (OpenDocument format) files.