Text Parsing in Perl
Introduction
Perl has a rightful reputation as a good language for parsing text and even its name originally stands for "Practical Extraction and Report Language". However, many beginners, are tempted to use regular expressions exclusively even for parsing the most complex texts (a la "If all you have is a hammer, everything starts to look like a nail."), and it should be avoided. Here we give some more options.
With What to Parse Stuff?
If you're going to parse HTML, don't use regular expressions, and instead look at Perl HTML-parsing modules (also see an older link). The canonical modules for that are HTML-Parser, which has built-in support for handling many of the irregularities of HTML in the wild, and XML-LibXML's HTML support. Those should generally not be used directly. Instead look at one of their abstractions:
HTML-TreeBuilder-LibXML - HTML::TreeBuilder and XPath compatible interface using libxml.
HTML::TreeBuilder (and other modules in HTML::Tree).
HTML-TokeParser-Simple - an event-based pull parser that is useful for very large HTMLs.
Another useful module is HTML-Selector-XPath which allows one to convert CSS-style selectors to XPath and provides functionality similar to that offered by such JavaScript libraries such as jQuery. So you can, for example, write
selector_to_xpath('ul.myclass a')
to find alla
elements inside aul
element with a CSS class ofmyclass
.
In order to parse XML, look at our dedicated page about XML processing.
Comma-separated values (CSV) files should be parsed using Text-CSV_XS, which is a fast, tried and tested module for parsing CSV that can handle most edge-cases and irregularities that are present in CSV files that can be found in the wild.
JSON should be parsed using JSON-MaybeXS, or possibly using an event-based, incremental, JSON parser.
In order to parse URLs/URIs (= Uniform Resource Locators), one should use the “URI” collection of CPAN modules.
In order to parse and analyse file and directory paths, you should use the modules recommended in our "Files and Directories" page.
Advanced Parsing Techniques
Parser Generators
For many grammars (such as those of most programming languages, which involves such idioms as balanced brackets or operator precedence which are called context-free languages), regular expressions will not be enough and you may opt to use a parser generator. Some notable parser generators in Perl include:
Regexp-Grammars - a more modern version of Parse-RecDescent by the same author that only works on perl-5.10.x and above.
Parser-MGC - allows one to build simple recursive-descent parsers by using methods and closures.
Marpa-XS - a parser generator that aims to fully parse all context-free grammars. See also Marpa-PP for its pure-Perl and slower version.
Parse-Yapp - old and has been unmaintained, but may still be good enough.
What a parser generator does is generate a parser for your language that can then yield an "abstract syntax tree (AST)" that will allow you to process valid texts of this language as a human would understand them.
Incremental Extraction in Regular Expressions Using \G and /g
Sometimes, you'll find that writing everything in one regular expression would be very hard and you'd like to parse a string incrementally - step by step. For that, Perl offers the the pos() function/operator that allows one to set the last matched position within a string. One can make a good use of it using the \G
regular expression escape and the /g
and /c
regex modifiers.
Here's an example:
use strict; use warnings; # String with names inside square brackets my $string = "Hello [Peter] , [Sophie] and [Jack] are here."; pos($string) = 0; while ($string =~ m{\G.*?\[([^\]]+)\]}cg) { my $name = $1; print "Found name $name .\n"; }
This example is a bit contrived, but should be illustrative enough.