Main Navigation

Content

Text Parsing in Perl

Learn Perl Now!
And get a job doing Perl.

Introduction

Perl has a rightful reputation as a good language for parsing text and even its name originally stands for "Practical Extraction and Report Language". However, many beginners, are tempted to use regular expressions exclusively even for parsing the most complex texts (a la "If all you have is a hammer, everything starts to look like a nail."), and it should be avoided. Here we give some more options.

With What to Parse Stuff?

Advanced Parsing Techniques

Parser Generators

For many grammars (such as those of most programming languages, which involves such idioms as balanced brackets or operator precedence which are called context-free languages), regular expressions will not be enough and you may opt to use a parser generator. Some notable parser generators in Perl include:

  1. Parse-RecDescent

  2. Regexp-Grammars - a more modern version of Parse-RecDescent by the same author that only works on perl-5.10.x and above.

  3. Parser-MGC - allows one to build simple recursive-descent parsers by using methods and closures.

  4. Marpa-XS - a parser generator that aims to fully parse all context-free grammars. See also Marpa-PP for its pure-Perl and slower version.

  5. Parse-Yapp - old and has been unmaintained, but may still be good enough.

What a parser generator does is generate a parser for your language that can then yield an "abstract syntax tree (AST)" that will allow you to process valid texts of this language as a human would understand them.

Incremental Extraction in Regular Expressions Using \G and /g

Sometimes, you'll find that writing everything in one regular expression would be very hard and you'd like to parse a string incrementally - step by step. For that, Perl offers the the pos() function/operator that allows one to set the last matched position within a string. One can make a good use of it using the \G regular expression escape and the /g and /c regex modifiers.

Here's an example:

use strict;
use warnings;

# String with names inside square brackets
my $string = "Hello [Peter] , [Sophie] and [Jack] are here.";

pos($string) = 0;
while ($string =~ m{\G.*?\[([^\]]+)\]}cg)
{
    my $name = $1;
    print "Found name $name .\n";
}

This example is a bit contrived, but should be illustrative enough.

Share/Bookmark

Footer