One of Perl's original applications was text processing (see section A Brief History of Perl). So far we have seen easy manipulation of scalar and list data is in Perl but we have yet to explore the core of Perl's text processing construct--regular expressions. To remedy that this chapter is devoted completely to regular expressions.
Regular expressions are a concept borrowed from automata theory. Regular expressions provide a a way to describe a "language" of strings.
The term language when used in the sense borrowed from automata theory can be a bit confusing. A language in automata theory is simply some (possibly infinite) set of strings. Each string (which can be possibly empty) is composed of a set of characters from a fixed finite set. In our case this set will be all the possible @acronym{ASCII} characters(10) characters.}.
When we write a regular expression we are writing a description of some set of possible strings. For the regular expression to have meaning this set of possible strings that we are defining should have some meaning to us.
Regular expressions give us extreme power to do pattern matching on text documents. We can use the regular expression syntax to write a succinct description of the entire infinite class of strings that fit our specification. In addition anyone else who understands the description language of regular expressions can easily read out description and determine what set of strings we want to match. Regular expressions are a universal description for matching regular strings.
When we discuss regular expressions we discuss "matching". If a regular expression "matches" a given string then that string is in the class we described with the regular expression. If it does not match then the string is not in the desired class.
We can start our discussion of regular expression by considering the simplest of operators that can actually be used to create all possible regular expressions (11). All the other regular expression operators can actually be reduced into a set of these simple operators.
In regular expressions
generally
a character matches itself. The only
exceptions are regular expression special characters. To match one of
these special characters
you must put a \ before the character.
For example
the regular expression abc matches a set of strings
that contain abc somewhere in them. Since * happens to be
a regular expression special character
the regular expression \*
matches any string that contains the * character.
As we mentioned * is a regular expression special character. The
* is used to indicate that zero or more of the previous
characters should be matched. Thus
the regular expression a*
will match any string that contains zero or more a's.
Note that since a* will match any string with zero or more
a's
a* will match all strings
since all strings
(including the empty string) contain at least zero a's. So
a* is not a very useful regular expression.
A more useful regular expression might be baa*. This regular
expression will match any string that has a b
followed by one or
more a's. Thus
the set of strings we are matching are those
that contain ba
baa
baaa
etc. In other words
we are looking to see if there is any "sheep speech" hidden in our
text.
The next special character we will consider is the . character. The
. will match any valid character. As an example
consider the
regular expression a.c. This regular expression will match any
string that contains an a and a c
with any possible character
in between. Thus
strings that contain abc
acc
amc
etc. are all in the class of strings that this regular expression
matches.
The | special character is equivalent to an "or" in regular
expressions. This character is used to give a choice. So
the regular
expression abc|def will match any string that contains either
abc or def.
Sometimes
within regular expressions
we want to group things together.
Doing this allows building of larger regular expressions based on smaller
components. The ()'s are used for grouping.
For example
if we want to match any string that contains abc or
def
zero or more times
surrounded by a xx on either side
we could write the regular expression xx(abc|def)*xx. This
applies the * character to everything that is in the parentheses.
Thus we can match any strings such as xxabcxx
xxabcdefxx
etc.
Sometimes we want to apply the regular expression from a defined point. In other words we want to anchor the regular expression so it is not permitted to match anywhere in the string just from a certain point.
The anchor operators allow us to do this. When we start a regular
expression with a ^
it anchors the regular expression to the
beginning of the string. This means that whatever the regular
expression starts with must be matched at the beginning of the
string. For example
^aa* will not match strings that contain
one or more a's; rather it matches strings that start with
one or more a's.
We can also use the $ at the end of the string to anchor the
regular expression at the end of the string. If we applied this to our
last regular expression
we have ^aa*$ which now matches
only those strings that consist of one or more a's. This
makes it clear that the regular expression cannot just look anywhere in
the string
rather the regular expression must be able to match the
entire string exactly
or it will not match at all.
In most cases you will want to either anchor a regular expression to the start of the string the end of the string or both. Using a regular expression without some sort of anchor can also produce confusing and strange results. However it is occasionally useful.
Now that you are familiar with some of the basics of regular
expressions
you probably want to know how to use them in Perl. Doing
so is very easy. There is an operator
=~
that you can use to
match a regular expression against scalar variables. Regular
expressions in Perl are placed between two forward slashes (i.e.
//). The whole $scalar =~ // expression will evaluate to
1 if a match occurs
and undef if it does not.
Consider the following code sample:
use strict;
while ( defined($currentLine = <STDIN>) ) {
if ($currentLine =~ /^(J|R)MS speaks:/) {
print $currentLine;
}
}
This code will go through each line of the input and print only those lines that start with "JMS speaks:" or "RMS speaks:".
Writing out regular expressions can be problematic. For example if we want to have a regular expression that matches all digits we have to write:
(0|1|2|3|4|5|6|7|8|9)
It would be terribly annoying to have to write such things out. So Perl gives an incredible number of shortcuts for writing regular expressions. These are largely syntactic sugar since we could write out regular expressions in the same way we did above. However that is too cumbersome.
For example
for ranges of values
we can use the brackets
[]'s.
So
for our digit expression above
we can write [0-9]. In fact
it is even easier in perl
because \d will match that very same
thing.
There are lots of these kinds of shortcuts. They are listed in the `perlre' online manual. They are listed in many places so there is no need to list them again here.
However
as you learn about all the regular expression shortcuts
remember that they can all be reduced to the original operators we
discussed above. They are simply short ways of saying things that can
be built with regular characters
*
()
and |.
Go to the first previous next last section table of contents.