The Art of

Chapter Fifteen

Table of Content

Chapter Sixteen (Part 2) 

16.1 - An Introduction to Formal Language (Automata) Theory
16.1.1 - Machines vs. Languages
16.1.2 - Regular Languages - Regular Expressions - Nondeterministic Finite State Automata (NFAs) - Converting Regular Expressions to NFAs - Converting an NFA to Assembly Language - Deterministic Finite State Automata (DFAs) - Converting a DFA to Assembly Language
16.1.3 - Context Free Languages
16.1.4 - Eliminating Left Recursion and Left Factoring CFGs
16.1.5 - Converting REs to CFGs
16.1.6 - Converting CFGs to Assembly Language
16.1.7 - Some Final Comments on CFGs
16.1.8 - Beyond Context Free Languages
16.2 - The UCR Standard Library Pattern Matching Routines
16.3 - The Standard Library Pattern Matching Functions
16.3.1 - Spancset
16.3.2 - Brkcset
16.3.3 - Anycset
16.3.4 - Notanycset
16.3.5 - MatchStr
16.3.6 - MatchiStr
16.3.7 - MatchToStr
16.3.8 - MatchChar
16.3.9 - MatchToChar
16.3.10 - MatchChars
16.3.11 - MatchToPat
16.3.12 - EOS
16.3.13 - ARB
16.3.14 - ARBNUM
16.3.15 - Skip
16.3.16 - Pos
16.3.17 - RPos
16.3.18 - GotoPos
16.3.19 - RGotoPos
16.3.20 - SL_Match2
16.4 - Designing Your Own Pattern Matching Routines
16.5 - Extracting Substrings from Matched Patterns
16.6 - Semantic Rules and Actions
16.7 - Constructing Patterns for the MATCH Routine
16.8 - Some Sample Pattern Matching Applications
16.8.1 - Converting Written Numbers to Integers
16.8.2 - Processing Dates
16.8.3 - Evaluating Arithmetic Expressions
16.8.4 - A Tiny Assembler
16.8.5 - The "MADVENTURE" Game
16.9 - Laboratory Exercises
16.9.1 - Checking for Stack Overflow (Infinite Loops)
16.9.2 - Printing Diagnostic Messages from a Pattern
Copyright 1996 by Randall Hyde All rights reserved.

Duplication other than for immediate display through a browser is prohibited by U.S. Copyright Law.
This material is provided on-line as a beta-test of this text. It is for the personal use of the reader only. If you are interested in using this material as part of a course please contact

Supporting software and other materials are available via anonymous ftp from See the "/pub/pc/ibmpcdir" directory for details. You may also download the material from "Randall Hyde's Assembly Language Page" at URL:

This document does not contain the laboratory exercises programming assignments exercises or chapter summary. These portions were omitted for several reasons: either they wouldn't format properly they contained hyperlinks that were too much work to resolve they were under constant revision or they were not included for security reasons. Such omission should have very little impact on the reader interested in learning this material or evaluating this document.

This document was prepared using Harlequin's Web Maker 2.2 and Quadralay's Webworks Publisher. Since HTML does not support the rich formatting options available in Framemaker this document is only an approximation of the actual chapter from the textbook.

If you are absolutely dying to get your hands on a version other than HTML you might consider having the UCR Printing a Reprographics Department run you off a copy on their Xerox machines. For details please read the following EMAIL message I received from the Printing and Reprographics Department:

Hello Again Professor Hyde

Dallas gave me permission to take orders for the Computer Science 13 Manuals. We would need to take charge card orders. The only cards we take are: Master Card Visa and Discover. They would need to send the name numbers expiration date type of card and authorization to charge $95.00 for the manual and shipping also we should have their phone number in case the company has any trouble delivery. They can use my e-mail address for the orders and I will process them as soon as possible. I would assume that two weeks would be sufficient for printing packages and delivery time.

I am open to suggestions if you can think of any to make this as easy as possible.

Thank You for your business
Kathy Chapman Assistant
Printing and Reprographics University of California Riverside (909) 787-4443/4444

We are currently working on ways to publish this text in a form other than HTML (e.g. Postscript PDF Frameviewer hard copy etc.). This however is a low-priority project. Please do not contact Randall Hyde concerning this effort. When something happens an announcement will appear on "Randall Hyde's Assembly Language Page." Please visit this WEB site at for the latest scoop.

Redesigned 10/2000 with "MS FrontPage 98" using 17" monitor 1024x768
(c) 2000 BIRCOM Entertainment'95

The last chapter covered character strings and various operations on those strings. A very typical program reads a sequence of strings from the user and compares the strings to see if they match. For example DOS' COMMAND.COM program reads command lines from the user and compares the strings the user types to fixed strings like "COPY" "DEL" "RENAME" and so on. Such commands are easy to parse because the set of allowable commands is finite and fixed. Sometimes however the strings you want to test for are not fixed; instead they belong to a (possibly infinite) set of different strings. For example if you execute the DOS command "DEL *.BAK" MS-DOS does not attempt to delete a file named "*.BAK". Instead it deletes all files which match the generic pattern "*.BAK". This of course is any file which contains four or more characters and ends with ".BAK". In the MS-DOS world a string containing characters like "*" and "?" are called wildcards; wildcard characters simply provide a way to specify different names via patterns. DOS' wildcard characters are very limited forms of what are known as regular expressions; regular expressions are very limited forms of patterns in general. This chapter describes how to create patterns that match a variety of character strings and write pattern matching routines to see if a particular string matches a given pattern.

16.1 An Introduction to Formal Language (Automata) Theory

Pattern matching despite its low-key coverage is a very important topic in computer science. Indeed pattern matching is the main programming paradigm in several programming languages like Prolog SNOBOL4 and Icon. Several programs you use all the time employ pattern matching as a major part of their work. MASM for example uses pattern matching to determine if symbols are correctly formed expressions are proper and so on. Compilers for high level languages like Pascal and C also make heavy use of pattern matching to parse source files to determine if they are syntactically correct. Surprisingly enough an important statement known as Church's Hypothesis suggests that any computable function can be programmed as a pattern matching problem. Of course there is no guarantee that the solution would be efficient (they usually are not) but you could arrive at a correct solution. You probably wouldn't need to know about Turing machines (the subject of Church's hypothesis) if you're interested in writing say an accounts receivable package. However there many situations where you may want to introduce the ability to match some generic patterns; so understanding some of the theory of pattern matching is important. This area of computer science goes by the stuffy names of formal language theory and automata theory. Courses in these subjects are often less than popular because they involve a lot of proofs mathematics and well theory. However the concepts behind the proofs are quite simple and very useful. In this chapter we will not bother trying to prove everything about pattern matching. Instead we will accept the fact that this stuff really works and just apply it. Nonetheless we do have to discuss some of the results from automata theory so without further ado...

16.1.1 Machines vs. Languages

You will find references to the term "machine" throughout automata theory literature. This term does not refer to some particular computer on which a program executes. Instead this is usually some function that reads a string of symbols as input and produces one of two outputs: match or failure. A typical machine (or automaton ) divides all possible strings into two sets - those strings that it accepts (or matches) and those string that it rejects. The language accepted by this machine is the set of all strings that the machine accepts. Note that this language could be infinite finite or the empty set (i.e. the machine rejects all input strings). Note that an infinite language does not suggest that the machine accepts all strings. It is quite possible for the machine to accept an infinite number of strings and reject an even greater number of strings. For example it would be very easy to design a function which accepts all strings whose length is an even multiple of three. This function accepts an infinite number of strings (since there are an infinite number of strings whose length is a multiple of three) yet it rejects twice as many strings as it accepts. This is a very easy function to write. Consider the following 80x86 program that accepts all strings of length three (we'll assume that the carriage return character terminates a string):

MatchLen3       proc    near
getc                    ;Get character #1.
cmp     al
cr          ;Zero chars if EOLN.
je      Accept
getc                    ;Get character #2.
cmp     al
je      Failure
getc                    ;Get character #3.
cmp     al
jne     MatchLen3
Failure:        mov     ax
0           ;Return zero to denote failure.

Accept:         mov     ax
1           ;Return one to denote success.
MatchLen3       endp

By tracing through this code you should be able to easily convince yourself that it returns one in ax if it succeeds (reads a string whose length is a multiple of three) and zero otherwise.

Machines are inherently recognizers. The machine itself is the embodiment of a pattern. It recognizes any input string which matches the built-in pattern. Therefore a codification of these automatons is the basic job of the programmer who wants tomatch some patterns.

There are many different classes of machines and the languages they recognize. From simple to complex the major classifications are deterministic finite state automata (which are equivalent to nondeterministic finite state automata ) deterministic push down automata nondeterministic push down automata and Turing machines. Each successive machine in this list provides a superset of the capabilities of the machines appearing before it. The only reason we don't use Turing machines for everything is because they are more complex to program than say a deterministic finite state automaton. If you can match the pattern you want using a deterministic finite state automaton you'll probably want to code it that way rather than as a Turing machine.

Each class of machine has a class of languages associated with it. Deterministic and nondeterministic finite state automata recognize the regular languages. Nondeterministic push down automata recognize the context free languages. Turing machines can recognize all recognizable languages. We will discuss each of these sets of languages and their properties in turn.

16.1.2 Regular Languages

The regular languages are the least complex of the languages described in the previous section. That does not mean they are less useful; in fact patterns based on regular expression are probably more common than any other. Regular Expressions

The most compact way to specify the strings that belong to a regular language is with a regular expression. We shall define recursively a regular expression with the following rules:

Note that <empty set> and are not the same. The empty set is a regular language that does not accept any strings including strings of length zero. If a regular language is denoted by {} then it accepts exactly one string the string of length zero. This latter regular language accepts something the former does not.

The three rules above provide our basis for a recursive definition. Now we will define regular expressions recursively. In the following definitions assume that r s and t are any valid regular expressions.

These operators following the normal associative and distributive laws and exhibit the following precedences:

Highest:		(r)
Kleene Star
Lowest:		Alternation/Union


	(r | s) t = rt | st

rs* = r(s*)

r  t - s = r  (t - s)

r  t - s = (r  t) - s

Generally we'll use parenthesis to avoid any ambiguity

Although this definition is sufficient for an automata theory class there are some practical aspects to this definition that leave a little to be desired. For example to define a regular expression that matches a single alphabetic character you would need to create something like (a | b | c | ... | y | z ). Quite a lot of typing for such a trivial character set. Therefore we shall add some notation to make it easier to specify regular expressions.

With the notational baggage out of the way it's time to discuss how to actually use regular expressions as pattern matching specifications. The following examples should give a suitable introduction.

Identifiers: Most programming languages like Pascal or C/C++ specify legal forms for identifiers using a regular expression. Expressed in English terms the specification is something like "An identifier must begin with an alphabetic character and is followed by zero or more alphanumeric or underscore characters." Using the regular expression (RE) syntax described in this section an identifier is


Integer Consts: A regular expression for integer constants is relatively easy to design. An integer constant consists of an optional plus or minus followed by one or more digits. The RE is

(+ | - | ) [0-9]+

Note the use of the empty string () to make the plus or minus optional.

Real Consts: Real constants are a bit more complex but still easy to specify using REs. Our definition matches that for a real constant appearing in a Pascal program - an optional plus or minus following by one or more digits; optionally followed by a decimal point and zero or more digits; optionally followed by an "e" or an "E" with an optional sign and one or more digits:

(+ | - | ) [0-9]+ ( "." [0-9]* | ) (((e | E) (+ | - | ) [0-9]+) | )

Since this RE is relatively complex we should dissect it piece by piece. The first parenthetical term gives us the optional sign. One or more digits are mandatory before the decimal point the second term provides this. The third term allows an optional decimal point followed by zero or more digits. The last term provides for an optional exponent consisting of "e" or "E" followed by an optional sign and one or more digits.

Reserved Words: It is very easy to provide a regular expression that matches a set of reserved words. For example if you want to create a regular expression that matches MASM's reserved words you could use an RE similar to the following:

( mov | add | and | | mul )

Even: The regular expression ( )* matches all strings whose length is a multiple of two.

Sentences: The regular expression:

(* " "* )* run ( " "+ ( * " "+ | )) fast (" " * )*

matches all strings that contain the separate words "run" followed by "fast" somewhere on the line. This matches strings like "I want to run very fast" and "run as fast as you can" as well as "run fast."

While REs are convenient for specifying the pattern you want to recognize they are not particularly useful for creating programs (i.e. "machines") that actually recognize such patterns. Instead you should first convert an RE to a nondeterministic finite state automaton or NFA. It is very easy to convert an NFA into an 80x86 assembly language program; however such programs are rarely efficient as they might be. If efficiency is a big concern you can convert the NFA into a deterministic finite state automaton (DFA) that is also easy to convert to 80x86 assembly code but the conversion is usually far more efficient.

Chapter Fifteen

Table of Content

Chapter Sixteen (Part 2) 

Chapter Sixteen: Pattern Matching (Part 1)
29 SEP 1996