The Art of
ASSEMBLY LANGUAGE PROGRAMMING

Chapter Sixteen (Part 5)

Table of Content

Chapter Sixteen (Part 7) 

CHAPTER SIXTEEN:
PATTERN MATCHING (Part 6)
16.4 - Designing Your Own Pattern Matching Routines
16.5 - Extracting Substrings from Matched Patterns
16.4 Designing Your Own Pattern Matching Routines

Although the UCR Standard Library provides a wide variety of matching functions there is no way to anticipate the needs of all applications. Therefore you will probably discover that the library does not support some particular pattern matching function you need. Fortunately it is very easy for you to create your own pattern matching functions to augment those available in the UCR Standard Library. When you specify a matching function name in the pattern data structure the match routine calls the specified address using a far call and passing the following parameters:

es:di- Points at the next character in the input string. You should not look at any characters before this address. Furthermore you should never look beyond the end of the string (see cx below).

ds:si- Contains the four byte parameter found in the matchparm field.

cx- Contains the last position plus one in the input string you're allowed to look at. Note that your pattern matching routine should not look beyond location es:cx or the zero terminating byte; whichever comes first in the input string.

On return from the function ax must contain the offset into the string (di's value) of the last character matched plus one if your matching function is successful. It must also set the carry flag to denote success. After your pattern matches the match routine might call another matching function (the one specified by the next pattern field) and that function begins matching at location es:ax.

If the pattern match fails then you must return the original di value in the ax register and return with the carry flag clear. Note that your matching function must preserve all other registers.

There is one very important detail you must never forget with writing your own pattern matching routines - ds does not point at your data segment it contains the H.O. word of the matchparm parameter. Therefore if you are going to access global variables in your data segment you will need to push ds load it with the address of dseg and pop ds before leaving. Several examples throughout this chapter demonstrate how to do this.

There are some obvious omissions from (the current version of) the UCR Standard Library's repertoire. For example there should probably be matchtoistr matchichar and matchtoichar pattern functions. The following example code demonstrates how to add a matchtoistr (match up to a string doing a case insensitive comparison) routine.

                .xlist

include         stdlib.a
includelib      stdlib.lib
matchfuncs
.list

dseg            segment para public 'data'

TestString      byte    "This is the string 'xyz' in it"
cr
lf
0

TestPat         pattern {matchtoistr
xyz}
xyz             byte    "XYZ"
0

dseg            ends


cseg            segment para public 'code'
assume  cs:cseg
ds:dseg

; MatchToiStr-  Matches all characters in a string up to
and including
the
;               specified parameter string. The parameter string must be
;               all upper case characters. This guy matches string using
;               a case insensitive comparison.
;
; inputs:
;               es:di-  Source string
;               ds:si-  String to match
;               cx-     Maximum match position
;
; outputs:
;               ax-     Points at first character beyond the end of the
;                       matched string if success
contains the initial DI
;                       value if failure occurs.
;               carry-  0 if failure
1 if success.

MatchToiStr     proc    far
pushf
push    di
push    si
cld

; Check to see if we're already past the point were we're allowed
; to scan in the input string.

cmp     di
cx
jae     MTiSFailure

; If the pattern string is the empty string
always match.

cmp     byte ptr ds:[si]
0
je      MTSsuccess

; The following loop scans through the input string looking for
; the first character in the pattern string.

ScanLoop:       push    si
lodsb                   ;Get first char of string

dec     di
FindFirst:      inc     di              ;Move on to next (or 1st) char.
cmp     di
cx          ;If at cx
then we've got to
jae CantFind1st ; fail.

mov     ah
es:[di]     ;Get input character.
cmp     ah
'a'         ;Convert input character to
jb      DoCmp           ; upper case if it's a lower
cmp     ah
'z'         ; case character.
ja      DoCmp
and     ah
5fh
DoCmp:          cmp     al
ah          ;Compare input character against
jne     FindFirst       ; pattern string.


; At this point
we've located the first character in the input string
; that matches the first character of the pattern string. See if the
; strings are equal.

push    di              ;Save restart point.

CmpLoop:        cmp     di
cx          ;See if we've gone beyond the
jae     StrNotThere     ; last position allowable.
lodsb                   ;Get next input character.
cmp     al
0           ;At the end of the parameter
je      MTSsuccess2     ; string? If so
succeed.

inc     di
mov     ah
es:[di]     ;Get the next input character.
cmp     ah
'a'         ;Convert input character to
jb      DoCmp2          ; upper case if it's a lower
cmp     ah
'z'         ; case character.
ja      DoCmp2
and     ah
5fh
DoCmp2:         cmp     al
ah          ;Compare input character against
je      CmpLoop
pop     di
pop     si
jmp     ScanLoop

StrNotThere:    add     sp
2           ;Remove di from stack.
CantFind1st:    add     sp
2           ;Remove si from stack.
MTiSFailure:    pop     si
pop     di
mov     ax
di          ;Return failure position in AX.
popf
clc                     ;Return failure.
ret

MTSSuccess2:    add     sp
2           ;Remove DI value from stack.
MTSSuccess:     add     sp
2           ;Remove SI value from stack.
mov     ax
di          ;Return next position in AX.
pop     si
pop     di
popf
stc                     ;Return success.
ret
MatchToiStr     endp

Main            proc
mov     ax
dseg
mov     ds
ax
mov     es
ax
meminit

lesi    TestString
ldxi    TestPat
xor     cx
cx
match
jnc     NoMatch
print
byte    "Matched"
cr
lf
0
jmp     Quit

NoMatch:        print
byte    "Did not match"
cr
lf
0

Quit:           ExitPgm
Main            endp

cseg            ends

sseg            segment para stack 'stack'
stk             db      1024 dup ("stack ")
sseg            ends

zzzzzzseg       segment para public 'zzzzzz'
LastBytes       db      16 dup (?)
zzzzzzseg       ends
end     Main
16.5 Extracting Substrings from Matched Patterns

Often simply determining that a string matches a given pattern is insufficient. You may want to perform various operations that depend upon the actual information in that string. However the pattern matching facilities described thus far do not provide a mechanism for testing individual components of the input string. In this section you will see how to extract portions of a pattern for further processing.

Perhaps an example may help clarify the need to extract portions of a string. Suppose you are writing a stock buy/sell program and you want it to process commands described by the following regular expression:

(buy | sell) [0-9]+ shares of (ibm | apple | hp | dec)

While it is easy to devise a Standard Library pattern that recognizes strings of this form calling the match routine would only tell you that you have a legal buy or sell command. It does not tell you if you are to buy or sell who to buy or sell or how many shares to buy or sell. Of course you could take the cross product of (buy | sell) with (ibm | apple | hp | dec) and generate eight different regular expressions that uniquely determine whether you're buying or selling and whose stock you're trading but you can't process the integer values this way (unless you willing to have millions of regular expressions). A better solution would be to extract substrings from the legal pattern and process these substrings after you verify that you have a legal buy or sell command. For example you could extract buy or sell into one string the digits into another and the company name into a third. After verifying the syntax of the command you could process the individual strings you've extracted. The UCR Standard Library patgrab routine provides this capability for you.

You normally call patgrab after calling match and verifying that it matches the input string. Patgrab expects a single parameter - a pointer to a pattern recently processed by match. Patgrab creates a string on the heap consisting of the characters matched by the given pattern and returns a pointer to this string in es:di. Note that patgrab only returns a string associated with a single pattern data structure not a chain of pattern data structures. Consider the following pattern:

PatToGrab       pattern {matchstr
str1
0
Pat2}
Pat2            pattern {matchstr
str2}
str1            byte    "Hello"
0
str2            byte    " there"
0

Calling match on PatToGrab will match the string "Hello there". However if after calling match you call patgrab and pass it the address of PatToGrab patgrab will return a pointer to the string "Hello".

Of course you might want to collect a string that is the concatenation of several strings matched within your pattern (i.e. a portion of the pattern list). This is where calling the sl_match2 pattern matching function comes in handy. Consider the following pattern:

Numbers         pattern {sl_match2
FirstNumber}
FirstNumber     pattern {anycset
digits
0
OtherDigs}
OtherDigs       pattern {spancset
digits}

This pattern matches the same strings as

Numbers         pattern {anycset
digits
0
OtherDigs}
OtherDigs       pattern {spancset
digits}

So why bother with the extra pattern that calls sl_match2? Well as it turns out the sl_match2 matching function lets you create parenthetical patterns. A parenthetical pattern is a pattern list that the pattern matching routines (especially patgrab) treat as a single pattern. Although the match routine will match the same strings regardless of which version of Numbers you use patgrab will produce two entirely different strings depending upon your choice of the above patterns. If you use the latter version patgrab will only return the first digit of the number. If you use the former version (with the call to sl_match2) then patgrab returns the entire string matched by sl_match2 and that turns out to be the entire string of digits.

The following sample program demonstrates how to use parenthetical patterns to extract the pertinent information from the stock command presented earlier. It uses parenthetical patterns for the buy/sell command the number of shares and the company name.

                .xlist
include                 stdlib.a
includelib              stdlib.lib
matchfuncs
.list

dseg            segment para public 'data'

; Variables used to hold the number of shares bought/sold
a pointer to
; a string containing the buy/sell command
and a pointer to a string
; containing the company name.

Count           word    0
CmdPtr          dword   ?
CompPtr         dword   ?

; Some test strings to try out:

Cmd1            byte    "Buy 25 shares of apple stock"
0
Cmd2            byte    "Sell 50 shares of hp stock"
0
Cmd3            byte    "Buy 123 shares of dec stock"
0
Cmd4            byte    "Sell 15 shares of ibm stock"
0
BadCmd0         byte    "This is not a buy/sell command"
0

; Patterns for the stock buy/sell command:
;
; StkCmd matches buy or sell and creates a parenthetical pattern
; that contains the string "buy" or "sell".

StkCmd          pattern {sl_match2
buyPat
0
skipspcs1}

buyPat          pattern {matchistr
buystr
sellpat}
buystr          byte    "BUY"
0

sellpat         pattern {matchistr
sellstr}
sellstr         byte    "SELL"
0

; Skip zero or more white space characters after the buy command.

skipspcs1       pattern {spancset
whitespace
0
CountPat}

; CountPat is a parenthetical pattern that matches one or more
; digits.

CountPat        pattern {sl_match2
Numbers
0
skipspcs2}
Numbers         pattern {anycset
digits
0
RestOfNum}
RestOfNum       pattern {spancset
digits}

; The following patterns match " shares of " allowing any amount
; of white space between the words.

skipspcs2       pattern {spancset
whitespace
0
sharesPat}

sharesPat       pattern {matchistr
sharesStr
0
skipspcs3}
sharesStr       byte    "SHARES"
0

skipspcs3       pattern {spancset
whitespace
0
ofPat}

ofPat           pattern {matchistr
ofStr
0
skipspcs4}
ofStr           byte    "OF"
0

skipspcs4       pattern {spancset
whitespace
0
CompanyPat}

; The following parenthetical pattern matches a company name.
; The patgrab-available string will contain the corporate name.

CompanyPat      pattern {sl_match2
ibmpat}

ibmpat          pattern {matchistr
ibm
applePat}
ibm             byte    "IBM"
0

applePat        pattern {matchistr
apple
hpPat}
apple           byte    "APPLE"
0

hpPat           pattern {matchistr
hp
decPat}
hp              byte    "HP"
0

decPat          pattern {matchistr
decstr}
decstr          byte    "DEC"
0

include stdsets.a
dseg            ends

cseg            segment para public 'code'
assume  cs:cseg
ds:dseg

; DoBuySell-    This routine processes a stock buy/sell command.
;               After matching the command
it grabs the components
;               of the command and outputs them as appropriate.
;               This routine demonstrates how to use patgrab to
;               extract substrings from a pattern string.
;
;               On entry
es:di must point at the buy/sell command
;               you want to process.

DoBuySell       proc    near
ldxi StkCmd
xor     cx
cx
match
jnc     NoMatch

lesi    StkCmd
patgrab
mov     word ptr CmdPtr
di
mov     word ptr CmdPtr+2
es

lesi    CountPat
patgrab
atoi                    ;Convert digits to integer
mov     Count
ax
free                    ;Return storage to heap.

lesi    CompanyPat
patgrab
mov     word ptr CompPtr
di
mov     word ptr CompPtr+2
es

printf
byte    "Stock command: %^s\n"
byte    "Number of shares: %d\n"
byte    "Company to trade: %^s\n\n"
0
dword   CmdPtr
Count
CompPtr

les     di
CmdPtr
free
les     di
CompPtr
free
ret

NoMatch:                print
byte    "Illegal buy/sell command"
cr
lf
0
ret
DoBuySell       endp


Main            proc
mov     ax
dseg
mov     ds
ax
mov     es
ax

meminit

lesi    Cmd1
call    DoBuySell
lesi    Cmd2
call    DoBuySell
lesi    Cmd3
call    DoBuySell
lesi    Cmd4
call    DoBuySell
lesi    BadCmd0
call    DoBuySell

Quit:           ExitPgm
Main            endp

cseg            ends

sseg            segment para stack 'stack'
stk             db      1024 dup ("stack ")
sseg            ends

zzzzzzseg       segment para public 'zzzzzz'
LastBytes       db      16 dup (?)
zzzzzzseg       ends
end     Main


Sample program output:
Stock command: Buy
Number of shares: 25
Company to trade: apple

Stock command: Sell
Number of shares: 50
Company to trade: hp

Stock command: Buy
Number of shares: 123
Company to trade: dec

Stock command: Sell
Number of shares: 15
Company to trade: ibm

Illegal buy/sell command

Chapter Sixteen (Part 5)

Table of Content

Chapter Sixteen (Part 7) 

Chapter Sixteen: Pattern Matching (Part 6)
29 SEP 1996