The Art of
ASSEMBLY LANGUAGE PROGRAMMING

Chapter Fifteen (Part 2)

Table of Content

Chapter Fifteen (Part 4) 

CHAPTER FIFTEEN:
STRINGS AND CHARACTER SETS (Part 3)
15.2 - Character Strings
15.2.1 - Types of Strings
15.2.2 - String Assignment
15.2.3 - String Comparison
15.2 Character Strings

Since you'll encounter character strings more often than other types of strings they deserve special attention. The following sections describe character strings and various types of string operations.

15.2.1 Types of Strings

At the most basic level the 80x86's string instruction only operate upon arrays of characters. However since most string data types contain an array of characters as a component the 80x86's string instructions are handy for manipulating that portion of the string.

Probably the biggest difference between a character string and an array of characters is the length attribute. An array of characters contains a fixed number of characters. Never any more never any less. A character string however has a dynamic run-time length that is the number of characters contained in the string at some point in the program. Character strings unlike arrays of characters have the ability to change their size during execution (within certain limits of course).

To complicate things even more there are two generic types of strings: statically allocated strings and dynamically allocated strings. Statically allocated strings are given a fixed maximum length at program creation time. The length of the string may vary at run-time but only between zero and this maximum length. Most systems allocate and deallocate dynamically allocated strings in a memory pool when using strings. Such strings may be any length (up to some reasonable maximum value). Accessing such strings is less efficient than accessing statically allocated strings. Furthermore garbage collection[5] may take additional time. Nevertheless dynamically allocated strings are much more space efficient than statically allocated strings and in some instances accessing dynamically allocated strings is faster as well. Most of the examples in this chapter will use statically allocated strings.

A string with a dynamic length needs some way of keeping track of this length. While there are several possible ways to represent string lengths the two most popular are length-prefixed strings and zero-terminated strings. A length-prefixed string consists of a single byte or word that contains the length of that string. Immediately following this length value are the characters that make up the string. Assuming the use of byte prefix lengths you could define the string "HELLO" as follows:

HelloStr        byte    5
"HELLO"

Length-prefixed strings are often called Pascal strings since this is the type of string variable supported by most versions of Pascal[6].

Another popular way to specify string lengths is to use zero-terminated strings. A zero-terminated string consists of a string of characters terminated with a zero byte. These types of strings are often called C-strings since they are the type used by the C/C++ programming language. The UCR Standard Library since it mimics the C standard library also uses zero-terminated strings.

Pascal strings are much better than C/C++ strings for several reasons. First computing the length of a Pascal string is trivial. You need only fetch the first byte (or word) of the string and you've got the length of the string. Computing the length of a C/C++ string is considerably less efficient. You must scan the entire string (e.g. using the scasb instruction) for a zero byte. If the C/C++ string is long this can take a long time. Furthermore C/C++ strings cannot contain the NULL character. On the other hand C/C++ strings can be any length yet require only a single extra byte of overhead. Pascal strings however can be no longer than 255 characters when using only a single length byte. For strings longer than 255 bytes you'll need two bytes to hold the length for a Pascal string. Since most strings are less than 256 characters in length this isn't much of a disadvantage.

An advantage of zero-terminated strings is that they are easy to use in an assembly language program. This is particularly true of strings that are so long they require multiple source code lines in your assembly language programs. Counting up every character in a string is so tedious that it's not even worth considering. However you can write a macro which will easily build Pascal strings for you:

PString         macro   String
local   StringLength
StringStart
byte    StringLength
StringStart     byte    String
StringLength    =       $-StringStart
endm
.
.
.
PString "This string has a length prefix"

As long as the string fits entirely on one source line you can use this macro to generate Pascal style strings.

Common string functions like concatenation length substring index and others are much easier to write when using length-prefixed strings. So we'll use Pascal strings unless otherwise noted. Furthermore the UCR Standard library provides a large number of C/C++ string functions so there is no need to replicate those functions here.

15.2.2 String Assignment

You can easily assign one string to another using the movsb instruction. For example if you want to assign the length-prefixed string String1 to String2 use the following:

; Presumably
ES and DS are set up already

lea     si
String1
lea     di
String2
mov     ch
0           ;Extend len to 16 bits.
mov     cl
String1     ;Get string length.
inc     cx              ;Include length byte.
rep     movsb

This code increments cx by one before executing movsb because the length byte contains the length of the string exclusive of the length byte itself.

Generally string variables can be initialized to constants by using the PString macro described earlier. However if you need to set a string variable to some constant value you can write a StrAssign subroutine which assigns the string immediately following the call. The following procedure does exactly that:

                include         stdlib.a
includelib      stdlib.lib

cseg            segment para public 'code'
assume  cs:cseg
ds:dseg
es:dseg
ss:sseg

; String assignment procedure

MainPgm         proc    far
mov     ax
seg dseg
mov     ds
ax
mov     es
ax

lea     di
ToString
call    StrAssign
byte    "This is an example of how the "
byte    "StrAssign routine is used"
0
nop
ExitPgm
MainPgm         endp

StrAssign       proc    near
push    bp
mov     bp
sp
pushf
push    ds
push    si
push    di
push    cx
push    ax
push    di              ;Save again for use later.
push    es
cld

; Get the address of the source string

mov     ax
cs
mov     es
ax
mov     di
2[bp]       ;Get return address.
mov     cx
0ffffh      ;Scan for as long as it takes.
mov     al
0           ;Scan for a zero.
repne   scasb                   ;Compute the length of string.
neg     cx              ;Convert length to a positive #.
dec     cx              ;Because we started with -1
not 0.
dec     cx              ;skip zero terminating byte.

; Now copy the strings

pop     es              ;Get destination segment.
pop     di              ;Get destination address.
mov     al
cl          ;Store length byte.
stosb

; Now copy the source string.

mov     ax
cs
mov     ds
ax
mov     si
2[bp]
rep     movsb

; Update the return address and leave:

inc     si              ;Skip over zero byte.
mov     2[bp]
si

pop     ax
pop     cx
pop     di
pop     si
pop     ds
popf
pop     bp
ret
StrAssign       endp

cseg            ends

dseg            segment para public 'data'
ToString        byte    255 dup (0)
dseg            ends

sseg            segment para stack 'stack'
word    256 dup (?)
sseg            ends
end     MainPgm

This code uses the scas instruction to determine the length of the string immediately following the call instruction. Once the code determines the length it stores this length into the first byte of the destination string and then copies the text following the call to the string variable. After copying the string this code adjusts the return address so that it points just beyond the zero terminating byte. Then the procedure returns control to the caller.

Of course this string assignment procedure isn't very efficient but it's very easy to use. Setting up es:di is all that you need to do to use this procedure. If you need fast string assignment simply use the movs instruction as follows:

; Presumably
DS and ES have already been set up.

lea     si
SourceString
lea     di
DestString
mov     cx
LengthSource
rep     movsb
.
.
.
SourceString    byte    LengthSource-1
byte    "This is an example of how the "
byte    "StrAssign routine is used"
LengthSource    =       $-SourceString

DestString      byte    256 dup (?)

Using in-line instructions requires considerably more setup (and typing!) but it is much faster than the StrAssign procedure. If you don't like the typing you can always write a macro to do the string assignment for you.

15.2.3 String Comparison

Comparing two character strings was already beaten to death in the section on the cmps instruction. Other than providing some concrete examples there is no reason to consider this subject any further.

Note: all the following examples assume that es and ds are pointing at the proper segments containing the destination and source strings.

Comparing Str1 to Str2:

                lea     si
Str1
lea     di
Str2

; Get the minimum length of the two strings.

mov     al
Str1
mov     cl
al
cmp     al
Str2
jb      CmpStrs
mov     cl
Str2

; Compare the two strings.

CmpStrs:        mov     ch
0
cld
repe    cmpsb
jne     StrsNotEqual

; If CMPS thinks they're equal
compare their lengths
; just to be sure.

cmp     al
Str2
StrsNotEqual:

At label StrsNotEqual the flags will contain all the pertinent information about the ranking of these two strings. You can use the conditional jump instructions to test the result of this comparison.


[5] Reclaiming unused storage.

[6] At least those versions of Pascal which support strings.

Chapter Fifteen (Part 2)

Table of Content

Chapter Fifteen (Part 4) 

Chapter Fifteen: Strings And Character Sets (Part 3)
28 SEP 1996