Objects First


The C Pre-processor

We've now met and used a number of the C pre-processor's directives, so it's time to look at all of them.

History

Early C compilers were actually built from a number of separate programs:

Stages in compiling x.c
(Filenames are those used by Unix compilers)
ModulePurpose
Pre-processorPre-process text for input to the compiler
CompilerCompile pre-processed text into intermediate code
Code generatorGenerate assembler code for a target machine
AssemblerCompile assembler code into object modules
LinkerLink object modules into executable code
Many variations of this scheme are possible and will be found in existing compilers.

In fact, a lot of C's early popularity may have derived from this structure, as the first two programs were written in C and were common to all machines. Generating a new compiler for a new architecture was simply a matter of writing new code generators, assemblers and linkers. This speeds up the process of generating a compiler for a new architecture considerably and meant that C compilers became quickly available for all new architectures.

Unix compilers (as well as some other commonly used compilers, eg the Open Software Foundation's gnu compiler, gcc) still use separate programs for the various phases. However, other modern GUI-based development systems bundle the various compiler modules differently - or into one large monolithic program - thereby gaining some small speed increases and reducing programmers' caffeine intakes by not providing as many excuses for coffee breaks.

Whatever their actual structure, all C compilers provide the first two phases in a logical sense, ie the result of your compilation is as if your program has been passed first through the pre-processor and then through the compiler itself.

Pre-processor actions

The pre-processor simply processes text; it knows virtually nothing about C syntax. It scans the input program text, looking for the directives in which the first character on a line is a '#'. We've already encountered #include, #define and #ifdef. When it meets these directives, it takes certain actions, including other files to be processed, defining symbols, etc. It acts entirely on the program text and will happily pre-process text which may be complete gibberish for the C compiler. The Unix calendar program makes use of the fact that the C pre-processor is a separate program and uses the pre-processor alone to massage personal calendar files for use by the rest of the program. This pre-processed text is then passed on to the C compiler itself, which checks the C syntax and, if it likes it, consumes it, translating it into some intermediate code which is output, ready for input to the next phase.

Directives

#include

A #include directive reads another file into the program at the point at which it is placed. It effectively merges two input files into a single output file for the compiler.

There are two variants of #include:

#include <stdio.h>

#include "abc.h"
The first, in which the file name appears in angle brackets, < > is generally used for "standard" include files, eg specification files for ANSI library functions, such as stdio.h, stdlib.h, time.h, etc. It looks for the file in a standard list of directories: this list is set for your compiler and may vary from system to system: on Unix systems, it will usually be set to /usr/include. On other systems, the list of directories to be searched will usually be set when the compiler is installed. However, it's generally possible to edit it, so that, for example, you can add your own libraries of functions or classes to those required by the ANSI standard. A little bit of searching through the various option menus will usually uncover the list for your system.

In the second form, in which the filename appears in quotes, " ", the local directory is searched for the file. You would use this form for including specification files for classes that you have written for this program. The file name is usually interpreted according to rules for the host operating system, so that

#include "/u/userx/classZ.h"
will be acceptable to a Unix operating system and
#include "C:classZ.h"
will be acceptable to lesser systems.

#define

#define allows us to define symbolic constants and also macros. When using #define, remember that the pre-processor simply processes text: it has little knowledge of C syntax. Thus when it encounters #define it enters a new symbol into its substitution table along with a text string to replace the symbol with in the body of the program. Examples of #define directives:
#define EPSILON 1.0e-5
#define MAX_COUNT 500
#define PRINT_LINE printf("----------");
#define LONG_SUBSTITION if(a>b){ printf("a>b"); } \
                          else { prinf("a<=b"); }
Note the use of the back-slash (\) to extend the text string that will be substituted over multiple lines. As the examples show, #define can be used to define arbitrary program fragments in addition to its normal use for defining constants.

Macros with #define

#define can also create macros with arguments: for example:
#define PRINT(x)      printf("#x = %d",x)
#define SQR(x)        (x*x)
#define CUBE(x)       (x*x*x)
#define PER_CENT(x,y) (x*100.0/y)
The pre-processor takes the actual arguments to the macro and substitutes them for the formals as it expands the macro. Thus:
becomes
SQR(z)(z*z)
PRINT(max)printf("max = %d",max)
PER_CENT(p.q)(p*100.0/q)
PER_CENT(a+b,c+d)(a+b*100.0/c+d)
SQR(a/b)(a/b*a/b)
The last two examples demonstrate a common trap in the definition of C macros. Since the pre-processor knows nothing about expression syntax, it just blindly performs string substitutions: in the PER_CENT case, replacing all x's with a+b and y's with c+d. Because of the precedence rules (which we don't want to bother to remember anyway!), the resulting expanded expression will almost certainly not produce the intended result! (And since I don't remember the precedence rules, I don't know whether the last expression is correct or not, but as I'll show, there is no need for me to remember this detail anyway!)

In order to obtain the desired results reliably, macro arguments which form part of expressions should be enclosed in parentheses.

#define PRINT(x)      printf("#x = %d",(x))
#define SQR(x)        ((x)*(x))
#define CUBE(x)       ((x)*(x)*(x))
#define PER_CENT(x,y) ((x)*100.0/(y))
Now the expansions produce:
becomes
SQR(z)((z)*(z))
PRINT(max)printf("max = %d",(max))
PER_CENT(p.q)((p)*100.0/(q))
PER_CENT(a+b,c+d)((a+b)*100.0/(c+d))
SQR(a/b)((a/b)*(a/b))
and we obtain the intended result - whatever the precedence rules might be!

Symbols within macros are expanded by the pre-processor also. So that if I want to assign symbolic names to a set of sequential integers, eg the names for states of a state machine, then I can write:

#define RESET     0
#define IDLE      (RESET+1)
#define MEM_WAIT  (IDLE+1)
#define MEM_READ  (MEM_WAIT+1)
....
The expansion produced by the pre-processor for MEM_READ is
(((0+1)+1)+1)
which gives us the desired value of 3. The reason for doing this, rather than explicitly assigning 0, 1, 2, .. to the various states, is that it makes maintenance of the program easier. To insert a new state into the middle of the sequence, the new line is simply added and the one following it changed. If I had explicitly numbered them, I would have had to change all the following lines! Of course, I could also have used an enum and saved myself all the bother, but, unfortunately, there are occasions when the actual values need to be defined in some sequence!

#undef

Sometimes, one wants to re-define a symbol that is already defined. To do this, it is necessary to undefine it with #undef first:
#undef SYMBOL_X
#define SYMBOL_X  NEW_VALUE_FOR_X
If the #undef is omitted, a C compiler will emit an error like
"Attempt to re-define macro SYMBOL_X"

#ifdef, #ifndef, #else, #endif

One can use symbols to control the conditional compilation of sections of a program. A common use of this is to allow code needed for debugging only to be included or omitted from the program using compiler directives:
#ifdef TRACE_MODE
  printf("x = %d, y = %d, z = %d, q = %d\n", x, y, z, q );
#endif
Another common use is to prepare programs for a number of target architectures:
#ifdef ALPHA
typedef long int64;
typedef short int32;
typedef struct { char msb, lsb; } int16;
#else
#ifdef SUN
typedef long long int64;
typedef int int32;
typedef short int int16;
#else
typedef struct {int msw, lsw; } int64;
typedef int int32;
typedef short int int16;
#endif
#endif
#endif
which shows a possible way of writing portable code to deal with the differing interpretations of int on a variety of machines. When the word length is critical, int64, int32 or int16 are used rather than long, int or short.

Note that #ifdef, #ifndef and #else can be nested if needed. Don't forget to include the right number of terminating #endif's - one for every #ifdef or #ifndef.

#ifndef X can be read "if X is not defined .." and can be followed by #else if needed.

#if

#if is rather similar to #ifdef except that it can be followed by a simple expression (in terms of symbols already defined) which the pre-processor evaluates. The standard C rule is followed: if the expression evaluates to 0, it is considered false, otherwise it is treated as true.

Key terms

Unix
An operating system developed at Bell Laboratories in the 1970's. One of the first operating systems support multiple users on interactive terminals, it has grown out of its early reputation as an academic toy (because most of its early users were in universities) and is now offered on most large computer systems.
assembler code
Assembler code (also called assembly language ) is a text representation of the instructions actually executed by a particular computer. It the lowest level at which it is practical to program any computer. It used to be popular to use assembler code to gain the maximum performance from a particular computer, but modern optimising compilers are so good that this is now generally unnecessary (or at least unprofitable - in the sense that the benefit is small compared to the effort required). Assembler code is not portable from one machine architecture to another (eg an Intel Pentium to a Motorola PowerPC). An assembler translates assembler code into binary codes which the machine can execute directly (usually after some address adjustments performed by a linker).
gcc
gcc or the Gnu C compiler is a portable C compiler built by the Open Software Foundation which is available on many machines. It is freely available shareware.

Continue on to Class Design Review - Step by Step
Back to the Table of Contents
© John Morris, 1998