Normalization of the Token Stream

Normalization of the Token Stream normalization of the token stream token normalization

A program is scanned to break up the source character stream into tokens. Despite being different in form, many tokens have the same meaning. For example, all the three tokens 125 000125 +125 denote the same number 125.

It is for this reason that the description of the lexical structure of programs deals with such terms as "numeric literal" and "word literal" rather that "number" and "word".

Besides, a token like "character string literal" represents a sequence of characters rather that a single syntax entity.

Thus, when describing the syntax, we assume the token stream to have been "normalized", each token having been reduced to its normal form, so that different tokens always represent different entities.

In addition we assume each character string literal to have been broken up into the string of separate tokens, a token representing a single character.

The above enables us to describe the syntax in terms of syntax "entities" rather than "representations of syntax entities".

Here is the correspondence between the source tokens and the normalized tokens: CharacterStringLiteral ==> Character1 Character2 ... CharacterN WordLiteral ==> Word NumericLiteral ==> Number

The character symbols obtained by scanning a program should not be confused with the characters appearing in the source text of the program. For example, parsing the three characters 'A' results in producing a single character symbol.