Scanner Specifications (Abbreviated)
The specifications for writing the scanner are outlined below.
Background
The background of the specifications is the theory of formal languages and automata that form the foundation of scanning.
Regular Expressions
Teams are given a general description of μPascal and an understanding of what tokens are. They are then required to identify all tokens in μPascal. Next, they are to write a regular expression for each token. Finally, they are given formal specifications for the tokens of μPascal as regular expressions for comparison with the regular expressions they derived. Thus, after making their own attempts, all teams begin coding of the scanner with a standard set of regular expressions for the tokens of μPascal.
Finite State Automata
Teams then review the theory and practice of finite state automata so that they can implement the regular expressions that stand of the tokens of μPascal as finite state automata.
Implementing the Scanner
The Dispatcher
A dispatcher (itself a finite state automaton) is to do the following:
precondition: the source file pointer is either pointing at the first character of the file (when the file is first opened) or is pointing to the first character in the file after the most recently scanned lexeme that matched a token.
- skip white space (where white space is a regular expression defined in the μPascal specifications)
- peek at the first non-white space character without consuming it
- call the finite state automaton that is responsible for scanning for tokens that start with that character
The Finite State Automata for Each Token
The specifications for each implemented FSA are the same:
- precondition:
- the file pointer is pointing to the first character of the current token being scanned
- postcondition:
- If a token was successfully matched:
- instance variable token has been set to the currently scanned token
- instance variable lexeme contains the lexeme associated with the scanned token
- instance variable lineNumber has been set to the source code line that contains the token scanned
- instance variable columnNumber that is the column number of the start of the current token scanned
- the file pointer is pointing to the first character that follows the lexeme associated with the currently scanned token
- If no token was matched:
- instance variable token is set to a special error token
- instance variable lexeme is undefined
- instance variable lineNumber has been set to the source code line on which the error was encountered
- instance variable columnNumber has been set to the column indicated by the file pointer when the FSA was called.
- the file pointer is incremented by one from its initial position when this FSA was called.
- some leeway was allowed in handling special errors, such as run-on-string and run-on-comment
- If a token was successfully matched:
In addition:
- each token FSA is to always return the token that is matched by the longest possible lexeme
- a standard approach for implementing each FSA must be chosen:
- EITHER: A loop that runs until a lexeme for the token is matched. On each pass through the loop
- in an outer switch statement, select the current state of the FSA
- the current state is represented by another switch statement that sets the next state based on the current input character.
- OR the FSA is implemented with goto statements such that
- each state is represented with a label
- a goto statement is associated with every state that transfers execution to the label of the state indicated by the current state and the current input symbol.
- EITHER: A loop that runs until a lexeme for the token is matched. On each pass through the loop