Tokenizer
A tokenizer converts a input string in to tokens. It's very usefull when you need to do parsing of text.
This tokenizer is used by the BruteForce language written by El Muerte TDS, for information check his [Developer Journals/El Muerte TDS]?
class Tokenizer extends Object; const NEWLINE = 10; var private array<string> buffer; // the input buffer var private byte c; // holds the current char var private int linenr; // current line in the input buffer var private int pos; // position on the current line var int verbose; // for debugging enum tokenType { TT_None, TT_Literal, TT_Identifier, TT_Integer, TT_Float, TT_String, TT_Operator, TT_EOF, };
This tokenizer recognizes 8 different tokens.
- TT_None
- this token is never assigned, but used as a default
- TT_Literal
- the token is a literal, e.g. ( or ). Most tokenizers just use the ascii value of the literal but because of limitations in UScript we will use this
- TT_Identifier
- an identifier is a string which begins with a alpha or underscore followed by zero or more alphanumeric characters or underscores. Regular expression:
Identifier ::= [a-z_][a-z0-9_]*
- TT_Integer
- a natural number, negative numbers are not supported because this is incompatible with a '-' operator, so you have to keep that in mind when you define your grammar. Regular expression:
Integer ::= [0-9]+
- TT_Float
- a regular number with a floating point. Regular expression:
Float ::= [0-9]+\.[0-9]*
- TT_String
- a string of characters encapsuled with double quotes, literal double quotes need to be escaped using a backslashRegular expression:
String ::= "[^"]*"
- TT_Operator
- an operator: =, ==, >, >=, ... Regular expression:
Identifier ::= [-=+<>*/!]+
- TT_EOF
- the end of file
var private tokenType curTokenType; // holds the current token var private string curTokenString; // holds the current string representation /** Create a tokenizer */ function Create(array<string> buf) { buffer.length = 0; buffer = buf; linenr = 0; pos = 0; c = 0; }
Call this to initialize the tokenizer with a new buffer
/** returns the string representation of the current token */ function string tokenString() { return curTokenString; } /** returns the type of the current token */ function tokenType currentToken() { return curTokenType; }
We don't want anybody writing to out variables thus provide them with functions to read the value
/** retreives the next token */ function tokenType nextToken() { return _nextToken(); }
Get the next token in the buffer, this calls the private _nextToken() for the real processing
/* Private functions */ private function tokenType _nextToken() { local int tokenPos, endPos; skipBlanks(); if (curTokenType == TT_EOF) return curTokenType; tokenPos = pos; // identifier: [A-Za-z]([A-Za-z0-9_])* if (((c >= 65) && (c <= 90)) || ((c >= 97) && (c <= 122)) || (c == 95)) { pos++; c = _c(); while (((c >= 65) && (c <= 90)) || ((c >= 97) && (c <= 122)) || (c == 95) || ((c >= 48) && (c <= 57))) { pos++; c = _c(); } endPos = pos; curTokenType = TT_Identifier; } // number: [0-9]+(\.([0-9])+)? else if ((c >= 48) && (c <= 57)) { pos++; c = _c(); while ((c >= 48) && (c <= 57)) { pos++; c = _c(); } if (c == 46) // . { pos++; c = _c(); while ((c >= 48) && (c <= 57)) { pos++; c = _c(); } endPos = pos; curTokenType = TT_Float; } else { endPos = pos; curTokenType = TT_Integer; } } // string: "[^"]*" else if (c == 34) { pos++; c = _c(); while (true) { if (c == 34) break; if (c == 92) // escape char skip one char { pos++; } if (c == NEWLINE) { Warn("Unterminated string @"@linenr$","$pos); assert(false); } pos++; c = _c(); } tokenPos++; endPos = pos; pos++; curTokenType = TT_String; } // operator: [+-*/=><!]+ // literal else if ((c == 33) || (c == 42) || (c == 43) || (c == 45) || (c == 47) || (c == 60) || (c == 61) || (c == 62) || (c == 61)) { pos++; c = _c(); while ((c == 33) || (c == 42) || (c == 43) || (c == 45) || (c == 47) || (c == 60) || (c == 61) || (c == 62) || (c == 61)) { pos++; c = _c(); } endPos = pos; curTokenType = TT_Operator; } else { pos++; endPos = pos; curTokenType = TT_Literal; } // make up result if (linenr >= buffer.length) // EOF break { curTokenType = TT_EOF; curTokenString = ""; } else { curTokenString = Mid(buffer[linenr], tokenPos, endPos-tokenPos); } if (verbose > 0) log(curTokenType@curTokenString, 'Tokenizer'); return curTokenType; } /** Skip all characters with ascii value < 33 (32 is space) */ private function skipBlanks() { c = _c(); while (c < 33) { if (c == NEWLINE) { linenr++; pos = 0; if (linenr >= buffer.length) // EOF break { curTokenType = TT_EOF; curTokenString = ""; return; } } else pos++; c = _c(); } }
skipBlanks skips all characters considered whitespace, in this case all ASCII controll characters including the space.
/** returns the current char */ private function byte _c(optional int displacement) { local string t; t = Mid(buffer[linenr], pos+displacement, 1); if (t == "") return NEWLINE; // empty string is a newline return Asc(t); }
This function is used to read the current character, because we can't just increase the read pointer like you would do normaly we need to extract the current character from the current line and convert it to the ASCII value for better processing.
defaultproperties { verbose=0 }
Issues
Escape characters in strings
Escaped characters are accepted by this tokenizer but not fixed.
"a string with \"double quotes\""
will be returned as:
a string with \"double quotes\"
Negative numbers
Negative numbers are not supported but this tokenizer, instead you will get a Operator '-' and a Number '123' insetad of a Number '-123'. This is because it's impossible to see the diffirence between the operator '-' and a leading minus symbol in a string. For example:
x = x - 1
and x = -1
So when parsing your code you need to keep this in mind that a number can be preceded with a '-' (pre-operator)