Tokenizer

A tokenizer converts a input string in to tokens. It's very usefull when you need to do parsing of text.

This tokenizer is used by the BruteForce language written by El Muerte TDS, for information check his [Developer Journals/El Muerte TDS]?

class Tokenizer extends Object;

const NEWLINE = 10;

var private array<string> buffer;   // the input buffer
var private byte c;                 // holds the current char
var private int linenr;             // current line in the input buffer
var private int pos;                // position on the current line

var int verbose;                    // for debugging

enum tokenType 
{
  TT_None,
  TT_Literal,
  TT_Identifier,  
  TT_Integer,
  TT_Float,
  TT_String,
  TT_Operator,
  TT_EOF,
};

This tokenizer recognizes 8 different tokens.

TT_None: this token is never assigned, but used as a default
TT_Literal: the token is a literal, e.g. ( or ). Most tokenizers just use the ascii value of the literal but because of limitations in UScript we will use this
TT_Identifier: an identifier is a string which begins with a alpha or underscore followed by zero or more alphanumeric characters or underscores. Regular expression: Identifier ::= [a-z_][a-z0-9_]*
TT_Integer: a natural number, negative numbers are not supported because this is incompatible with a '-' operator, so you have to keep that in mind when you define your grammar. Regular expression: Integer ::= [0-9]+
TT_Float: a regular number with a floating point. Regular expression: Float ::= [0-9]+\.[0-9]*
TT_String: a string of characters encapsuled with double quotes, literal double quotes need to be escaped using a backslashRegular expression: String ::= "[^"]*"
TT_Operator: an operator: =, ==, >, >=, ... Regular expression: Identifier ::= [-=+<>*/!]+
TT_EOF: the end of file

var private tokenType curTokenType; // holds the current token
var private string curTokenString;  // holds the current string representation

/**
  Create a tokenizer
*/
function Create(array<string> buf)
{
  buffer.length = 0;
  buffer = buf;
  linenr = 0;
  pos = 0;
  c = 0;
}

Call this to initialize the tokenizer with a new buffer

/**
  returns the string representation of the current token
*/
function string tokenString()
{
  return curTokenString;
}

/**
  returns the type of the current token
*/
function tokenType currentToken()
{
  return curTokenType;
}

We don't want anybody writing to out variables thus provide them with functions to read the value

/**
  retreives the next token
*/
function tokenType nextToken()
{
  return _nextToken();
}

Get the next token in the buffer, this calls the private _nextToken() for the real processing

/* Private functions */

private function tokenType _nextToken()
{
  local int tokenPos, endPos;
  skipBlanks();
  if (curTokenType == TT_EOF) return curTokenType; 
  tokenPos = pos;
  // identifier: [A-Za-z]([A-Za-z0-9_])*
  if (((c >= 65) && (c <= 90)) || ((c >= 97) && (c <= 122)) || (c == 95))
  {
    pos++;
    c = _c();
    while (((c >= 65) && (c <= 90)) || ((c >= 97) && (c <= 122)) || (c == 95) || ((c >= 48) && (c <= 57)))
    {
      pos++;
      c = _c();
    }
    endPos = pos;
    curTokenType = TT_Identifier;
  }
  // number: [0-9]+(\.([0-9])+)?
  else if ((c >= 48) && (c <= 57))
  {
    pos++;
    c = _c();
    while ((c >= 48) && (c <= 57))
    {
      pos++;
      c = _c();
    }
    if (c == 46) // .
    {
      pos++;
      c = _c();
      while ((c >= 48) && (c <= 57))
      {
        pos++;
        c = _c();
      }
      endPos = pos;
      curTokenType = TT_Float;
    }
    else {
      endPos = pos;
      curTokenType = TT_Integer;
    }
  }
  // string: "[^"]*"
  else if (c == 34)
  {
    pos++;
    c = _c();
    while (true)
    {
      if (c == 34) break;
      if (c == 92) // escape char skip one char
      {
        pos++;
      }
      if (c == NEWLINE)
      {
        Warn("Unterminated string @"@linenr$","$pos);
        assert(false);
      }
      pos++;
      c = _c();
    }
    tokenPos++;
    endPos = pos;
    pos++;
    curTokenType = TT_String;
  }
  // operator: [+-*/=><!]+
  // literal
  else if ((c == 33) || (c == 42) || (c == 43) || (c == 45) || (c == 47) || (c == 60) || (c == 61) || (c == 62) || (c == 61))
  {
    pos++;
    c = _c();
    while ((c == 33) || (c == 42) || (c == 43) || (c == 45) || (c == 47) || (c == 60) || (c == 61) || (c == 62) || (c == 61))
    {
      pos++;
      c = _c();
    }
    endPos = pos;
    curTokenType = TT_Operator;
  }
  else {
    pos++;
    endPos = pos;
    curTokenType = TT_Literal;
  }
  // make up result
  if (linenr >= buffer.length) // EOF break
  {
    curTokenType = TT_EOF; 
    curTokenString = "";
  }
  else {
    curTokenString = Mid(buffer[linenr], tokenPos, endPos-tokenPos);
  }
  if (verbose > 0) log(curTokenType@curTokenString, 'Tokenizer');
  return curTokenType;
}

/**
  Skip all characters with ascii value < 33 (32 is space)
*/
private function skipBlanks()
{  
  c = _c();
  while (c < 33)
  {
    if (c == NEWLINE)
    {
      linenr++;
      pos = 0;
      if (linenr >= buffer.length) // EOF break
      {
        curTokenType = TT_EOF; 
        curTokenString = "";
        return;
      }
    }
    else pos++;
    c = _c();
  }
}

skipBlanks skips all characters considered whitespace, in this case all ASCII controll characters including the space.

/**
  returns the current char
*/
private function byte _c(optional int displacement)
{
  local string t;
  t =  Mid(buffer[linenr], pos+displacement, 1);
  if (t == "") return NEWLINE; // empty string is a newline
  return Asc(t);
}

This function is used to read the current character, because we can't just increase the read pointer like you would do normaly we need to extract the current character from the current line and convert it to the ASCII value for better processing.

defaultproperties
{
  verbose=0
}

Issues

Escape characters in strings

Escaped characters are accepted by this tokenizer but not fixed.

"a string with \"double quotes\"" will be returned as:

a string with \"double quotes\"

Negative numbers

Negative numbers are not supported but this tokenizer, instead you will get a Operator '-' and a Number '123' insetad of a Number '-123'. This is because it's impossible to see the diffirence between the operator '-' and a leading minus symbol in a string. For example:

x = x - 1 and x = -1

So when parsing your code you need to keep this in mind that a number can be preceded with a '-' (pre-operator)

Tokenizer

Issues

Escape characters in strings

Negative numbers

Related Topics