| Home Page | Recent Changes | Preferences

Tokenizer

A tokenizer converts a input string in to tokens. It's very usefull when you need to do parsing of text.

This tokenizer is used by the BruteForce language written by El Muerte TDS, for information check his [Developer Journals/El Muerte TDS]?

class Tokenizer extends Object;

const NEWLINE = 10;

var private array<string> buffer;   // the input buffer
var private byte c;                 // holds the current char
var private int linenr;             // current line in the input buffer
var private int pos;                // position on the current line

var int verbose;                    // for debugging

enum tokenType 
{
  TT_None,
  TT_Literal,
  TT_Identifier,  
  TT_Integer,
  TT_Float,
  TT_String,
  TT_Operator,
  TT_EOF,
};

This tokenizer recognizes 8 different tokens.

TT_None
this token is never assigned, but used as a default
TT_Literal
the token is a literal, e.g. ( or ). Most tokenizers just use the ascii value of the literal but because of limitations in UScript we will use this
TT_Identifier
an identifier is a string which begins with a alpha or underscore followed by zero or more alphanumeric characters or underscores. Regular expression: Identifier ::= [a-z_][a-z0-9_]*
TT_Integer
a natural number, negative numbers are not supported because this is incompatible with a '-' operator, so you have to keep that in mind when you define your grammar. Regular expression: Integer ::= [0-9]+
TT_Float
a regular number with a floating point. Regular expression: Float ::= [0-9]+\.[0-9]*
TT_String
a string of characters encapsuled with double quotes, literal double quotes need to be escaped using a backslashRegular expression: String ::= "[^"]*"
TT_Operator
an operator: =, ==, >, >=, ... Regular expression: Identifier ::= [-=+<>*/!]+
TT_EOF
the end of file
var private tokenType curTokenType; // holds the current token
var private string curTokenString;  // holds the current string representation

/**
  Create a tokenizer
*/
function Create(array<string> buf)
{
  buffer.length = 0;
  buffer = buf;
  linenr = 0;
  pos = 0;
  c = 0;
}

Call this to initialize the tokenizer with a new buffer

/**
  returns the string representation of the current token
*/
function string tokenString()
{
  return curTokenString;
}

/**
  returns the type of the current token
*/
function tokenType currentToken()
{
  return curTokenType;
}

We don't want anybody writing to out variables thus provide them with functions to read the value

/**
  retreives the next token
*/
function tokenType nextToken()
{
  return _nextToken();
}

Get the next token in the buffer, this calls the private _nextToken() for the real processing

/* Private functions */

private function tokenType _nextToken()
{
  local int tokenPos, endPos;
  skipBlanks();
  if (curTokenType == TT_EOF) return curTokenType; 
  tokenPos = pos;
  // identifier: [A-Za-z]([A-Za-z0-9_])*
  if (((c >= 65) && (c <= 90)) || ((c >= 97) && (c <= 122)) || (c == 95))
  {
    pos++;
    c = _c();
    while (((c >= 65) && (c <= 90)) || ((c >= 97) && (c <= 122)) || (c == 95) || ((c >= 48) && (c <= 57)))
    {
      pos++;
      c = _c();
    }
    endPos = pos;
    curTokenType = TT_Identifier;
  }
  // number: [0-9]+(\.([0-9])+)?
  else if ((c >= 48) && (c <= 57))
  {
    pos++;
    c = _c();
    while ((c >= 48) && (c <= 57))
    {
      pos++;
      c = _c();
    }
    if (c == 46) // .
    {
      pos++;
      c = _c();
      while ((c >= 48) && (c <= 57))
      {
        pos++;
        c = _c();
      }
      endPos = pos;
      curTokenType = TT_Float;
    }
    else {
      endPos = pos;
      curTokenType = TT_Integer;
    }
  }
  // string: "[^"]*"
  else if (c == 34)
  {
    pos++;
    c = _c();
    while (true)
    {
      if (c == 34) break;
      if (c == 92) // escape char skip one char
      {
        pos++;
      }
      if (c == NEWLINE)
      {
        Warn("Unterminated string @"@linenr$","$pos);
        assert(false);
      }
      pos++;
      c = _c();
    }
    tokenPos++;
    endPos = pos;
    pos++;
    curTokenType = TT_String;
  }
  // operator: [+-*/=><!]+
  // literal
  else if ((c == 33) || (c == 42) || (c == 43) || (c == 45) || (c == 47) || (c == 60) || (c == 61) || (c == 62) || (c == 61))
  {
    pos++;
    c = _c();
    while ((c == 33) || (c == 42) || (c == 43) || (c == 45) || (c == 47) || (c == 60) || (c == 61) || (c == 62) || (c == 61))
    {
      pos++;
      c = _c();
    }
    endPos = pos;
    curTokenType = TT_Operator;
  }
  else {
    pos++;
    endPos = pos;
    curTokenType = TT_Literal;
  }
  // make up result
  if (linenr >= buffer.length) // EOF break
  {
    curTokenType = TT_EOF; 
    curTokenString = "";
  }
  else {
    curTokenString = Mid(buffer[linenr], tokenPos, endPos-tokenPos);
  }
  if (verbose > 0) log(curTokenType@curTokenString, 'Tokenizer');
  return curTokenType;
}

/**
  Skip all characters with ascii value < 33 (32 is space)
*/
private function skipBlanks()
{  
  c = _c();
  while (c < 33)
  {
    if (c == NEWLINE)
    {
      linenr++;
      pos = 0;
      if (linenr >= buffer.length) // EOF break
      {
        curTokenType = TT_EOF; 
        curTokenString = "";
        return;
      }
    }
    else pos++;
    c = _c();
  }
}

skipBlanks skips all characters considered whitespace, in this case all ASCII controll characters including the space.

/**
  returns the current char
*/
private function byte _c(optional int displacement)
{
  local string t;
  t =  Mid(buffer[linenr], pos+displacement, 1);
  if (t == "") return NEWLINE; // empty string is a newline
  return Asc(t);
}

This function is used to read the current character, because we can't just increase the read pointer like you would do normaly we need to extract the current character from the current line and convert it to the ASCII value for better processing.

defaultproperties
{
  verbose=0
}

Issues

Escape characters in strings

Escaped characters are accepted by this tokenizer but not fixed.

"a string with \"double quotes\"" will be returned as:

a string with \"double quotes\"

Negative numbers

Negative numbers are not supported but this tokenizer, instead you will get a Operator '-' and a Number '123' insetad of a Number '-123'. This is because it's impossible to see the diffirence between the operator '-' and a leading minus symbol in a string. For example:

x = x - 1 and x = -1

So when parsing your code you need to keep this in mind that a number can be preceded with a '-' (pre-operator)

Related Topics

The Unreal Engine Documentation Site

Wiki Community

Topic Categories

Image Uploads

Random Page

Recent Changes

Offline Wiki

Unreal Engine

Console Commands

Terminology

Mapping Topics

Mapping Lessons

UnrealEd Interface

Questions&Answers

Scripting Topics

Scripting Lessons

Making Mods

Class Tree

Questions&Answers

Modeling Topics

Questions&Answers

Log In