-----------------------------------
wtd
Tue Nov 16, 2004 9:53 pm

[Regex-tut] Finding Variables in Turing
-----------------------------------
Disclaimer

The code shown here is not a fool-proof way of parsing Turing code.  It can be fooled by code that does tricky things like sticking comment characters in strings.

Where do we start?

We start by reading in a Turing source file.  First we have to open the file for reading.

turing_file = File.open("test.t", File::RDONLY)

Then, we have to read in the entire file as a big string.

contents = turing_file.read

And since we've got the string, now, we can close the file.

turing_file.close

Eliminating comments

The next step is to eliminate any comments in the code, as we don't want to find variable declarations in comments.

To get rid of text that matches a given pattern, we'll use the gsub method of our "contents" string.  This method takes a regular expression, and a replacement string.  Anywhere it finds a match, it replaces that text with the replacement string.

In turing, a line comment is anything that follows a %, so a regular expression to match a line comments would looking like:

/ % .* $ /x

Thus, the gsub call to eliminate such comments looks like:

without_line_comments = contents.gsub(/ % .* $ /x, "")

Eliminating multi-line comments

Multi-line comments, like so:

/*
   hello
   world
*/

Present a problem.  Regular expressions, as seen so far, only match a single line.  The . character matches everything except newlines, so a single regular expression can't span multiple lines.

We've already seen the x modifier, and how it allows use of spaces inside regular expressions.  The m modifier allows . to match newlines, thus giving regular expressions the ability to span lines.

A regular expression to find a /* */ style comment on one line:

/ \/\* .*? \/\* /x

And now, to apply this to multiple lines:

/ \/\* .*? \/\* /mx

So eliminating multi-line comments looks like:

without_comments = without_line_comments.gsub(/ \/\* .*? \/\* /mx, "")

Note: the backslashes were necessary because / and * have meaning in a regular expression on their own.

Matching a variable or constant declaration

For simplicity, we'll only be matching variables or constants declared as one of the built-in types:

intrealbooleancharstring

Variable names in Turing begin with either a lower or uppercase letter.  Following that you can have lower or uppercase letters, numbers, or underscores.  As a result, the regular expression to match one looks like:

[a-zA-Z][a-zA-Z0-9_]*

We can make this a little more concise by using the \d shortcut, which is equivalent to:

0-9

[a-zA-Z][a-zA-Z\d_]*

Of course, these names are preceded by either "var" or "const" and at least one space, so...

(var|const)\s+[a-zA-Z][a-zA-Z\d_]*

And just as we want to capture "var" or "const", we want to capture the name:
 
(var|const)\s+([a-zA-Z][a-zA-Z\d_]*)

Now, all of this has to be followed by a colon and then the type of the variable.  The colon may have whitespace around it, but it may not.  And of course, want to capture the type.

(var|const)\s+([a-zA-Z][a-zA-Z\d_]*)\s*:\s*(int|real|boolean|char|string)

Variable declarations come one their own lines, so we should modify this slightly:

^\s*(var|const)\s+([a-zA-Z][a-zA-Z\d_]*)\s*:\s*(int|real|boolean|char|string)\s*$

The problem

The problem with this regex is that it only matches variable declarations where one variable is being declared.  We need a way to match something llike:

var a, b, c, d : int

So, we need to look for the pattern.  There's at least one variable being declared.  Let's say that one variable in this case is "a".  Remove "a" from the list and you're left with:

, b, c, d

Clearly each possible variable is preceded by a comma.  To match this pattern, we can use:

(?:\s*,\s*[a-zA-Z][a-zA-Z\d_]*)*

Or, broken down:

(?:             # the start of the group
   \s*          # zero or more saces
   ,            # a comma
   \s*          # zero or more saces
   [a-zA-Z]     # match the first character in the variable name
   [a-zA-Z\d_]* # any following characters in the variable name
)*              # match this whole group zero or more times

So, we can reincorporate this into the bigger regex:

^\s*(var|const)\s+([a-zA-Z][a-zA-Z\d_]*(?:\s*,\s*[a-zA-Z][a-zA-Z\d_]*)*)\s*:\s*(int|real|boolean|char|string)\s*$

Refactoring

So, how do we make this more readable?

Well, in another thread on Ruby I demonstrated that strings can be "interpolated" into other strings.  If I have a "name" variable, I can interpolate it into a greeting string.

name = "Bob"
greeting = "Hello, #{name}"

It shouldn't be terribly surprising then, that the same can work with regular expressions.

greeting_word = /[hH]ello/
pattern = /#{greeting_word}, Bob/

So, we can factor out repeated bits of the regular expression and give them names.

variable = /[a-zA-Z][a-zA-Z\d_]*/
^\s*(var|const)\s+(#{variable}(?:\s*,\s*#{variable})*)\s*:\s*(int|real|boolean|char|string)\s*$

Factoring further:

variable = /[a-zA-Z][a-zA-Z\d_]*/
var_list = /#{variable}(?:\s*,\s*#{variable})*/
^\s*(var|const)\s+(#{var_list})\s*:\s*(int|real|boolean|char|string)\s*$

And a bit further:

variable  = /[a-zA-Z][a-zA-Z\d_]*/
var_list  = /#{variable}(?:\s*,\s*#{variable})*/
types     = ["int", "real", "boolean", "char", "string"]
type_list = types.join("|")
^\s*(var|const)\s+(#{var_list})\s*:\s*(#{type_list})\s*$

This last change is particularly profound.  With it, the list of types to match can grow just by adding the type name to the array, rather than editing the regular expression directly.

Testing it

So, let's put this up against a sample variable declaration.

variable  = /[a-zA-Z][a-zA-Z\d_]*/
var_list  = /#{variable}(?:\s*,\s*#{variable})*/
types     = ["int", "real", "boolean", "char", "string"]
type_list = types.join("|")
"var a, b, c : int" =~ /^\s*(var|const)\s+(#{var_list})\s*:\s*(#{type_list})\s*$

$1 now contains:

"var"

$2 contains:

"a, b, c"

And $3 contains:

"int"

Taking the variable name list apart

So now we have a variable list, "a, b, c", but it's all one big string.  Fortunately, we can use a regex and a method of that string to get each name.

name_list = $2.dup
names = name_list.split(/\s*,\s*/)

Scanning the whole source

Ruby strings have a method called "scan", which takes a reglar expression and return an array containing all of the matches.  We can use this, and the regex we've constructed for finding variable declarations to find all variable declarations in the source.

The code

So, let's recap the code thus far:

turing_file = File.open("test.t", File::RDONLY)
contents = turing_file.read
turing_file.close

without_line_comments = contents.gsub(/ % .* $ /x, "")
without_comments = without_line_comments.gsub(/ \/\* .*? \/\* /mx, "")

variable  = /[a-zA-Z][a-zA-Z\d_]*/
var_list  = /#{variable}(?:\s*,\s*#{variable})*/
types     = ["int", "real", "boolean", "char", "string"]
type_list = types.join("|")
declaration = /^\s*(var|const)\s+(#{var_list})\s*:\s*(#{type_list})\s*$/

all_declarations = without_comments.scan(declaration)

And we run it with the following test.t:

var hello : real
var world : int
var foo, bar, hw : string

We end up with:

[
   ["var", "hello", "real"], 
   ["var", "world", "int"], 
   ["var", "foo, bar, hw", "string"]
]

Getting one list

Now, we have something that's a lot simpler than what we started with, but  it's still not a list of variables.  To get that we'll have to separate out the variable names.

turing_file = File.open("test.t", File::RDONLY)
contents = turing_file.read
turing_file.close

without_line_comments = contents.gsub(/ % .* $ /x, "")
without_comments = without_line_comments.gsub(/ \/\* .*? \/\* /mx, "")

variable  = /[a-zA-Z][a-zA-Z\d_]*/
var_list  = /#{variable}(?:\s*,\s*#{variable})*/
types     = ["int", "real", "boolean", "char", "string"]
type_list = types.join("|")
declaration = /^\s*(var|const)\s+(#{var_list})\s*:\s*(#{type_list})\s*$/

variable_list = {}

without_comments.scan declaration do |var_or_const, names, type|
   names.split(/\s*,\s*/).each do |name|
      variable_list[name] = { "var or const" => var_or_const, "type" => type }
   end
end

And now "variable_list" looks like:

{
   "world" => {
      "var or const" => "var", 
      "type"         => "int"
   }, 
   "hw"    => {
      "var or const" => "var", 
      "type"         => "string"
   }, 
   "foo"   => { 
      "var or const" => "var", 
      "type"         => "string"
   }, 
   "hello" => {
      "var or const" => "var", 
      "type"         => "real"
   }, 
   "bar"   => {
      "var or const" => "var", 
      "type"         => "string"
   }
}

That's all for now, folks

Well, it feels like I should call it quits at this point.  Future tutorials may work on identifying arrays and custom data types.