wtd
|
Posted: Tue Nov 16, 2004 9:53 pm Post subject: [Regex-tut] Finding Variables in Turing |
|
|
Disclaimer
The code shown here is not a fool-proof way of parsing Turing code. It can be fooled by code that does tricky things like sticking comment characters in strings.
Where do we start?
We start by reading in a Turing source file. First we have to open the file for reading.
code: | turing_file = File.open("test.t", File::RDONLY) |
Then, we have to read in the entire file as a big string.
code: | contents = turing_file.read |
And since we've got the string, now, we can close the file.
Eliminating comments
The next step is to eliminate any comments in the code, as we don't want to find variable declarations in comments.
To get rid of text that matches a given pattern, we'll use the gsub method of our "contents" string. This method takes a regular expression, and a replacement string. Anywhere it finds a match, it replaces that text with the replacement string.
In turing, a line comment is anything that follows a %, so a regular expression to match a line comments would looking like:
Thus, the gsub call to eliminate such comments looks like:
code: | without_line_comments = contents.gsub(/ % .* $ /x, "") |
Eliminating multi-line comments
Multi-line comments, like so:
Present a problem. Regular expressions, as seen so far, only match a single line. The . character matches everything except newlines, so a single regular expression can't span multiple lines.
We've already seen the x modifier, and how it allows use of spaces inside regular expressions. The m modifier allows . to match newlines, thus giving regular expressions the ability to span lines.
A regular expression to find a /* */ style comment on one line:
And now, to apply this to multiple lines:
code: | / \/\* .*? \/\* /mx |
So eliminating multi-line comments looks like:
code: | without_comments = without_line_comments.gsub(/ \/\* .*? \/\* /mx, "") |
Note: the backslashes were necessary because / and * have meaning in a regular expression on their own.
Matching a variable or constant declaration
For simplicity, we'll only be matching variables or constants declared as one of the built-in types:
Variable names in Turing begin with either a lower or uppercase letter. Following that you can have lower or uppercase letters, numbers, or underscores. As a result, the regular expression to match one looks like:
code: | [a-zA-Z][a-zA-Z0-9_]* |
We can make this a little more concise by using the \d shortcut, which is equivalent to:
code: | [a-zA-Z][a-zA-Z\d_]* |
Of course, these names are preceded by either "var" or "const" and at least one space, so...
code: | (var|const)\s+[a-zA-Z][a-zA-Z\d_]* |
And just as we want to capture "var" or "const", we want to capture the name:
code: | (var|const)\s+([a-zA-Z][a-zA-Z\d_]*) |
Now, all of this has to be followed by a colon and then the type of the variable. The colon may have whitespace around it, but it may not. And of course, want to capture the type.
code: | (var|const)\s+([a-zA-Z][a-zA-Z\d_]*)\s*:\s*(int|real|boolean|char|string) |
Variable declarations come one their own lines, so we should modify this slightly:
code: | ^\s*(var|const)\s+([a-zA-Z][a-zA-Z\d_]*)\s*:\s*(int|real|boolean|char|string)\s*$ |
The problem
The problem with this regex is that it only matches variable declarations where one variable is being declared. We need a way to match something llike:
code: | var a, b, c, d : int |
So, we need to look for the pattern. There's at least one variable being declared. Let's say that one variable in this case is "a". Remove "a" from the list and you're left with:
Clearly each possible variable is preceded by a comma. To match this pattern, we can use:
code: | (?:\s*,\s*[a-zA-Z][a-zA-Z\d_]*)* |
Or, broken down:
code: | (?: # the start of the group
\s* # zero or more saces
, # a comma
\s* # zero or more saces
[a-zA-Z] # match the first character in the variable name
[a-zA-Z\d_]* # any following characters in the variable name
)* # match this whole group zero or more times |
So, we can reincorporate this into the bigger regex:
code: | ^\s*(var|const)\s+([a-zA-Z][a-zA-Z\d_]*(?:\s*,\s*[a-zA-Z][a-zA-Z\d_]*)*)\s*:\s*(int|real|boolean|char|string)\s*$ |
Refactoring
So, how do we make this more readable?
Well, in another thread on Ruby I demonstrated that strings can be "interpolated" into other strings. If I have a "name" variable, I can interpolate it into a greeting string.
code: | name = "Bob"
greeting = "Hello, #{name}" |
It shouldn't be terribly surprising then, that the same can work with regular expressions.
code: | greeting_word = /[hH]ello/
pattern = /#{greeting_word}, Bob/ |
So, we can factor out repeated bits of the regular expression and give them names.
code: | variable = /[a-zA-Z][a-zA-Z\d_]*/
^\s*(var|const)\s+(#{variable}(?:\s*,\s*#{variable})*)\s*:\s*(int|real|boolean|char|string)\s*$ |
Factoring further:
code: | variable = /[a-zA-Z][a-zA-Z\d_]*/
var_list = /#{variable}(?:\s*,\s*#{variable})*/
^\s*(var|const)\s+(#{var_list})\s*:\s*(int|real|boolean|char|string)\s*$ |
And a bit further:
code: | variable = /[a-zA-Z][a-zA-Z\d_]*/
var_list = /#{variable}(?:\s*,\s*#{variable})*/
types = ["int", "real", "boolean", "char", "string"]
type_list = types.join("|")
^\s*(var|const)\s+(#{var_list})\s*:\s*(#{type_list})\s*$ |
This last change is particularly profound. With it, the list of types to match can grow just by adding the type name to the array, rather than editing the regular expression directly.
Testing it
So, let's put this up against a sample variable declaration.
code: | variable = /[a-zA-Z][a-zA-Z\d_]*/
var_list = /#{variable}(?:\s*,\s*#{variable})*/
types = ["int", "real", "boolean", "char", "string"]
type_list = types.join("|")
"var a, b, c : int" =~ /^\s*(var|const)\s+(#{var_list})\s*:\s*(#{type_list})\s*$ |
$1 now contains:
$2 contains:
And $3 contains:
Taking the variable name list apart
So now we have a variable list, "a, b, c", but it's all one big string. Fortunately, we can use a regex and a method of that string to get each name.
code: | name_list = $2.dup
names = name_list.split(/\s*,\s*/) |
Scanning the whole source
Ruby strings have a method called "scan", which takes a reglar expression and return an array containing all of the matches. We can use this, and the regex we've constructed for finding variable declarations to find all variable declarations in the source.
The code
So, let's recap the code thus far:
code: | turing_file = File.open("test.t", File::RDONLY)
contents = turing_file.read
turing_file.close
without_line_comments = contents.gsub(/ % .* $ /x, "")
without_comments = without_line_comments.gsub(/ \/\* .*? \/\* /mx, "")
variable = /[a-zA-Z][a-zA-Z\d_]*/
var_list = /#{variable}(?:\s*,\s*#{variable})*/
types = ["int", "real", "boolean", "char", "string"]
type_list = types.join("|")
declaration = /^\s*(var|const)\s+(#{var_list})\s*:\s*(#{type_list})\s*$/
all_declarations = without_comments.scan(declaration) |
And we run it with the following test.t:
code: | var hello : real
var world : int
var foo, bar, hw : string |
We end up with:
code: | [
["var", "hello", "real"],
["var", "world", "int"],
["var", "foo, bar, hw", "string"]
] |
Getting one list
Now, we have something that's a lot simpler than what we started with, but it's still not a list of variables. To get that we'll have to separate out the variable names.
code: | turing_file = File.open("test.t", File::RDONLY)
contents = turing_file.read
turing_file.close
without_line_comments = contents.gsub(/ % .* $ /x, "")
without_comments = without_line_comments.gsub(/ \/\* .*? \/\* /mx, "")
variable = /[a-zA-Z][a-zA-Z\d_]*/
var_list = /#{variable}(?:\s*,\s*#{variable})*/
types = ["int", "real", "boolean", "char", "string"]
type_list = types.join("|")
declaration = /^\s*(var|const)\s+(#{var_list})\s*:\s*(#{type_list})\s*$/
variable_list = {}
without_comments.scan declaration do |var_or_const, names, type|
names.split(/\s*,\s*/).each do |name|
variable_list[name] = { "var or const" => var_or_const, "type" => type }
end
end |
And now "variable_list" looks like:
code: | {
"world" => {
"var or const" => "var",
"type" => "int"
},
"hw" => {
"var or const" => "var",
"type" => "string"
},
"foo" => {
"var or const" => "var",
"type" => "string"
},
"hello" => {
"var or const" => "var",
"type" => "real"
},
"bar" => {
"var or const" => "var",
"type" => "string"
}
} |
That's all for now, folks
Well, it feels like I should call it quits at this point. Future tutorials may work on identifying arrays and custom data types. |
|
|