Computer Science Canada [Regex-tut] Finding Variables in Turing |
Author: | wtd [ Tue Nov 16, 2004 9:53 pm ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Post subject: | [Regex-tut] Finding Variables in Turing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Disclaimer The code shown here is not a fool-proof way of parsing Turing code. It can be fooled by code that does tricky things like sticking comment characters in strings. Where do we start? We start by reading in a Turing source file. First we have to open the file for reading.
Then, we have to read in the entire file as a big string.
And since we've got the string, now, we can close the file.
Eliminating comments The next step is to eliminate any comments in the code, as we don't want to find variable declarations in comments. To get rid of text that matches a given pattern, we'll use the gsub method of our "contents" string. This method takes a regular expression, and a replacement string. Anywhere it finds a match, it replaces that text with the replacement string. In turing, a line comment is anything that follows a %, so a regular expression to match a line comments would looking like:
Thus, the gsub call to eliminate such comments looks like:
Eliminating multi-line comments Multi-line comments, like so:
Present a problem. Regular expressions, as seen so far, only match a single line. The . character matches everything except newlines, so a single regular expression can't span multiple lines. We've already seen the x modifier, and how it allows use of spaces inside regular expressions. The m modifier allows . to match newlines, thus giving regular expressions the ability to span lines. A regular expression to find a /* */ style comment on one line:
And now, to apply this to multiple lines:
So eliminating multi-line comments looks like:
Note: the backslashes were necessary because / and * have meaning in a regular expression on their own. Matching a variable or constant declaration For simplicity, we'll only be matching variables or constants declared as one of the built-in types:
Variable names in Turing begin with either a lower or uppercase letter. Following that you can have lower or uppercase letters, numbers, or underscores. As a result, the regular expression to match one looks like:
We can make this a little more concise by using the \d shortcut, which is equivalent to:
Of course, these names are preceded by either "var" or "const" and at least one space, so...
And just as we want to capture "var" or "const", we want to capture the name:
Now, all of this has to be followed by a colon and then the type of the variable. The colon may have whitespace around it, but it may not. And of course, want to capture the type.
Variable declarations come one their own lines, so we should modify this slightly:
The problem The problem with this regex is that it only matches variable declarations where one variable is being declared. We need a way to match something llike:
So, we need to look for the pattern. There's at least one variable being declared. Let's say that one variable in this case is "a". Remove "a" from the list and you're left with:
Clearly each possible variable is preceded by a comma. To match this pattern, we can use:
Or, broken down:
So, we can reincorporate this into the bigger regex:
Refactoring So, how do we make this more readable? Well, in another thread on Ruby I demonstrated that strings can be "interpolated" into other strings. If I have a "name" variable, I can interpolate it into a greeting string.
It shouldn't be terribly surprising then, that the same can work with regular expressions.
So, we can factor out repeated bits of the regular expression and give them names.
Factoring further:
And a bit further:
This last change is particularly profound. With it, the list of types to match can grow just by adding the type name to the array, rather than editing the regular expression directly. Testing it So, let's put this up against a sample variable declaration.
$1 now contains:
$2 contains:
And $3 contains:
Taking the variable name list apart So now we have a variable list, "a, b, c", but it's all one big string. Fortunately, we can use a regex and a method of that string to get each name.
Scanning the whole source Ruby strings have a method called "scan", which takes a reglar expression and return an array containing all of the matches. We can use this, and the regex we've constructed for finding variable declarations to find all variable declarations in the source. The code So, let's recap the code thus far:
And we run it with the following test.t:
We end up with:
Getting one list Now, we have something that's a lot simpler than what we started with, but it's still not a list of variables. To get that we'll have to separate out the variable names.
And now "variable_list" looks like:
That's all for now, folks Well, it feels like I should call it quits at this point. Future tutorials may work on identifying arrays and custom data types. |