----------------------------------- wtd Tue Nov 16, 2004 9:53 pm [Regex-tut] Finding Variables in Turing ----------------------------------- Disclaimer The code shown here is not a fool-proof way of parsing Turing code. It can be fooled by code that does tricky things like sticking comment characters in strings. Where do we start? We start by reading in a Turing source file. First we have to open the file for reading. turing_file = File.open("test.t", File::RDONLY) Then, we have to read in the entire file as a big string. contents = turing_file.read And since we've got the string, now, we can close the file. turing_file.close Eliminating comments The next step is to eliminate any comments in the code, as we don't want to find variable declarations in comments. To get rid of text that matches a given pattern, we'll use the gsub method of our "contents" string. This method takes a regular expression, and a replacement string. Anywhere it finds a match, it replaces that text with the replacement string. In turing, a line comment is anything that follows a %, so a regular expression to match a line comments would looking like: / % .* $ /x Thus, the gsub call to eliminate such comments looks like: without_line_comments = contents.gsub(/ % .* $ /x, "") Eliminating multi-line comments Multi-line comments, like so: /* hello world */ Present a problem. Regular expressions, as seen so far, only match a single line. The . character matches everything except newlines, so a single regular expression can't span multiple lines. We've already seen the x modifier, and how it allows use of spaces inside regular expressions. The m modifier allows . to match newlines, thus giving regular expressions the ability to span lines. A regular expression to find a /* */ style comment on one line: / \/\* .*? \/\* /x And now, to apply this to multiple lines: / \/\* .*? \/\* /mx So eliminating multi-line comments looks like: without_comments = without_line_comments.gsub(/ \/\* .*? \/\* /mx, "") Note: the backslashes were necessary because / and * have meaning in a regular expression on their own. Matching a variable or constant declaration For simplicity, we'll only be matching variables or constants declared as one of the built-in types: intrealbooleancharstring Variable names in Turing begin with either a lower or uppercase letter. Following that you can have lower or uppercase letters, numbers, or underscores. As a result, the regular expression to match one looks like: [a-zA-Z][a-zA-Z0-9_]* We can make this a little more concise by using the \d shortcut, which is equivalent to: 0-9 [a-zA-Z][a-zA-Z\d_]* Of course, these names are preceded by either "var" or "const" and at least one space, so... (var|const)\s+[a-zA-Z][a-zA-Z\d_]* And just as we want to capture "var" or "const", we want to capture the name: (var|const)\s+([a-zA-Z][a-zA-Z\d_]*) Now, all of this has to be followed by a colon and then the type of the variable. The colon may have whitespace around it, but it may not. And of course, want to capture the type. (var|const)\s+([a-zA-Z][a-zA-Z\d_]*)\s*:\s*(int|real|boolean|char|string) Variable declarations come one their own lines, so we should modify this slightly: ^\s*(var|const)\s+([a-zA-Z][a-zA-Z\d_]*)\s*:\s*(int|real|boolean|char|string)\s*$ The problem The problem with this regex is that it only matches variable declarations where one variable is being declared. We need a way to match something llike: var a, b, c, d : int So, we need to look for the pattern. There's at least one variable being declared. Let's say that one variable in this case is "a". Remove "a" from the list and you're left with: , b, c, d Clearly each possible variable is preceded by a comma. To match this pattern, we can use: (?:\s*,\s*[a-zA-Z][a-zA-Z\d_]*)* Or, broken down: (?: # the start of the group \s* # zero or more saces , # a comma \s* # zero or more saces [a-zA-Z] # match the first character in the variable name [a-zA-Z\d_]* # any following characters in the variable name )* # match this whole group zero or more times So, we can reincorporate this into the bigger regex: ^\s*(var|const)\s+([a-zA-Z][a-zA-Z\d_]*(?:\s*,\s*[a-zA-Z][a-zA-Z\d_]*)*)\s*:\s*(int|real|boolean|char|string)\s*$ Refactoring So, how do we make this more readable? Well, in another thread on Ruby I demonstrated that strings can be "interpolated" into other strings. If I have a "name" variable, I can interpolate it into a greeting string. name = "Bob" greeting = "Hello, #{name}" It shouldn't be terribly surprising then, that the same can work with regular expressions. greeting_word = /[hH]ello/ pattern = /#{greeting_word}, Bob/ So, we can factor out repeated bits of the regular expression and give them names. variable = /[a-zA-Z][a-zA-Z\d_]*/ ^\s*(var|const)\s+(#{variable}(?:\s*,\s*#{variable})*)\s*:\s*(int|real|boolean|char|string)\s*$ Factoring further: variable = /[a-zA-Z][a-zA-Z\d_]*/ var_list = /#{variable}(?:\s*,\s*#{variable})*/ ^\s*(var|const)\s+(#{var_list})\s*:\s*(int|real|boolean|char|string)\s*$ And a bit further: variable = /[a-zA-Z][a-zA-Z\d_]*/ var_list = /#{variable}(?:\s*,\s*#{variable})*/ types = ["int", "real", "boolean", "char", "string"] type_list = types.join("|") ^\s*(var|const)\s+(#{var_list})\s*:\s*(#{type_list})\s*$ This last change is particularly profound. With it, the list of types to match can grow just by adding the type name to the array, rather than editing the regular expression directly. Testing it So, let's put this up against a sample variable declaration. variable = /[a-zA-Z][a-zA-Z\d_]*/ var_list = /#{variable}(?:\s*,\s*#{variable})*/ types = ["int", "real", "boolean", "char", "string"] type_list = types.join("|") "var a, b, c : int" =~ /^\s*(var|const)\s+(#{var_list})\s*:\s*(#{type_list})\s*$ $1 now contains: "var" $2 contains: "a, b, c" And $3 contains: "int" Taking the variable name list apart So now we have a variable list, "a, b, c", but it's all one big string. Fortunately, we can use a regex and a method of that string to get each name. name_list = $2.dup names = name_list.split(/\s*,\s*/) Scanning the whole source Ruby strings have a method called "scan", which takes a reglar expression and return an array containing all of the matches. We can use this, and the regex we've constructed for finding variable declarations to find all variable declarations in the source. The code So, let's recap the code thus far: turing_file = File.open("test.t", File::RDONLY) contents = turing_file.read turing_file.close without_line_comments = contents.gsub(/ % .* $ /x, "") without_comments = without_line_comments.gsub(/ \/\* .*? \/\* /mx, "") variable = /[a-zA-Z][a-zA-Z\d_]*/ var_list = /#{variable}(?:\s*,\s*#{variable})*/ types = ["int", "real", "boolean", "char", "string"] type_list = types.join("|") declaration = /^\s*(var|const)\s+(#{var_list})\s*:\s*(#{type_list})\s*$/ all_declarations = without_comments.scan(declaration) And we run it with the following test.t: var hello : real var world : int var foo, bar, hw : string We end up with: [ ["var", "hello", "real"], ["var", "world", "int"], ["var", "foo, bar, hw", "string"] ] Getting one list Now, we have something that's a lot simpler than what we started with, but it's still not a list of variables. To get that we'll have to separate out the variable names. turing_file = File.open("test.t", File::RDONLY) contents = turing_file.read turing_file.close without_line_comments = contents.gsub(/ % .* $ /x, "") without_comments = without_line_comments.gsub(/ \/\* .*? \/\* /mx, "") variable = /[a-zA-Z][a-zA-Z\d_]*/ var_list = /#{variable}(?:\s*,\s*#{variable})*/ types = ["int", "real", "boolean", "char", "string"] type_list = types.join("|") declaration = /^\s*(var|const)\s+(#{var_list})\s*:\s*(#{type_list})\s*$/ variable_list = {} without_comments.scan declaration do |var_or_const, names, type| names.split(/\s*,\s*/).each do |name| variable_list[name] = { "var or const" => var_or_const, "type" => type } end end And now "variable_list" looks like: { "world" => { "var or const" => "var", "type" => "int" }, "hw" => { "var or const" => "var", "type" => "string" }, "foo" => { "var or const" => "var", "type" => "string" }, "hello" => { "var or const" => "var", "type" => "real" }, "bar" => { "var or const" => "var", "type" => "string" } } That's all for now, folks Well, it feels like I should call it quits at this point. Future tutorials may work on identifying arrays and custom data types.