Today Google Labs have introduced Code Search. It’s a nifty tool for searching publicly available source code – that is, what Google gets to, including .tar and .zip archives. It works with actual source-code files, but not code snippets that are often embedded into posts.
The best part is the support for Regular Expression searches. Our resident guru wtd has wrote an excellent series on the use of Regular Expressions.
“Regular expressions, in practical terms, are a means of performing pattern matching against strings.
Rather than simply comparing two strings for equality, or even comparing ranges of them, regular expressions allow programmers a way to relatively construct powerful, truly flexible patterns. They allow us to not just make simple comparisons, but to identify whole classes of strings.”
So searching for ^var .*:array I can find out who’s declaring array variables, regardless of the variable identifier name. A properly constructed regexp can be used to easily check if a certain unique line of code has been taken from another source, even if it was cosmetically changed. The flexibility allows for a fair bit of logic to go into the regexp pattern, and wtd explores how it could be used to parse files to identify variable names used.
Perhaps this will mean an accessible solution to plagiarism problems in computer science classes, as it is way too easy to copy code.