Optimizing your Turing code from the compiler's perspective

Foundry · **Posted:** Tue Feb 14, 2017 8:00 pm

Optimizing your Turing code

One of the major issues that people have with Turing is that it's slow. And frankly, when compared with compiled languages like C, Rust, or Haskell, it is.
Being an interpreted language in its reference implementation with no JIT compilation abilities, Turing can be hideously slow under certain circumstances. Unfortunately, many beginner or intermediate Turing programmers make seemingly innocent blunders that have rather large implications on the performance of their software.
Let's take a look at these three blocks of code:

Turing:

type Vector3i:
record
x, y, z: int
end record

var myVec: Vector3i
myVec.x := 100
myVec.y := 45
myVec.z := 250

var myVecArray: array 1..10 of Vector3i
for i: 1..10
myVecArray(i).x := i
myVecArray(i).y := i * 2
myVecArray(i).z := i
end for

Turing:

type Vector3i:
record
x, y, z: int
end record

var myVec: Vector3i := init(100, 45, 250)

var myVecArray: array 1..10 of Vector3i := init(
init(1, 2, 1), init(2, 4, 2),
init(3, 6, 3), init(4, 8, 4),
init(5, 10, 5), init(6, 12, 6),
init(7, 14, 7), init(8, 16, 8),
init(9, 18, 9), init(10, 20, 10)
)

Turing:

type Vector3i:
record
x, y, z: int
end record

const myVec: Vector3i := init(100, 45, 250)

const myVecArray: array 1..10 of Vector3i := init(
init(1, 2, 1), init(2, 4, 2),
init(3, 6, 3), init(4, 8, 4),
init(5, 10, 5), init(6, 12, 6),
init(7, 14, 7), init(8, 16, 8),
init(9, 18, 9), init(10, 20, 10)
)

Taking a cursory glance at them, you might realize that these all do pretty much the same thing when shown without any additional context. And superficially, that is correct! But when you run the code, what does it actually compile down to? To find out, you'll need to grab the last library you'll ever use: turing.reflect. Its Opcode introspection capabilities will come in handy in figuring out exactly what your code is doing when it runs.

The first snippet

As I mentioned above, the reference implementation of the Turing language is interpreted. When you run your Turing code, the Turing compiler translates your source code not into machine language as a C compiler would, but instead into interpreter opcodes. Each opcode represents a single instruction for the Turing interpreter to execute.
Using turing.reflect's function reflection capacities, we can look at the opcodes that the interpreter generates for any given block of code. The first block of code, for example, generates something like this:

Turing:

LOCATELOC, 12,
PUSHADDR1, 0, 41212648,
UNINIT,
LOCATELOC, 12,
PUSHINT1, 100,
ASNINT,
LOCATELOC, 12,
FIELD,
ADDINT,
PUSHINT1, 45,
ASNINT,
LOCATELOC, 12,
FIELD,
ADDREAL,
PUSHINT2, 250,
ASNINT,
LOCATELOC, 132,
PUSHADDR1, 764, 41212684,
UNINIT,
PUSHVAL1,
PUSHINT1, 10,
PUSHVAL1,
LOCATELOC, 148,
FOR,
SUBNAT,
LOCATELOC, 132,
LOCATELOC, 148,
FETCHINT4,
VSUBSCRIPT, 12, 1, 10,
LOCATELOC, 148,
FETCHINT4,
ASNINT,
LOCATELOC, 132,
LOCATELOC, 148,
FETCHINT4,
VSUBSCRIPT, 12, 1, 10,
FIELD,
ADDINT,
LOCATELOC, 148,
FETCHINT4,
PUSHINT1, 2,
MULINT,
ASNINT,
LOCATELOC, 132,
LOCATELOC, 148,
FETCHINT4,
VSUBSCRIPT, 12, 1, 10,
FIELD,
ADDREAL,
LOCATELOC, 148,
FETCHINT4,
ASNINT,
LOCATELOC, 148,
ENDFOR,
SIGNAL

That's quite a mouthful, huh.
Each word there is a mnemonic for a single interpreter instruction. Anything that is shown as a raw number is an operand for that instruction, telling it how to perform its job. I'm aware that there is a non-trivial number of individual opcodes present in the above snippet, so I'll leave matching individual blocks of opcodes to source code representations as an exercise to the user. A list of all interpreter opcodes is also present in turing.reflect's Opcodes file for anyone curious.
Suffice to say, this monstrosity is what the Turing interpreter generates when it compiles the first code snippet shown above. Dumbed down, this code is allocating blocks of memory on the stack and writing data to offsets from those root block addresses one by one. Every bit of information needs to be initialized manually whenever this code is hit.
As you can see, writing code like this doesn't seem terribly efficient. Let's see if we can't clean some of this up.

The second snippet

The code in the second snippet compiles down to this:

Turing:

PUSHADDR1, 0, 40491764,
LOCATELOC, 12,
ASNNONSCALARINV, 12,
PUSHADDR1, 756, 40491804,
LOCATELOC, 132,
ASNNONSCALARINV, 120

Wow. That's a whole lot more concise!
What happened here is that the compiler, seeing that it could initialize the data for each variable at compile-time, went ahead and did just that. The addresses in memory that you can see after each PUSHADDR1 instruction actually point to the fully initialized form of the variable, allocated on the heap. Whenever the second block of code above is reached, a copy of that fully initialized variable is made and pushed onto the stack.
Each ASNNONSCALARINV (roughly read as "assign non-scalar in variable" in my mind) instruction is copying a given number of bytes from a source destination to a target destination, specified by the PUSHADDR1 and LOCATELOC instructions respectively.
In a nutshell, this is a lot more efficient than the opcodes generated by the first block of code because the actual initialization of each variable is done by fast C code instead of slow Turing code. But can it be any faster?

The third snippet

The code in the third snippet compiles down to this:

Turing:

No, you're not missing something here. Absolutely no instructions are generated!
Changing each var declaration to be const signals to the compiler that you will never be changing the contents of those variables, allowing the only things that the instructions generated in block two do (pushing a copy of the fully initialized form of each variable stored on the heap to the stack) to be optimized out completely. Any code that accesses either of those variables will operate directly with the fully initialized data stored on the heap.
Now, you won't always be able to mark things as const, but in many cases it is feasible to do so. As you can see, there can be quite a performance benefit from doing so when you can.

So what does this mean for me?
In conclusion, observing what the interpreter is doing with your source code can offer incredibly valuable insight into the performance of your applications. Making relatively simple changes to your code can end up paying dividends. If you're interested in seeing some more comparisons of how different styles of writing code can have performance implications, feel free to let me know!

Insectoid · **Posted:** Wed Feb 15, 2017 11:06 am

Neat stuff, I didn't know about the implementation difference between init() and manual assignment. That said, 99% of the time when someone's code runs slow, it's because their algorithm is awful, not because of Turing's interpreter. People have investigated similar stuff (View.Update vs View.UpdateArea, for example) but it's never had any significant effect on program performance. The vast majority of projects regulate speed with delay or delaysincelast as it is.

Are there any performance changes if you compile to an exe? I was never sure if that option actually compiled the code or if it just bundled the interpreter with the opcodes. There are subtle behaviour differences between the two so they aren't identical but I really don't know what's going on under the hood. Note that Turing 4.1.1 cannot compile exes, so you'll need 4.1 instead. OpenTuring may have fixed that, I don't know.

It's awesome to see someone really digging into Turing. I'm pretty sure the staff of this site plus a couple users/former users are only Turing masters left on earth so seeing this pop up out of the blue is a treat. It's not fair to claim your library is the only one we'll ever use though, given it's probably the only one ever published (certainly the first in ten years).

	Computer Science Canada Programming C, C++, Java, PHP, Ruby, Turing, VB Username: Password: Register
Wiki Blog Search Turing Chat Room Members