Comments on the Mill CPU from OotB Computing
do 01 augustus 2013 — Raphael ‘kena’ Poss
Ivan Godard from Out-of-the-Box Computing is currently touring the US talking about his Mill CPU architecture, based on a new machine model called the “Belt”. I just finished watching the 2nd talk on the machine model and its realization in the processor’s micro-architecture.
OotB’s contribution is a machine model (the “Belt”) where functional units (FUs) produce values at the head of a “conveyor belt” and consume them from arbitrary positions relative to the head. The Mill CPU is a particular architecture that implements the Belt. To enable instruction-level parallelism (ILP), the Mill fixes operation latencies and lets the compiler schedule instructions. The Mill also supports multiple pipelines and VLIW issue. OotB (through Ivan) proposes to deal with forward compatibility (different latencies in subsequent processor generations) by re-generating code on each CPU from a common intermediate representation (IR), eg LLVM.
What makes this stuff worth some attention: this guy and his company are executing a cool idea that has been around since the 1970’s: using stack machines  as a basic abstract model instead of Turing/register machines.
(Don’t listen to him when he says the idea is “new”. It isn’t. But his execution of it is fresh, interesting and worth listening to).
|||Stack machines are a theoretical model that is as powerful as Turing machines but is quite differently defined. This is exciting stuff, there have been nearly no hardware CPUs using stack machines as basic model until now!|
For reference, most hardware CPUs built so far are derivative from Turing machines, or more accurately register machines which are really an optimization of Turing machines with constant-time cell addressing. Some notable hardware CPUs that use stack machines as model are the 8087 FPU (which is still visible in the x86 ISA, although Intel has been using RISC internally for newer models) and embedded Forth processors.
The important aspect is that software has been using stack machines as a basic machine abstraction for ages already (ever since Algol: eg C, C++, ML, LLVM, JVM all these use stack machine as primary execution model). Most of the compiler research has been how to map the software stack-based model into a hardware register-based model; of course since there were two different abstractions, the translation was never optimal.
A general-purpose hardware CPU using stack machines as model could greatly increase the efficiency of mapping software to hardware. This is why this stuff is exciting.
However past the excitement I see a few problems already:
For now the available materials skip the question of memory entirely. This is a killer point nowadays, since the “main” problem of computer architects is how to deal with memory efficiently (the “memory wall”). So far I can see, if he keeps traditional memory semantics, the only option is to stall the entire processor if a memory unit is not available. This is a traditional problem of stack machines.
That said, memory will be the topic of a later talk from Ivan so we can just wait and see.
However, if I was him, I would pull a rabbit out of my hat and introduce futures for memory accesses: an instruction to load data from memory does not place the data directly on the “belt” but instead a token that is “good for” the actual value. Then only when the token is actually consumed by a FU start waiting on the load. This way you can tolerate access latencies of a few cycles without having to stall everything. Or maybe he will present something else entirely. I am personally very curious: if he finds something good, he would solve a tremendous problem, but many people before him have tried to crack this egg with no success.
Ivan argues in his talks that his 64-item belt is cheaper than the 300+-registers needed in superscalar designs. However the 300+ registers are useful to enable large ILP, so as to tolerate large latencies especially to memory. Unless he pulls a rabbit from his hat, the Mill’s 64-item belt will not tolerate as much latencies so it is really an unfair comparison.
The Mill requires fixed timings for all operations. The fact they must be fixed is problematic, since processors nowadays often encounter variable latencies: not only from cache misses, also from sharing of hardware units between different pipes (or even cores) in a parallel implementation.
This problem is arguably specific to his choice of a VLIW-based architecture, not the choice of stack machines as model. However I think it will make it problematic to integrate in large general-purpose designs.
So all in all, the Mill is indeed likely to produce something “different” with new trade-offs between performance, efficiency, area on chip and energy usage.
From the architecture perspective, there does not seem to be any silver bullet: performance is mainly obtained by VLIW techniques, which are fragile in contexts where operations have variable latencies, eg. memory or shared FUs. The “belt” seems cheap but how much more latency it can tolerate than a superscalar processor of similar size is left to be seen. The machine model is not “revolutionarily new”, stack machines have been around for a long time already.
However, even if the Mill’s performance ends up being merely similar to (instead of much better than) register-based designs at some similar area&energy budget, it will still be an interesting contribution, because it will make the translation from programming languages more straightforward: faster compilation times, smaller code size. Just these would be big news.
Edit (Aug 2nd 2013): an earlier version was comparing the Belt to theoretical queue machines, which was erroneous. The closer model was the stack machine. Thanks for Ivan for pointing out the discrepancy.
Comment (Aug 11th 2013): from “MW” (wishes to remain anonymous):
In you post on the Mill CPU from OotB Computing you include a quote: “This is exciting stuff, there have been nearly no hardware CPUs using stack machines as basic model until now!” This is historically inaccurate.
The Burroughs Large System machines were a very successful stack architecture. They have existed through multiple generations from 1961 until at least 2005.
The Burroughs Large Systems Group designed large mainframes using stack machine instruction sets with dense instruction syllables [NB 1] and 48-bit data words. The first such design was the B5000 in 1961. It was optimized for running ALGOL 60 extremely well, using simple compilers. It evolved into the B5500. Subsequent major redesigns include the B6500/B6700 line and its successors, and the separate B8500 line. … The first of the Burroughs large systems was the B5000. Designed in 1961, it was a second-generation computer using discrete transistor logic and magnetic core memory. The successor machines followed the hardware development trends to re-implement the architecture in new logic over the next 25 years, with the B5500, B6500, B5700, B6700, B7700, B6800, B7800, and finally the Burroughs A series. After a merger in which Burroughs acquired Sperry Corporation and changed its name to Unisys, the company continued to develop new machines based on the MCP CMOS ASIC. These machines were the Libra 100 through the Libra 500, With the Libra 590 being announced in 2005.
Besides their stack architecture, these processors were programed in variants of Algol. There was no assembly language support. They had virtual memory before IBM. These machines were truly ahead of their time.
I would also like to point out that the famous Dutch computer scientist Edsger Dijkstra was deeply involved with both Algol and Burroughs. He worked on the first Algol-60 compiler, and became a Burroughs Fellow in the early 1980s. He got a ACM Turing award, and shortly after his passing had another award named after him. Given your location in the Netherlands you should have some familiarity with his contributions to the computer science community.
Sadly, I find you lack of historical knowledge to the norm in the computer community. This is unfortunate, because instead of making progress, the same mistakes are repeated over and over again.
I hope I have been able to supply you with some additional information about computer architecture, and hopefully encourage you to invest some effort to find out about the rich and useful history of computing.