I've worked in a start-up company similar to your company.
We developed a 256-cores RISC processor, only shared memory was used between all the cores instead of a mash-up of a memory block for each core and DMA for transactions.
How do you intend to synchronize work between the different cores? How a compiler will abstract away the memory synchronizations?
Which programming language is going to be used? So many questions as this is such a complex area in computing...
From my personal experience of over 5 years developing such chip in a start-up company, the cost of production will probably be a huge obstacle. Good luck!
Thanks! If you don't mind me asking, what was the name of the company?
Technically, you don't need to synchronize the cores... it's a MIMD/MPMD system that is not in lockstep. It is up to the programmer (with help from the compiler) to make sure you don't do anything too stupid ;)
As for the programming languages, C and Fortran are the big ones to us. We hope once our LLVM backend is improved, you'll be able to run anything you like on the cores themselves. As for the programming model, the three we like the best are CSP (Go), Actor (Erlang), and PGAS (C, Fortran, Chapel, a few other research ones). If you're familiar with SHMEM, that's the closest thing we can think of currently.
When it comes to cost, we're trying to develop as much ourselves to reduce licensing costs. We've been able to do a pretty good job (if I do say so myself) as just two people so far with no capital. Fabrication costs are the killer, with it being about ~$500k per shuttle run, and $5m-$7m for a mask set when we actually go to full production.
All our cores shared the same memory for both data and instruction and we based our synchronization of work-load on a hardware instead of software. It yielded such a huge speedup for execution time that most of the companies simply ignored our results as fake. :)
Choosing a programming language is crucial - we went with C and a declaration language for tasks. Today (4 years after we closed the company) I am not sure whether it was the best decision. The simpler the parallel definition in code the better. Programmers as getting confused easily.
Impressive for your age. But how is this any different from the millions of identical proposals that never got anywhere? Packing plenty of alus with some sort of basic network hasn't really worked out in real life. The first block diagram in the article is so basic it's concerning.
Apologies for being critical, I wish you the best.
The article is incorrect in saying that the (very simplistic...it's supposed to just show the basic components per core, not how it is actually functioning) diagram for a single core is the whole chip... our chip design has 256 of those cores (So in total, 256 64 bit ALUs and 256 64 bit FPUs). We have a custom mesh network that allows for atomic operations on from any core to any adjacent core, and DMA operations that allow for a read or write to any core's local scratchpad memory from any other scratchpad memory.
In comparison to other architectures, we have chose to stick to RISC, instead of some crazy VLIW or very long pipeline scheme. In doing this, we limit compiler complexity while still having very simple/efficient core design, and thus hopefully keeping every core's pipeline full and without hazards. The idea is that we just want to have a bunch of very simple and focused SPMD cores, so that we can have a MIMD/MPMD chip.
We are currently fixing the bugs on our single core FPGA demo, and hope to have our full 256 core cycle accurate simulator done by ~January/February. We want to release that (and our currently very early compilers) to the public ASAP.
I should also point out that it isn't just a bunch of ALUs or FPUs connected by a network... each core has it's own full execution logic, making each core a full 5-stage superscalar RISC core that can independently operate on its own instruction and own piece of data.
128K per core, so 32MB per chip. This is really just a fabrication/size limitation... we are planning for 28nm right now, and don't want our memory to use (too much) more power than our actual compute logic ;)
If we can go to 14/16nm in the future, we are planning 512 and 1024 core versions with different amounts of memory depending on if you are memory or compute bound.
>Local scratchpad memories are physically addressed as part of a flat global address space.
So from the programmers perspective each core will have a block of the address space, I.E.: 0-255, 256-511, 512-767, 768-1023 etc.?
Or is there address translation between units? Or if a thread is just built to arbitrarily execute on a unit, it'll have to pre-process its position for name space translation?
Also is there a memory locking in local scratchpad? (I maybe reaching).
At the physical level, each core would have a 128k block that is ascending, but the idea is that our compilers would abstract that away from the programmer, and allocate memory based on what core needs it, and place it in that core's scratchpad, or as close as possible. The DMA capabilities abstract it when actually accessing memory, as it is just a series of (very simple) comparators that get the request to the write block of memory.
I would not call it a cache as it is:
1)Physically addressed
2)Has no complex control logic (TLB, CAMs, additional flags)
3)Us not just simply replicating something higher up in a memory hierarchy
A programmer will have full access to be able to handle memory however they want, but we want to be able to build out the tools to allow for a programmer to treat it similarly to an L1 cache (that is part of a shared memory space... that has different access times)
So our current initial plans are to open source the ISA through the Open Compute Project (http://opencompute.org), using their reciprocal license. The HPC group within OCP that I co-lead has plans for HDL and RTL level submissions, but we are not sure what the timeframe or what would be included from REX (yet).
The business case for open sourcing the ISA is that we want others to be making compatible chips. As a small startup with a new architecture, it would be GREAT if others were to make competing chips, as it would only further the architecture and software ecosystem, making it more competitive with the existing market incumbents.
To us, our floating point unit is our "secret sauce", but as an open source enthusiast, I don't want that to be locked up forever. My general idea right now is that we want to be able to tape out our first chip (at least the prototypes for it), and will open source the HDL for the non "secret sauce" parts of it. As we get onto further generations, I do want to open source the full design of our previous chips for free/open use.
Thanks for the answer. Another quick question. One of the limitations of parallela is the offchip memory bandwidth. Will you be integrating a third party phy?
Our chip-to-chip bandwidth is currently 48GB/s per parallel interface, and we have 8 of those per chip (2 per side), giving us an aggregate bandwidth of 384GB/s.
The Epiphany 4's interfaces are 1.5GB/s serial (12.5Gb serdes), and there are 4 of those per chip, giving you an aggregate bandwidth of 6GB/s.
If you can spend a bit more power (or wait for 14/16nm process), we think it is possible to double that to 96GB/s per interface. Using some more exotic methods (which are purely in an idea stage right now, and not tested, are are a couple years away at best), we think it is possible to get that up to 128GB/s per interface.
Can you provide some more details on your parallel interface? As an ASIC designer, these numbers sound extremely fishy for a non-serdes interface. A 128-bit parallel interface (which is a lot of pins, especially if you need 8 links) would need to be running at 3 GHz, for instance, to hit that. Is it differential/single-ended? Source-synchronous? How big of a package are you planning?
Relevant paper (by Intel Research, actually)... there are quite a few differences (we are keeping it a lot simpler on the tx and rx ends, but that is some of our secret sauce... I can talk about it offline if you are really interested)
http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=648778....
Is there a plain text version of the overview on the page you linked? The current version is very pretty as a homepage, but if there was a short white-paper available it would help the put the focus on the information.
Personally, I found that the slide-ins and wobbly text made it difficult to keep my place. I tried Readability, but it didn't do well on it. Scrolling to the bottom to let the glitz run its course, and then scrolling back up to the top to read helped in the end.
There are a lot, and I would recommend reading through this page... the basics on Epiphany (and the same can be said about most others) is the fact that we designed from the start to have a very high memory/IO bandwidth (both on and off chip), stuck to RISC, and focused on floating point (one of the few architectures in this area being fully IEEE compliant with this high of a performance per watt ratio).
have you ever heard of Epiphany[1] ? They claim to achieve 70 GFLOPS/WATT. Also the processor seems to be fairly fast and they manage to put 4096 cores on a chip. Though no activity in recent time. Maybe you could find some collaboration points with them..
Also wish you good luck and recommend change focus:
First figure out, what kind of real world problems this architecture can handle and how to do it best. The architecture will perform very well, if the number of operations, which can do done on a set of data fitting into the scratchpad memory is large, so that communication overhead is small or if the problem can be mapped to the grid in a way that they only require communication with the neighbours. However I would assume that typical real world big data problems don't fulfill these requirements and problems fulfilling them also run well on classical architectures. As soon as you start to need a lot of data transfer, only the cores close to the border of the grid will be able to work, as they get data, the ones at the border are busy with forwarding data to the memory and the innermost just wait ...
Therefore before investing a lot of effort, money and time into a hardware, which many others also do in very similar ways and ask the community to find out, how to use it, spend the energy in innovative ideas to actually efficiently use such architectures. I.e. languages, profiling and debug tools, ...
That's where the real innovation is needed and where there is a lot of room for improvement.
And if you want to stay with hardware it is probably much easier to convince investors, if you can show in a simulation of FPGA prototype that 2 or 3 real world applications showing a very bad performance on classical machines can achieve a substantive boost on your architecture ...
Unlike GPUs and other SIMD accelerators, Neo's MIMD processor design leverages independent program counters and instruction registers in each core to allow different operations to be performed in parallel on separate pieces of data.[1]
Other than the grid interconnect, how does the architecture differ from Xeon Phi? In particular, what allows it to get such dramatically higher power efficiency? I'd have guessed that Intel's smaller process would make it difficult to match.
[1] It feels awkward that this sentence is included twice on such a short page.
Thanks for that... didn't notice that we had a repeated section (I just updated it with the proper text for that section)
As for comparison to the Phi... it is the fact that our actual core size (and thus the whole chip) is MUCH smaller. The Phi's cores are actually based on the original Pentium architecture (they are just downsized P54Cs) with added AVX instructions. Contrary to popular belief, they are not full x86-64 cores.
Intel has not released the official die size for the Phi, but has said it is about ~5 billion transistors (at a 22nm process), and independent "guesstimates" have pegged the die at 600-700mm2 (there is one place that says 350mm2, but that is false). For the top of the line 61 core Xeon Phi, it uses 300W, with a theoretical peak performance of 1.2TFLOP of double precision performance. That gets you to about 4GFLOP/Watt.
In comparison, our entire will be under 100 million gates, with each core (excluding memory) being around 100k gates. At a 28nm process, our core size (without memory) is a little under 0.1mm2. Our theoretical peak performance per compute chip is 256 GFLOPs double precision, while it should be using around 3 Watts, giving us a performance per watt ratio of ~85GFLOPs/Watt.
Intel has even said that their next generation Xeon Phi, made at their 14nm process, will be at 14 to 16GFLOPs/Watt. At SC14 last week, they made a soft announcement for the following generation at 10nm process will be in the ~2018 timeframe, and that is estimated at only being around 25GFLOPs/Watt.
The bottom line is Intel is just following Moore's law, and is sticking to big and complex systems, which while retaining legacy compatibility, kill you when it comes to efficiency.
Interesting, and a great answer. The usual estimates I've seen put the power efficiency overhead of the legacy x86 layer to be small enough to not be a major factor (http://research.cs.wisc.edu/vertical/papers/2013/hpca13-isa-...). But I suppose as you shrink the core smaller and smaller, that small mostly-fixed difference becomes a greater and greater part of the total power budget.
I discredit most people who continue the RISC/CISC debate to this day (As Intel forfeited in ~2006... all modern x86 processors are actually RISC processors. They are RISC at the lowest level, and just decode x86 instruction and translate them into Intel RISC microcode).
Then again, I don't think most popular "RISC" CPUs are very RISC like (Does that make me a RISC hipster?). While making things general purpose, there is no reason for ARM processors to be OoO, do branch prediction/speculative execution, or any more of these crazy things... it reduces efficiency in the long run.
Take a look at this for instance: http://chip-architect.com/news/2013_core_sizes_768.jpg ... an AMD CPU and ARM chip are virtually the same size at the same process node. I find it insane that one of our cores is a bit less than 1/5th of the size of a cortex A7, and can do more FLOPs than it. Then again, we have focused on doing that, but still.
I think it's fair, despite long-time predjudices - to think of the x86 architecture to be midway between RISC and CISC - most instructions (push/etc being the notable exception) make a single memory reference and no addressing mode does a memory indirect or has addressing side effects (auto incs/etc)- that means that instructions are retryable (don't ever need to be able to be restarted in the middle after a page fault) with a single dcache/TLB port
Intel's insistence on x86 and dragging old designs around just isn't sustainable in the long run, especially once silicon gets to its absolute physical limits and Intel looses its traditional process technology advantage for a while (until a new material is introduced). Knight's Landing may be a temporary reverse to the killer micro order, but I give a higher chance to IBM+NVIDIA winning the next round, maybe followed by an architecture like yours.
My first thought was how this compares to Adapteva's Epiphany architecture (best known for it's use in their crowd-funded Parallela boards), so I was happy to see this project was inspired by that one. Andreas Olofsson from Adapteva has tweeted often about how difficult it is to get funding as a silicon start up, though. I think the statement "it's a tough sell" will prove to be a colossal understatement. But these are impressive kids and it's a great concept - I wish them luck.
I was working on the Parallela boards for a long time... but there are fundamental flaws in the architecture (missing instructions, it is only 32 bit, not IEEE754-2008 compliant, etc) that makes the Epiphany architecture not really suitable for the markets we are trying to address. When we decided to go out and make our own, we knew there was going to be a lot of difficulties, but the idea is that if we are going to do a startup and give it our all, why not try to do something big?
The fact that we are using mostly open source tools for our development (such as Chisel: chisel.eecs.berkeley.edu), our development time has decreased and productivity has had a huge boost compared to if we were just writing in Verilog. We also made the decision to not use off the shelf IP, which while difficult to verify, we think we make a much better system. Compare this to the Epiphany implementations, which while having a custom ISA and basic core components, used off the shelf ARM interconnect, off the shelf ARM memory compiler, and many other things that were used to minimize development time and verification. While this is a bit more difficult upfront, since we are keeping our components very simple, our verification is no where near as complex as a "normal" processor. Plus we don't have to pay $500k-$1m+ in upfront licensing fees.
Are they licensing anyone's IP for the interconnect, or CPU? What's the bandwidth of the interconnect? Is is packet-oriented? How does fair-routing work?
Custom ISA that we have developed... we developed it in parallel to RISCV (before they released their public 2.0 ISA), but we have diverged a bit... we are a static 64 bit ISA, have no options to have VLIW expandability, our FPU (and thus its instructions) are able to do two single precision IEEE754-2008 FLOPs per cycle, and one double precision per cycle. We have also added a set of DMA instructions.
We currently have 128KByte dual-ported SRAM per core (which is physically part of the core, and not a giant array somewhere else on die). It has single cycle latency to the core's registers and to the Network on Chip router.
The on chip mesh network is custom 128 bit wide going core to core. The router can do a read or write to SRAM per cycle AND allows a passthrough to another core in the same cycle.
Our chip-to-chip interconnect is a custom 64 bit (72 lane) parallel interface allowing 48GB/s. There are two of these (unidirectional) interfaces per side, giving you a total of 8 of these interfaces per chip.
I realized I didn't fully answer the chip-to-chip interconnect questions... it is an extremely simple parallel interface (that I would not even call a "bus", as that is assuming it has a lot more control logic than it has). Instead of having a full serdes per lane, our solution is to use a very simple latch and buffer, along with PLLs on each chip to have a MUCH (50-70%) smaller and more power effecient point-to-point connection between chips. As such, it is not packet based, and we are really just focusing on moving a 64 bit word per cycle.
I asked this previously without seeing that you had answered it here. So you're claiming to have a parallel interface without a SERDES running at 64-bits at 6 Gb/s? How are you maintaining bit alignment between lanes? Do you have any in-line signal conditioning or anything (CTLE, DFE, etc.)? Parallel interfaces are rarely run faster than 1 Gb/s, so 6 Gb/s sounds unlikely.
Relevant paper (by Intel Research, actually)... there are quite a few differences (we are keeping it a lot simpler on the tx and rx ends, but that is some of our secret sauce... I can talk about it offline if you are really interested)
Going over 50 cm of Twinax is great and all, but that's the ideal environment. When you guys put 16 of these things on a board, routing that PCB and getting good signal integrity is not going to be fun, even after you throw in pre-emphasis and so on.
Getting anything approximating Intel's work, assuming you are rolling all your own IP, is pretty ambitious (and probably a good deal more work than your Core design). Even at only 6 Gbit/s (faster than PCIe Gen2, btw). Just curious - have you ever taped out a chip on a modern process before?
We aren't trying to reach their 128GB/s number as listed in that paper, only 48GB/s... in addition to our design being much simpler than that. Our design is having each pin is simply a buffer and a latch that is synchronized with all the other pins part of the interface by a PLL... much easier to implement and run than a serializer for even just a single pin.
I myself have not taped out something on a modern process, but have advisors who have. My co founder and I do have nanofab experience, so we do understand the physical complexities of fabrication first hand.
Ah, so the idea is to have 64 parallel bits coming in at 6 Gbits/s, along with a slower clock that you multiply up to 6 GHz and use to sample the inputs? That will be quite tricky to get working 1) without any analog signal conditioning on the inputs (or outputs), and 2) without inter-bit skew making it impossible to meet timing on your inputs. Best of luck to you, but the interface alone sounds likely to be problematic.
That's the basic idea, but we can run it at 3GHz if we do DDR, or 1.5GHz doing QDR. There is some extra magic there which I don't want to talk publicly about just yet ;)
The biggest problem (even with our solutions for skew and crosstalk) is just the number of pins/traces on the board, but that's not unsolvable... nothing a ~10 layer PCB can't solve.
I wonder what kind of penalties there are for transmitting data to neighboring nodes. Same for receiving. If every node does for example receives data from a neighboring node, does a single fused multiply-add and transmits the result to a neighboring node, how many FLOPS you get out of the whole thing?
How big chunks of computation you need to do in a node for this to be effective?
Our theoretical double precision FLOPs (based on running at 1GHz) per core is 1GFLOP per second, but that is based on doing an add or a multiply. In the case of just doing FMAs, you would be doing 1.5GFLOP. This gives you (theoretical peak) 256 to 384 double precision GFLOPs per chip. Our bandwidth between cores is 16GB/s, while the required bandwidth for doing 1GFLOP is 8GB/s.
Our single precision numbers are actually double that of double precision (compared to a GPU or most other SIMD systems, which have independent FP32 and FP64 FPUs, we have a single combined unit). While our ISA is pure 64 bit, we have packed 32 bit FPU instructions for doing two single precision FLOPs per cycle.
I forgot to add why the separate FP32 and FP64 floating point units are "bad" (from our PoV)... having the separate units just increase complexity, and NVIDIA GPUs, for instance, have 4x the number of FP32 units compared to FP64 units, thus making their single precision numbers (which are what they typically advertise) 4x that of their double precision numbers.
So how many I can do if for each FMA I also receive 16 bytes of data from a neighboring node and send 16 bytes of data to a neighboring node? Or is the data transfer free, is the neighboring node memory mapped? If so, how does synchronization work?
Edit: didn't notice same was asked before too. Regardless, how many FMAs can be executed in the scenario I gave, also sending 16 bytes and receiving 16 bytes for each FMA?
The superscalar design means Load/Store unit can operate independently from FMA and DMA units, i.e. FMA operation on register operands will not interfere with other operations elsewhere on the chip.
Synchronization should be handled in a dataflow-driven program design. Any sort of mutex/semaphore/etc. will have to be software defined or interrupt-driven.
In regards to neighboring node being memory mapped... are you asking about another compute chip or another node (with the 16 compute chips + GaMMU)? All of the compute chips in a grid are part of the same flat global memory map, and have DMA capabilities between each other. Once you get out of the compute grid (that is managed by the GaMMU), then that is a separate memory space, but can be accessed through some other layer through something like MPI, for instance.
I guess this is one of those cases where you just need to get your hands dirty to really understand it.
Can you do reasonably efficient [arbitrary size] fixed point arithmetic on your hardware? Do you have 64-bit add with carry and 64-bit multiply with 128 bit results? I'm interested in 64.64 and 128.128. Needed operations are add, sub and multiply.
I think compute grids like these could very well be an important part of computing in the future, maybe even the most important part. Ever since when I first saw an article about transputer. Grids or VLIWs, sadly software is always the pain point. I wish you luck, please get this working and right.
First of all, good luck for you guys.
I've worked in a start-up company similar to your company.
We developed a 256-cores RISC processor, only shared memory was used between all the cores instead of a mash-up of a memory block for each core and DMA for transactions.
How do you intend to synchronize work between the different cores? How a compiler will abstract away the memory synchronizations? Which programming language is going to be used? So many questions as this is such a complex area in computing...
From my personal experience of over 5 years developing such chip in a start-up company, the cost of production will probably be a huge obstacle. Good luck!
Thanks! If you don't mind me asking, what was the name of the company?
Technically, you don't need to synchronize the cores... it's a MIMD/MPMD system that is not in lockstep. It is up to the programmer (with help from the compiler) to make sure you don't do anything too stupid ;) As for the programming languages, C and Fortran are the big ones to us. We hope once our LLVM backend is improved, you'll be able to run anything you like on the cores themselves. As for the programming model, the three we like the best are CSP (Go), Actor (Erlang), and PGAS (C, Fortran, Chapel, a few other research ones). If you're familiar with SHMEM, that's the closest thing we can think of currently.
When it comes to cost, we're trying to develop as much ourselves to reduce licensing costs. We've been able to do a pretty good job (if I do say so myself) as just two people so far with no capital. Fabrication costs are the killer, with it being about ~$500k per shuttle run, and $5m-$7m for a mask set when we actually go to full production.
http://plurality.com/ - website is not very good and the company is dead. You can read more information in wikipedia: https://en.wikipedia.org/wiki/Plurality_%28company%29
All our cores shared the same memory for both data and instruction and we based our synchronization of work-load on a hardware instead of software. It yielded such a huge speedup for execution time that most of the companies simply ignored our results as fake. :)
Choosing a programming language is crucial - we went with C and a declaration language for tasks. Today (4 years after we closed the company) I am not sure whether it was the best decision. The simpler the parallel definition in code the better. Programmers as getting confused easily.
I'm the founder and CEO of REX... Check out our website for a brief overview (http://rexcomputing.com) and feel free to ask questions here!
Impressive for your age. But how is this any different from the millions of identical proposals that never got anywhere? Packing plenty of alus with some sort of basic network hasn't really worked out in real life. The first block diagram in the article is so basic it's concerning.
Apologies for being critical, I wish you the best.
The article is incorrect in saying that the (very simplistic...it's supposed to just show the basic components per core, not how it is actually functioning) diagram for a single core is the whole chip... our chip design has 256 of those cores (So in total, 256 64 bit ALUs and 256 64 bit FPUs). We have a custom mesh network that allows for atomic operations on from any core to any adjacent core, and DMA operations that allow for a read or write to any core's local scratchpad memory from any other scratchpad memory.
In comparison to other architectures, we have chose to stick to RISC, instead of some crazy VLIW or very long pipeline scheme. In doing this, we limit compiler complexity while still having very simple/efficient core design, and thus hopefully keeping every core's pipeline full and without hazards. The idea is that we just want to have a bunch of very simple and focused SPMD cores, so that we can have a MIMD/MPMD chip.
We are currently fixing the bugs on our single core FPGA demo, and hope to have our full 256 core cycle accurate simulator done by ~January/February. We want to release that (and our currently very early compilers) to the public ASAP.
I should also point out that it isn't just a bunch of ALUs or FPUs connected by a network... each core has it's own full execution logic, making each core a full 5-stage superscalar RISC core that can independently operate on its own instruction and own piece of data.
This looks pretty exciting.
How much is the scratchpad memory in each of the processors?
128K per core, so 32MB per chip. This is really just a fabrication/size limitation... we are planning for 28nm right now, and don't want our memory to use (too much) more power than our actual compute logic ;)
If we can go to 14/16nm in the future, we are planning 512 and 1024 core versions with different amounts of memory depending on if you are memory or compute bound.
This tripped me up a bit
>Local scratchpad memories are physically addressed as part of a flat global address space.
So from the programmers perspective each core will have a block of the address space, I.E.: 0-255, 256-511, 512-767, 768-1023 etc.?
Or is there address translation between units? Or if a thread is just built to arbitrarily execute on a unit, it'll have to pre-process its position for name space translation?
Also is there a memory locking in local scratchpad? (I maybe reaching).
At the physical level, each core would have a 128k block that is ascending, but the idea is that our compilers would abstract that away from the programmer, and allocate memory based on what core needs it, and place it in that core's scratchpad, or as close as possible. The DMA capabilities abstract it when actually accessing memory, as it is just a series of (very simple) comparators that get the request to the write block of memory.
So the scratchpad isn't so much programmer controlled as just a L1 cache level?
I would not call it a cache as it is: 1)Physically addressed 2)Has no complex control logic (TLB, CAMs, additional flags) 3)Us not just simply replicating something higher up in a memory hierarchy
A programmer will have full access to be able to handle memory however they want, but we want to be able to build out the tools to allow for a programmer to treat it similarly to an L1 cache (that is part of a shared memory space... that has different access times)
Interesting. I look forward to playing around with one.
So the scratchpad memory contains instructions as well as data?
@wglb: That is the plan for now. An instruction cache is possible, but we don't see the need to add extra complexity.
The article was unclear in your open source plans. Will you be opening your hdl, and under what license?
So our current initial plans are to open source the ISA through the Open Compute Project (http://opencompute.org), using their reciprocal license. The HPC group within OCP that I co-lead has plans for HDL and RTL level submissions, but we are not sure what the timeframe or what would be included from REX (yet).
The business case for open sourcing the ISA is that we want others to be making compatible chips. As a small startup with a new architecture, it would be GREAT if others were to make competing chips, as it would only further the architecture and software ecosystem, making it more competitive with the existing market incumbents.
To us, our floating point unit is our "secret sauce", but as an open source enthusiast, I don't want that to be locked up forever. My general idea right now is that we want to be able to tape out our first chip (at least the prototypes for it), and will open source the HDL for the non "secret sauce" parts of it. As we get onto further generations, I do want to open source the full design of our previous chips for free/open use.
Thanks for the answer. Another quick question. One of the limitations of parallela is the offchip memory bandwidth. Will you be integrating a third party phy?
Our chip-to-chip bandwidth is currently 48GB/s per parallel interface, and we have 8 of those per chip (2 per side), giving us an aggregate bandwidth of 384GB/s.
The Epiphany 4's interfaces are 1.5GB/s serial (12.5Gb serdes), and there are 4 of those per chip, giving you an aggregate bandwidth of 6GB/s.
If you can spend a bit more power (or wait for 14/16nm process), we think it is possible to double that to 96GB/s per interface. Using some more exotic methods (which are purely in an idea stage right now, and not tested, are are a couple years away at best), we think it is possible to get that up to 128GB/s per interface.
Can you provide some more details on your parallel interface? As an ASIC designer, these numbers sound extremely fishy for a non-serdes interface. A 128-bit parallel interface (which is a lot of pins, especially if you need 8 links) would need to be running at 3 GHz, for instance, to hit that. Is it differential/single-ended? Source-synchronous? How big of a package are you planning?
(Copied from another comment on this page):
Relevant paper (by Intel Research, actually)... there are quite a few differences (we are keeping it a lot simpler on the tx and rx ends, but that is some of our secret sauce... I can talk about it offline if you are really interested) http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=648778....
FYI, That link doesn't work, but the one you posted here: https://news.ycombinator.com/item?id=8659653 does.
Is there a plain text version of the overview on the page you linked? The current version is very pretty as a homepage, but if there was a short white-paper available it would help the put the focus on the information.
Personally, I found that the slide-ins and wobbly text made it difficult to keep my place. I tried Readability, but it didn't do well on it. Scrolling to the bottom to let the glitz run its course, and then scrolling back up to the top to read helped in the end.
Very impressive.
How does this compare to other multi-core processors / architectures e.g. epiphany http://www.adapteva.com/introduction/
There are a lot, and I would recommend reading through this page... the basics on Epiphany (and the same can be said about most others) is the fact that we designed from the start to have a very high memory/IO bandwidth (both on and off chip), stuck to RISC, and focused on floating point (one of the few architectures in this area being fully IEEE compliant with this high of a performance per watt ratio).
Hi, trsohmers,
have you ever heard of Epiphany[1] ? They claim to achieve 70 GFLOPS/WATT. Also the processor seems to be fairly fast and they manage to put 4096 cores on a chip. Though no activity in recent time. Maybe you could find some collaboration points with them..
[1] http://www.adapteva.com/epiphany-multicore-intellectual-prop...
Also wish you good luck and recommend change focus: First figure out, what kind of real world problems this architecture can handle and how to do it best. The architecture will perform very well, if the number of operations, which can do done on a set of data fitting into the scratchpad memory is large, so that communication overhead is small or if the problem can be mapped to the grid in a way that they only require communication with the neighbours. However I would assume that typical real world big data problems don't fulfill these requirements and problems fulfilling them also run well on classical architectures. As soon as you start to need a lot of data transfer, only the cores close to the border of the grid will be able to work, as they get data, the ones at the border are busy with forwarding data to the memory and the innermost just wait ... Therefore before investing a lot of effort, money and time into a hardware, which many others also do in very similar ways and ask the community to find out, how to use it, spend the energy in innovative ideas to actually efficiently use such architectures. I.e. languages, profiling and debug tools, ... That's where the real innovation is needed and where there is a lot of room for improvement. And if you want to stay with hardware it is probably much easier to convince investors, if you can show in a simulation of FPGA prototype that 2 or 3 real world applications showing a very bad performance on classical machines can achieve a substantive boost on your architecture ...
Unlike GPUs and other SIMD accelerators, Neo's MIMD processor design leverages independent program counters and instruction registers in each core to allow different operations to be performed in parallel on separate pieces of data.[1]
Other than the grid interconnect, how does the architecture differ from Xeon Phi? In particular, what allows it to get such dramatically higher power efficiency? I'd have guessed that Intel's smaller process would make it difficult to match.
[1] It feels awkward that this sentence is included twice on such a short page.
Thanks for that... didn't notice that we had a repeated section (I just updated it with the proper text for that section)
As for comparison to the Phi... it is the fact that our actual core size (and thus the whole chip) is MUCH smaller. The Phi's cores are actually based on the original Pentium architecture (they are just downsized P54Cs) with added AVX instructions. Contrary to popular belief, they are not full x86-64 cores.
Intel has not released the official die size for the Phi, but has said it is about ~5 billion transistors (at a 22nm process), and independent "guesstimates" have pegged the die at 600-700mm2 (there is one place that says 350mm2, but that is false). For the top of the line 61 core Xeon Phi, it uses 300W, with a theoretical peak performance of 1.2TFLOP of double precision performance. That gets you to about 4GFLOP/Watt.
In comparison, our entire will be under 100 million gates, with each core (excluding memory) being around 100k gates. At a 28nm process, our core size (without memory) is a little under 0.1mm2. Our theoretical peak performance per compute chip is 256 GFLOPs double precision, while it should be using around 3 Watts, giving us a performance per watt ratio of ~85GFLOPs/Watt.
Intel has even said that their next generation Xeon Phi, made at their 14nm process, will be at 14 to 16GFLOPs/Watt. At SC14 last week, they made a soft announcement for the following generation at 10nm process will be in the ~2018 timeframe, and that is estimated at only being around 25GFLOPs/Watt.
The bottom line is Intel is just following Moore's law, and is sticking to big and complex systems, which while retaining legacy compatibility, kill you when it comes to efficiency.
Interesting, and a great answer. The usual estimates I've seen put the power efficiency overhead of the legacy x86 layer to be small enough to not be a major factor (http://research.cs.wisc.edu/vertical/papers/2013/hpca13-isa-...). But I suppose as you shrink the core smaller and smaller, that small mostly-fixed difference becomes a greater and greater part of the total power budget.
I discredit most people who continue the RISC/CISC debate to this day (As Intel forfeited in ~2006... all modern x86 processors are actually RISC processors. They are RISC at the lowest level, and just decode x86 instruction and translate them into Intel RISC microcode).
Then again, I don't think most popular "RISC" CPUs are very RISC like (Does that make me a RISC hipster?). While making things general purpose, there is no reason for ARM processors to be OoO, do branch prediction/speculative execution, or any more of these crazy things... it reduces efficiency in the long run.
Take a look at this for instance: http://chip-architect.com/news/2013_core_sizes_768.jpg ... an AMD CPU and ARM chip are virtually the same size at the same process node. I find it insane that one of our cores is a bit less than 1/5th of the size of a cortex A7, and can do more FLOPs than it. Then again, we have focused on doing that, but still.
Your .1 mm^2 was without memory, right? I don't know the exact numbers, but to be fair, the ratio does become somewhat closer when you discount the area on the A7 used for memory: http://www.arm.com/images/Single_Cortex-A7_core_layout_image...
That is correct, but even with memory we should only be around .2mm^2 to .3mm^2, while having 4x the SRAM as the Cortex-A7.
I think it's fair, despite long-time predjudices - to think of the x86 architecture to be midway between RISC and CISC - most instructions (push/etc being the notable exception) make a single memory reference and no addressing mode does a memory indirect or has addressing side effects (auto incs/etc)- that means that instructions are retryable (don't ever need to be able to be restarted in the middle after a page fault) with a single dcache/TLB port
Intel's insistence on x86 and dragging old designs around just isn't sustainable in the long run, especially once silicon gets to its absolute physical limits and Intel looses its traditional process technology advantage for a while (until a new material is introduced). Knight's Landing may be a temporary reverse to the killer micro order, but I give a higher chance to IBM+NVIDIA winning the next round, maybe followed by an architecture like yours.
My first thought was how this compares to Adapteva's Epiphany architecture (best known for it's use in their crowd-funded Parallela boards), so I was happy to see this project was inspired by that one. Andreas Olofsson from Adapteva has tweeted often about how difficult it is to get funding as a silicon start up, though. I think the statement "it's a tough sell" will prove to be a colossal understatement. But these are impressive kids and it's a great concept - I wish them luck.
I was working on the Parallela boards for a long time... but there are fundamental flaws in the architecture (missing instructions, it is only 32 bit, not IEEE754-2008 compliant, etc) that makes the Epiphany architecture not really suitable for the markets we are trying to address. When we decided to go out and make our own, we knew there was going to be a lot of difficulties, but the idea is that if we are going to do a startup and give it our all, why not try to do something big?
The fact that we are using mostly open source tools for our development (such as Chisel: chisel.eecs.berkeley.edu), our development time has decreased and productivity has had a huge boost compared to if we were just writing in Verilog. We also made the decision to not use off the shelf IP, which while difficult to verify, we think we make a much better system. Compare this to the Epiphany implementations, which while having a custom ISA and basic core components, used off the shelf ARM interconnect, off the shelf ARM memory compiler, and many other things that were used to minimize development time and verification. While this is a bit more difficult upfront, since we are keeping our components very simple, our verification is no where near as complex as a "normal" processor. Plus we don't have to pay $500k-$1m+ in upfront licensing fees.
Does anyone know what the actual CPUs are? It mentions it has 64 registers? My guess is ARM / MIPS, based upon: http://en.wikipedia.org/wiki/Processor_register
How much scratch memory is there? Is it SRAM?
Are they licensing anyone's IP for the interconnect, or CPU? What's the bandwidth of the interconnect? Is is packet-oriented? How does fair-routing work?
Custom ISA that we have developed... we developed it in parallel to RISCV (before they released their public 2.0 ISA), but we have diverged a bit... we are a static 64 bit ISA, have no options to have VLIW expandability, our FPU (and thus its instructions) are able to do two single precision IEEE754-2008 FLOPs per cycle, and one double precision per cycle. We have also added a set of DMA instructions.
We currently have 128KByte dual-ported SRAM per core (which is physically part of the core, and not a giant array somewhere else on die). It has single cycle latency to the core's registers and to the Network on Chip router.
The on chip mesh network is custom 128 bit wide going core to core. The router can do a read or write to SRAM per cycle AND allows a passthrough to another core in the same cycle.
Our chip-to-chip interconnect is a custom 64 bit (72 lane) parallel interface allowing 48GB/s. There are two of these (unidirectional) interfaces per side, giving you a total of 8 of these interfaces per chip.
I realized I didn't fully answer the chip-to-chip interconnect questions... it is an extremely simple parallel interface (that I would not even call a "bus", as that is assuming it has a lot more control logic than it has). Instead of having a full serdes per lane, our solution is to use a very simple latch and buffer, along with PLLs on each chip to have a MUCH (50-70%) smaller and more power effecient point-to-point connection between chips. As such, it is not packet based, and we are really just focusing on moving a 64 bit word per cycle.
I asked this previously without seeing that you had answered it here. So you're claiming to have a parallel interface without a SERDES running at 64-bits at 6 Gb/s? How are you maintaining bit alignment between lanes? Do you have any in-line signal conditioning or anything (CTLE, DFE, etc.)? Parallel interfaces are rarely run faster than 1 Gb/s, so 6 Gb/s sounds unlikely.
Relevant paper (by Intel Research, actually)... there are quite a few differences (we are keeping it a lot simpler on the tx and rx ends, but that is some of our secret sauce... I can talk about it offline if you are really interested)
http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=648778...
Going over 50 cm of Twinax is great and all, but that's the ideal environment. When you guys put 16 of these things on a board, routing that PCB and getting good signal integrity is not going to be fun, even after you throw in pre-emphasis and so on.
Getting anything approximating Intel's work, assuming you are rolling all your own IP, is pretty ambitious (and probably a good deal more work than your Core design). Even at only 6 Gbit/s (faster than PCIe Gen2, btw). Just curious - have you ever taped out a chip on a modern process before?
We aren't trying to reach their 128GB/s number as listed in that paper, only 48GB/s... in addition to our design being much simpler than that. Our design is having each pin is simply a buffer and a latch that is synchronized with all the other pins part of the interface by a PLL... much easier to implement and run than a serializer for even just a single pin.
I myself have not taped out something on a modern process, but have advisors who have. My co founder and I do have nanofab experience, so we do understand the physical complexities of fabrication first hand.
Ah, so the idea is to have 64 parallel bits coming in at 6 Gbits/s, along with a slower clock that you multiply up to 6 GHz and use to sample the inputs? That will be quite tricky to get working 1) without any analog signal conditioning on the inputs (or outputs), and 2) without inter-bit skew making it impossible to meet timing on your inputs. Best of luck to you, but the interface alone sounds likely to be problematic.
That's the basic idea, but we can run it at 3GHz if we do DDR, or 1.5GHz doing QDR. There is some extra magic there which I don't want to talk publicly about just yet ;)
The biggest problem (even with our solutions for skew and crosstalk) is just the number of pins/traces on the board, but that's not unsolvable... nothing a ~10 layer PCB can't solve.
I wonder what kind of penalties there are for transmitting data to neighboring nodes. Same for receiving. If every node does for example receives data from a neighboring node, does a single fused multiply-add and transmits the result to a neighboring node, how many FLOPS you get out of the whole thing?
How big chunks of computation you need to do in a node for this to be effective?
Our theoretical double precision FLOPs (based on running at 1GHz) per core is 1GFLOP per second, but that is based on doing an add or a multiply. In the case of just doing FMAs, you would be doing 1.5GFLOP. This gives you (theoretical peak) 256 to 384 double precision GFLOPs per chip. Our bandwidth between cores is 16GB/s, while the required bandwidth for doing 1GFLOP is 8GB/s.
Our single precision numbers are actually double that of double precision (compared to a GPU or most other SIMD systems, which have independent FP32 and FP64 FPUs, we have a single combined unit). While our ISA is pure 64 bit, we have packed 32 bit FPU instructions for doing two single precision FLOPs per cycle.
I forgot to add why the separate FP32 and FP64 floating point units are "bad" (from our PoV)... having the separate units just increase complexity, and NVIDIA GPUs, for instance, have 4x the number of FP32 units compared to FP64 units, thus making their single precision numbers (which are what they typically advertise) 4x that of their double precision numbers.
So how many I can do if for each FMA I also receive 16 bytes of data from a neighboring node and send 16 bytes of data to a neighboring node? Or is the data transfer free, is the neighboring node memory mapped? If so, how does synchronization work?
Edit: didn't notice same was asked before too. Regardless, how many FMAs can be executed in the scenario I gave, also sending 16 bytes and receiving 16 bytes for each FMA?
The superscalar design means Load/Store unit can operate independently from FMA and DMA units, i.e. FMA operation on register operands will not interfere with other operations elsewhere on the chip.
Synchronization should be handled in a dataflow-driven program design. Any sort of mutex/semaphore/etc. will have to be software defined or interrupt-driven.
In regards to neighboring node being memory mapped... are you asking about another compute chip or another node (with the 16 compute chips + GaMMU)? All of the compute chips in a grid are part of the same flat global memory map, and have DMA capabilities between each other. Once you get out of the compute grid (that is managed by the GaMMU), then that is a separate memory space, but can be accessed through some other layer through something like MPI, for instance.
I guess this is one of those cases where you just need to get your hands dirty to really understand it.
Can you do reasonably efficient [arbitrary size] fixed point arithmetic on your hardware? Do you have 64-bit add with carry and 64-bit multiply with 128 bit results? I'm interested in 64.64 and 128.128. Needed operations are add, sub and multiply.
I think compute grids like these could very well be an important part of computing in the future, maybe even the most important part. Ever since when I first saw an article about transputer. Grids or VLIWs, sadly software is always the pain point. I wish you luck, please get this working and right.