Last visit was: Wed Dec 01, 2021 7:32 pm
It is currently Wed Dec 01, 2021 7:32 pm



 [ 33 posts ]  Go to page 1, 2, 3  Next
 Noc - Network on chip 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
Attempting to learn by doing here. I’d like to create an onchip network based around a small cpu such as found in the eight bit challenge. The network should be somewhat more sophisticated than my previous one that used a parallel bus in a linear arrangement.
I think a simple grid arrangement of possibly up to 64 cores in an 8x8 grid would work. I have a few questions:
1) How to route messages around (hardware and software) ?
2) How to interface to the rest of the system ?
My naïve setup:
Each core would be connected to a router.
The network router would be a pair of specialized serial uarts (one each for X and Y directions) that transmit asynchronously. The network message could consist of routing information 16 bytes ? and a message payload 16 bytes plus a crc ? That’s 260bits. Each byte of the routing information records a vector of down and to the right movements. If it takes more than 16 vectors to transverse a 8x8 grid then it’s assumed not a valid route.
To interface to system components (keyboard, display, disk) several of the cores could be dedicated to servicing a single device.

_________________
Robert Finch http://www.finitron.ca


Sat Jun 10, 2017 2:39 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1647
Hmm, it feels that with as few destinations as 8x8, one or maybe two bytes of header ought to be enough to get to where you're going. Are your endpoints for a packet the cores themselves, or one of many processes running on the cores?

If each row and column is labelled in the natural way, then a packet not in the right row can move down, and one not in the right column can move right. If both possibilities are valid, then it can do either.

(I'm assuming the 8x8 array wraps around)

BTW one way to arrange wrap around is like this:
Code:
   +->  1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 -> 8 --+
   |                                           |
   +------<-------------------------<----------+

But another way which avoids a long link is like this:

Code:
   +->  1 ------> 2 ------> 3 ------> 4 -->----+
   |                                           |
   +------<--8 <-------7 <-------6 <-------5 <-+



Sat Jun 10, 2017 5:18 am

Joined: Tue Dec 11, 2012 8:03 am
Posts: 285
Location: California
Multiprocessing has been an interest of mine but not high enough on the list to have spent any time on it. What examples of others' work have you looked into, to learn from? I am reminded of Samuel Falvo's related post at http://forum.6502.org/viewtopic.php?p=7998#p7998 and following regarding an X*Y matrix of computers in a multiprocessing system.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources


Sat Jun 10, 2017 7:06 am WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 206
Location: Huntsville, AL
I think that you can do what you want with a single byte, and only six bits of the byte required.

Since 2^6 = 64, and the processor ID can be segmented into a rectangular array of processors, an 8 x 8 grid, then one group of three bits identifies the row number, and the other the column number. A single byte would precede each packet. On receipt of this byte, the UART would examine the value of row and column fields, if the fields don't match its ID, then the UART would queue the packet for transmittal first in the row direction, and then in the column direction. On initialization, the processor would establish the row/column of its UARTs. The processors on the edges would only have two valid connections.

For example, to transmit a packet from processor #0 (6'b000_000) to processor #63 (6'b111_111), processor #0 has two ports N and W which are not connected. Let's say the packet is transmitted first toward the row, and then along the column when the row # matches. So processor #0 would transmit the packet to processor #8 (6'b001_000) using the S UART, who would forward it processor #16 (6'b010_000), who would forward it to processor #24 (6'b100_000), etc. until it arrives on the N port of processor #56 (6'b111_000). Since the row # now matches, processor #56 sends the packet out its E UART port to processor #57 (6'b111_001). This process proceeds along the 7th row until it reaches the last processor in the row, the destination.

If the connections are modified so that all four UARTs are connected, then the number of processors that a packet needs to traverse can be drastically reduced. For example, if processor #0 North connects to processor #56 South, then the initial transfer could have been made from processor #0 to processor #56, and then along the row from processor #56 to processor #63. Instead of taking 14 hops to go from processor #0 to processor #63, the new route would only require 8 hops.

I suppose those 36 spare UARTs along the outside of the grid could also be folded back into the grid. For example, if the corners connect to the middle processor on the opposite side of the grid, then it would be possible to route from processor #0 North to processor #63 West by sending the packet first to processor #60 and then along the row to processor #63. Making the outside edges connects in this way complicates the routing decision somewhat, but probably reduces the number of hops a packet is required to make.

Congestion control and re-routing is a matter best left to another day, or to a simulation.

_________________
Michael A.


Sat Jun 10, 2017 3:26 pm

Joined: Fri May 05, 2017 7:39 pm
Posts: 22
Me thinks the organization of that network and its interconnects are application dependant:
- if you have or cannot avoid much communication between different processors then a x-y-matrix or a x-y-z-cube (... hypercubes ) might be advantageous.
- if you have a high workload for each processor and less cross communication then a linear arrangement is a neat solution.

The latter one, so called "farm" was my favorite choice when playing with Transputers and rendering "Apfelmännchen" (Mandelbrot set) - what else :roll:

Each transputer had a workload_in channel connected to the previous transputers workload_out channel and a result_out channel (connected to the previous result_in channel). The last one had its workload_out channel connected to its result_in channel to close the path.

The top level processor pushed workloads downstream until a workload comes back on the result path - what means all transputers have work. As each transputer fetches two workloads (one for immediate processing, one for next processing without waiting), the top level processor only needs to wait until a result packages arrived. Then a next workload could issued keeping the farm busy.

Pretty easy to setup and maintain :)


Sat Jun 10, 2017 5:06 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
Thanks, it's obvious to me now that so many bytes for routing aren't required. I was thinking along the lines of recording the nodes traversed, but only a delta really has to be maintained.

There's only two bits required for each routing choice, (down/same or right/same) so if room is provided for 16 choices (double the max distance between nodes) that's only 32 bits. Six more bits are required to store each of the target node and source node, but they are going to be stored in bytes so that's another 16 bits. It might be possible to get away with something like an 96 bit packet then (48+48? payload). I'm not sure what would go in the payload, but it's probably at least one address pointer(24 bits?) and one data byte + control byte.
Quote:
If the connections are modified so that all four UARTs are connected

I was going to have just two uarts in a node with only a single direction of transmit / receive. Receive from the previous node, transmit to the next node. A standard uart with receive fifos will be used, and the processing core's going to have to do a lot of work managing the network. The uart's maybe bigger than the processing core resource wise. I started to code a custom uart but then realized there wouldn't be enough room in the FPGA for a large number of nodes.

There's 136,000 LUTs to play with, with 500LUTs or less per core, that's 272 core that would fit. However assuming a node is actually 3x larger than the core, only 90 nodes would fit. I'm going to try for 64 but that might have to be reduced. I coded up a single node but the synthesis tools won't tell me how large it is. It reports 113 LUTs (which can't be right). When I open the elaborated design it shows all the components present. If I add the number of cells reported by elaboration it's about 700 for a node.

Quote:
Congestion control and re-routing is a matter best left to another day, or to a simulation

I am going to get things working first, but I want the room reserved in the packet for routing deltas. I don't want to change the packet later.

Quote:
Multiprocessing has been an interest of mine but not high enough on the list to have spent any time on it. What examples of others' work have you looked into, to learn from? I
Multiprocessing struck me as current state of art, having read about Google's computer I thought maybe I should spend more time on it. I've not yet looked into other's work in a big way. I've read a few articles, there was one in BYTE magazine using 8048's? I think. There was the university project BigEd pointed out.

_________________
Robert Finch http://www.finitron.ca


Sat Jun 10, 2017 6:17 pm WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1647
You'll find a lot of transputer-related material at http://wotug.org/parallel/

I'm thinking about the UARTs. A UART moves bytes from one place to another, using only one data wire, and allowing the two ends to have different clocks. But on-chip, you have the same clock. I think a byte-wide bus will be a cheaper way to move a byte. It's more wires, but then your links are local. And it's less complexity at each end.


Sat Jun 10, 2017 6:36 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
Quote:
I'm thinking about the UARTs. A UART moves bytes from one place to another, using only one data wire,
I changed the UART to a non-standard UART that transmits / receives 128 bit packets rather than bytes. It also takes care of routing the messages now, offloading the task from the cpu core. The UART is really simple I don't know if I could make one transmitting/ receiving byte values any simpler. It turns out the UART could be made a little beefier as there are LUT resources to support it. Almost all the block ram in the device is used, but only 1/2 the LUT's are used. A byte wide transmit/receive might also have too much bandwidth which would go unused.
Quote:
But on-chip, you have the same clock.
Yes, it's the same clock (33MHz for now). The clock is used directly as the 16x baud clock so that gives a rate of 2MBits/s. This could easily be doubled or quadrupled by using an 8x or 4x baud clock. The packet rate is about 15kHz. I'm not sure how fast the packets can be processed by the cpu core. 15kHz is only about 2,200 clock cycles between packets.
Been working on the router. The router takes care of aging the packets, filters out messages not intended for the node, filters in global broadcast messages. Propagates messages. Messages older than 64 hops disappear automatically. It does not yet record the message route.
The system as tried for the first time in FPGA doesn't work at all. The text display doesn't even appear correctly.
This is all mainly academic as the cores don't have enough memory to do a lot a work. There's only a single 4kiB block ram for each core. (16kiB rom). It might be possible to tradeoff rom for more ram. The boot node requires more rom, but the other nodes should be able execute code from ram. They only need a message receiver code in rom.

_________________
Robert Finch http://www.finitron.ca


Sun Jun 11, 2017 1:33 pm WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 206
Location: Huntsville, AL
Interesting concept. One thing to consider is the differences allowed in the frequencies of the reference clocks between communicating FPGAs. If there's no communications between devices, then the issue is moot. However, if you intend to allow communications between one device to another, then I think you need to consider the effect of extending the synchronization period from 10 (8,N,1) to 130 (128,N,1) bit times.

robfinch wrote:
There's only a single 4kiB block ram for each core. (16kiB rom)
Aren't we getting a bit politically correct here. :D All computers constructed using base 10 memory addressing of which I am aware have long been obsolete: see ENIAC, IBM 650, etc.

_________________
Michael A.


Sun Jun 11, 2017 2:56 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1647
(Careful now, or we'll have a Decimal Computer Challenge!)


Sun Jun 11, 2017 3:00 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
Quote:
I think you need to consider the effect of extending the synchronization period from 10 (8,N,1) to 130
I had not thought of that. The way it is it would require clocks accurate to better than 0.1%. Back to the drawing board maybe. I could change that 130 bits to 160 bits (10 bits for each byte) to allow start / stop bits for every byte. But I think I will leave it as-is for now. The on chip network is likely to remain a custom design of some sort. I could use the board's onboard Ethernet PHY to interface to other boards but I haven't been able to get an Ethernet controller working yet.

Test #2 coming up. I hope to at least see a LED light up indicating the boot node is working. If things go perfectly there should be a 8x8 display of characters on-screen that indicates all the nodes reset. The likely-hood of that working is almost zero. Test#1 didn't work at all but I found and fixed numerous problems.

_________________
Robert Finch http://www.finitron.ca


Sun Jun 11, 2017 7:15 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
After several more tries and fixes, the LED lights up indicating the boot node at least got to the reset point. If one node works then likely all the nodes work (they're all identical). There doesn't appear to be a working network yet. The next step is to generate an on-screen display showing which nodes got the reset message. This relies on the network component working.
I goofed and didn't realize it. The system is running at 50MHz, I'd only planned on trying it at 33MHz. 50MHz x 64 cores is 3.2GHz.At four clocks per instruction (memory read time) that's maybe 800MIPs.

_________________
Robert Finch http://www.finitron.ca


Tue Jun 13, 2017 12:54 am WWW

Joined: Tue Dec 11, 2012 8:03 am
Posts: 285
Location: California
After re-reading the whole topic again, I'm still not sure, or have found conflicting answers about, whether you want lots of networked processors on one FPGA, or you want a grid of FPGAs. If they're all on the same one, I agree that parallel connections seem better, since the point of serial is to minimize the number of traces between ICs or the number of conductors in wires or the size and complexity of the connectors, which don't enter into the picture if you're not going off-chip anyway. But if it is a grid of FPGAs, and you do use serial, why not use synchronous serial? It seems so much simpler than asynchronous (UART) since you don't have the timing requirements. It also allows faster bit rates. For reference, some SPI devices approach 100MHz bit rates.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources


Wed Jun 14, 2017 2:49 am WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 206
Location: Huntsville, AL
For the more current FPGAs, many serializer/deserializer (serdes) are commonly available on the parts themselves. They work very well and generally operate at speeds ranging from 1.25 Gbps to upwards to 6 Gbps. They are commonly used for inter-FPGA wiring. If my understanding is correct, ultra-high speed serial interconnects are being used in the high end FPGAs to connect the multiple die that comprise the packaged product. The serdes on most current FPGAs today are the ones used to implement PCIe interfaces to/from the FPGA, or SGMII interfaces to 10/100/1000BaseT PHYs and Switches. My only problem with the commonly available serdes blocks on FPGAs today is that they normally don't operate below 600 Mbps. To used them below that bit rate, bit replication or a similar technique is required.

For Rob's on chip NOC, a UART interface is still useful. He can just use the interface, and actually perform the transfer in a bit-parallel form. For inter-FPGA communications, he can still use a UART interface, and then time multiplex multiple channels in a bit-parallel manner onto a single high speed serdes bidirectional channel.

_________________
Michael A.


Wed Jun 14, 2017 3:55 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
Quote:
After re-reading the whole topic again, I'm still not sure, or have found conflicting answers about, whether you want lots of networked processors on one FPGA, or you want a grid of FPGAs.
I want both! It's starting with a processors on one FPGA, if things go well it'll expand onto other FPGAs.
An asynchronous serial is in use right now because I don't think the processing cores will be able to process at a high enough rate to make use of additional bandwidth. I suppose a faster network having extra bandwidth wouldn't hurt. I am trying to get it working at about 3Mbits/sec. The UART transmit / receive is only about 150LUTs (the router component is much larger). The FPGA can support much higher bandwidth.

Quote:
For the more current FPGAs, many serializer/deserializer (serdes) are commonly available on the parts themselves
I was looking at the FPGA's serdes resources for connecting more boards. The one crazy idea I had was to use the board's HDMI interfaces as a connect between boards rather than for display. The reason being the board vendor supplies a working rgb2HDMI / HDMI2RGB converter sample cores which I figure might make a good starting place. The HDMI interface makes use of serdes resources at a decent rate. And RGB to HDMI / HDMI to RGB converter could be used as a serial device. It'd be transmitting/receiving 24 bits at a time. Which could be used for multiple channels.

For software, a modified version of TinyBasic is currently incorporated into each processing core. The TinyBasic uses only about 5kB of rom. TinyBasic was a working piece of software already for the Butterfly core in the past. In theory the primary node should be able to push BASIC code out to any worker nodes. But getting it to run isn't coded yet.

_________________
Robert Finch http://www.finitron.ca


Wed Jun 14, 2017 11:07 am WWW
 [ 33 posts ]  Go to page 1, 2, 3  Next

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software