View unanswered posts | View active topics It is currently Thu Mar 28, 2024 11:14 am



Reply to topic  [ 25 posts ]  Go to page 1, 2  Next
 Noc - Network on chip 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
A recent endeavor of mine is to build a networked system. I don't have much practical experience in that regard.
I tried (briefly) to find some noc information but detailed info was not easily found, so I figured I'd just make something up myself.
Any tips would be welcome.

Here's my starting point. I've chosen a 128 bit packet size which is transmitted as parallel data around a ring. I plan on using a network of RiSC-V compatible cores.

127 126 123 121 118 117 116 109 107 0
GB RID TID ACK AGE Payload Area

GB RID TID ACK AGE We Sel4 Addr32 Data32


GB – global broadcast
- this bit is set all receivers should pay attention to the packet.
RID – id of intended receiver
TID – id of transmitter
ACK – acknowledge receipt of packet
AGE – age of the packet in ring cycles
- the packet gets to being too old it is automatically deleted.
ID #0 means the packet is empty (no receiver)
ID #1 to #4 are used by nodes
ID #15 is the system controller – it takes care of interfacing to the outside world I/O and aging the packets as they travel around the ring.

Attachment:
FinNoc.png
FinNoc.png [ 13.9 KiB | Viewed 20557 times ]

_________________
Robert Finch http://www.finitron.ca


Mon Apr 04, 2016 8:14 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
There must be some commonality with the concerns of conventional networks - token ring springs to mind. It must have had some advantages and drawbacks which are now well known. Conventional (local) networks are often serial, because wires and connectors are cheaper and there's no bus skew to worry about. An on-chip network could be parallel... but 128 bits sounds very wide to me! It's hard to know whether that will cost too much in utilisation. You could try the fitting (I'm assuming you're on FPGA) even before you have any idea of the protocol.

Whatever your packet size, it seems likely there'll be a higher level of protocol which sometimes needs to send multi-packet messages, so you'll need to accommodate that.

Back in the day, it was popular to try to prove things about a given network: that it's deadlock-free, would maintain throughput, is fair, is efficient. Your packet ageing idea looks like it sorts out endlessly recirculating packets. I don't know what other pitfalls there might be.


Mon Apr 04, 2016 12:01 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Just a thought - starvation. If your ring is 1-2-3-4-1, and 1 is sending a lot of data to 4, will it starve out nodes 2 and 3? If 1 sees a message come back consumed, it can fill it with the next payload. But that stops 2 and 3 from ever seeing an empty packet.

But if 1 has to let its own packet go past to give the others a chance, the throughput from 1 to 4 drops to a half: alternately we see one and then zero payloads go around.


Tue Apr 05, 2016 2:53 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I had wondered about the starvation effect. But I think it might be okay because the packets have to be built by cores cpu via software and there's no DMA to memory to transfer packets. So there will probably be at least a few clock cycles between each transmit. The software's bound to take some cycles to execute. I added a 63 entry receive fifo in the net controller to handle the case where packets are received faster than the software can process them. But there's only a single transmit buffer.
Another issue is handling interrupts. As it is now the cpu core doesn't support interrupts, it has to poll the net controller. I think a special interrupt message will be put onto the network when there's a need. One or two of the nodes may be dedicated to processing interrupt messages. So they would just sit in a loop waiting for an int message to appear.

The way I have the nodes setup right now the cpu can only execute instructions from ROM which can't be updated. So I've been thinking of using an interpretive language which could be stored in the RAM as data, for a ROM interpreter to execute.

Though only our are shown in diagram I might be able to fit as many as six nodes (12 cpu cores) on the board.

_________________
Robert Finch http://www.finitron.ca


Wed Apr 06, 2016 2:55 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
That sounds like a capacious board!


Wed Apr 06, 2016 7:48 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I managed to get this little piece of RiSC-V code almost working. It displays a single solitary zero on the screen in the proper colors of grey on green. So there's data being transmitted through the network. It's supposed to display a sequence of 2,4,6,8 on successive lines according to the node number. It looks like the node number is zero for some reason. Note the RiSC-V ISA was modified slightly (in a compatible way) to allow the use of 32 bit constants.

Code:
TXTROWS      EQU      1
TXTCOLS      EQU      4
TEXTSCR      EQU      $FFD00000
NOCC_PKTLO     EQU       $FFD80000
NOCC_PKTMID    EQU       $FFD80004
NOCC_PKTHI     EQU       $FFD80008
NOCC_TXPULSE   EQU       $FFD80018
NOCC_STAT      EQU       $FFD8001C
CPU_INFO       EQU       $FFD90000

NormAttr    EQU      8

   org      0x2000
   jmp      start
start:
  lw    r4,CPU_INFO      ; figure out which core we are
  andi  r4,r4,#15
  slti  r4,r4,#2
  beq   r4,r0,.0002      ; not core #1 (wasn't less than 2)
  lw    r4,NOCC_STAT     ; get which node we are
  srli  r4,r4,#16        ; extract bit field
  andi  r4,r4,#15
  or    r3,r4,r0         ; move to r3
  ldi   r1,#$1000001F    ; select write cycle to main system
  ldi   r2,#$FFDC0600    ; LEDs
  jal   r31,xmitPacket
 
  ldi   r5,#336          ; number of bytes per screen line
  mul   r4,r4,r5         ; r4 = node number * bytes per screen line
  addi  r2,r4,#$FFD00000 ; add in screen base address r2 = address
  ldi   r1,#$1000001F    ; target system interface for word write cycle
  ori   r3,r3,#%000111000_110110110_000011_0000    ; grey on green text
  jal   r31,xmitPacket
.0001:                   ; hang the cpu
  beq   r0,r0,.0001
  ; Here do processing for the second CPU
.0002:
  beq   r0,r0,.0002

xmitPacket:
  ; first wait until the transmitter isn't busy
.0001:
  lw    r4,NOCC_STAT
  andi  r4,r4,#$100      ; bit 8 is xmit status
  bne   r4,r0,.0001
  ; Now transmit packet
  sw    r1,NOCC_PKTHI    ; set high order packet word
  sw    r2,NOCC_PKTMID   ; set middle packet word
  sw    r3,NOCC_PKTLO    ; and set low order packet word
  sw    r0,NOCC_TXPULSE  ; and send the packet
  jal   [r31]


_________________
Robert Finch http://www.finitron.ca


Thu Apr 07, 2016 10:11 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I have the Noc system working much better. It is able to properly display data from multiple nodes to the screen. The simple 2,4,6,8 node display test worked. I'm working on getting a bitmap test program working and so access to the DRAM is required. I noticed that memory access is really inefficient when multiple cpu's are involved through a single DRAM port.

Currently there is only a single system interface node that allows access to all the I/O and common memory in the system.
The problem is that the DRAM reads in eight word bursts which isn't very efficient when multiple nodes access the DRAM randomly. For instance during one cycle Node 1 might want to read address $1000. The next cycle node 2 might want access to address $2000. So node 1's buffered data is useless and get dumped so that node 2's data can be loaded.
One solution is to use multiple DRAM ports with multiple system interface nodes.
The benefit of multiple system interface nodes is that network traffic might be reduced as the sysnode's remove packets that are intended for them. Also the latency for access to the system devices could be reduced.
Another solution is to use a multi-way cache on a single DRAM port.
I'm leaning towards using the multi-way cache and keeping the rest of the system simple.

_________________
Robert Finch http://www.finitron.ca


Wed May 04, 2016 4:59 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
It does sound like you need to hold at least 8 words per node near the DRAM port. Is it enough to have 8 buffers rather than a full multiway cache?


Wed May 04, 2016 5:04 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I have modified the controller for a separate buffer for each node. It's simpler than a multi-way cache. But now the controller needs to know the node number. Not a big deal since the node number is available as part of the network packet received by the system node.

Bitmapped graphics test doesn't appear to work. The screen is blank. It could be that data isn't being written to the DRAM properly. To test the theory I've wired up part of the bitmap data input to a constant pixel value. That should tell me if the controller is actually trying to read data from memory or something else is wrong. If the controller is actually reading from memory part of the display should appear as stripes of the constant color.

_________________
Robert Finch http://www.finitron.ca


Wed May 04, 2016 7:09 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
For interest, I just came across the ZMOB, an architecture for very many Z80s to pass messages within one system. See
http://www.dtic.mil/dtic/tr/fulltext/u2/a081346.pdf
Each Z80 has RAM and boot ROM and access to a communication port. The network is parallel, maybe 40 bits wide, so each message spends one or two clock ticks at each node. The idea is to allow several messages in transit at once. If a message gets around the whole loop without finding its destination, the sending node can take it off.
Attachment:
ZMOB-network.png
ZMOB-network.png [ 64.49 KiB | Viewed 20370 times ]

It seems the system was actually built at a scale of 128 CPUs.
"In the late 1970's, Chuck Rieger and Mark Weiser built ZMOB, an early parallel computer based on commodity microprocessors. The system consisted of 128 Z-80 processors."


Mon Jul 04, 2016 9:27 pm
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Very interesting, thanks for the link.

_________________
Michael A.


Tue Jul 05, 2016 1:49 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Thanks for the reference. Provides a potential application for the network. They are making use of the fact that the Z80 is much slower than the network mail slots so that network access appears almost immediate to the Z80's. The Noc's network works slightly differently because the cores are just as fast the network.

I’ve got almost the same thing but with a technology update. 75MHz RiSCV cores rather than 4MHz z80’s and a slightly different mail slot. The Noc doesn’t use an index pulse. Instead it matches the source address against the node address to know which slot is “owned”. The parallel bus is wider so that it allows a global resource addressing in the messages. The Noc also has a special node dedicated to access global resources, this is provided by the host computer in ZMOB.

I’ve been working on adding interrupt capability / supporting different operating modes to the cores.

_________________
Robert Finch http://www.finitron.ca


Wed Jul 06, 2016 4:54 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I just discovered that the source register and address register for store instructions were swapped in the RiSCV core. It worked because they were also swapped around in the assembler. I guess it could be a feature, but it'd be incompatible.

_________________
Robert Finch http://www.finitron.ca


Fri Aug 26, 2016 5:31 am
Profile WWW
User avatar

Joined: Tue Jan 15, 2013 5:43 am
Posts: 189
Amusing. In human affairs it's axiomatic that "two wrongs don't make a right." But computers are 100% OK with it!

_________________
http://LaughtonElectronics.com


Fri Aug 26, 2016 2:31 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The RISCV compatible core has been re-written with an extension to support 64 bit instructions for 32 bit immediate constants. The I-Cache was also modified to be 4-way set associative. Now running the system in simulation hangs the PC big time. Everything locks up and task manager can't be activated. It requires a power on-off-on to get the PC working again. It's just with the change to a 4-way set associative cache. It was simulating fine before this. I guess the simulator doesn't like the 4-way icache.
Hopefully it will run in the FPGA hardware.

_________________
Robert Finch http://www.finitron.ca


Sat Sep 03, 2016 3:13 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 25 posts ]  Go to page 1, 2  Next

Who is online

Users browsing this forum: No registered users and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software