| 
    
        | Last visit was: Sun Oct 26, 2025 3:43 am 
 | It is currently Sun Oct 26, 2025 3:43 am 
 |  
 
 
 
	
			
	
	
	
        
        
            | Author | Message |  
			| robfinch 
					Joined: Sat Feb 02, 2013 9:40 am
 Posts: 2405
 Location: Canada
   | A recent endeavor of mine is to build a networked system. I don't have much practical experience in that regard. I tried (briefly) to find some noc information but detailed info was not easily found, so I figured I'd just make something up myself. Any tips would be welcome. Here's my starting point. I've chosen a 128 bit packet size which is transmitted as parallel data around a ring. I plan on using a network of RiSC-V compatible cores. 127	126 123	121 118	117	116      109	107                                                                                 0 GB	RID	TID	ACK	AGE	Payload Area								 GB	RID	TID	ACK	AGE		We	Sel4	Addr32	Data32								 GB – global broadcast -	 this bit is set all receivers should pay attention to the packet. RID – id of intended receiver TID – id of transmitter ACK – acknowledge receipt of packet AGE – age of the packet in ring cycles -	 the packet gets to being too old it is automatically deleted. ID #0 means the packet is empty (no receiver) ID #1 to #4 are used by nodes ID #15 is the system controller – it takes care of interfacing to the outside world I/O and aging the packets as they travel around the ring.
 You do not have the required permissions to view the files attached to this post.
 _________________Robert Finch   http://www.finitron.ca 
 
 |  
			| Mon Apr 04, 2016 8:14 am |   |  
		|  |  
			| BigEd 
					Joined: Wed Jan 09, 2013 6:54 pm
 Posts: 1841
   | There must be some commonality with the concerns of conventional networks - token ring springs to mind. It must have had some advantages and drawbacks which are now well known. Conventional (local) networks are often serial, because wires and connectors are cheaper and there's no bus skew to worry about. An on-chip network could be parallel... but 128 bits sounds very wide to me! It's hard to know whether that will cost too much in utilisation. You could try the fitting (I'm assuming you're on FPGA) even before you have any idea of the protocol.
 Whatever your packet size, it seems likely there'll be a higher level of protocol which sometimes needs to send multi-packet messages, so you'll need to accommodate that.
 
 Back in the day, it was popular to try to prove things about a given network: that it's deadlock-free, would maintain throughput, is fair, is efficient. Your packet ageing idea looks like it sorts out endlessly recirculating packets. I don't know what other pitfalls there might be.
 
 
 |  
			| Mon Apr 04, 2016 12:01 pm |  |  
		|  |  
			| BigEd 
					Joined: Wed Jan 09, 2013 6:54 pm
 Posts: 1841
   | Just a thought - starvation. If your ring is 1-2-3-4-1, and 1 is sending a lot of data to 4, will it starve out nodes 2 and 3? If 1 sees a message come back consumed, it can fill it with the next payload. But that stops 2 and 3 from ever seeing an empty packet.
 But if 1 has to let its own packet go past to give the others a chance, the throughput from 1 to 4 drops to a half: alternately we see one and then zero payloads go around.
 
 
 |  
			| Tue Apr 05, 2016 2:53 pm |  |  
		|  |  
			| robfinch 
					Joined: Sat Feb 02, 2013 9:40 am
 Posts: 2405
 Location: Canada
   | I had wondered about the starvation effect. But I think it might be okay because the packets have to be built by cores cpu via software and there's no DMA to memory to transfer packets. So there will probably be at least a few clock cycles between each transmit. The software's bound to take some cycles to execute. I added a 63 entry receive fifo in the net controller to handle the case where packets are received faster than the software can process them. But there's only a single transmit buffer. Another issue is handling interrupts. As it is now the cpu core doesn't support interrupts, it has to poll the net controller. I think a special interrupt message will be put onto the network when there's a need. One or two of the nodes may be dedicated to processing interrupt messages. So they would just sit in a loop waiting for an int message to appear.
 
 The way I have the nodes setup right now the cpu can only execute instructions from ROM which can't be updated. So I've been thinking of using an interpretive language which could be stored in the RAM as data, for a ROM interpreter to execute.
 
 Though only our are shown in diagram I might be able to fit as many as six nodes (12 cpu cores) on the board.
 _________________Robert Finch   http://www.finitron.ca 
 
 |  
			| Wed Apr 06, 2016 2:55 am |   |  
		|  |  
			| BigEd 
					Joined: Wed Jan 09, 2013 6:54 pm
 Posts: 1841
   | That sounds like a capacious board! 
 
 |  
			| Wed Apr 06, 2016 7:48 am |  |  
		|  |  
			| robfinch 
					Joined: Sat Feb 02, 2013 9:40 am
 Posts: 2405
 Location: Canada
   | I managed to get this little piece of RiSC-V code almost working. It displays a single solitary zero on the screen in the proper colors of grey on green. So there's data being transmitted through the network. It's supposed to display a sequence of 2,4,6,8 on successive lines according to the node number. It looks like the node number is zero for some reason. Note the RiSC-V ISA was modified slightly (in a compatible way) to allow the use of 32 bit constants. Code: TXTROWS      EQU      1TXTCOLS      EQU      4
 TEXTSCR      EQU      $FFD00000
 NOCC_PKTLO     EQU       $FFD80000
 NOCC_PKTMID    EQU       $FFD80004
 NOCC_PKTHI     EQU       $FFD80008
 NOCC_TXPULSE   EQU       $FFD80018
 NOCC_STAT      EQU       $FFD8001C
 CPU_INFO       EQU       $FFD90000
 
 NormAttr    EQU      8
 
 org      0x2000
 jmp      start
 start:
 lw    r4,CPU_INFO      ; figure out which core we are
 andi  r4,r4,#15
 slti  r4,r4,#2
 beq   r4,r0,.0002      ; not core #1 (wasn't less than 2)
 lw    r4,NOCC_STAT     ; get which node we are
 srli  r4,r4,#16        ; extract bit field
 andi  r4,r4,#15
 or    r3,r4,r0         ; move to r3
 ldi   r1,#$1000001F    ; select write cycle to main system
 ldi   r2,#$FFDC0600    ; LEDs
 jal   r31,xmitPacket
 
 ldi   r5,#336          ; number of bytes per screen line
 mul   r4,r4,r5         ; r4 = node number * bytes per screen line
 addi  r2,r4,#$FFD00000 ; add in screen base address r2 = address
 ldi   r1,#$1000001F    ; target system interface for word write cycle
 ori   r3,r3,#%000111000_110110110_000011_0000    ; grey on green text
 jal   r31,xmitPacket
 .0001:                   ; hang the cpu
 beq   r0,r0,.0001
 ; Here do processing for the second CPU
 .0002:
 beq   r0,r0,.0002
 
 xmitPacket:
 ; first wait until the transmitter isn't busy
 .0001:
 lw    r4,NOCC_STAT
 andi  r4,r4,#$100      ; bit 8 is xmit status
 bne   r4,r0,.0001
 ; Now transmit packet
 sw    r1,NOCC_PKTHI    ; set high order packet word
 sw    r2,NOCC_PKTMID   ; set middle packet word
 sw    r3,NOCC_PKTLO    ; and set low order packet word
 sw    r0,NOCC_TXPULSE  ; and send the packet
 jal   [r31]
 
 
_________________Robert Finch   http://www.finitron.ca 
 
 |  
			| Thu Apr 07, 2016 10:11 am |   |  
		|  |  
			| robfinch 
					Joined: Sat Feb 02, 2013 9:40 am
 Posts: 2405
 Location: Canada
   | I have the Noc system working much better. It is able to properly display data from multiple nodes to the screen. The simple 2,4,6,8 node display test worked. I'm working on getting a bitmap test program working and so access to the DRAM is required. I noticed that memory access is really inefficient when multiple cpu's are involved through a single DRAM port.
 Currently there is only a single system interface node that allows access to all the I/O and common memory in the system.
 The problem is that the DRAM reads in eight word bursts which isn't very efficient when multiple nodes access the DRAM randomly. For instance during one cycle Node 1 might want to read address $1000. The next cycle node 2 might want access to address $2000. So node 1's buffered data is useless and get dumped so that node 2's data can be loaded.
 One solution is to use multiple DRAM ports with multiple system interface nodes.
 The benefit of multiple system interface nodes is that network traffic might be reduced as the sysnode's remove packets that are intended for them. Also the latency for access to the system devices could be reduced.
 Another solution is to use a multi-way cache on a single DRAM port.
 I'm leaning towards using the multi-way cache and keeping the rest of the system simple.
 _________________Robert Finch   http://www.finitron.ca 
 
 |  
			| Wed May 04, 2016 4:59 am |   |  
		|  |  
			| BigEd 
					Joined: Wed Jan 09, 2013 6:54 pm
 Posts: 1841
   | It does sound like you need to hold at least 8 words per node near the DRAM port. Is it enough to have 8 buffers rather than a full multiway cache? 
 
 |  
			| Wed May 04, 2016 5:04 am |  |  
		|  |  
			| robfinch 
					Joined: Sat Feb 02, 2013 9:40 am
 Posts: 2405
 Location: Canada
   | I have modified the controller for a separate buffer for each node. It's simpler than a multi-way cache. But now the controller needs to know the node number. Not a big deal since the node number is available as part of the network packet received by the system node.
 Bitmapped graphics test doesn't appear to work. The screen is blank. It could be that data isn't being written to the DRAM properly. To test the theory I've wired up part of the bitmap data input to a constant pixel value. That should tell me if the controller is actually trying to read data from memory or something else is wrong. If the controller is actually reading from memory part of the display should appear as stripes of the constant color.
 _________________Robert Finch   http://www.finitron.ca 
 
 |  
			| Wed May 04, 2016 7:09 pm |   |  
		|  |  
			| BigEd 
					Joined: Wed Jan 09, 2013 6:54 pm
 Posts: 1841
   | For interest, I just came across the ZMOB, an architecture for very many Z80s to pass messages within one system. Seehttp://www.dtic.mil/dtic/tr/fulltext/u2/a081346.pdf Each Z80 has RAM and boot ROM and access to a communication port. The network is parallel, maybe 40 bits wide, so each message spends one or two clock ticks at each node. The idea is to allow several messages in transit at once. If a message gets around the whole loop without finding its destination, the sending node can take it off. Attachment: ZMOB-network.pngIt seems the system was actually built at a scale of 128 CPUs. "In the late 1970's, Chuck Rieger and Mark Weiser built ZMOB, an early parallel computer based on commodity microprocessors. The system consisted of 128 Z-80 processors."
 You do not have the required permissions to view the files attached to this post.
 
 
 |  
			| Mon Jul 04, 2016 9:27 pm |  |  
		|  |  
			| MichaelM 
					Joined: Wed Apr 24, 2013 9:40 pm
 Posts: 213
 Location: Huntsville, AL
   | Very interesting, thanks for the link. _________________
 Michael A.
 
 
 |  
			| Tue Jul 05, 2016 1:49 pm |  |  
		|  |  
			| robfinch 
					Joined: Sat Feb 02, 2013 9:40 am
 Posts: 2405
 Location: Canada
   | Thanks for the reference. Provides a potential application for the network. They are making use of the fact that the Z80 is much slower than the network mail slots so that network access appears almost immediate to the Z80's. The Noc's network works slightly differently because the cores are just as fast the network.
 I’ve got almost the same thing but with a technology update. 75MHz RiSCV cores rather than 4MHz z80’s and a slightly different mail slot. The Noc doesn’t use an index pulse. Instead it matches the source address against the node address to know which slot is “owned”. The parallel bus is wider so that it allows a global resource addressing in the messages. The Noc also has a special node dedicated to access global resources, this is provided by the host computer in ZMOB.
 
 I’ve been working on adding interrupt capability / supporting different operating modes to the cores.
 _________________Robert Finch   http://www.finitron.ca 
 
 |  
			| Wed Jul 06, 2016 4:54 pm |   |  
		|  |  
			| robfinch 
					Joined: Sat Feb 02, 2013 9:40 am
 Posts: 2405
 Location: Canada
   | I just discovered that the source register and address register for store instructions were swapped in the RiSCV core. It worked because they were also swapped around in the assembler. I guess it could be a feature, but it'd be incompatible. _________________Robert Finch   http://www.finitron.ca 
 
 |  
			| Fri Aug 26, 2016 5:31 am |   |  
		|  |  
			| Dr Jefyll 
					Joined: Tue Jan 15, 2013 5:43 am
 Posts: 189
   | Amusing. In human affairs it's axiomatic that  "two wrongs don't make a right." But computers are 100% OK with it! 
 
 |  
			| Fri Aug 26, 2016 2:31 pm |   |  
		|  |  
			| robfinch 
					Joined: Sat Feb 02, 2013 9:40 am
 Posts: 2405
 Location: Canada
   | The RISCV compatible core has been re-written with an extension to support 64 bit instructions for 32 bit immediate constants. The I-Cache was also modified to be 4-way set associative. Now running the system in simulation hangs the PC big time. Everything locks up and task manager can't be activated. It requires a power on-off-on to get the PC working again. It's just with the change to a 4-way set associative cache. It was simulating fine before this. I guess the simulator doesn't like the 4-way icache.Hopefully it will run in the FPGA hardware.
 _________________Robert Finch   http://www.finitron.ca 
 
 |  
			| Sat Sep 03, 2016 3:13 am |   |  
 
	
		| Who is online |  
		| Users browsing this forum: claudebot and 0 guests |  
 
	|  | You cannot post new topics in this forum You cannot reply to topics in this forum
 You cannot edit your posts in this forum
 You cannot delete your posts in this forum
 You cannot post attachments in this forum
 
 |  
 |