AnyCPU
http://anycpu.org/forum/

DSD7
http://anycpu.org/forum/viewtopic.php?f=23&t=331
Page 7 of 7

Author:  BigEd [ Tue May 16, 2017 3:28 pm ]
Post subject:  Re: DSD7

Interesting choice - why use high order bits to select the set? Feels like that would spread things around less. (Unless I'm confused...)

Author:  robfinch [ Tue May 16, 2017 8:11 pm ]
Post subject:  Re: DSD7

Quote:
why use high order bits to select the set?


Using high order bits likely doesn’t make a difference, especially in an FPGA, but using them might allow the same cam tag ram to be active for longer periods of time. I think higher order address bits switch less frequently than lower order ones. The sram for the cache doesn’t care which bits it’s fed as long as it’s consistent for reads and writes. It’s maybe cognitively harder to envision the use of the higher order bits since it does spread the storage around, but I don’t think it matters to the operation.
Pure speculation by me (I have yet to research this). In other words it’s an attempt to lower the frequency of cam switches. Lower frequency switches is lower power. Having more sets may also hopefully reduce power as 1/16 of the tag ram is active rather than ¼.

Author:  BigEd [ Tue May 16, 2017 8:27 pm ]
Post subject:  Re: DSD7

Ok, thanks, that's a reason I hadn't thought of at all.

Author:  robfinch [ Thu May 18, 2017 7:05 am ]
Post subject:  Re: DSD7

The core is now using a fully associative L1 cache based on a cam tag memory. The L1 cache now takes three cycles to update rather than one because the cam takes two cycles then the cache itself takes another cycle. Previously the tags could be updated in a same cycle as the cache memory. So it’s two more clock cycles, but only on cache misses. Changing the cache to full associativity reduces the miss rate. With only a 2kB cache the miss rate is pretty good running only the BIOS.
hit rate = 29229/(29229+56) * 100 = 99.8%
I had to write the cam code myself (so it’s clean room implementation). I couldn’t find any on the web which surprised me. I’m not sure it’s the best implementation. (No, it doesn't use hordes of comparators).
The next step will be to use cam memories for the L2 cache.

Author:  robfinch [ Fri May 19, 2017 12:55 pm ]
Post subject:  Re: DSD7

2015/05/18
Working on increasing the stability of the system and resolving all the missed timing constraints. The big one missing timing is the multi-port memory controller. It tops out at about 75MHz operation, but it is clocked by a signal from the DDR3 app controller which is a fixed 100MHz. The frequency the DDR controller runs at can’t be altered very easily so I broke the multi-port controller into two separate state machines, one operating at 100MHz and the other much slower at the system clock rate. Still no go. The high-speed state machine still isn’t fast enough. And there’s a bug in the controller now. I waded through about a half dozen web queries on how to lower the frequency of the controller. I’m not the only one who’s run into this problem.
Even though it failed timing the simpler controller did run the ram memory test successfully (sometimes). It’s just not reliable enough.
The L2 instruction cache is now a 16kB, 32way, 16 set associative cache utilizing cam memories for the tags. While it worked fine in simulation, testing the cache in hardware reveals that it’s a monster that affects timing. 20,000 LUTs are required to implement the cam memory for L2 and there are timing errors all over the design now. For the next unstep I will be switching the L2 cache to 16kB, 4way set associative that uses regular comparators rather than cam memories. The lower associativity isn’t as important to the larger cache.
2015/05/19
Adding neural network accelerator instructions to the instruction set. Having had a look at: http://anycpu.org/forum/viewtopic.php?f=17&t=380 It looks like neural nets may be the wave of the future/present. The first instruction is nnmac (for neural network multiply-accumulate). nnmac performs seven parallel floating point multiplies, then sums the results together using an adder tree. The output of the instruction may be cycled back as an adder input to allow more than seven inputs to be processed for a neuron. The output is a sigmoid activation level.
The reason to process seven inputs at a time was that the data will fit into 256 bit cache lines. (7, 32 bit floats) So the whole cache line can be used to load weights or input data.
Attachment:
NNMAC.png
NNMAC.png [ 23.34 KiB | Viewed 6817 times ]

The L2 Icache was switched to 16Kb 4 way associative. But now the system is busted back to the clear screen level.

Author:  BigEd [ Fri May 19, 2017 5:29 pm ]
Post subject:  Re: DSD7

20000 LUTs under the sea... I wonder how big a module can be and still be reasonable in implementation. My eight-hour place and route of the Elf was, I think, unreasonable.

Author:  robfinch [ Sat May 20, 2017 11:23 am ]
Post subject:  Re: DSD7

Large distributed RAMs are slow timing wise because of all the cascaded LUTs and the tools will put a lot of effort into trying to improve the timing. A large cache isn't really suitable to be made from distributed RAM because of the slow timing. However switching to block RAMs for the cache cost clock cycles in access time anyway. It seems like one just can't win.
It took about 2.5 hours to place and route my design with cams made out of LUT ram, changing back to Block RAMs in the cache the design only took 1hr to P&R. In some instances I've taken to using the vendors tools to generate RAM components rather than have them inferred automatically. It seems to do a better job.

Quote:
I wonder how big a module can be and still be reasonable in implementation

That consideration may be up to the designer what they think is reasonable.
I've been thinking I would not want a larger FPGA now because of the time it takes to build a system. Once the FPGA is full is might take all day to build. Kinda puts the kibosh on interactive development. It may be more rewarding to use a smaller FPGA / design.

Author:  BigEd [ Sat May 20, 2017 11:54 am ]
Post subject:  Re: DSD7

Do you use any floorplanning guidance? It's possible that doing that would improve the P&R runtime and might also improve the result.

Author:  robfinch [ Sat May 20, 2017 11:50 pm ]
Post subject:  Re: DSD7

2017/05/20
Quote:
Do you use any floorplanning guidance?

I’ve never floor-planned a design yet. I’ve viewed floor-planning as one of the last stages to improving a design. It’s an optimization process. One has to know what all the components of the design are and the components have to be fairly stable. Otherwise one could be stuck in a loop re-floor-planning the design as changes are made. One can work to make a design FPGA friendly (for instance using 6 input functions) while coding.

Continue to tweak the ICache. Modified the cache line length that is stored to 240 bits from 256 bits to reduce the cache size and save some resources. Only 120 of 128 bits in a hexibyte value are used for instructions, instructions being 40 bits in size. Six instructions can be stored in only 240 bits.

Spent the morning researching and creating a high-speed sigmoid function evaluator for the neural network accelerator. It uses the simple table lookup method. There’s no PWL (piece-wise linear approximation) to it, so it’s a stepping function. This will tend to limit accuracy in the neuron.
A single neuron in the accelerator uses about 7,000 LUTs, 4 BRAMS, and 21 DSP slices. I hope the design can evaluate 7 or 8 neurons in parallel depending on room in the FPGA. That’s roughly 100 floating point ops being calculated in parallel. It does take about 40-50 clock cycles for a neuron calculation percolating through the MAC and sigmoid in the neuron.

Author:  robfinch [ Mon May 22, 2017 12:59 am ]
Post subject:  Re: DSD7

Built the system with the neural network accelerator included. The accelerator models seven neurons at a time. The accelerator is accessed as an array of registers. Load the weights and input values with a LDWA instruction then store off the sigmoid results about 75 clocks later with the STSIG instruction. It uses up about 50,000 LUTs. There's still room in the FPGA. But there's still a lot of devices to get working (Ethernet, graphics accelerator, ).
I found out why the floating point test portion of the bootrom wasn't working. It seems I renamed a clock signal used by the floating point unit so that it was no longer being clocked. With all my tweaking I've busted the system somehow so that it now hangs after the first LED display. I'm guessing it's a problem with the cache.
Code:
+----------------------------+-------+-------+-----------+-------+
|          Site Type         |  Used | Fixed | Available | Util% |
+----------------------------+-------+-------+-----------+-------+
| Slice LUTs*                | 83164 |     0 |    134600 | 61.79 |
|   LUT as Logic             | 80779 |     0 |    134600 | 60.01 |
|   LUT as Memory            |  2385 |     0 |     46200 |  5.16 |
|     LUT as Distributed RAM |   724 |     0 |           |       |
|     LUT as Shift Register  |  1661 |     0 |           |       |
| Slice Registers            | 43106 |     0 |    269200 | 16.01 |
|   Register as Flip Flop    | 43072 |     0 |    269200 | 16.00 |
|   Register as Latch        |    34 |     0 |    269200 |  0.01 |
| F7 Muxes                   |  2497 |     0 |     67300 |  3.71 |
| F8 Muxes                   |   998 |     0 |     33650 |  2.97 |
+----------------------------+-------+-------+-----------+-------+
+-------------------+------+-------+-----------+-------+
|     Site Type     | Used | Fixed | Available | Util% |
+-------------------+------+-------+-----------+-------+
| Block RAM Tile    | 99.5 |     0 |       365 | 27.26 |
|   RAMB36/FIFO*    |   85 |     0 |       365 | 23.29 |
|     RAMB36E1 only |   85 |       |           |       |
|   RAMB18          |   29 |     0 |       730 |  3.97 |
|     RAMB18E1 only |   29 |       |           |       |
+-------------------+------+-------+-----------+-------+
+----------------+------+-------+-----------+-------+
|    Site Type   | Used | Fixed | Available | Util% |
+----------------+------+-------+-----------+-------+
| DSPs           |  179 |     0 |       740 | 24.19 |
|   DSP48E1 only |  179 |       |           |       |
+----------------+------+-------+-----------+-------+

Author:  robfinch [ Tue May 23, 2017 5:00 pm ]
Post subject:  Re: DSD7

2017/05/23
Decided to take a break from DSD and work on the FT832 system. FT832 is a 65832 backwards compatible processing core. It needed to be ported to the current FPGA board. After several fixes it’s almost working. It gets to the BIOS prompt, and interrupts are running but when I press a key an equals sign begins marching across the screen. I’m reasonably certain that it’s the keyboard that’s sending the character repeatedly. I think this occurs because the keyboard needs to be reset so I’m in the process of adding keyboard reset code. This didn’t seem to be required for the old board.

Author:  robfinch [ Sat Jul 01, 2017 6:07 am ]
Post subject:  Re: DSD7

Back to working on this project but I'm not having much luck today. DSD9 doesn't synthesize, the entire machine locks up forcing a hard power off and on. I wish they could make the toolset so it doesn't lock up the whole machine when it runs into a problem. DSD7 is approaching the four hour mark to place and route. It looks like the toolset can't P & R the design even though it's using up only about 15% of the FPGA's capacity. Project too complex error again. Time to try sprinkling some more FF's around.

Page 7 of 7 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
http://www.phpbb.com/