View unanswered posts | View active topics It is currently Thu Mar 28, 2024 6:53 pm



Reply to topic  [ 105 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7
 DSD7 
Author Message

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Interesting choice - why use high order bits to select the set? Feels like that would spread things around less. (Unless I'm confused...)


Tue May 16, 2017 3:28 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
why use high order bits to select the set?


Using high order bits likely doesn’t make a difference, especially in an FPGA, but using them might allow the same cam tag ram to be active for longer periods of time. I think higher order address bits switch less frequently than lower order ones. The sram for the cache doesn’t care which bits it’s fed as long as it’s consistent for reads and writes. It’s maybe cognitively harder to envision the use of the higher order bits since it does spread the storage around, but I don’t think it matters to the operation.
Pure speculation by me (I have yet to research this). In other words it’s an attempt to lower the frequency of cam switches. Lower frequency switches is lower power. Having more sets may also hopefully reduce power as 1/16 of the tag ram is active rather than ¼.

_________________
Robert Finch http://www.finitron.ca


Tue May 16, 2017 8:11 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Ok, thanks, that's a reason I hadn't thought of at all.


Tue May 16, 2017 8:27 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The core is now using a fully associative L1 cache based on a cam tag memory. The L1 cache now takes three cycles to update rather than one because the cam takes two cycles then the cache itself takes another cycle. Previously the tags could be updated in a same cycle as the cache memory. So it’s two more clock cycles, but only on cache misses. Changing the cache to full associativity reduces the miss rate. With only a 2kB cache the miss rate is pretty good running only the BIOS.
hit rate = 29229/(29229+56) * 100 = 99.8%
I had to write the cam code myself (so it’s clean room implementation). I couldn’t find any on the web which surprised me. I’m not sure it’s the best implementation. (No, it doesn't use hordes of comparators).
The next step will be to use cam memories for the L2 cache.

_________________
Robert Finch http://www.finitron.ca


Thu May 18, 2017 7:05 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
2015/05/18
Working on increasing the stability of the system and resolving all the missed timing constraints. The big one missing timing is the multi-port memory controller. It tops out at about 75MHz operation, but it is clocked by a signal from the DDR3 app controller which is a fixed 100MHz. The frequency the DDR controller runs at can’t be altered very easily so I broke the multi-port controller into two separate state machines, one operating at 100MHz and the other much slower at the system clock rate. Still no go. The high-speed state machine still isn’t fast enough. And there’s a bug in the controller now. I waded through about a half dozen web queries on how to lower the frequency of the controller. I’m not the only one who’s run into this problem.
Even though it failed timing the simpler controller did run the ram memory test successfully (sometimes). It’s just not reliable enough.
The L2 instruction cache is now a 16kB, 32way, 16 set associative cache utilizing cam memories for the tags. While it worked fine in simulation, testing the cache in hardware reveals that it’s a monster that affects timing. 20,000 LUTs are required to implement the cam memory for L2 and there are timing errors all over the design now. For the next unstep I will be switching the L2 cache to 16kB, 4way set associative that uses regular comparators rather than cam memories. The lower associativity isn’t as important to the larger cache.
2015/05/19
Adding neural network accelerator instructions to the instruction set. Having had a look at: http://anycpu.org/forum/viewtopic.php?f=17&t=380 It looks like neural nets may be the wave of the future/present. The first instruction is nnmac (for neural network multiply-accumulate). nnmac performs seven parallel floating point multiplies, then sums the results together using an adder tree. The output of the instruction may be cycled back as an adder input to allow more than seven inputs to be processed for a neuron. The output is a sigmoid activation level.
The reason to process seven inputs at a time was that the data will fit into 256 bit cache lines. (7, 32 bit floats) So the whole cache line can be used to load weights or input data.
Attachment:
NNMAC.png
NNMAC.png [ 23.34 KiB | Viewed 11485 times ]

The L2 Icache was switched to 16Kb 4 way associative. But now the system is busted back to the clear screen level.

_________________
Robert Finch http://www.finitron.ca


Fri May 19, 2017 12:55 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
20000 LUTs under the sea... I wonder how big a module can be and still be reasonable in implementation. My eight-hour place and route of the Elf was, I think, unreasonable.


Fri May 19, 2017 5:29 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Large distributed RAMs are slow timing wise because of all the cascaded LUTs and the tools will put a lot of effort into trying to improve the timing. A large cache isn't really suitable to be made from distributed RAM because of the slow timing. However switching to block RAMs for the cache cost clock cycles in access time anyway. It seems like one just can't win.
It took about 2.5 hours to place and route my design with cams made out of LUT ram, changing back to Block RAMs in the cache the design only took 1hr to P&R. In some instances I've taken to using the vendors tools to generate RAM components rather than have them inferred automatically. It seems to do a better job.

Quote:
I wonder how big a module can be and still be reasonable in implementation

That consideration may be up to the designer what they think is reasonable.
I've been thinking I would not want a larger FPGA now because of the time it takes to build a system. Once the FPGA is full is might take all day to build. Kinda puts the kibosh on interactive development. It may be more rewarding to use a smaller FPGA / design.

_________________
Robert Finch http://www.finitron.ca


Sat May 20, 2017 11:23 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Do you use any floorplanning guidance? It's possible that doing that would improve the P&R runtime and might also improve the result.


Sat May 20, 2017 11:54 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
2017/05/20
Quote:
Do you use any floorplanning guidance?

I’ve never floor-planned a design yet. I’ve viewed floor-planning as one of the last stages to improving a design. It’s an optimization process. One has to know what all the components of the design are and the components have to be fairly stable. Otherwise one could be stuck in a loop re-floor-planning the design as changes are made. One can work to make a design FPGA friendly (for instance using 6 input functions) while coding.

Continue to tweak the ICache. Modified the cache line length that is stored to 240 bits from 256 bits to reduce the cache size and save some resources. Only 120 of 128 bits in a hexibyte value are used for instructions, instructions being 40 bits in size. Six instructions can be stored in only 240 bits.

Spent the morning researching and creating a high-speed sigmoid function evaluator for the neural network accelerator. It uses the simple table lookup method. There’s no PWL (piece-wise linear approximation) to it, so it’s a stepping function. This will tend to limit accuracy in the neuron.
A single neuron in the accelerator uses about 7,000 LUTs, 4 BRAMS, and 21 DSP slices. I hope the design can evaluate 7 or 8 neurons in parallel depending on room in the FPGA. That’s roughly 100 floating point ops being calculated in parallel. It does take about 40-50 clock cycles for a neuron calculation percolating through the MAC and sigmoid in the neuron.

_________________
Robert Finch http://www.finitron.ca


Sat May 20, 2017 11:50 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Built the system with the neural network accelerator included. The accelerator models seven neurons at a time. The accelerator is accessed as an array of registers. Load the weights and input values with a LDWA instruction then store off the sigmoid results about 75 clocks later with the STSIG instruction. It uses up about 50,000 LUTs. There's still room in the FPGA. But there's still a lot of devices to get working (Ethernet, graphics accelerator, ).
I found out why the floating point test portion of the bootrom wasn't working. It seems I renamed a clock signal used by the floating point unit so that it was no longer being clocked. With all my tweaking I've busted the system somehow so that it now hangs after the first LED display. I'm guessing it's a problem with the cache.
Code:
+----------------------------+-------+-------+-----------+-------+
|          Site Type         |  Used | Fixed | Available | Util% |
+----------------------------+-------+-------+-----------+-------+
| Slice LUTs*                | 83164 |     0 |    134600 | 61.79 |
|   LUT as Logic             | 80779 |     0 |    134600 | 60.01 |
|   LUT as Memory            |  2385 |     0 |     46200 |  5.16 |
|     LUT as Distributed RAM |   724 |     0 |           |       |
|     LUT as Shift Register  |  1661 |     0 |           |       |
| Slice Registers            | 43106 |     0 |    269200 | 16.01 |
|   Register as Flip Flop    | 43072 |     0 |    269200 | 16.00 |
|   Register as Latch        |    34 |     0 |    269200 |  0.01 |
| F7 Muxes                   |  2497 |     0 |     67300 |  3.71 |
| F8 Muxes                   |   998 |     0 |     33650 |  2.97 |
+----------------------------+-------+-------+-----------+-------+
+-------------------+------+-------+-----------+-------+
|     Site Type     | Used | Fixed | Available | Util% |
+-------------------+------+-------+-----------+-------+
| Block RAM Tile    | 99.5 |     0 |       365 | 27.26 |
|   RAMB36/FIFO*    |   85 |     0 |       365 | 23.29 |
|     RAMB36E1 only |   85 |       |           |       |
|   RAMB18          |   29 |     0 |       730 |  3.97 |
|     RAMB18E1 only |   29 |       |           |       |
+-------------------+------+-------+-----------+-------+
+----------------+------+-------+-----------+-------+
|    Site Type   | Used | Fixed | Available | Util% |
+----------------+------+-------+-----------+-------+
| DSPs           |  179 |     0 |       740 | 24.19 |
|   DSP48E1 only |  179 |       |           |       |
+----------------+------+-------+-----------+-------+

_________________
Robert Finch http://www.finitron.ca


Mon May 22, 2017 12:59 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
2017/05/23
Decided to take a break from DSD and work on the FT832 system. FT832 is a 65832 backwards compatible processing core. It needed to be ported to the current FPGA board. After several fixes it’s almost working. It gets to the BIOS prompt, and interrupts are running but when I press a key an equals sign begins marching across the screen. I’m reasonably certain that it’s the keyboard that’s sending the character repeatedly. I think this occurs because the keyboard needs to be reset so I’m in the process of adding keyboard reset code. This didn’t seem to be required for the old board.

_________________
Robert Finch http://www.finitron.ca


Tue May 23, 2017 5:00 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Back to working on this project but I'm not having much luck today. DSD9 doesn't synthesize, the entire machine locks up forcing a hard power off and on. I wish they could make the toolset so it doesn't lock up the whole machine when it runs into a problem. DSD7 is approaching the four hour mark to place and route. It looks like the toolset can't P & R the design even though it's using up only about 15% of the FPGA's capacity. Project too complex error again. Time to try sprinkling some more FF's around.

_________________
Robert Finch http://www.finitron.ca


Sat Jul 01, 2017 6:07 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I fired up the ole system builder and rebuilt the DSD9 system. I had to modify it a little bit as the last time it was built on a different workstation and the project directory was moved. After building DSD9 I realized I built it for the wrong FPGA board. So I've kicked off another build, this time for the current board.
I added a quick immediate float format for float instructions same idea as for g-core.
IIRC I was able to get some software running on DSD9.
I note the set of SoC peripherals is seriously out of date, but they work fine.

_________________
Robert Finch http://www.finitron.ca


Fri Jan 24, 2020 5:17 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added octa-byte load and store instructions (LDO, STO) mainly to deal with 64-bit peripherals. The DSD9 native word size is 80-bits resulting in deci-byte load and store instructions (LDD, STD). DSD9 SoC was originally composed of only 32-bit peripherals so the 64-bit instructions weren’t needed. Updated the memory controller to the most recent version. Updated the assembler and compiler for a newer version of visual studio.

_________________
Robert Finch http://www.finitron.ca


Sat Jan 25, 2020 4:45 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Found an error in the assembler’s 128-bit arithmetic. A signed shift right was being used for a shift right operation where it should have been an unsigned shift. This did not affect prior versions as the right shift was not being used.
Constants larger than 64 bits weren’t being encoded properly.

_________________
Robert Finch http://www.finitron.ca


Sun Jan 26, 2020 3:34 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 105 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7

Who is online

Users browsing this forum: Amazonbot, SemrushBot and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software