View unanswered posts | View active topics It is currently Mon Oct 14, 2019 7:02 pm



Reply to topic  [ 69 posts ]  Go to page Previous  1, 2, 3, 4, 5  Next
 nvio 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
The project has been renamed “NVIO”. Runs up to 75 instructions correctly now. Slow going clock cycle-by-cycle in simulation, a lot of bugs have been worked out, mostly introduced from splitting the instruction window. There's a little bit of indirection dealing with indexes into the buffers, each buffer contains indexes into the other. The Sieve of Eratosthenes is being used to debug the core.

_________________
Robert Finch http://www.finitron.ca


Fri Jun 14, 2019 3:24 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Worked on the floating-point multiplier-adder and normalizer. The normalizer is now pipelined better. It went from two stages to eight stages.
The idea of using a single floating-point unit containing all the fp-operations is under scrutiny. It would be better for performance if some of the units were separated out. The FMA can accept new input every clock cycle. It’s a bit of a shame to have the fp unit waiting for the entire latency to expire before accepting another input. Because the FMA has such a large latency, one idea is to also have a separate multiplier and separate adder which have lower latency. Normalized rounded output from the FMA takes about 26 clock cycles. For an adder by itself, the latency is about 18 clock cycles.

_________________
Robert Finch http://www.finitron.ca


Sat Jun 15, 2019 3:08 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
The core used a system of indirect indexes to make it possible to queue up instructions in the dispatch buffer even if a re-order buffer isn’t available at time of queue. This system required swapping dispatch buffer id’s and re-order buffer id’s around and it almost worked. The issue is that it was too complex of a system. Once a re-order buffer entry was available, the core had to go back and patch up entries in the dispatch buffer with re-order buffer ids. Dependency resolution had to deal with two sets of ids depending on the queue state of the instruction. So, scrap that idea. Simpler solution: just refuse to queue a new instruction unless there is a re-order buffer entry available. Add more re-order buffer entries to help avoid stalls at the queue stage. Instructions can’t execute anyway unless there is a place to store the result. The dispatch buffer size and the re-order buffer size should correspond because the throughput is consistent from one to the other. It may be possible to make the re-order buffer slightly smaller than dispatch as dispatch misses for branch misses and cache misses occur.

_________________
Robert Finch http://www.finitron.ca


Mon Jun 17, 2019 3:20 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
I finally got enough of the emulator implemented to run the boot up until the point the simulation crashes. It proves that there is a hardware issue as the emulator runs past the crash point with no issues. So, I have to find the hardware the hardware bug somewhere in the first 150 instructions executed. Found one bug where the data strobe signal wasn’t being delayed enough through the tlb.

Added two addressing modes to the core, indexed indirect with post increment, and indexed indirect with pre-decrement. LDO $r1,[$r2++,$r3*8] will load a 64 bit value into r1 from the address r2 + r3*8, then add 8 to r2. r3 could be r0 which then gives a simpler [$r2++] mode. Added load and store multiple registers operations. Load and store multiple registers instructions are a little different as the first register to load or store is specified in the instruction, then the remaining registers to load or store are specified in a bitmask. One of the registers may then be loaded or stored twice at different memory locations if desired. It worked out this way in the interest of keeping hardware simpler and fast. When an instruction is fetched, the register fields are available immediately so the instruction may begin executing. Fields specified in the bitmask are not available until after the next clock cycle because the bitmask has to be decoded.

_________________
Robert Finch http://www.finitron.ca


Tue Jun 18, 2019 4:05 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Loads were not working. The defined constant for a load was incorrect. The push constant instruction was being treated as a load. This caused the stack pointer to become zeroed out during a push operation.
Finally got the core to run for more than a few instructions and tried a couple of different configurations. The clocks per instruction were measured. Note memory access dominates since it’s taking about 12 clock cycles per access.
a) CPI = 6.6 for 1-way scalar out-of-order, 1 ALU, 1 Agen, 1 Mem
b) CPi = 2.6 for 2-way super out-of-order 2 ALU, 2 AGEN, 2 Mem channels
I tried a 3-way configuration but it crashed after about 30 us. Going 2 ways more than doubled performance.
What about the fabulous 'less than one clock per instruction'? An average CPI of 2.6 shows that the core is hiding a lot of the memory access time.

_________________
Robert Finch http://www.finitron.ca


Wed Jun 19, 2019 4:15 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1277
> Going 2 ways more than doubled performance
A great result!

> An average CPI of 2.6 shows that the core is hiding a lot of the memory access time.
Indeed, and impressive!


Wed Jun 19, 2019 8:08 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
The core was sometimes queuing the same instruction twice on a cache miss. For many instructions it doesn’t matter if they execute twice, however for many others it does matter. Found because a push instruction was being executed twice leaving the stack messed up. The bug was in the ip increment during a cache miss. The core was reconfigured for a minimal version (4 queue entries) and it crashes on a bad return address after executing about 200 instructions. I was kinda surprised by the crash because the core has been run up to about 4,000+ instructions with more queue entries.

Added a text screen display to the emulator. It’s now possible to see output from the software running in the emulator.

_________________
Robert Finch http://www.finitron.ca


Thu Jun 20, 2019 2:57 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
There was a problem in the assembler that caused the upper 16 bits of addresses to be truncated. It had to do with conversions to 128-bit integers.
There was also a bracket in the wrong place in the rtl code which caused the upper 16 bits of addresses to be truncated. Having two errors with similar effects at the same time was confusing. I found and fixed the assembler error first, then found out things still didn’t work.
Fixing the bracket’s position changed the pattern of execution.
A small tweak causes the number of clock cycles required for processing to increase by one. This shifts everything in the buffers and hides the issue.
The crash in the three-way configuration was fixed along the way. The core now seems to run in the minimum configuration. Which is the configuration that will fit into the FPGA. It has been run for 1 ms or about 3,000 instructions. So, it’s just about ready to try in an FPGA.

_________________
Robert Finch http://www.finitron.ca


Fri Jun 21, 2019 3:12 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Discovered there were multiple lines of almost the same data in the data cache. For every write update a line was being assigned randomly even if there was current data in the cache already, rather than using the current line. The random line picker needed to be disabled for updates where there were write hits.
Discovered that reads and writes to the same data cache line were occurring at the same time. The address overlap detection logic for memory issue needed to be altered to match if the cache line was the same, rather than if the entire address was the same.

Started working on another project, this time to interface a CmodA7 to a 65C816 cpu to create a two “chip” system.

_________________
Robert Finch http://www.finitron.ca


Sat Jun 22, 2019 6:02 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Found out why placing the rom at the input of the L1 cache maybe isn’t a good idea. It’s great for performance but, the rom also has to be placed at the input of the data cache as well, otherwise it wouldn’t be possible to read data from the rom using load and store instructions. Since the data cache is dual ported that means the rom needs to be triple ported to support dual data ports and an instruction port. It starts to use a lot of block ram.

_________________
Robert Finch http://www.finitron.ca


Sun Jun 23, 2019 3:37 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1277
Could you use some macros (or something) to change your tables of data into load immediates to then store the static data into RAM? That is, lose a bit of density but simplify the machine. Presuming that there would never be a very huge amount of static data in the ROM.


Sun Jun 23, 2019 6:16 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Quote:
Could you use some macros (or something) to change your tables of data into load immediates to then store the static data into RAM? That is, lose a bit of density but simplify the machine. Presuming that there would never be a very huge amount of static data in the ROM.

I had not thought of trying that. It’s tempting but there’s a lot of code that I didn’t actually write that contains tables for things like ascii char classification. (The rom is about 160k IIRC). That might mean re-writing a lot of code. There are other things like performing a rom checksum that couldn’t be done that way. The other thought I had was to dual port the instruction cache and detect when an instruction is trying to load from the i-cache. The issue then is that only one load at a time would be allowed, reducing performance. The data cache allows two loads to occur at the same time.

Building the soc through to a bitstream reveals the size to be 132k LUTs just barely within the 136k LUT size of the target device. A review of the utilization report shows that the data cache appears to be using about 10x as many LUTs as it should. The L1 data cache must be built out of LUT ram in order to get single cycle performance. It looks like the tools aren’t able to synthesize the LUT memory very efficiently. I had to use the distributed memory generator in the IP core generator to create a ram. Since this isn’t parameterized I had to generate a ram of the maximum size that might be used. It takes longer to the use the ip core generator in this case than it does to code the ram by hand and the result isn’t as flexible. The size of the data cache before using ip core generator: 56,000 LUTs. After recoding using the generator: 900 LUTs. 62x smaller.

The core has been run in an FPGA now. It crashed shortly after updating the LEDs on the board. Timing wasn’t 100% met so I’m guessing a timing issue is present. The clock frequency for the core is being reduced to 10MHz to see if that makes a difference.

_________________
Robert Finch http://www.finitron.ca


Mon Jun 24, 2019 3:59 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Same result running at 10MHz. So rebuilding the system with a logic analyzer to see what’s going on.
A high-speed uart (921k baud) with 4kB fifos were added to the soc.

_________________
Robert Finch http://www.finitron.ca


Tue Jun 25, 2019 5:15 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Something I had not thought of was that the rom can have the appearance of being modified since it’s loaded into the data cache and lines in the data cache are both readable and writeable. There needs to be a read-only bit added to the data cache.

Decided to come up with a uart core that’s register compatible with a 6551. The 6551 registers are widened to 32-bits to support more features. The low order eight bits of the registers are compatible with the 6551. It should be possible to use the core as an eight-bit core by grounding the upper 24 input data bits and grounding the corresponding byte lane selects with some loss of features. The baud rate generator uses harmonic synthesis to generate the baud clock. The baud rate table is setup for a 100MHz input clock.

Some serial I/O routines were coded along with a port of a hex downloader.

An attempt is being made to get simulation running to the point of serial transmit for the startup message. Found some issues with the assembler. One was swapped fields for opcode and function in indexed loads.

_________________
Robert Finch http://www.finitron.ca


Thu Jun 27, 2019 5:35 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Debugging time! Debugged a host of errors that prevented the core from working properly. The branch target override bit in the instruction pointer module needed to be referring to the ip not the delayed ip. One fix. A typo prevented the second target register from invalidating on a queue. “Rd” was specified and it needed to be “Rd2”. In the data cache on a write hit the line number used in the cache wasn’t set properly.
The assembler was outputting the stack increment doubled for a return instruction.
Shrank the sequence numbers down to a minimal size. Decided to switch the core back to using sequence numbers rather than branch tags as there was an issue with the branch tag logic that cropped up after it worked successfully for a long run of instructions. I couldn’t identify where the bug was. Rather than spend a lot of time debugging, the core was simply switched to sequence numbering. It’s a little bit more logic but a lot easier to understand and debug.
The core now makes use of dual result busses to allow: auto-increment / auto-decrement addressing, pop, link, and unlink stack instructions, and potentially other instructions, like returning both the quotient and remainder of a divide operation.

_________________
Robert Finch http://www.finitron.ca


Sat Jun 29, 2019 2:55 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 69 posts ]  Go to page Previous  1, 2, 3, 4, 5  Next

Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software