AnyCPU
http://anycpu.org/forum/

nvio
http://anycpu.org/forum/viewtopic.php?f=23&t=606
Page 3 of 9

Author:  robfinch [ Fri Jun 14, 2019 3:24 am ]
Post subject:  Re: rtfItanium / NIVO

The project has been renamed “NVIO”. Runs up to 75 instructions correctly now. Slow going clock cycle-by-cycle in simulation, a lot of bugs have been worked out, mostly introduced from splitting the instruction window. There's a little bit of indirection dealing with indexes into the buffers, each buffer contains indexes into the other. The Sieve of Eratosthenes is being used to debug the core.

Author:  robfinch [ Sat Jun 15, 2019 3:08 am ]
Post subject:  Re: NVIO

Worked on the floating-point multiplier-adder and normalizer. The normalizer is now pipelined better. It went from two stages to eight stages.
The idea of using a single floating-point unit containing all the fp-operations is under scrutiny. It would be better for performance if some of the units were separated out. The FMA can accept new input every clock cycle. It’s a bit of a shame to have the fp unit waiting for the entire latency to expire before accepting another input. Because the FMA has such a large latency, one idea is to also have a separate multiplier and separate adder which have lower latency. Normalized rounded output from the FMA takes about 26 clock cycles. For an adder by itself, the latency is about 18 clock cycles.

Author:  robfinch [ Mon Jun 17, 2019 3:20 am ]
Post subject:  Re: NVIO

The core used a system of indirect indexes to make it possible to queue up instructions in the dispatch buffer even if a re-order buffer isn’t available at time of queue. This system required swapping dispatch buffer id’s and re-order buffer id’s around and it almost worked. The issue is that it was too complex of a system. Once a re-order buffer entry was available, the core had to go back and patch up entries in the dispatch buffer with re-order buffer ids. Dependency resolution had to deal with two sets of ids depending on the queue state of the instruction. So, scrap that idea. Simpler solution: just refuse to queue a new instruction unless there is a re-order buffer entry available. Add more re-order buffer entries to help avoid stalls at the queue stage. Instructions can’t execute anyway unless there is a place to store the result. The dispatch buffer size and the re-order buffer size should correspond because the throughput is consistent from one to the other. It may be possible to make the re-order buffer slightly smaller than dispatch as dispatch misses for branch misses and cache misses occur.

Author:  robfinch [ Tue Jun 18, 2019 4:05 am ]
Post subject:  Re: NVIO

I finally got enough of the emulator implemented to run the boot up until the point the simulation crashes. It proves that there is a hardware issue as the emulator runs past the crash point with no issues. So, I have to find the hardware the hardware bug somewhere in the first 150 instructions executed. Found one bug where the data strobe signal wasn’t being delayed enough through the tlb.

Added two addressing modes to the core, indexed indirect with post increment, and indexed indirect with pre-decrement. LDO $r1,[$r2++,$r3*8] will load a 64 bit value into r1 from the address r2 + r3*8, then add 8 to r2. r3 could be r0 which then gives a simpler [$r2++] mode. Added load and store multiple registers operations. Load and store multiple registers instructions are a little different as the first register to load or store is specified in the instruction, then the remaining registers to load or store are specified in a bitmask. One of the registers may then be loaded or stored twice at different memory locations if desired. It worked out this way in the interest of keeping hardware simpler and fast. When an instruction is fetched, the register fields are available immediately so the instruction may begin executing. Fields specified in the bitmask are not available until after the next clock cycle because the bitmask has to be decoded.

Author:  robfinch [ Wed Jun 19, 2019 4:15 am ]
Post subject:  Re: NVIO

Loads were not working. The defined constant for a load was incorrect. The push constant instruction was being treated as a load. This caused the stack pointer to become zeroed out during a push operation.
Finally got the core to run for more than a few instructions and tried a couple of different configurations. The clocks per instruction were measured. Note memory access dominates since it’s taking about 12 clock cycles per access.
a) CPI = 6.6 for 1-way scalar out-of-order, 1 ALU, 1 Agen, 1 Mem
b) CPi = 2.6 for 2-way super out-of-order 2 ALU, 2 AGEN, 2 Mem channels
I tried a 3-way configuration but it crashed after about 30 us. Going 2 ways more than doubled performance.
What about the fabulous 'less than one clock per instruction'? An average CPI of 2.6 shows that the core is hiding a lot of the memory access time.

Author:  BigEd [ Wed Jun 19, 2019 8:08 am ]
Post subject:  Re: rtfItanium

> Going 2 ways more than doubled performance
A great result!

> An average CPI of 2.6 shows that the core is hiding a lot of the memory access time.
Indeed, and impressive!

Author:  robfinch [ Thu Jun 20, 2019 2:57 am ]
Post subject:  Re: rtfItanium

The core was sometimes queuing the same instruction twice on a cache miss. For many instructions it doesn’t matter if they execute twice, however for many others it does matter. Found because a push instruction was being executed twice leaving the stack messed up. The bug was in the ip increment during a cache miss. The core was reconfigured for a minimal version (4 queue entries) and it crashes on a bad return address after executing about 200 instructions. I was kinda surprised by the crash because the core has been run up to about 4,000+ instructions with more queue entries.

Added a text screen display to the emulator. It’s now possible to see output from the software running in the emulator.

Author:  robfinch [ Fri Jun 21, 2019 3:12 am ]
Post subject:  Re: rtfItanium

There was a problem in the assembler that caused the upper 16 bits of addresses to be truncated. It had to do with conversions to 128-bit integers.
There was also a bracket in the wrong place in the rtl code which caused the upper 16 bits of addresses to be truncated. Having two errors with similar effects at the same time was confusing. I found and fixed the assembler error first, then found out things still didn’t work.
Fixing the bracket’s position changed the pattern of execution.
A small tweak causes the number of clock cycles required for processing to increase by one. This shifts everything in the buffers and hides the issue.
The crash in the three-way configuration was fixed along the way. The core now seems to run in the minimum configuration. Which is the configuration that will fit into the FPGA. It has been run for 1 ms or about 3,000 instructions. So, it’s just about ready to try in an FPGA.

Author:  robfinch [ Sat Jun 22, 2019 6:02 am ]
Post subject:  Re: nvio

Discovered there were multiple lines of almost the same data in the data cache. For every write update a line was being assigned randomly even if there was current data in the cache already, rather than using the current line. The random line picker needed to be disabled for updates where there were write hits.
Discovered that reads and writes to the same data cache line were occurring at the same time. The address overlap detection logic for memory issue needed to be altered to match if the cache line was the same, rather than if the entire address was the same.

Started working on another project, this time to interface a CmodA7 to a 65C816 cpu to create a two “chip” system.

Author:  robfinch [ Sun Jun 23, 2019 3:37 am ]
Post subject:  Re: nvio

Found out why placing the rom at the input of the L1 cache maybe isn’t a good idea. It’s great for performance but, the rom also has to be placed at the input of the data cache as well, otherwise it wouldn’t be possible to read data from the rom using load and store instructions. Since the data cache is dual ported that means the rom needs to be triple ported to support dual data ports and an instruction port. It starts to use a lot of block ram.

Author:  BigEd [ Sun Jun 23, 2019 6:16 am ]
Post subject:  Re: nvio

Could you use some macros (or something) to change your tables of data into load immediates to then store the static data into RAM? That is, lose a bit of density but simplify the machine. Presuming that there would never be a very huge amount of static data in the ROM.

Author:  robfinch [ Mon Jun 24, 2019 3:59 am ]
Post subject:  Re: nvio

Quote:
Could you use some macros (or something) to change your tables of data into load immediates to then store the static data into RAM? That is, lose a bit of density but simplify the machine. Presuming that there would never be a very huge amount of static data in the ROM.

I had not thought of trying that. It’s tempting but there’s a lot of code that I didn’t actually write that contains tables for things like ascii char classification. (The rom is about 160k IIRC). That might mean re-writing a lot of code. There are other things like performing a rom checksum that couldn’t be done that way. The other thought I had was to dual port the instruction cache and detect when an instruction is trying to load from the i-cache. The issue then is that only one load at a time would be allowed, reducing performance. The data cache allows two loads to occur at the same time.

Building the soc through to a bitstream reveals the size to be 132k LUTs just barely within the 136k LUT size of the target device. A review of the utilization report shows that the data cache appears to be using about 10x as many LUTs as it should. The L1 data cache must be built out of LUT ram in order to get single cycle performance. It looks like the tools aren’t able to synthesize the LUT memory very efficiently. I had to use the distributed memory generator in the IP core generator to create a ram. Since this isn’t parameterized I had to generate a ram of the maximum size that might be used. It takes longer to the use the ip core generator in this case than it does to code the ram by hand and the result isn’t as flexible. The size of the data cache before using ip core generator: 56,000 LUTs. After recoding using the generator: 900 LUTs. 62x smaller.

The core has been run in an FPGA now. It crashed shortly after updating the LEDs on the board. Timing wasn’t 100% met so I’m guessing a timing issue is present. The clock frequency for the core is being reduced to 10MHz to see if that makes a difference.

Author:  robfinch [ Tue Jun 25, 2019 5:15 am ]
Post subject:  Re: nvio

Same result running at 10MHz. So rebuilding the system with a logic analyzer to see what’s going on.
A high-speed uart (921k baud) with 4kB fifos were added to the soc.

Author:  robfinch [ Thu Jun 27, 2019 5:35 am ]
Post subject:  Re: nvio

Something I had not thought of was that the rom can have the appearance of being modified since it’s loaded into the data cache and lines in the data cache are both readable and writeable. There needs to be a read-only bit added to the data cache.

Decided to come up with a uart core that’s register compatible with a 6551. The 6551 registers are widened to 32-bits to support more features. The low order eight bits of the registers are compatible with the 6551. It should be possible to use the core as an eight-bit core by grounding the upper 24 input data bits and grounding the corresponding byte lane selects with some loss of features. The baud rate generator uses harmonic synthesis to generate the baud clock. The baud rate table is setup for a 100MHz input clock.

Some serial I/O routines were coded along with a port of a hex downloader.

An attempt is being made to get simulation running to the point of serial transmit for the startup message. Found some issues with the assembler. One was swapped fields for opcode and function in indexed loads.

Author:  robfinch [ Sat Jun 29, 2019 2:55 am ]
Post subject:  Re: nvio

Debugging time! Debugged a host of errors that prevented the core from working properly. The branch target override bit in the instruction pointer module needed to be referring to the ip not the delayed ip. One fix. A typo prevented the second target register from invalidating on a queue. “Rd” was specified and it needed to be “Rd2”. In the data cache on a write hit the line number used in the cache wasn’t set properly.
The assembler was outputting the stack increment doubled for a return instruction.
Shrank the sequence numbers down to a minimal size. Decided to switch the core back to using sequence numbers rather than branch tags as there was an issue with the branch tag logic that cropped up after it worked successfully for a long run of instructions. I couldn’t identify where the bug was. Rather than spend a lot of time debugging, the core was simply switched to sequence numbering. It’s a little bit more logic but a lot easier to understand and debug.
The core now makes use of dual result busses to allow: auto-increment / auto-decrement addressing, pop, link, and unlink stack instructions, and potentially other instructions, like returning both the quotient and remainder of a divide operation.

Page 3 of 9 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
http://www.phpbb.com/