View unanswered posts | View active topics It is currently Thu Mar 28, 2024 1:46 pm



Reply to topic  [ 34 posts ]  Go to page 1, 2, 3  Next
 g-core 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I've been working on this core in parallel with the uop6502 core. Changing g-core then porting changes back to rtf65004 (the micro-op 6502).

In the FPGA the core hangs at the first jump instruction, the pipeline advance signal isn’t true causing a pipeline stall. Pipeline advance would only be false if there were not enough queue entries to queue a new instruction. So, for some reason instructions aren’t leaving the queue properly. Have to verify this theory.

Added floating-point to the core. Decided to use a Goldschmidt divider.

_________________
Robert Finch http://www.finitron.ca


Thu Jan 09, 2020 6:29 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
It might be handy to have an overview of what these various cores are - I can see you are spinning plates, and it can certainly be interesting to see progress on different fronts, but I feel I'm lacking context.

For example, I might ask: What is each core called, what is it implementing, what characterises the design choices in this core, how big is it.


Thu Jan 09, 2020 9:30 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I can see where it would be confusing if one was trying to follow along. I hope this helps.

Here’s a list of the most recent cores (oldest to newest):

Thor – 90?k LUTs – 64 bit datapath, predicated instructions ,learned a lot of architecture working on this core. (FPGA LUT budget 100k)

DSD9 ?LUTs – 80-bit datapath, classic overlapped pipeline, testbed for fp, got as far as being able to dump fp results on-screen.

FT64 - ? LUTs – 64-bit datapath, 2?-way superscalar (supposed to be an improvement in many ways over the Thor core. stopped working on it in July 1019). (FPGA LUT budget 100k) (First use of compressed instruction sets).

cs01 – 32-bit RISCV compatible core, < 15k LUTs thinking of education uses.

nvio - ? LUTs – 80 bit datapath, 3-way superscalar, wanted 80 bit fp (FPGA LUT budget 200k experimenting with larger LUT budget) (First ways > 2 superscalar). Branch tags.

nvio3 – 500k+ LUTs – 128-bit datapath, 4-way superscalar (shelved as too big), Q: if I was to include everything… 256 bit wide vector register set (First use of wide vector registers).

rtf65004 – 75k? LUTs – 64 bit datapath, micro-ops for the front-end, 2-ways, wanted to see if micro-ops could offer performance improvements. and learning curve for micro-op approach. Uses status register for compares and branching (first time I used a status register in a superscalar design). perceptron branch predictor.

Gambit (g-core) – 100k LUTs – 52 bit datapath, 2-ways, improved micro-op engine (later removed). Improved timing over earlier cores. 50%? faster. Separate register files for link, compare results, fp and integer. Goldschmidt divider, results caching for reciprocal, reciprocal square root.


Most of the cores inherit work from earlier cores. I’ve made the cores more modular over time, reducing the dependence on a particular core’s working. Much of the “engine” can now be simply copied from one project to the next, even though the programming model is different. It’s been a progression of learning for me. There is also an aspect of project obfuscation at play.

I’d like to take another crack at issue logic having seen some schedulers, so there may be yet another core in the future (it would be an attempt to make the scheduling smaller and more efficient).

_________________
Robert Finch http://www.finitron.ca


Fri Jan 10, 2020 3:58 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Had to put an issue blocker in to prevent back-to-back issue in consecutive clock cycles on the same memory channel. It takes a clock cycle for the ram state machine to update to busy. During this clock cycle another access on the same memory port should be prevented. Found during simulation that the core was skipping over updates to the LED.

The core gets a couple of instruction further in the FPGA now. It’s a bit mysterious as to why as I didn’t modify anything except debugging signals.

When floating-point was added it hardly changed the size of the core at all. I guess the floating-point (<3,000 LUTs?”) is small relative to the size of the rest of the core, so, I’ve decided to add more floating-point operations than originally planned.
Added reciprocal and reciprocal square root estimate functions that can return a result within a single clock cycle if the function was previously calculated and cached. Otherwise the result is returned in five clock cycles. The results caches are small 64-entry direct mapped caches. Careful placement of reciprocal functions in code will make the best use of the caches.

Latest hang: the exception miss signal was never cleared after being set. This led to a perpetual miss state.

_________________
Robert Finch http://www.finitron.ca


Fri Jan 10, 2020 4:00 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Great list - thanks! It's useful to see these as a progression of experiments to learn or explore certain tactics.


Fri Jan 10, 2020 8:18 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Put the thinking cap on and added results caching for all fp operations, then removed the caching code specific to the reciprocal functions.

Slowed the clock down to 25MHz to make it easier to meet timing. One benefit of a slower clock is that lower latency floating-point ops can be used. A slower clock allows more work to be done per clock cycle. That means all the work required to support fp can be done in fewer clock cycles. For instance, the normalizer optimized for a high frequency clock takes eight clock cycles. For a low frequency clock this is reduced to two clock cycles.

Moved the detection of prior queue conditions to the queue stage from the issue stage. It’s easier to detect prior conditions in the queue during queue because by definition all current entries in the queue must be prior to the one being queued. For instance, the presence of a previous sync instruction in the queue must be known when an instruction is about to execute. Previously this detection was done during the issue stage meaning there could be instructions both before and after the sync instruction. Moving the logic to the queue stage results in simpler logic, but more state needs to be recorded in the queue. For a given queue entry there are now state bits indicating a prior sync, fsync, memdb, memsb, or path changing instruction.

In the process of moving the prior logic detection another bug was found in the mem issue logic. However, the bug was unlikely to manifest itself. memsb/memsb were indexed as [n] and should have been indexed as [heads[n]]. This would only affect cases where memory synchronization primitives were in use.

_________________
Robert Finch http://www.finitron.ca


Sat Jan 11, 2020 4:21 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Put an option on the fpu to omit several instructions which makes the core smaller. Some of the less frequently used fpu instructions don’t need to be available on every fpu in the machine. These instructions include: FDIV, FSQRT, ITOF, FTOI and TRUNC. There are other instructions that aren’t frequently used, but they don’t add significantly to the size of the core.

Still haven't managed to get LEDs to light in the FPGA. For some reason the address to store to is formed incorrectly, since the address is invalid the core hangs waiting for a response from the memory system. I got stymied for a while wondering why nothing displayed in the ILA until I remembered that the clock supplied has to be faster than the cable clock.

_________________
Robert Finch http://www.finitron.ca


Sun Jan 12, 2020 4:38 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Played with memory management today. Copied the pmmu from FT64 and modified it suitably. Setup a test-bench to ensure that adding the pmmu wouldn’t affect the operation at least at the machine operating level. The pmmu automatically walks page tables to find an address translation. It has a 512 entry eight-way associative TLB to store translations. It adds two clock cycles to every memory access. But I believe this is typical for an mmu.

Finally got some output on LEDs in the FPGA, but I cheated a little bit. Modified the code to use a different instruction sequence to build the LED I/O address. There is still something amiss but it hasn’t been hit yet in simulation. It could be anything from argument forwarding to a pipeline bug of some sort.

_________________
Robert Finch http://www.finitron.ca


Mon Jan 13, 2020 7:06 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The return stack buffer (RSB) is on the critical timing path now. It’s being fed from the fetch stage for updates. Using decoded fetch stage signals is particularly bad for timing; there’s a lot that has to happen before the signals can be decoded. Moving the updates to the decode stage should have little impact on operation and improve the timing.

Worked mostly on the pmmu today. I've found the PTE's of contemporary cpus to be quite spartan. I note that in some operating systems multiple aliased page tables are setup so that additional information needed by the OS can be stored in the page table. For instance, often required is the virtual address in addition to the physical one. I've tried to include everything the OS might need in a single PTE. This makes the PTE's larger than typical, but they are also more functional. Page directory entries are 1/2 the size of page table entries, since they only serve to locate PTEs. The first half of a PTE is the same as a PDE.
Attachment:
File comment: g-core PTE's
PTE.png
PTE.png [ 63.94 KiB | Viewed 8203 times ]

_________________
Robert Finch http://www.finitron.ca


Tue Jan 14, 2020 3:38 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Put some elbow grease into the sequence number reset. They need to be reset periodically which was being enabled by an interrupt. This has been switched to an automatic process that doesn’t require an interrupt, but it acts a bit like one. Measuring the timing it took about 25us to reset the sequence numbers, and they need to be reset every 300ms. This is about 0.008% overhead. While the sequence numbers are being reset, the queue is fed NOP instructions and the program counter is frozen at its current value. I’m not sure why it took so long to reset the numbers. I measured something like 700 clock cycles. All the core has to do is execute about 60 NOPs. It shouldn’t take that long. It shouldn’t be any worse than a divide or square root.

Tested the add, sub and multiply modules for the floating-point. It took me some searching to find the test vector generator. Test vectors for the 52-bit floating point routines were generated from 64-bit results. The 52-bit format is the same as the 64-bit format except that it has 12 fewer bits in the mantissa. So, all that was done for test vectors was to dump 64-bit values with 12 mantissa bits truncated off. One gotcha found in testing was that the use of structured variables didn’t work. An assignment to a structured variable type resulted in it having the value ‘X’. So, I had to back out all the changes I made recently for the neat structured variables.

Started work again on FT64, this time version 10. It’s going to be loosely based on the RiSCV architecture but some things will be different making it an innovation design.

_________________
Robert Finch http://www.finitron.ca


Wed Jan 15, 2020 4:56 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
I imagine that the usual way to deal with sequence numbers is to let them roll over. If you're really (really) careful about how you do your comparisons, you can make it so there's no abrupt transition, no ambiguity, and no danger of collision. The transputer documentation said something about this, I think, because in the case of timers (with, I think, signed values) you want to be able to say 'x is later than y' or 'time t has passed' without getting into knots about timer overflow. Instead of making 32 bit timers and hoping, you could have 16 bit timers and be confident. (If 16 is enough for your purposes.)

If your oldest object is no older than 127 ticks, it might be that an 8 bit ID is always enough, so long as you get the maths right.

I say this without having worked it through!


Wed Jan 15, 2020 8:30 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I tried to find some document on the web dealing with sequence numbering overflow but couldn’t find any.

I don’t believe there is a way to do it mathematically. Timer overflow can be handled by reading the timer twice (to check before and after), but timer overflow is different than the sequence number overflow. Anyway, I think I got it working, but maybe not in the most efficient fashion.

Back onto the sequence numbering again. This time removing the reset code and replacing it with different code. Now a check is made to see if all the MSB’s of the sequence numbers are all set, then the MSB’s of the sequence numbers are all set to zero. Since they are unsigned numbers the ordering should be maintained. It’s a little trickier than it sounds because the sequence number is used in several places and the MSB has to be conditionally masked off in all the cases. The sequence number size is set to one more bit than the number of bits needed to represent a queue entry number. Since small queues are in use the sequence number is small, currently six bits.

The pipeline stalls if there are less than four empty entries available in the queue. The issue is that at the fetch stage it isn’t known how many instructions will queue two stages later. It’s conceivable that the queue might fill up if all entries in the pipeline queue. So, a conservative estimate is made of the maximum number of instructions that might queue. The issue then is that the queue can’t really be smaller than about eight entries, and with that small a queue it stalls quite often.

_________________
Robert Finch http://www.finitron.ca


Fri Jan 17, 2020 5:08 am
Profile WWW

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
robfinch wrote:
I don’t believe there is a way to do it mathematically. Timer overflow can be handled by reading the timer twice (to check before and after), but timer overflow is different than the sequence number overflow. Anyway, I think I got it working, but maybe not in the most efficient fashion.


I have absolutely no experience of CPU design other than what I've picked up in tinkering with my own project, so I have no clear understanding of precisely how you're using the sequence numbers (i.e. as an index into a buffer, or just to ensure things happen in the right order?), so this might be a silly idea, but...:
Would it be possible to avoid the overflow issues by using a relative rather than an absolute sequence number, effectively tracking each instruction's 'age', incrementing it as it passes through the pipeline?


Fri Jan 17, 2020 12:54 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
I have absolutely no experience of CPU design other than what I've picked up in tinkering with my own project, so I have no clear understanding of precisely how you're using the sequence numbers (i.e. as an index into a buffer, or just to ensure things happen in the right order?), so this might be a silly idea, but...:
Would it be possible to avoid the overflow issues by using a relative rather than an absolute sequence number, effectively tracking each instruction's 'age', incrementing it as it passes through the pipeline?
The biggest use for sequence numbers is to determine where instructions fall relative to a branch because instructions after a branch sometimes need to be invalidated. Another use for sequence numbers is in determining if there was a sync instruction prior to the current instruction. So it has to do with order.
I had a system similar to what you suggest setup but wasn’t able to get it to work reliably. Numbers were assigned to instructions incrementally, then as instructions were retired all the assigned numbers were decremented to keep the relative order. For some reason the system would sometimes decrement a number making it negative which since numbers are unsigned made it a larger number. This wrecked havoc on the ordering. Another issue with the system was it required a decrementer for each number in the queue, more logic than necessary.
The current system just zeros out the top bit of all the numbers so is just an ‘and’ gate on the MSB. It also needs a wide ‘or’ function to determine when all of the top bits are 1. I had to run several simulations working out bugs before things worked correctly.

_________________
Robert Finch http://www.finitron.ca


Sat Jan 18, 2020 4:54 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Worked mostly on documentation for a new project today. I'll have to find an older project and reuse the project name, calling it a different version.
I haven’t been having much luck getting g-core to run in an FPGA. It runs for a few instructions then dies on an invalid memory address. The invalid address appears to have been formed by instructions not executing properly. The last wyde of the address is ‘dea8’ which is part of ‘dead’ the ‘dead’ being output by the alu for invalid instructions.
Anyway, I came up with the following for floating-point operations using immediate constants. This was inspired by the VAX’s 6-bit floating-point literal. If a VAX can do it with only six bits available, surely the same thing could be done with more bits available.
Attachment:
File comment: FP Immediate format
FPImmediates.png
FPImmediates.png [ 45.89 KiB | Viewed 8056 times ]

_________________
Robert Finch http://www.finitron.ca


Sat Jan 18, 2020 5:06 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 34 posts ]  Go to page 1, 2, 3  Next

Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software