Last visit was: Fri Nov 01, 2024 12:06 am
It is currently Fri Nov 01, 2024 12:06 am



 [ 775 posts ]  Go to page Previous  1 ... 40, 41, 42, 43, 44, 45, 46 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2205
Location: Canada
Wrote a precision event timer core (PET) to replace the PIT core currently in use. It has many features of the HPET but lacks legacy support and FSB interrupt messaging. Registers are laid out differently. When using arrays of registers, I put the array first to simplify decoding addresses. Then that is followed with registers that are common or general purpose in nature.

The master counter and comparators are limited to 48 bits as that is sufficient to time intervals up to a year in length given a 10 MHz reference clock. It also conserves hardware over using full 64-bit components. I have the number of bytes to use for the timers present in the capabilities register.
The PET component may have up to 32 timers; for the test system there are only 8 implemented. This is noted in the capabilities register.


Running Thor2022 in post-synthesis simulation reveals it does the same thing as it does in real hardware. It jumps abruptly to address $0000C0. I should be able to track this down eventually as the simulation environment is easier to debug than the real FPGA.
I found one place where a signal was being assigned a ‘Z’ value instead of a zero. I am not sure why synthesis would do this. The signal is a seven-bit signal and is assigned a seven-bit value. But the top bit of the assigned value is zero. The tools have a tendency to propagate ‘Z’s into becoming ‘X’s and that messes things up.

_________________
Robert Finch http://www.finitron.ca


Sat Aug 13, 2022 3:02 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2205
Location: Canada
Both the ip and next_ip signals are transitioning at the same time to $0000C0 which should not be possible AFAIK. Next_ip is supposed to be generated from ip by adding the instruction length. After the clock edge next_ip is loaded into ip. I think the transfer to $0000C0 is due to an exception occurring.

Setup the stack pointer register to be multi-faceted. There is a separate stack pointer for each operating mode and interrupts. That makes five stack pointers all addressable via the same register code. A stack pointer selector register, architecturally invisible, was added. Some code was added at the decode stage to select the appropriate register for register code 31. The REX instruction, which allows redirecting exceptions to a lower level, has code to prevent the interrupt stack pointer from being switched away. If an interrupt routine is active and there is a redirect to a lower operating mode, most likely it is desired to keep the interrupt stack. Otherwise, the stack would be set to the stack corresponding to the operating mode.

Asynchronous interrupts due to external causes cause the interrupt stack pointer to be selected. Other exception which switch the operating mode to machine mode cause the machine mode stack pointer to be selected. From there a REX instruction to a lower operating mode may cause the stack pointer for that mode to be selected.
With the stack pointer automatically switched for interrupts it should be relatively painless to write an interrupt service routine. On entry a handful of register can be pushed right away. The PUSH instruction may push up to four registers which is plenty for many small interrupt routines.

_________________
Robert Finch http://www.finitron.ca


Sun Aug 14, 2022 5:58 am WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 665
Does your MMU switch as well as your stacks?


Sun Aug 14, 2022 8:40 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2205
Location: Canada
Quote:
Does your MMU switch as well as your stacks?
No. There is no explicit code to switch things in the MMU. The MMU / data cache will automatically process requests for data as they come along. The TLB has a locked way that could be used to hold pointers to the interrupt stack area to improve performance.

Got past the $0000C0 crash, I do not know how. I marked several signals for debug, then it started working in simulation, so I tried it in the FPGA, and it gets further along before crashing. The $0000C0 transfer was caused by an exception I think. Dumping the exception cause register showed it had the value $00F1 in it. Nowhere in the code is $F1 loaded into the register. Now the core is not performing memory store operations or the RTS instruction.

Got rid of the 3-input AND, OR, and XOR operations and replaced them with REM, REMU, and REMSU operations. There are already short instruction forms for AND, OR, and XOR with two operands; the three operand forms would be largely redundant. The instruction set was missing opcodes for remainder operations so those were added.
The divide component was modified to cache operations, operands, and results. So, if the same divide operation is performed as in the cache the cached result is returned in only four clock cycles. This is useful when both the divide and remainder operations of the same values are needed.

Got rid of the hexi-byte load and store operations and used the opcodes for vector load and store operations. Thor2022 is primarily a 64-bit machine; I went crazy a while ago and made Thor2022 a 128-bit machine, but for most cases that is not necessary, so I switched it back to 64-bit. 128-bit data can be processed in vector registers using pairs of elements.

Worked on the vector store operations. Vector stores store one element at a time to the memory queue to make it easy to compress the vector during storage. Storing to the queue only takes one clock cycle per element, while the physical store operation may take dozens of clock cycles to store to the DDRAM. Storing to the queue should not be a bottleneck.
Stores support a scatter operation in addition to a flat store via the use of a vector register as an index register.

Vector load gathering instructions work similarly to stores. They submit requests to the bus interface unit for each element. The BIU builds up the vector value and returns it in a single response. Ordinary vector loads need only submit a single request to the queue.

_________________
Robert Finch http://www.finitron.ca


Mon Aug 15, 2022 4:53 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2205
Location: Canada
Duh, got rid of the opcodes for vector load and store. I must have been asleep when I assigned them. Any load or store instruction may be flagged as a vector operation already. It was just a matter of supporting them with code. Spent a bunch of time writing code to support the vector load and store operations.

Forgot to add code to suppress an RTS miss when the address is predicted correctly. This resulted in a double return causing a crash. Amazingly the return address predictor seems to work.

The core is running up to a couple of hundred instructions in the FPGA now. It is crashing because the SP is getting corrupted. Just working through all the bugs.

ATM pondering the stack aligment. ATM it is 16 byte aligned, but may be better reduced to 8 bytes for my small system to conserve RAM.

_________________
Robert Finch http://www.finitron.ca


Tue Aug 16, 2022 4:43 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2205
Location: Canada
A completing load operation for a stomped on instruction was setting the result and execute flag in the reorder buffer. This led to cases where the instruction was erroneously marked as executed. The issue is a multi-cycle operation may have started speculatively then subsequently got stomped on by a branch miss, but was still returning the result to the reorder buffer to a buffer entry that may have been reused.

Woes having to do with memory operations when the queue is full. If the queue is full the instruction needs to be executed again, with the core trying again until the queue is no longer full.

The core is hanging ATM because it runs out of available buffers due to memory requests that are outstanding without responses. Somehow memory requests are being lost in the BIU.

_________________
Robert Finch http://www.finitron.ca


Wed Aug 17, 2022 5:54 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2205
Location: Canada
The lost memory requests were due to the queue loading into the wrong slot if a read and write of the queue were occurring at the same time. 'Off by one'.

Added load and store history buffers to make debugging easier. Also added a record of the instruction pointer to memory requests and responses. All of this just for debugging in simulation.

The core currently crashes because of a stale value for the link register.

Issues with fetching correct register values today. There is a valid (busy) bit for the register file which indicates when the register is valid. There is also register source tracking. It is a bit of a mess because I have been experimenting with pipelining these bits to get a higher fmax. Something is broken now, and a sanity check is failing. ‘arg missing’ is the message. That indicates the instruction is the oldest instruction in the queue and it does not have all arguments. Something that should be impossible.

_________________
Robert Finch http://www.finitron.ca


Thu Aug 18, 2022 4:16 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2205
Location: Canada
Busted the core pretty good after getting it almost working. Decided to re-work the front-end to use single entry buffers for instruction fetch, decompress, and decode. The buffers form an in-order pipeline that feeds the reorder buffer. Previously these entities were being place in the re-order buffer. Processing them that way is an inefficient method requiring more multiplexors and decoders. With the front-end pulled out of the reorder buffer there are more entries to allow execute out of order. Hopefully the core will also be smaller.

Funny bug: the jump instructions were storing the return value in the loop count register for jumps that do not store a return address.

_________________
Robert Finch http://www.finitron.ca


Fri Aug 19, 2022 4:40 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2205
Location: Canada
Started to clean up the instruction decoder. It had grown to about 1000 LOC with some dead code. Moved a few of the decodes out to smaller modules. The synthesizer performs much better when dealing with smaller modules.

Got rid of the dedicated vector mask instructions and instead made the R3 register format instructions accept six-bit register codes. The R3 format had unused opcode bits available. The mask registers are now available as register codes 32 to 39. The R2 format supports only five-bit register codes so the normal GPRs are available to R2 format instructions.

The core was skipping micro-code instructions when there was a cache miss or no open buffer to store an instruction in.

_________________
Robert Finch http://www.finitron.ca


Sat Aug 20, 2022 11:41 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2205
Location: Canada
Many fixes later…. The core runs for about 1,000 instructions then craps out after a subroutine return. Not sure what the issue is yet. It is returning to the correct address but then the instruction pointer gets off track somehow. It also missed updating the LEDs. The core can be seen submitting a request to memory to update the LEDs and getting a response back, but there does not seem to be any bus activity to update the LEDs.

_________________
Robert Finch http://www.finitron.ca


Sun Aug 21, 2022 4:13 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2205
Location: Canada
Added several 64-bit opcodes. These support a 30-bit immediate and register spec fields with six bits. These allow access to specially designated registers in the register file.

Issues with memory operation not being performed in order. I tried to manage this using sequence numbers but I hit upon using an ordering buffer instead. An ordering buffer is simpler, operating as a fifo, it shifts by one entry then stores the queue index of the most recently decoded memory operation in the lowest spot. The order buffer is searched from oldest to newest entry for valid memory operations. Memory ordering issues appear to be fixed.

The data cache was not updating odd lines properly. This led to missing data for some fetches from the cache. The core works a little bit better all the time.

The lack of a LED display is due to the TLB not being updated properly. This appears to be due to software issues. The LED display address then is not mapped and causes a TLB miss which is currently ignored.

_________________
Robert Finch http://www.finitron.ca


Mon Aug 22, 2022 6:18 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2205
Location: Canada
Reduced the size of the cause code from 16 to 12 bits. There is not really a need for a larger cause code, and it was wasting hardware.

Stuck at the moment on a simulation issue which is probably a race condition of some sort. Simulation is adding the instruction length onto the IP one cycle too soon in one case after running for thousands of cycles. With the IP messed up the program crashes.

With just two vector lanes and only partially implemented the core is about 95,000 LUTs. This design is really too big. With a full multi-core design, it could just be 10 million LUTs.

_________________
Robert Finch http://www.finitron.ca


Tue Aug 23, 2022 4:58 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2205
Location: Canada
The Thor2022 scheduler component is driving me crazy. It is now reporting as being over 100,000 LUTs in size, totally ridiculous and blowing the LUT budget, when included in the top module. If I synthesize the module by itself, it reports as being 51 LUTs in size, which I think is the proper size. So, I am experimenting to try and find out why the difference.

Maybe time to re-install the software.

_________________
Robert Finch http://www.finitron.ca


Thu Aug 25, 2022 3:56 am WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 665
Can you have different types of optimization for different modules?
It sounds like it is duplicating part of your logic blocks for speed, but what do know I *yawn* after midnight.
Ben.


Thu Aug 25, 2022 6:21 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2205
Location: Canada
Quote:
Can you have different types of optimization for different modules?
It sounds like it is duplicating part of your logic blocks for speed, but what do know I *yawn* after midnight.
Ben.
It would be a ridiculous amount of duplication. 100,000 LUTs instead of 50?

Shelved Thor2022 for now in lieu of yet another project - rfPhoenix. Maybe I will be able to spot what is going on if I leave it for a while. I have been staring at it a lot lately.

_________________
Robert Finch http://www.finitron.ca


Fri Aug 26, 2022 4:03 am WWW
 [ 775 posts ]  Go to page Previous  1 ... 40, 41, 42, 43, 44, 45, 46 ... 52  Next

Who is online

Users browsing this forum: AhrefsBot, claudebot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software