View unanswered posts | View active topics It is currently Thu Apr 25, 2024 8:43 am



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 45, 46, 47, 48, 49, 50, 51, 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Did a lot of work on the Thor2024 specifications document. Changed from 16-bit instruction parcels to 32-bit parcels. Also changed the alignment of instructions to byte alignment from 16-bit alignment.

Stuck on Thor2023 in simulation. A data bus is not being loaded properly. The load is delayed by several cycles with ‘X’s prior to the load. There are no ‘X’s as inputs AFAIK. I do not know what is causing the delay. But it causes the wrong data to be output during a store operation.

_________________
Robert Finch http://www.finitron.ca


Tue May 16, 2023 6:28 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Worked on the design of Thor2024 some more. Up to 300 pages of specs now with more to include yet. Just added the float exception trigger, enable, disable, and clear instructions. These are fit in with the generic IRQ generating instruction.

Branch instructions may be either 32 or 64 bits in size. 32-bit branches only support comparing two registers and branching to a 12-bit target displacement. 64-bit branches add the option of storing a return address in a link register, branching to an address in a target register, and 40-bit branch target displacements. The larger displacement may be handy for randomizing the address of code in a large virtual address space.

Also supported is a three-way branch for less than, greater than, or equal. BGL. The tree-way branch has two 20-bit displacement fields for the less than and greater than targets. If operands are equal execution continues with the next instruction.

While 32-bit instructions parcels are in use, code may be byte aligned. The jump and branch instructions support a byte-aligned target.

_________________
Robert Finch http://www.finitron.ca


Wed May 17, 2023 5:31 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Shelving Thor again as it requires a larger FPGA to do it justice. I may be able to obtain a larger FPGA. A "free" toolset for larger FPGAs was pointed out being in GitHub. https://github.com/openXC7

_________________
Robert Finch http://www.finitron.ca


Tue May 23, 2023 4:26 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Decided I was a crazy to shelve Thor and start yet another project once I started looking at developing software for rfx32. Instead, a trimmed down version of Thor is being implemented, based on copying most of the rfx32 code, but using the latest Thor ISA. Timing for Thor is somewhat slower since 64-bits are being used instead of 32. Tools indicate the max clock rate is about 45.5 MHz, so the system is being built to run under the 40 MHz clock. According to the tools the path through the divider is the longest one, which strikes me as a bit strange since the divider is a simple sequentially clocked radix-2 divider. I would have expected the 64-bit multiplier to be the slowest path. I suppose I could try breaking up the divide into finer stages but it would then take more clock cycles. I also suspect it will be challenging to get a 64-bit machine working beyond 50 MHz in the FPGA.

_________________
Robert Finch http://www.finitron.ca


Thu May 25, 2023 3:44 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Managed to edge past 50 MHz timing by pipelining the multiplier so it now takes four clock cycles, and adding states to the divider, which now takes about 140 clocks. The trick will be to maintain the timing as the core is improved.

_________________
Robert Finch http://www.finitron.ca


Thu May 25, 2023 7:56 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 593
Does the muliplier, match memory timing well for indexing like foo[a,b]?


Fri May 26, 2023 3:55 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Does the muliplier, match memory timing well for indexing like foo[a,b]?
it would be better if the multiply were faster as the memory reference can not complete until after the multiply is done. If possible, the compiler will convert multiplies into shifts which are single cycle. Multiply should take about the same length of time as a cache access.

_________________
Robert Finch http://www.finitron.ca


Fri May 26, 2023 4:16 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Moved the inline ALU code out to its own module. The code was duplicated twice in the top module, now it is two instances of the same module. This should make it easier to manage in the future.

Made some of the code more generic in nature, accepting a parameter for the number of queue entries.

Added predicated execution of instructions where applicable. Decided to not support predicated instruction execution for flow control operations. Predication has use for vector operations and for very short sequences of instructions; otherwise, it is better to branch. It is possible to branch around flow control instructions based on the value of a predicate register if needed. Predicating branches would cost too many branch displacement bits.

_________________
Robert Finch http://www.finitron.ca


Fri May 26, 2023 4:19 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added the ATOM modifier, which has somewhat dubious operation due to the need to apply the mask immediately at the fetch stage. ATOM is automatic interrupt control over a range of instructions. The ATOM modifier sets the minimum interrupt level for the next eight instructions. The interrupt level can be set separately for each instruction. The master interrupt mask still applies. For instance, setting the interrupt level to ‘7’ for instructions will ensure that only non-maskable interrupts are recognized. However, due to the current implementation the first instruction after the ATOM always has interrupts masked to level 7.
A bitmask from the ATOM instruction is stored in a buffer which shifts as instructions are queued. Postfix instructions do not count as instructions. Since it would be possible to disable interrupts for an extended period of time if a long sequence of postfix instructions were coded, an exception will occur if more than four postfix instructions in a row are used.

Thinking about getting rid the of the PRED modifier. It sounds simple on paper but implementing it is challenging and most instructions already have a predicate register spec field. Unlike the ATOM modifier it is critical that it be applied to the correct instructions. For instance, if the ATOM modifier is off by an instruction then interrupts may be disabled for an extra clock cycle. This is probably non-critical. If the PRED modifier is off by an instruction then an instruction may be executed or elided that should not be. It may be necessary to surround the instructions covered by a PRED with NOP ramps. The PRED modifier is detected at instruction queue time after decode, but it affects which predicate register is read for the instruction in the instruction fetch stage.

Used register code #63 to specify to use a postfix immediate for the operand instead of a register value. So, there are now only 63 general purpose registers. Even with 63 registers the register file is looking cramped. Some of the argument registers are shared with predicate registers.

Squeezed a rounding mode field into FP instructions. More bits than needed were being used to specify the FP function. So, some were traded off to allow a rounding mode field.

_________________
Robert Finch http://www.finitron.ca


Sat May 27, 2023 2:42 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Spent part of today coding hardware table walkers for both hierarchical and hash page tables. The hash page table is really fast as it is made from block RAM rather than going out to main memory. Set the page size at 64kB so the block RAM usage can be limited to about 1/6 the FPGA memory. The hash table uses wide memory to allow searching eight entries in parallel. It is also simple enough that it is clocked at double the CPU clock rate. The hierarchical table walker is a little more complex. It acts as a bus master, triggered by a TLB miss.

_________________
Robert Finch http://www.finitron.ca


Sun May 28, 2023 9:51 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Milestone: Executed the first instruction for Thor2024 in simulation today. Just a NOP.
Milestone: got LED output in simulation.

First pass at assembler written. Using Fibonacci again to test.

Had to put code in to backup the PC by two instructions if there was a cache miss. Since the core is pipelined it increments the PC by two instructions before it knows there was a cache miss. I should maybe try registering the miss address rather than using a subtractor. I wonder which has better timing?

_________________
Robert Finch http://www.finitron.ca


Mon May 29, 2023 3:55 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Dealing with a complicated pipelining issue today. At the fetch stage instructions are copied to a fetch buffer. Copying the instructions to the buffer occurs a clock cycle after the cache is accessed. Accessing the cache takes a clock cycle. The program counter is one ahead of the fetch copy. So, when there is a miss there may still be valid data in the pipeline that needs to be copied to the fetch buffer. For a miss, the PC has already incremented to the next address, so the PC needs to be backed up to the miss address. It starts to get complicated when the fetch buffer cannot be loaded yet because the previous instructions have not queued. The data has to be held in the pipeline until the fetch buffer is ready to be loaded. Add to that a branch miss occurring at the same time and it seems to turn into a real mess.
I have not hit the right combination of logic yet. Either the same instructions are queued multiple times, or instructions are skipped over and not queued.

I think I have this solved now, except that instructions in the branch shadow are being executed when they should not be.

The core running in sim until it hits the first branch now.

_________________
Robert Finch http://www.finitron.ca


Tue May 30, 2023 7:36 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Got the branch shadow execution of instructions fixed. Relatively easy, by tracking the last instruction or two that was queued at branch time and stomping on them if there is a branch.
Issues executing the first loop to store to the screen. Sometimes the same store is happening twice and other times a store operation is dropped. Not sure if this is just how instructions are executing in general or if it’s the store operation.
Code:
 02:000000000000001E 9303000000             23:    mov t3,r0
02:0000000000000023 0403000002             24:    ldi t2,16384
                                           25: .st1:
02:0000000000000028 57023800007C0000       26:    sto t0,txtscreen[r0+t3]
02:0000000000000030 00FD
02:0000000000000032 84E3400000             27:    add t3,t3,8
02:0000000000000037 28E830F8FF             28:    blt t3,t2,.st1
                                           29:
02:000000000000003C 9303000000             30:    mov t3,r0
02:0000000000000041 0403400100             31:    ldi t2,40


Instructions at 3Ch and 41h fetched after the branch due to pipelining are being correctly stomped on and not executed. The branch does loop backwards and the register in the register file can be seen incrementing by eight due to the add instruction.

_________________
Robert Finch http://www.finitron.ca


Wed May 31, 2023 12:49 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Playing with cordic. Used to calculate sine and co-sine. Cannot get it quite to work. It looks like the inverse gain calculation may be wrong. If an angle of 0 degrees is input, the sine output is zero, correct, but the cos output is nuts. For various angles the ratio of sine to cos looks close to correct, but the values are nuts.

_________________
Robert Finch http://www.finitron.ca


Fri Jun 02, 2023 9:06 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Got it working now, now to make an FP sin / cos module.

_________________
Robert Finch http://www.finitron.ca


Fri Jun 02, 2023 11:52 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 45, 46, 47, 48, 49, 50, 51, 52  Next

Who is online

Users browsing this forum: No registered users and 14 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software