View unanswered posts | View active topics It is currently Thu Mar 28, 2024 8:22 pm



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 18, 19, 20, 21, 22, 23, 24 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Tue Jan 15, 2013 10:11 am
Posts: 114
Location: Norway/Japan
MichaelM wrote:
Rob:

I'm trying to follow along, but for the life of me, I can't translate FPP. What does the acronym refer to in your posts above?
That would be a preprocessor, like e.g. CPP I guess. Not sure what the F means.. not Fortran for sure! :)


Thu Oct 18, 2018 6:48 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
MichaelM wrote:
Rob:

I'm trying to follow along, but for the life of me, I can't translate FPP. What does the acronym refer to in your posts above?


Finch's Pre Processor, perhaps? See
viewtopic.php?p=2769#p2769


Thu Oct 18, 2018 11:42 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I've forgotten what the 'F' initially stood for. I wanted to distinguish it from the 'C' pre-preprocessor because I wasn't sure it was 100% compatible. I just call it 'Finch's' myself.


I started version seven of the core. Decided to fix the position of the target register field, which means all the bits in the instructions are shuffled around. Meaning the assembler needed to be updated. The compiler was updated too, to remove the increment and branch instruction code. Two instructions, increment and decrement and branch have been dropped because there isn't a good way to encode them. Motivation for the change is the hope that a slightly simpler organization to the instruction will reduce the 'design is too congested' issue.

_________________
Robert Finch http://www.finitron.ca


Thu Oct 18, 2018 6:53 pm
Profile WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Well I kept coming up with Floating Point Processor. I knew that wasn't right based on how the posts were moving. Don't think I would have ever put two and two together like Tor. Thanks. :D

_________________
Michael A.


Fri Oct 19, 2018 12:03 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Putting the target register spec in a fixed position did help the tools. There's only a couple of muxes now in the register spec path. One is to force the target register to zero for instructions that don't have a target. It may be possible to avoid this mux in the future by making use of the write enable signal to the register file ram. Another mux is in the 'B' register spec to force the register read to the link register for the return instruction. This mux could be avoided by having the return instruction not able to update the stack pointer. Still thinking about this one.
In the test system, a LED was connected to the LED circuit select line to see if at least a circuit select for the LEDS was being generated. Yup. According to the LED there is a circuit select. But still no output on the LEDs. I’ve checked and triple checked, data lines, select lines, addresses, everything appears to be in order. And it works in simulation.
The GPU screen output is still black after a couple of minor fixes, but I just found a bug in the data load path. Data load was from external data only, and not the GPU’s rom memory where the constants are stored. Easy to fix, two hours later to test.

_________________
Robert Finch http://www.finitron.ca


Fri Oct 19, 2018 3:25 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Tried adding additional registers on the cpu input and output data paths. It was possible to get away with adding registration on the output path because the mmu delays the cycle active signal by two cycles. So, delaying the data by a cycle shouldn’t have an impact; it’s still one cycle ahead of the bus active signal. That should be plenty of setup time for data. Registering the input path the ack signal had to be registered also, so there is one more clock cycle for the cpu to access external memory.
I wired up the leds to display a constant value if a write by the cpu is taking place. That worked so I know the leds are correctly connected to the I/O port. But when the leds are connected to a latched version of the databus none of them light up. It's as if a zero is being written to the leds instead of the correct data. Hence my experimentation with additional output registering on the data bus.
I may have to look into a debug facility like chipscope, which can look at signals inside the FPGA.

_________________
Robert Finch http://www.finitron.ca


Sat Oct 20, 2018 3:19 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I identified all the timing errors through the timing report. They were almost exclusively in the multi-port memory controller which was using a 100MHz clock. According to the timing report the memory controller isn’t quite fast enough, it seems to top out about 77MHz. So, I changed most of the clocking from the internal mem_ui_clk (100MHz) to an external 40MHz clock.

Changed the relative addressing of branches to displacement plus offset addressing. Using an offset in addition to a relative displacement allowed an adder to be removed from each place the branch displacement is calculated (five spots = five adders). It also shortened the remaining adders by eight bits. With DPO addressing the lower eight bits of the address come directly from the target field in the instructions without being added to the pc. After bit eight addressing is relative so addressing is broken into 256-byte pages. This adds some complexity to the assembler and linker, but simplifies the hardware generated.
In terms of hardware the following:
Code:
   br_target <= pc + {{20{ir[31]}},ir[31:23],ir[17],2'b0};   // Simple displacement

becomes:
Code:
   br_target[31:8] <= pc[31:8] + {{20{ir[31]}},ir[31:28]};   // page relative displacement
   br_target[7:0] <= {ir[27:23],ir[17],2'b00};      // plus offset

The adder and carry chain now cross only 24 bits instead of 31 bits.

I also changed the location the memory addresses generated by the core are dumped. Used to be once an address was generated it was dumped back into one of the argument slots in the re-order buffer. The argument slot has a fairly hefty multiplexor on its input. So, instead I added another field in the re-order buffer to store the memory address in. This should reduce the size of the multiplexor required.

_________________
Robert Finch http://www.finitron.ca


Sun Oct 21, 2018 3:20 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Curious! Is that a new invention or is displacement plus offset seen in any previous machine out there?


Sun Oct 21, 2018 8:32 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Curious! Is that a new invention or is displacement plus offset seen in any previous machine out there?

I've not seen it used before, combining two different address modes at the same time. I used it in an earlier machine a few years ago. Probably considered a bad idea by designers. I can't imagine that someone hasn't thought of it before, in the interest of performance. It's a bit confusing to deal with. It may be that it doesn't make enough difference to be worth using.

Fired up the ol logic analyzer. Chipscope is easier to use than I thought it might be. You have to feed it a fast-enough clock though. I started with a 10MHz clock and that wasn’t fast enough, 20MHz didn’t work either, so it’s now connected to a 40MHz clock. This is a multiple of the system clock, so it should work okay. So far, I haven’t found anything out. It could take a while switching the probes around to find out what signals are hanging the system up. It’s not as easy a switching real probes and takes hours to rebuild the system.

I received a book yesterday on processor design with a focus on superscalars. I’m just reading through it and finding out things that are consistent with what I’ve learned from other sources. Reading up on pre-decoding I realized there are a few signals that could be pre-decoded in FT64 that might help with timing. These signals include the register file write signal, memory load operation decode, and source automatically valid signals. These signals are used between the fetch and queue parts of the core and at the moment are decoded inline between the stages. They can be decoded earlier then fed forward as registered signals.

There are so many instructions that write to the register file, and so many load operations that I’ve toyed with the idea of a new ISA that simply has the write and load specifiers as dedicated bits in the instruction opcode. That way no pre-decoders are required (the decode moved to software), but it requires a wide opcode (at least five extra bits). Pre-decode bits would end up begin stored in the cache so there’s no real additional cost there.

_________________
Robert Finch http://www.finitron.ca


Mon Oct 22, 2018 4:11 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
DPO addressing isn’t as bad at it might seem. The displacement portion is just the difference of the upper bits of the target address and the current address. The offset portion is just the lower bits of the target address. The following sample calc. shows what’s used in the assembler.
Code:
      disp = (val >> 8LL) - ((code_address + 4LL) >> 8LL);
      offset = val & 0xffLL;

“val” is the target address. The addressing mode is well hidden by the assembler. The compiler doesn’t have to know about it.

_________________
Robert Finch http://www.finitron.ca


Mon Oct 22, 2018 4:27 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Oh, rehooking ChipScope sounds painful.

Could you give a reference or link for that book? Thanks!


Mon Oct 22, 2018 7:33 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The book is:

Modern Processor Design – Fundamentals of Superscalar Processors
John Paul Shen, Mikko H. Lipasti
Waveland Press, Inc.
http://www.Waveland.com

It rates well on Amazon.com.

_________________
Robert Finch http://www.finitron.ca


Tue Oct 23, 2018 4:39 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I found out which line of code the core is hanging on. It’s address F…FC095E the branch instruction. This is about the 12th instruction to execute. The program counter was dumped in the logic analyzer and can be seen sequencing through instructions until it gets to the branch.
Code:
                           ClearTxtScreen:
FFFFFFFFFFFC0938 04 80 90 00                          ldi      r4,#$0024
FFFFFFFFFFFC093C 55 00 10 18 70 FF                    sb      r4,LEDS
FFFFFFFFFFFC0942 44 20 00 00 40 FF                    ldi      r1,#$FFFFFFFFFFD00000   ; text screen address
FFFFFFFFFFFC0948 04 40 60 00                          ldi      r2,#24      ; number of chars 2480 (48x35)
FFFFFFFFFFFC094C 27 63 40 00 49 60 80 00              ldi      r3,#$00000080FFFF0020
FFFFFFFFFFFC0954 FC FF                       
                           .cts1:
FFFFFFFFFFFC0956 24 81 0C 00                          sw      r3,[r1]
FFFFFFFFFFFC095A 81 04                                add      r1,r1,#8
FFFFFFFFFFFC095C A2 0F                                sub      $r2,$r2,#1
FFFFFFFFFFFC095E 30 22 03 05                          bne      $r2,$r0,.cts1
FFFFFFFFFFFC0962 80 20                                ret

If I had to guess I’d say it’s stalled because the instruction queue is full, and no more instructions can queue. This could happen if an earlier instruction didn’t finish properly. I’m wondering if I should include logic to unstick the core if this happens. Say a wait of 200 clock cycles with nothing happening then flush the instruction queue and exception.
Now to move the probes to verify the hypothesis.

_________________
Robert Finch http://www.finitron.ca


Wed Oct 24, 2018 3:01 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I hope to try varying the queue size slightly to check if the processor hangs due to a full queue.

Finally got around to finishing off the totally parameterized queue size. The queue size can now be set to value between 10 and 127. One caveat is there is only memory issue logic for the first 10 slots. Issue logic for other types of instructions is automatically generated based on the queue size.

The core will speculate past any number of branches that can be contained in the queue. Many superscalars have a fixed limit to the number of branches they will speculate past, for instance three or four. This occurs because a two-bit counter is used to count the branch instructions, and instructions following the branch are assigned the count value. This is how the machine knows which instructions to remove from the queue when a branch miss occurs. FT64 is simpler in that a counter is not used to just count the branch instructions. For v7 a five to eight-bit count (depends on queue size) of the instruction queued (applied to all instructions) is recorded. When an instruction commits to the machine state, all the following queue entries have their counts decremented by the count of the instruction just committed. When an instruction is queued, the count is set to the next highest count of the counts found in the queue. This mechanism keeps the count within a few values of the number of queue entries, guaranteeing the count will never overflow. The counts also remain in sequence so it’s possible to tell the order of instructions. To remove instructions from the queue on a branch miss, all the instructions with a higher count are removed. Previously FT64 relied on a 32-bit instruction counter that had to be periodically reset. There’s no need for a special counter reset in v7 of the core. The comparator logic for the count should be greatly reduced due to the smaller count size.

_________________
Robert Finch http://www.finitron.ca


Thu Oct 25, 2018 5:00 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Played mostly with a TLB today. The TLB from Thor was ported over to FT64. Since I’m not having much luck getting the system built with a larger queue size I decided to try the opposite, building the system with a smaller queue size. I had to add some generate statements to strip out some of the issue logic for a smaller queue size. The queue size may now be set as low as six entries. There's the potential for an even smaller queue size.
As suspected with a smaller queue size the core locked up sooner. The probes have been moved to verify that the commit signal isn’t being generated. The mystery is why the problem doesn't show up in simulation.

_________________
Robert Finch http://www.finitron.ca


Sat Oct 27, 2018 6:56 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 18, 19, 20, 21, 22, 23, 24 ... 52  Next

Who is online

Users browsing this forum: Bytespider and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software