View unanswered posts | View active topics It is currently Sat Apr 27, 2024 12:17 am



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 44, 45, 46, 47, 48, 49, 50 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Found a bunch of signals that were not connected. Fixing this did not affect the synthesis. It is still omitting the data cache and related logic.
Sketched out a 48-bit ISA for rfPhoenix2. GPU style vector processor.
Done a lot of work on the compiler lately.

_________________
Robert Finch http://www.finitron.ca


Sat Apr 29, 2023 8:53 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Got the core executing the first couple of instructions of Fibonacci in simulation. This is a bit of a feat considering that memory access is through a shared TLB. There is currently too much activity showing up on the bus. It looks like at least reads do not stop access at the proper time. They seem to alternate on and off.

Still have not figured out why the data cache logic is elided from the synthesized system. I was hoping that it would become clear by running simulations. If they can be made to run to the point of performing a load or store operation then maybe the issue will be illuminated.

I had to konk the assembler over the head a few times to get it to output relative addresses for branches. I thought I had the code in place, but it did not work as expected. I finally put in a bit of a kludge that works.

_________________
Robert Finch http://www.finitron.ca


Sun Apr 30, 2023 2:36 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Managed to free up an opcode at the root level by moving several oddball instructions over to the LOADZ opcode. LOADZ had a bit available to select the oddball instructions. Oddball instructions were the load address, LA, cache control, CACHE, and pointer store, STPTR operations. There are now two free opcodes at the root level.

Added cache-ability options to the load and store instructions. Two bits are dedicated to specifying the cache policy to use for the operation. Wondering now how to specify these in a high level language. There partial support in C with the volatile keyword.

Got around to running some simulations. All the extra bus activity was due to bus snooping that should not have been happening. Snoops were being triggered for every bus cycle. As a result many of the valid bus cycles got aborted then restarted. Got a core to run up to the point of performing a memory store operation. Currently the store operation hangs the core; it is locked up in one state.

_________________
Robert Finch http://www.finitron.ca


Mon May 01, 2023 3:52 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Issues with the propagation of the transaction id, tid, on the system bus were resolved. The tid is used to keep track of the virtual address sent out by the cache controller. The address is stored in a table according to the tid so that the address does not need to be passed around the entire system.

_________________
Robert Finch http://www.finitron.ca


Wed May 03, 2023 4:12 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Finally got synthesis to build with the data cache included. The bare bones dual core system using a sequential CPU is about 30k LUTs. It includes integer CPUs, a MMU, timer, and interrupt controller. With FPU added it is likely to double in size.

Been running some simulations with limited success. Something is a miss with the data access yet. It is taking 50-60 cycles and it should be 5 or 6.

Cleaned up the code for the instruction cache. There was an issue where data was unavailable for instructions when it should be. This happened when an instruction postfix crossed a cache line boundary. ½ of the postfix did not show up in the cache leading to a bad store address. I tried switching the cache load to load 512 bits on a miss instead of 256 bits and that seems to have fixed that issue. The length of an instruction cache line is only 256-bits because they are used in pairs. So, it loads two cache lines on a miss now.

_________________
Robert Finch http://www.finitron.ca


Fri May 05, 2023 6:21 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 593
Is there any advantage to having a option for the compiler to align things with the cache line?
I am thinking of short code segements of looping code. while(*a++ = *b++); for example.


Fri May 05, 2023 7:20 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Is there any advantage to having a option for the compiler to align things with the cache line?
I am thinking of short code segements of looping code. while(*a++ = *b++); for example.

I am not sure, but I think there would be in some cases. Much of the time though the data / instructions will be in the cache, so then I think it does not make a difference. Suppose the short loop spans a cache line. The only time performance would be different is the first time in the loop if both cache lines are not loaded. If the loop is short enough and it is a pipelined processor, then it may be possible to get the instructions to execute in a pipeline loop so that it does not even go to the cache.

Changed the register file access in Thor to reduce the footprint and make storing / loading groups of registers more efficient. The register file is now 64-bits wide instead of 128-bits, except that registers can be viewed as being 128-bit because the upper half of the register is stored in the upper half of the register file. The lower or upper 64-bits of registers can be stored or loaded in groups of eight. When working with 64-bit values and operations this is twice the number of registers in a single load / store operation. Part of the register file was unused. In fact there is enough room to support an alternate register file because of the way the LUTs are mapped for register file usage.

Memory indirect addressing is being added to Thor. It is used when accessing shared library variables and code. Memory indirect jumps are possible as JSR [myrout@got(pc,r48)]. Variables are accessible with load and store memory indirect addressing. LOAD.H [myvar@got(pc,r48)]. Register r48 is used to point to the GOT table relative to the PC.

_________________
Robert Finch http://www.finitron.ca


Sat May 06, 2023 7:24 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Compiler:
I experimented with adding GOT addressing to the compiler. Without GOT relocations programs are limited to static linkages. I think I have most of the GOT capability in place, but am not sure if the linker will merge the GOT tables from different translation units.

Simulation:
It looks like there is an issue with a bit error occurring in the instruction cache for the second core. A couple of bits in the cache line are set to one’s that should not be. This causes core two to go off on a tangent executing invalid instructions eventually hanging. The first core is not executing branches, runs through to the end of the Fibonacci then halts waiting for the instruction cache to load. The instruction cache load is failing and repeats repeatedly. It loads 3/4 of the instruction data, but aborts on the last line. Then because it was not successful, it tries again.

_________________
Robert Finch http://www.finitron.ca


Sun May 07, 2023 6:28 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Decided to add post-increment and pre-decrement addressing mode to the scaled indexed addressing.

Also tentatively added a repeat instruction to the ISA. Any instruction(s) may be repeated. The loop counter may be either incremented or decremented during the repeat and it is tested against an immediate value with one of eight signed test conditions. A sample follows.
Code:
   mov LC,r0
   repilt 1000,0
      store.h r0,0xfffc0000[r0+LC*]

The above sets a block of memory to zero.
repi indicates to increment the loop counter. The lt indicates to loop while less than. The loop is for 1,000 times and the number of instructions in the loop is one. There is a max of eight instructions in the loop.
The benefit of REP comes from eliminating an instruction from the loop. Only the instructions inside the loop are repeated.
REP uses a buffer to store REP state that is about 200 bits. This buffer needs to be saved and restored when interrupts occur.

_________________
Robert Finch http://www.finitron.ca


Mon May 08, 2023 4:48 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
After some trials and tribulations with the simulator, the core starts fetching instructions much faster now. It gets a cache hit after 3us instead of 30us. For some reason the ack response processor was designed the same way as the request generator. It waited a random number of cycles after receiving a response before trying to process the next one. This of course caused it to miss the second response coming back from the RAM. Leading to retry attempts.

The core still does not process correctly, ignoring branches.

_________________
Robert Finch http://www.finitron.ca


Tue May 09, 2023 5:55 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Packed the REP state into 128-bits by limiting the REP limit to a 32-bit compare. Since it is a signed comparison that is a max count of 2^31-1.

Figured out an issue, the register file was not being updated because the register set selector was not set. This led simulation to assume an ‘X’ value. There were several other issues with the register file having to do with unset bits.
The latest set of fixes still did not get the branches including REP to work.

Figured out why REP was failing. The test program was not linked properly. I had been experimenting with GOT addressing and changed the makefile. The make was failing at an earlier point.

Got REP and branches working now. The core is also much faster on the bus now. A couple of issues with the bus interface fixed.

Now that I have got parts of it working, I have decided to shelve the design. It is not code-space efficient enough. Thor2022 had much better code space efficiency. I think the average instruction was under 32-bits. So, Thor2024 is starting from a combination of Thor2022 and Thor2023 and should have much better code density. It will be using a variable length instruction set with lengths varying in 16-bit parcels.

Reducing the number of registers to 32, but will have separate register files for integer and float operations. This saves 2 or 3 bits in every instruction. Also not using the register complement / negate bits and instead having more instructions. The 2024 version will be using postfix immediates. 2022 uses prefixes, but I think postfixes work better.

_________________
Robert Finch http://www.finitron.ca


Wed May 10, 2023 6:03 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Well, I un-shelved the core after spending most of the day documenting for Thor2024. Most of Thor2024’s instruction set is 32-bit. It will be a while though before it is finished enough to work on.

Just waiting for the system to build. I got it to work well enough that it should be able light up some LEDs on the FPGA board, and hopefully clear the screen. The core seems to respect branches now and it is doing the Fibonacci loop.
Even though the core is just a sequential machine with multiple cores running it should be able to hide some of the memory latency. While machine A is accessing memory, machine B is running other instructions.

The memory system is interesting because it is an asynchronous system. I have not used an asynchronous system for over 10 years. I find synchronous simpler, but it is slow as molasses.

Mulling over the loss of a graphics core due to a hard disk failure. I do not believe it, but I did not have it backed up anywhere that I can find. I had modified the opencore.org graphics accelerator to support a 128-bit bus with a bunch more graphics resolutions.

_________________
Robert Finch http://www.finitron.ca


Thu May 11, 2023 6:51 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
Ouch - sorry to hear about your data loss.


Thu May 11, 2023 8:03 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Fixed an issue in the system having to do with colliding responses. They needed to be buffered so that only one at a time is fed back to the cpu.

Worked on the graphics accelerator, setting it up with a 128-bit bus interface. The graphics accelerator is a bit out-of-date as such has been replaced by GPUs. The issue is that a GPU would use too many resources in the FPGA.

_________________
Robert Finch http://www.finitron.ca


Fri May 12, 2023 3:28 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The postfix immediates were simplified. This reduces the amount of logic required in the decoder. There are three sizes of postfixes rather than a single format that builds. The size must be decoded in the decoder anyway. Postfix size is reduced by a byte for 64-bit postfixes and by two bytes for 128-bit postfixes.

Coded-up load and store multiple registers. The code builds up a cache line worth of data then stores it, or loads a cache line and separates it into four register updates. So, there is only one memory access for every four registers.

Added a postfix instruction containing a register list bitmap.

Choosing which registers correspond to which mask bits was interesting. I decided not to use a direct map of bit position to register number, instead the registers thought to be most likely stored and loaded are first in the bitmap.
Postfixes are treated as part of the instruction. With two large postfixes and the instruction the instruction can be up to 256 bits long. (48+136+72)

Got the core to set LEDs in simulation, so I tried synthesizing it and running it. Result: did not work, the LEDs were not set.

_________________
Robert Finch http://www.finitron.ca


Sat May 13, 2023 5:38 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 44, 45, 46, 47, 48, 49, 50 ... 52  Next

Who is online

Users browsing this forum: AhrefsBot and 15 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software