View unanswered posts | View active topics It is currently Sat Apr 20, 2024 7:30 am



Reply to topic  [ 159 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7, 8 ... 11  Next
 ANY-1 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I have discovered a new toy: elastic pipelines making use of fifos between stages. In my previous implementations I had not tried this. So, I’ve decided to sidestep ANY1 momentarily to experiment with elastic pipelines. A simplified version of the nvio7 ISA is what I will be experimenting with.
Hopefully elastic pipelines will allow completed instructions to pile up behind memory operations in the pipeline, hiding some memory latency.
I would really like to get a good idea of how the CDC6600 with register score-boarding was working.

_________________
Robert Finch http://www.finitron.ca


Sat May 15, 2021 6:41 am
Profile WWW

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
robfinch wrote:
elastic pipelines making use of fifos between stages


where did you find it? and how does it work?


Sat May 15, 2021 10:20 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
where did you find it? and how does it work?

I have read through this pdf (found while looking up info on CDC6600):

http://csg.csail.mit.edu/6.375/6_375_2016_www/resources/archbook_2015-08-25.pdf

It works basically the same as a pipeline with registers between stages, except the handshake between fifos is more local.
The document describes the BSV language, used for the examples.
I like the idea of using fifos. Small LUT based fifos can store up to 64 instructions, so while the memory op is taking place the following instructions can queue up.

One thing I am stuck on at the moment is how forwarding of results occurs. The document shows the use of a scoreboard to avoid RAW hazards.
The pipeline is stalled until the scoreboard indicates the register is not in use because the register file has been updated.
But this would conflict with a results forwarding system I think. A regular pipeline would unstall once a result could be forwarded, which could be well before it is updated in the register file. I think without forwarding the pipeline would stall a lot of the time. The difference is one clock cycle on every stall.
Although in the doc a fifo is used for the score-board, instead I have setup an array of counters associated with the register file indicating the number of outstanding updates to the register.

_________________
Robert Finch http://www.finitron.ca


Sat May 15, 2021 4:00 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I have decided to stick with a simplified version of ANY1.

I have got a good chunk of the coding (about the first 1,000 lines) done now for a simplified version of ANY1.
Only the scalar integer instruction set is supported.
The core uses an elastic pipeline with a reorder buffer tacked on the tail end. This allows some instructions to execute out-of-order.

The memory system is decoupled from the core and operates on its own. It sits in a wait loop until there is an instruction cache miss or another memory request queued.

_________________
Robert Finch http://www.finitron.ca


Sun May 16, 2021 2:03 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Changed the JAL instruction so that the target address may be only eight-byte aligned instead of sixteen. Then modified the target register field to be two bits allowing only ra1,ra2, or ra3 as the target register. Then moved the return address adjustment field over to the target register field bits 2 to 7. This increases the number of inline parameters possible. These changes also increase the range of the jump to 8TB.

Put a check in to execute a store operation only if it is at the head of the list of instructions. This prevents a speculative store instruction from executing before it is known that prior instructions have exceptioned. If a prior instruction exceptioned, we do not want the store to execute as there is likely no way to undo the store operation. The only way to do this is to make sure there are no prior instructions.

I am wondering what to do about exceptions. RISCV has four different vector locations depending on the operating mode the processor was in when an exception occurred. It is up to software to examine the exception cause code and then decide what to do. I was thinking something similar would work for ANY1. There are five operating modes, there could be a separate vector for each operating mode. I was also contemplating using a vector table associated with the cause code. But there would have to be five vector tables for exceptions.

Current core size is 27,000 LUTs with only few instructions implemented.

_________________
Robert Finch http://www.finitron.ca


Mon May 17, 2021 5:07 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added a data cache. It works in a somewhat unusual fashion. If there is a cache miss on a load the data is loaded from memory and returned to the pipeline completing the load operation. At the end of the memory cycle the line containing the data is loaded into the cache. So, the requested data actually gets loaded twice on a cache miss, once to the pipeline and once to the cache. But loading the data cache line takes place in the background after the original request so some of that access is hidden.

Added a five entry fully associative victim cache to the instruction cache. This is the first time I have used a victim cache. The victim cache did not synthesize very well, leading to huge implementations taking a long time to synthesize and using tons of LUTs. I think the issue is the porting on the cache to swap cache lines around. It may be possible to implement using more than one clock cycle.

Added support for multiply instructions. Multiply instruction come in two variations, fast and slow. The fast multiply is unsigned only 24 x 16 bits and takes only a single cycle to execute. Other multiply operations take four clock cycles to execute. The multiplier has its own queue so other instructions may be executing while a slow multiply is taking place.

More filling out of the instruction set. Many operations are relatively inexpensive to implement compared to other processor features.

_________________
Robert Finch http://www.finitron.ca


Tue May 18, 2021 2:25 am
Profile WWW

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
robfinch wrote:
victim cache


what is "victim cache"? never heard about

robfinch wrote:
Added support for multiply instructions. Multiply instruction come in two variations, fast and slow. The fast multiply is unsigned only 24 x 16 bits and takes only a single cycle to execute. Other multiply operations take four clock cycles to execute. The multiplier has its own queue so other instructions may be executing while a slow multiply is taking place


How can the fast-MULL take only 1 clock cycle? On physical FPGA, DSP slices are the only piece of hardware able to provide fast unsigned MULL within 1 cycle but they are 16x16bit multipliers, hence if you need to multiply larger data you need to combine more slices along a hierarchical path

uin16_t A,B,C,D; (AxB) * (CxD) = A*C + A*D + B*C + C*D

Supposing you have four slices, the above takes 2..3 cycles to complete a 32x32 unsigned MUL (1 cycle for each slices, plus extra 1..2 cycles for the adders).

I have recently implemented a MAC, A = A + (B * C), multiplier. It takes 32 cycles to complete because its implementation is a simple "old school" algorithm. There is also a 5 cycles version but it consumes more resources and its algorithm is more complex, hence I am not sure I want to synthesize it.


Tue May 18, 2021 8:33 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
There are a number of explanations of a victim cache on the web.
https://en.wikipedia.org/wiki/Victim_cache

From the Xilinx Series7 Overview document:
Quote:
DSP slices with 25 x 18 multiplier, 48-bit accumulator, and pre-adder for high performance filtering, including optimized symmetric coefficient filtering

The multipliers are 25x18. But the ISA only specs 24x16. Easier to remember.

I have found the fast multiply useful for computing indexes into arrays, provided the array is not too large. There are sometimes several multiplies placed close together when calculating multi-dimensional array indexes.
I’ve also used it for calculating screen co-ordinates for a text or graphics display.

Have you looked at Karatsuba multiplies? Might take one less multiplier.

_________________
Robert Finch http://www.finitron.ca


Tue May 18, 2021 6:01 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Add support for divide operations.

Its a huge core again 73,000 LUTs. Much of the LUT usage is because all the ram type elements are being implemented using LUT rams and FF’s. When synthesized it indicates zero block rams used. The instruction and data cache are both 16kB and made up of LUT ram. At point they were synthesizing to inferred block rams and the core was much smaller. The size of the caches and other structures like the branch target buffer may end up being reduced.

The fifos between stages of the core are 66 entries deep. With about five stages to the core in theory there could be 300+ instructions being worked on. The reorder buffer is 32 entries deep. In practice, not more than a handful of instructions are expected to be queued in the fifos.

_________________
Robert Finch http://www.finitron.ca


Wed May 19, 2021 4:58 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Fixed up the core a lot the past day or so. Fixing things like results forwarding which wasn’t really happening. The core was stalling on hazards until the register file got updated in the commit stage. This is a lot of unnecessary stalling. The core now proceeds as soon as all arguments for the instruction are valid, the arguments coming from the register file, or reorder buffer as soon as a functional unit has calculated it. To do this the scoreboard changed to tracking the reorder buffer entry associated with a target register. A history file was also necessary to back out associations during a branch miss.

Fixed the update of the branch target buffer. It was updating the target for the fetch stage ip when it should have been updating the ip associated with the new target.

Removed the register fetch stage from the core. Register fetch is now done by the decode stage. There are five stages in the core.
Instruction Fetch
Instruction Align
Decode / Register Fetch
Execute
Writeback

Most of the integer instruction set is coded. The size field is not respected and all operations are 64-bit.

Some simulation has been started. The program is simply incrementing a register. Issues with results forwarding. It is counting 1,2,2,3,2,3,2…..

_________________
Robert Finch http://www.finitron.ca


Fri May 21, 2021 3:49 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Decided to drop most of the elastic pipeline and just use the reorder buffer instead. There were issues with entries being dropped, and entries were really being manipulated in two places at the same time; the reorder buffer and fifos.

_________________
Robert Finch http://www.finitron.ca


Sat May 22, 2021 4:47 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Thinking of the CDC6600 today.

The ISA is being modified to use tags and instructions and data shortened to 60-bits instead of 64 so that the instruction and data plus tag may fit into 64 bits. Modifying the ISA was relatively easy, there were already three unused bits in most instructions. Noting the U2 field of the instructions was redundant as the tag can supply this information, the U2 field was removed from the instruction. This means most instructions fit into 59 bits with no loss of functionality.

A tag applied to an instruction helps identify the operation required. Adding pointers is slightly different than adding integers for instance.

Using tags means that there must be a case statement processed during the execution of every instruction.

The tags identify several primitive data types. There are no tag values for descriptors as this is not a descriptor machine. However, several tag values are reserved for future use.

When the following values are written to a fifo: 6,7,16,17 they are read back as: 6,16,17. The 7 is missing. This is causing the pipeline to get screwed up. The exact issue has not been determined yet.
Attachment:
File comment: Branches in Reorder buffer
Slide1.PNG
Slide1.PNG [ 276.93 KiB | Viewed 770 times ]

_________________
Robert Finch http://www.finitron.ca


Mon May 24, 2021 4:54 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Wrote the first memory state as a task so that it could be invoked at the end of a memory cycle. This trimmed one clock cycle off every memory access. This first state is simply a busy-wait loop waiting for an instruction to dequeue.

Added memory key checking along with a key cache. Every page of memory has associated with it a protection key. Whenever memory is accessed the key for the page accessed is looked up from a cache of keys. Keys are 20 bits and stored in the low order bits of a 32-bit memory cell.

Decided to set aside the tags for data in such a way they could be renewed later. They were just too retro. However, instructions are still tagged so the instruction knows which data type to process, this amounts to an additional field in the instruction.

Added support for the CHK instruction.

_________________
Robert Finch http://www.finitron.ca


Tue May 25, 2021 3:37 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Moved the assignment of the reorder buffer from the instruction fetch stage to the decode stage. This is in preparation to support vector operations. Vector ops will be handled by queuing the decode output multiple times with a step number indicator. Since multiple reorder buffers are needed at decode queue time it was moved from the fetch stage which could allocate only a single entry since the type of instruction was not known until decode time.

There are simulation issues with the core now. Simulation quits with an error about too many iterations being processed, it says check the code. As far as I can tell code should not be the case. I traced things down to a specific line, removing the line causes sim to work, but no useful output is produced. It is the cache line validate signal. If cache lines are not signaled as valid then no instructions will execute as the core will be in a state of continuous miss. So, I tried setting the cache lines valid all the time while removing the line causing an issue and voila. The issue remains it just moved to a different line. I am now thinking it is some sort of simulator issue and not the code that previously was simulating.

The execute portion of the core has been broken out into a separate module. There were quite a few signals to manage but the use of struct types helped.

Added the branch on bit set instruction. Works the same way as the other branches.

Repurposed the Sz4 field in the branch instructions as displacement bits. Branch displacement is now 23 bits or ±4MB. My original thinking for the size field was that it could be used to allow comparisons of sub words types like bytes. But this probably is not a common case. It is easy enough to zero extend bytes in registers then do a full word compare.

Toying with the idea of modifying the ISA again to reduce the size of instructions. I like the instruction set, just not the size and ensuing entropy cost.

_________________
Robert Finch http://www.finitron.ca


Wed May 26, 2021 4:07 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Work has started on version two. Version two will have a much more compact instruction encoding. When all options for an instruction cannot be specified by the instruction an instruction prefix or modifier will be present to provide additional information. Version two will be using 32-bit instructions as a base. The majority of operations that are executed by the processor can be fit into 32-bits. The first version of the core was great at supporting everything using a minimal number of instruction format but that meant that many instructions had a lot of unused fields in them. The code density suffered.

The core has been adapted for ISA version two. It is good that it was not too far along before the major ISA changes.

Stuck on a simulator bug. Even re-writing major portions of the core, the same sim message appears.

_________________
Robert Finch http://www.finitron.ca


Thu May 27, 2021 4:45 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 159 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7, 8 ... 11  Next

Who is online

Users browsing this forum: Bing [Bot] and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software