View unanswered posts | View active topics It is currently Fri Mar 29, 2024 12:58 pm



Reply to topic  [ 19 posts ]  Go to page 1, 2  Next
 Black Widow Core 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Decided to shelve Thor for a while to work on yet another project.

Embarked on a journey to develop yet another processing core, this time called the Black Widow or BW for short. BW is a VLIW core using fixed 40-bit instructions with an 80-bit data path. Each instruction is conditionally executed based on a predicate. Up to three instructions may be executed in a clock cycle. It is a 64 register design with a unified register file.

There is however only a single memory op allowed per clock and stores cannot be directly followed by loads as stores take place in the writeback stage and loads take place at the execute stage. There would be a resource conflict between loads and stores if loads were allowed directly after stores. Performing a load is a two-step process. First the load is queued in the memory queue, this requires only a single clock cycle. Next a load check, LDCHK, instruction is executed to test if the bus interface unit, BIU, has completed the load. The whole machine stalls until the load is completed. Using a two-step load it is possible to issue the load then continue with other instructions before performing the load check. So, the load latency may be hidden. Stores do not require this subterfuge.

The core makes use of breaks between instructions to improve code density. When there is a break after an instruction the following instruction will not execute until the next clock cycle. This is used to resolved dependencies.

Instruction fetch advances by five bytes for every instruction processed during the clock cycle. Branches may be performed from any slot and may target any slot. The instruction register is a sliding window into the instruction stream. It shifts by 40 bits for every instruction processed. Branches are always relative with the exception of the jump to register instruction.

The core makes use of constant *suffixes* to extend constant values. Constant values up to 80 bits may be formed. Having the constant follow the instruction is a lot easier to process than using a prefix. The instruction with suffixes is treated as one large instruction. The suffixes decode at the same time as the instruction, overriding the sign extension of values. It also means interrupts do not need to be locked out. Constant suffixes are treated as NOP operations since the predicate for a constant suffix is always zero.

_________________
Robert Finch http://www.finitron.ca


Sat Apr 23, 2022 3:34 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
So... these are not triplets of 40 bit instructions, in a 120 bit word... should I think of the machine as fetching 120 bits in a cycle, or just 40 bits?

Also, the idea of a sliding window... are instructions always going to complete in order, or might the middle instruction of three complete first? Or am I mistaken to be thinking about threes again? (I suppose I'm imagining a three-wide decode.)


Sat Apr 23, 2022 6:26 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
So... these are not triplets of 40 bit instructions, in a 120 bit word... should I think of the machine as fetching 120 bits in a cycle, or just 40 bits?

Also, the idea of a sliding window... are instructions always going to complete in order, or might the middle instruction of three complete first? Or am I mistaken to be thinking about threes again? (I suppose I'm imagining a three-wide decode.)

This is not an out of order machine, so instructions effectively complete in order. Yes, this is a three-wide decode. There can be up to three instructions that complete in the same clock cycle. Two 512-bit cache lines are fetched at the same time for a 1024 bit window. The low order instruction pointer, IP, bits are used to select a range of 240 bits from the cache line. 240 bits allows the last instruction fetched to use three constant suffixes. So including constant suffixes six instructions are fetched at once. But much of the time there are no constant suffixes so only 120 bits are processed.
There can be breaks between the instructions which forces the following instructions to be processed in the next clock cycle. When there is a break after the first instruction, the second instruction becomes the first instruction of the next cycle and another third instruction is fetched. Instructions shift position. Instructions are not in fixed bundles. The breaks are needed so that the previous instructions results can be forwarded to the next instruction. There also need to be breaks between a memory store and following load. Memory ops always need a break before them to force them to be processed in the first slot.
I may be changing the machine width to 128-bits after some discussions about decimal floating-point on the comp.arch newsgroup.

I got the design far enough now that I need to write an assembler for software.

_________________
Robert Finch http://www.finitron.ca


Sun Apr 24, 2022 5:06 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Thanks for the extra detail!

It's a minor terminological preference, but I think I wouldn't call this VLIW, as the wide aspect isn't reading fixed bundles. I think that had me a bit confused...

... so I think I'd call this some kind of three-wide machine: it can decode three instructions and complete three instructions, if that's possible. Or it can make a bit less progress, if that's what has to happen.

Could you perhaps say a bit more about "breaks between instructions" - is that a feature in the instruction stream, or a consequence of the limited execution hardware, or something else?


Sun Apr 24, 2022 7:54 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Can you have 3 instructions on the go at once? A loop like while(*a++=*b++); is how many cycles after you wait
the memory cache to flush and fill? Could a simlulation of the cache and pipeline help decide just where wait states
are. Ben.


Mon Apr 25, 2022 1:18 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
There is a “break” indicator bit as the most significant bit of an instruction. So, it is part of the instruction stream. When it is set the following instructions will not execute until the next clock cycle. This is so that results from the prior instructions can be used for the following ones. If no results from prior instructions are needed then there does not need to be a break and instructions can execute in parallel.

Breaks are similar to the break positions specified by the template bits of an instruction bundle for the Itanium. In the case of the BW there is no template so breaks are specified with the instruction. Software is relied on to insert breaks as needed.

Yes, VLIW is used incorrectly here in my opinion, thanks for pointing that out. I started out thinking VLIW but it ended up different. I will change that in the docs. VLIW ISA typically has a mix of different types of instructions all bundled together as one unit. It can save on instruction decoding and reduce the size by having the slot position act as a decode indicator. The BGB poster on comp.arch is calling part of his ISA WEX instructions for wide-execute. So maybe we can use a newly? coined term a WEX machine. It is really an in-order superscalar machine.
Quote:
while(*a++=*b++);

It should be possible to code this something like:
Code:
again:
  LDB t0,[a0];   # store and load cannot be done at the same time
  STB t0,[a1]
  ADD a0,a0,1
  ADD a1,a1,1
  CMP.EQ p2,p3,t0,0;   # predicate needs time to update before branch
p2: BRA again

The semi-colons after the instructions indicate where breaks are needed and it makes the machine execute sequentially. Where there are no semi-colons the instructions execute in parallel. In the above code the store, two adds and a compare could all execute in parallel. However, the machine can handle only three at once.

_________________
Robert Finch http://www.finitron.ca


Mon Apr 25, 2022 10:00 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Ah, so it's a machine without interlocks, the instruction stream has to know what it can do, to some extent?


Mon Apr 25, 2022 5:26 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
With a short loop like that, could one not buffer N finished instructions from the pipeline and feed them
back to the input again, rather than deal with branching instructions?
In real life how many stalled cycles can the read from memory take?
Ben.


Mon Apr 25, 2022 6:47 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
h, so it's a machine without interlocks, the instruction stream has to know what it can do, to some extent?
Yes, the instruction stream needs to indicate when instructions can execute.

Quote:
With a short loop like that, could one not buffer N finished instructions from the pipeline and feed them
back to the input again, rather than deal with branching instructions?
In real life how many stalled cycles can the read from memory take?
Ben.

Were you thinking of a pipeline loop mode? The branch instruction still needs to be present to indicate where the loop is. The loop could be unrolled a couple of times and predicates used to suppress unnecessary operations.
Un-cached memory operations could take about 15 or more clock cycles. Cached access is probably four to five clocks. A couple of clock cycles are used to convert from virtual to physical addresses. Memory access will slow the machine down quite a bit. Stores look like single cycle operations because once the store is queued in the memory queue it is finished, the store takes place in the background. I have to think about this some more. What if there is an issue with the store?
I got the hypothetical code wrong. I forgot the load check instruction. An improved version is as follows:
Code:
again:
  LDB t0,[a0]
  ADD a0,a0,1   # increment while waiting for load
  LDCHK t0
  CMP.EQ p2,p3,t0,0; # force store to first slot
  STB t0,[a1]
  ADD a1,a1,1
p2: BRA again


Switching the core to 128-bit instead of 80-bit increased the size considerably more than 2x. 17,000 to 46,000 LUTs. But it should still fit in the FPGA.
The assembler is coming along. A modified version of the vasm PowerPC assembler is being used. The Black Widow has a lot fewer instructions at this stage.

_________________
Robert Finch http://www.finitron.ca


Tue Apr 26, 2022 4:02 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Found out there is not an easy way to modify the assembler to support predicates. It would be a lot of work to modify. So, since predicates are unnecessary with good branch prediction, I chose to drop them from the architecture. This gave more bits for other fields like the immediate constant. It also made the set instructions redundant with the compare instructions.

The assembler works well enough now to generate some test code.

A first simulation run reveals: nothing works.

_________________
Robert Finch http://www.finitron.ca


Wed Apr 27, 2022 3:03 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
This is the first time I implemented an even/odd line split instruction cache so there were some snags. Previously I replicated part of the next line on the current one to allow instructions to cross cache lines. But this would be a bit wasteful to do for a machine that can fetch up to five instructions at once. It would require almost doubling the size of the cache. So, it is more memory efficient to use even/odd lines. Splitting the cache output into even and odd lines allows instructions to cross over the end of a cache line without needing to replicate data. But it does require doubling up on the tags and valid bits.

Fixes, already: there was not enough forward bypassing for each lane. Bypassing needed to include results from other lanes.
The assembler needed to be updated to account for the removal of predicates. Some of the opcode fields shifted around.

Simulation current hangs in a loop at about 5.8 us.

_________________
Robert Finch http://www.finitron.ca


Fri Apr 29, 2022 2:48 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Fixed the simulation hang at 5.8 us. Some more clocking of ff's was required to eliminate timing loops.

The startup branch instruction works now. Several instructions execute but constants are not coming through properly. More debugging required.

_________________
Robert Finch http://www.finitron.ca


Sat Apr 30, 2022 3:48 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Always good to hear about progress Rob - how many cycles of activity is that, and how many minutes of simulation does it take to get that far? (I hope it's minutes and not hours!)


Sat Apr 30, 2022 6:29 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Always good to hear about progress Rob - how many cycles of activity is that, and how many minutes of simulation does it take to get that far? (I hope it's minutes and not hours!)
Sim gets to 5.8 us almost instantly. So debugging is fast right now. I run sim out to about 10 us now or about 100 clock cycles and it only takes a couple of seconds. Or Cycles of activity? As in simulation trials? Maybe 10 to 20. IDK. As the core works better and better it takes longer and longer. Partly why I shelved Thor for now. Hour long build cycles and 20 minutes waiting just for function sim to start up. Then another 10 mins or so waiting for sim to reach a fail point.

_________________
Robert Finch http://www.finitron.ca


Sun May 01, 2022 4:57 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Thanks Rob - yes, the turnaround time for you was what I was wondering about. It can easily become the limiting factor, as you note. It's not just the stretching out of each iteration, it's the loss of momentum as it's hard to keep mental context and momentum beyond a certain point.


Sun May 01, 2022 7:14 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 19 posts ]  Go to page 1, 2  Next

Who is online

Users browsing this forum: AhrefsBot and 11 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software