Last visit was: Sun Aug 01, 2021 3:28 am
It is currently Sun Aug 01, 2021 3:28 am



 [ 121 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7 ... 9  Next
 RTF64 processor 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Quote:
What do you use for schematic drawing? It is in readable B&W.

I used KiCad. There's an option to plot to a .pdf file in monochrome.

_________________
Robert Finch http://www.finitron.ca


Wed Oct 28, 2020 12:20 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
I have some doubt that the PC layout can be done well enough to be able to obtain the kind of performance I am after. So, I spent some time mulling over the idea of using a lower speed link. I will have to very carefully select the signals on the FPGA for routing to the buffers. But first I need to be sure about what I am doing.

Transmitting a packet clock using an entire lane uses 25% of the bandwidth just for the clock signal. I am considering ways to reduce that, including using more data lanes per clock. If seven data lanes were used with a single clock lane then the overhead would be cut to 12.5%. Another way to reduce the clock bandwidth requirement is to embed the clock in the data. The clock could be transmitted as a burst in the data stream. If the burst window is known it could be fed to a pll to regenerate the entire clock. It may require an external pll to do this. I’ve posted yet another question pertaining to the use of FPGA clock resources to create a clock from a burst signal. I would like to get the clock’s bandwidth usage down to low single digits of a percentage.

I came up with a second version of the xbus which has the extra signals for a video display built back in. When in debug mode, the xbus generates signals looking a lot like a video signal. It should be possible then to connect up a monitor a dump the xbus signalling directly to it.

Having a lot of fun with this.

_________________
Robert Finch http://www.finitron.ca


Thu Oct 29, 2020 3:39 am WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 203
Location: Huntsville, AL
Rob:

Been stopping in to see your progress. Always interesting to see the various design excursions that you try.

Seeing that you were considering using a PLL to extract the clock from an encoded (clock+data) lane, why don't you consider using an 8b/10b, or more efficient variation encoding for that job. Somewhere above you mentioned using a 14b/12b scheme. If I recall correctly, the 8b/10b scheme is a built-in primitive available on many of the SelectIO pins that have built-in serializers / deserializers. Using those SelectIO capabilities may cost a bit in bandwidth, but may get you to the promised land faster. Depending on the family, the SelectIO serializers / deserializers are capable of shifting at rates in the 1-2 Gbps range.

_________________
Michael A.


Thu Oct 29, 2020 12:48 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Quote:
If I recall correctly, the 8b/10b scheme is a built-in primitive available on many of the SelectIO pins that have built-in serializers / deserializers.
I am using the built in serializer / deserializers. I would be trying to use Gigabyte transceivers but the FPGA board I am using does not have the Gigabit transceivers routed, probably because the signal quality going through connector pins wouldn't be good enough. So I am stuck with using TMDS_33 I/O standard and DDR (most of the I/Os are wired for 3.3v). The I/O serializer supports up to 14 bits for DDR so they are all made useful. 8b/10b can use a faster packet clock because the multiple is only five times instead of the seven times for a 12b/14b encoding. But a slower packet clock (57MHz) is also desirable as it is closer to what the cpu will be able to operate at. If the cpu could use the packet clock that would be good. The packets have been designed around a 36 bit format (3 lanes of 12 bits). To use an 24 bit encoding would mean sending more packets. 36 bits works fairly well for transferring 32 bits of address or data at one time, the format is simple. I have come up with other formats 48-bit packets, 84-bit packets, and 96-bit packets, but they aren't quite as efficient. 8b/10b format is extremely popular and well known so it is very tempting to use. There may be more debugging tools for 8b/10b.

The concept of keeping the DC balanced can be applied for a 12b/14b format as for a 8b/10b. I have thought to encode things using four 3b/4b encoders but the serializers will not work with 16-bits.

Since this is for mother-board connections (poor man's PCI) and not telecom I've toyed with some other ideas. Modern LVCTTL buffers can work upwards of 200MHz this is about 1/4 the 800MHz bit rate possible with the current design. However there could be twice as many pins available to transfer data if differential signals are not used. It may end up that the system runs at a much lower rate than planned.

Wrote a posit to integer module early in the morning then I had to fix the rounding for it. Ran some simulations of the cpu and fixed several pipeline bugs. The simulator keeps crashing with not enough memory errors.

_________________
Robert Finch http://www.finitron.ca


Sat Oct 31, 2020 3:54 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
It occurs to me that the input buffers for xbus are probably not required. The receive signals can go directly to the FPGA pins with a pullup resistor. As long as output buffers are driving the bus things should work. This is good because the PCB layout is much simpler. Revising schematics now.

There is going to be just single outgoing channel from the cpu board and three incoming channels. Things are going to be wired in a point-to-point fashion. A channel being a clock + three data lanes. The bus can support up to eight channels, but I am going to set it up for only four. The high-speed serial channels do not require as many bus signals as a parallel bus. So, using the PC-AT bus is actually overkill.

Synthesis reveals that the posit divider is on the critical timing path. Not surprising as it does a cascade of NR iterations using large multipliers and adders. I had added some FF’s after the divider instance thinking that the tools would be able to retime things. Then I found out the global retiming flag must be set for synthesis as that is the only way things get retiming by more than one level of ff’s (at least until the I update the toolset). So, synthesis was run again, and nothing got retimed, although the log reveals that it does try to retime things. So, my next thought was that maybe the synthesizer could not retime across module boundaries, and maybe all the retiming ff’s need to be in the module where retiming is desired. So, synthesis was run again. The log does not indicate that the divider module got retimed. Time to adjust the log reporting depth.

_________________
Robert Finch http://www.finitron.ca


Sun Nov 01, 2020 3:12 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Having moved the posit / fp multiplier / divider off the critical path the next thing in line was the bitfield operations. I had coded them quickly without much thought to performance as other things would have a greater impact on timing. They are now fully pipelined. I am tempted to reduce the full pipelining as I suspect the ops are now overly fast, consuming too many clock cycles. It takes about seven clocks for a bitfield operation. Not great but still two to three times faster than executing a series of instruction for an extract or insert. However, as bitfield ops are a rarely used operation I am not going to spend much time on them. Integer multiplier is next on the critical path now. 2.5ns need to be shaved off. Then back to the integer remainder function. And onto the popcnt function.

While waiting for synthesis and implementation, a posit to quire format converter was written. It is interesting because it contains thousands of bits. The quire represents numbers as fixed point and allows converting to and from the posit format. A quire for 64-bit posits has over 2,000 bits. It is a bit large for an FPGA implementation. As written it requires a 2,000+ bit shift register that can shift left or right up to eight bits at a time. To convert a posit to quire the significand of the posit is centered in the quire, which has the point in the center. So, for values close to zero little shifting is required. However, if the number is very large or very small it could take a lot of clock cycles to perform the shifting (eg. 250).

Adding lots of pipeline registers. Of course, if too many pipeline registers are added it kind of defeats the purpose of having a higher frequency clock. I figure that about three pipeline clocks can be got away with in the execute stage without impacting performance because other stages take just as many clock cycles to execute.

Timing for 57MHz operation is being missed by only 300ps now. Next target will be 80MHz operation.

_________________
Robert Finch http://www.finitron.ca


Tue Nov 03, 2020 3:58 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1627
Yikes is there really a need for 2000 bit wide normalisation? The usual floating point seems to get by with 80 - of course, it has its limitations. If you add a really small posit to a really large one, presumably you must get back the large one?


Tue Nov 03, 2020 9:33 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Quote:
Yikes is there really a need for 2000 bit wide normalisation? The usual floating point seems to get by with 80 - of course, it has its limitations. If you add a really small posit to a really large one, presumably you must get back the large one?
It is just in the quire part of the posit standard. The quire is a honkin big fixed-point accumulator which is not used for normal posit operations. Other than the quire, posits are small. Normalization and other operations is similar to floats.

The RTF64 cpu supports 64-bit posits (posits having fixed sizes of 8,16,32 or 64 bits) as that is about the largest size one might want. All posit operations in the cpu are performed at 64-bits and then converted to or from lower precision posits.

I have been working on the floating-point multiplier, using Karatsuba's approach and building up larger multipliers out of smaller ones. I would like to support 128-bit floats, so a 128 bit multiplier was needed. Test benches were written for fp multiply and divide. I cannot seem to get the numbers to match exactly in all cases to the results generated by simulation. Sometimes the LSB is 1 unit too low or 1 unit too high. I assume there is some difference in the rounding. I found out that normalization was not generating a sticky bit as it should. So that is fixed now, but the numbers still don't match exactly.

Made the instruction cache 4-way associative while waiting for system builds.

_________________
Robert Finch http://www.finitron.ca


Wed Nov 04, 2020 5:01 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Did some more pipelining of the floating-point unit and made an option to disable support for denormals. This did not affect very much code. When underflow is found in the multiplier or divider the mantissa is forced to zero when denormals are not supported. Since underflow is already detected it was just a single bit combination change. The biggest change is in the normalizer where if denormals are not supported then a leading zero counter can be simplified.

Yo-yoing with the push / pop / link and unlink instructions. They have now been removed from the ISA as they were causing issues with the pipeline forwarding of results. The simplest solution was to just remove the instructions and make the core slightly smaller. They are not critical to have instructions.

The next big change for the core will be to get some form of branch prediction going. Without branch prediction the core loads the pipeline with instructions from the wrong path and for small tight loops it may cut performance in half.

Decided to try using the perceptron-based branch predictor as it should be a little bit more accurate. Although the predictor takes a couple of clock cycles to calculate the prediction, in this case the clock cycles are available.

Made a classic noob mistake. A pipeline bubble needed to be inserted when a memory op targets a source register used in the next instruction. The result cannot be forwarded until after it has been fetched from memory. Values were not propagating properly without this bubble.

_________________
Robert Finch http://www.finitron.ca


Fri Nov 06, 2020 7:00 am WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 255
For short loops < 8 instructions, could you not extend the decoded pipeline n cycles and
just reload the pipe line from the end with special branch zero/ non zero instruction?
if(data pipeline n?0) load instruction(pipeline a,b).
Things like C string functions come to mind here? while(*a++=*b++);
Ben.


Fri Nov 06, 2020 6:39 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Quote:
For short loops < 8 instructions, could you not extend the decoded pipeline n cycles and
just reload the pipe line from the end with special branch zero/ non zero instruction?
That sounds like a good idea. Sounds like loop mode in the 68k. Just feed one pipeline stage back into the pipeline at a different spot so the pipeline itself forms a loop. That way there are no instructions fetches taking place. I will have to try something like that. Since it is a six stage pipeline it should be able to handle loops with up to six instructions. I think it requires maintaining a lot of pipelined signals though. It may be necessary for some NOPs to be in the loop in order to limit the pipeline feedback positions. For instance, if the last stage is fed back to the first, then there must be six instructions in the loop. There could be issues with branches. Conditional branches are predicted and take place at the decode stage, then are confirmed to be correct three stages later in the memory stage. So where to put the loopback?

Milestone reached: got ‘AA” to appear on the LEDs of the FPGA board.

Results forwarding on the flags register was missed. This caused loops not to terminate properly.

Timing for 57MHz is now met with about 400ps to spare. The latest pipelining was in the interrupt controller. The outputs needed to be registered. It takes the system about an hour to build so I am going to leave further performance improvement to the future. As more performance is asked for, it seems to take longer for the tools to build the system.

_________________
Robert Finch http://www.finitron.ca


Sat Nov 07, 2020 5:12 am WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 255
The loop would be do ... untill loop, thus the testing would be at the end of the code segment
. A special purpose new instruction, no branches at all in the loop body. Something a c MACRO
might produce.
I tend to debug my FPGA hardware alot, so I have a dummy front panel to load and test memory.
and display the AC,PC and SP. Once I have that working, I then move on to simple serial I/O.
Debuging is easy if you can find the problem.
Ben.


Sat Nov 07, 2020 6:04 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Quote:
The loop would be do ... untill loop, thus the testing would be at the end of the code segment
. A special purpose new instruction, no branches at all in the loop body. Something a c MACRO
might produce.
I think it is possible to get by without a special branch by using the branch displacement value. If the displacement value is between negative five and zero then the branch will be back to an instruction already in the pipeline.

Spurred on by discussion at:
viewtopic.php?f=23&t=583

I decided to fix up the compiler a little bit and see what resulting RTF64 code would look like. Note the use of the special ?? operator which tells the compiler it is safe to generate code evaluating both sides of the conditional. Sometimes it can be unsafe to do depending on things like function calls and aliased pointers, so it is left up to the programmer to decide.
Code:
 int testX16( register int a, register int b, register int *c )
{
  *c = a==b ?? 2+a : 3+b;
  return a==b;
}

Compiles to:
Code:
public code _testX16:
;   *c = a==b ?? 2+a : 3+b;
  seq      $cr0,$a0,$a1
  add      $t1,$a0,#2
  add      $t2,$a1,#3
  cmovenz   $t0,$cr0,$t1,$t2
  sto      $t0,[$a2]
;   return a==b;
  aslx     $a0,$x0,#1
TestX16_14:
  rtl   
endpublic

Much the same as the CPU74 code.

_________________
Robert Finch http://www.finitron.ca


Sat Nov 07, 2020 7:07 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
I am liking the loop mode idea. It turns the five-cycle instruction fetch into a single-cycle operation, ultimately causing the minimum instruction cycle time to be four clocks instead of five. A performance improvement of 20%. It took me a couple of hours but I think loop mode is basically working. I added a couple of trailer stages to the pipeline to allow loops up to six instructions. I suspect there will be more debugging of this required in the future. I believe I know a way to trim yet another cycle off the CPI, but it is a bit messy. If the CPI can be trimmed down to three for simple instructions that would be great.

_________________
Robert Finch http://www.finitron.ca


Sat Nov 07, 2020 10:17 am WWW
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Rob,

I'm happy to see that the output of my CPU74 backend implementation for the LLVM compiler serves as an inspiration for your compiler. To get such kind of output for an instruction set you need non-flag-altering instructions as well as conditional moves or selects. as you seem to have. I also got the original C code example from your thread, of course. :D


Sat Nov 07, 2020 5:01 pm
 [ 121 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7 ... 9  Next

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software