Last visit was: Sun Aug 01, 2021 4:08 am
It is currently Sun Aug 01, 2021 4:08 am



 [ 121 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9  Next
 RTF64 processor 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Latest Fixes:
<software> Register forms of the SET instructions were not encoded correctly by the assembler. This led to many loops not functioning correctly.

Latest Mods:
<hardware> A return address stack predictor was added. Returns which used to be done in the m-stage can now be done two stages sooner in the d-stage. Decreasing the penalty for a return.

_________________
Robert Finch http://www.finitron.ca


Thu Dec 10, 2020 4:00 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Latest Mods:
<hardware>
The direct load and store of some special purpose registers was removed. These registers must be moved to/from general purpose registers now to load and store them.

_________________
Robert Finch http://www.finitron.ca


Fri Dec 11, 2020 4:51 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Latest Mod:
Added a data cache to the core. It is 4kB direct mapped. A memory access begins to external memory at the same time the cache is accessed. Then if the data is in the cache the external memory access is aborted. If the data is not in the cache, then the external access is aborted too, and a cache load sequence takes place. Addresses in the MMIO region are not cached. Write policy is my favorite – write through without allocating.

I had to change the placer strategy to ‘congestion spread logic’ as the placer could not place the design with the default options. This uses more of the device to implement the design.

_________________
Robert Finch http://www.finitron.ca


Sat Dec 12, 2020 4:16 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
The organization of the data cache line is as follows allowing single access reads of unaligned data, using only a single read port.
Attachment:
DataCacheOrg.png


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Sat Dec 12, 2020 9:14 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
I decided to scrap the overflow area on the cache line and use a more conventional means. It was looking pretty ugly to implement write updates. So, now the cache very quickly reads two consecutive cache lines if an access spans cache lines. Writing the cache is a bit complex. Supporting unaligned cache access made the cache more complex. There is small state machine involved.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 13, 2020 5:47 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1627
Ah, interesting development - I did wonder. It felt like you were going to be battling a kind of consistency.


Sun Dec 13, 2020 9:45 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Quote:
Ah, interesting development - I did wonder. It felt like you were going to be battling a kind of consistency.

I wondered why the approach did not seem to be used elsewhere. I have found that sometimes empirical trial and error can come up with solutions that don't seem theoretically good, but work. So, I like to experiment.

Latest Fixes:
<software> A couple of the opcodes for SET instructions were swapped in the assembler. Hopefully fixing this will get the ram testing routine working among other things.

_________________
Robert Finch http://www.finitron.ca


Wed Dec 16, 2020 4:04 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1627
Oh, yes indeed, a bit of experimentation is a useful adjunct to a bit of intuition.


Wed Dec 16, 2020 9:50 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Latest Fixes:
<software> The compiler’s double-target removal peephole optimization got confused by the SET and TST instruction which have implied targets. The double-target removal opt. searches for two consecutive instructions that target the same register, and if there are no other dependencies it removes the first instruction. In this case the compiler was removing too many instructions because of false matches with SET and TST instructions.

Latest Mods:
<hardware> Work started on decimal floating-point hardware. Coded up a DFP addsub unit modelled after the regular floating-point addsub. Was not as hard as I thought it might be. The decimal floating-point used is a custom 128-bit format. Using packed BCD there are 24 decimal digits occupying 96 bits, plus a 4 digit exponent occupying 16 bits. Then there is a status nybble containing the sign bits for the exponent and significand, a bit to indicate infinity and a bit to indicate not-a-number. That makes 116 bits used. The upper 12 bits contain the code '0xDF0' which can be used to distinguish the decimal floating-point number from other types of data. Sometimes software uses NaN values to do this. Since there were extra bits available they were put to this use.
The core is partially pipelined. The unpipelined portion being the BCD ripple carry. The adder / subtracter will need some more regs to make it fast.

_________________
Robert Finch http://www.finitron.ca


Thu Dec 17, 2020 4:45 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
I changed the decimal format in use to make use of the extra bits for more precision. The basic format is shown here:
Attachment:
DecimalFloatingPointFormat.png

The exponent and significand are both sign-magnitude numbers.

After coding the adder / subtracter I coded a multiplier, then divider. I did some research into decimal floating-point but ultimately decided to keep things simple.

A decimal floating-point divider has been developed. It uses a simple repeated subtraction method which works out to be not too bad. It takes an average of about 190 clocks for a 128-bit division, about double what a radix 2 binary divider would take. Dividing by zero takes a long time: 660 clocks. Since the division is done out to double precision (2x27 digits) there are 54 digits to process. Each digit can take up to 10 clock cycles, meaning the max time for the divide should be about 540 clocks. As can be seen the average time is a lot lower. Dividing by zero would hang the divider but there is special logic to detect if a divide is taking more than 600 clocks (meaning a divide by zero is taking place). The repeated subtraction keeps subtracting the divisor from the quotient until a borrow is needed. On average five clock cycles should be required since digits are 0 to 9. Consuming five clock cycles average to process four bits is almost one clock per bit. Sample test bench output follows.
Code:
rm ------ A ------  ------- B ------  - DUT Quotient -
0   54731730284198063956228301840697   50842254286367517703931144195455   53889287189677210325624552125257    189   inf
0   50000700000000000000000000000000   50000200000000000000000000000000   50000350000000000000000000000000     67   256.000000
0   50000900000000000000000000000000   50000200000000000000000000000000   50000450000000000000000000000000     68   162.000000
0   50000000000000000000000000000000   50000000000000000000000000000000   c9999000000000000000000000000000    660   328.000000
0   50000100000000000000000000000000   50000300000000000000000000000000   40001333333333333333333333333333    146   282.500000
0   57742855207635419742995309752875   51598442187529856298732887541068   56144193403833820837678338502900    178   261.600000
0   50560251015149897929392476267187   59363352719789263230798603292166   41198711655987383797223201491055    186   249.000000
0   50394271649286190530574052617493   57283721261361267168305167726131   43112376630859184354318587252077    192   240.857143

0   51145089367764017411289639013457   58738899849904594450527688617279   42409993140784492143295244150691    177   190.824191
0   54989588098129004120143033544539   51529856630375744281885391296366   53459686524953651278191205628045    184   190.814607

An alternative multiplier primitive was added which is based on a repeated addition similar to the divider’s repeated subtraction. Once again, it takes up to 540 clock cycles to perform an add this way, but the resource usage is much smaller. The multiplier implemented with this primitive is only about 1000 LUTs instead of 40,000. 40x smaller and 40x slower. (The multiplier using a parallel BCD multiplier was quite large).

Now to add a decimal float type to the compiler. The type would be ‘decimal’ for 128-bit decimal numbers.


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Sat Dec 19, 2020 3:45 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
I have hit upon densely packed decimal formats. The plan is to use a densely packed format for storage and transmission, then use the previously mentioned BCD format as a base for calculations. The densely packed format used here stores up to 33 digits, so the previous format has been extended with more digits.
There are some very clever manipulations of bits to store three decimal digits into ten binary bits. I decided not to use the DPD in that manner. Instead the DPD planned for use is a simple base 1000 binary number. This has the advantage that it retains order for comparisons. The DPD is easily converted using lookup tables.
Attachment:
DecimalFloatingPointStorageFormat.png


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Sat Dec 19, 2020 9:10 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Today was BCD square root primitive day. It was quite puzzling, kept me busy for a while but I managed to get it to work based on an algorithm found here:
https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1977-05.pdf1977 , Volume , Issue May-1977 (hp.com) (William. E. Egbert).

The square root is quite large (3,600 LUTs) probably due to all the BCD adders and subtractors and the number of digits (68) processed. Three adders are used just to calculate a 5x quantity. I think it is a good first effort and may be refined later. The square root requires an even number of digits. Meaning most likely the lead digit will be set to zero.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 20, 2020 4:24 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1627
Speaking of multiplying by small constants using adders... I think your division machinery could probably proceed by subtracting 4x, 2x, and x and make slightly faster progress. If you can make use of non-restoring division that saves a bit of time too.


Sun Dec 20, 2020 10:20 am

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 73
robfinch wrote:
I was wondering that myself. They do not get used at the same time. It is sometimes handy to be able to be able to have a depth of two for register-based routine returns without having to stack addresses. The first level routine returns using ra0 then the next level routine returns using ra1. Having two registers allows an alternate return path to be specified in a function. Why one register would not be sufficient I do not know.


Have you already found an example code that justify these two RA rationally ?


Sun Dec 20, 2020 8:24 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Yes, there is lots of room for further optimization of the cores. Like anything new to me, I tried to keep it simple. Getting it to work even if it is not very efficient is a great feeling. There is some very good stuff on the web about decimal floating-point already developed by people. It has been around for years and I suspect with larger transistor budgets it will make it onto personal computers, if it’s not already there.

I picked a base 1000 for the storage format because it may be possible to use it directly in some algorithms. Converting to BCD is a small lookup table. Converting from BCD is relatively painless as well. (I have since added a more conventional DPD format converter core).

I believe I have figured out how to do a BCD right-shift now. So, the square root can be improved. Same as a binary shift, except when there is a carry over from the higher nybble add five to the current nybble instead of eight. -> Reduced the size of the square root core by about 10%.
I took BigEd’s suggestion on the divide and now it tests for x5 in addition to x1. Doing this trimmed the average cycle count down to 161 from 190. The divide is a non-restoring divide.

I am toying with the idea of redoing the DFP in base 1000. It might be a little bit more hardware efficient.

I have been studying the decimal floating-point some more after a donation to Wikipedia. The exponent in the IEEE format is represented as a 14-bit binary value. To use this with the decimal floating-point routines as they are requires a 14-bit binary to BCD conversion. To reduce the binary to BCD conversion to six bits I’ve come up with yet another format. It has the same number of significant digits and the almost the same exponent range as the IEEE format. Encoding of NaNs and infinities is different.
Attachment:
DecimalFloatingPointStorageFormat2.png


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Mon Dec 21, 2020 5:22 am WWW
 [ 121 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9  Next

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software