View unanswered posts | View active topics It is currently Fri Mar 29, 2024 9:42 am



Reply to topic  [ 71 posts ]  Go to page Previous  1, 2, 3, 4, 5  Next
 G6A-RISC Relay Computer 
Author Message
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I completed the G6A-risc software simulator and started to perform some tests on 32 decimal multiplication (8 digits). The main goal for this computer is to run an usable floating point calculator using BCD floating point numbers. So my tests so far aim to simulate the multiplication of the normalised mantissas of floating point numbers.

I found that the faster algorithm for that is the "Nine-Multiples" algorithm. Well, that's nothing else than paper an pencil multiplication, but performed with relays!

The algorithm first generates all the nine multiples of the multiplicand. Next, the multiplier digits are used in sequence to select one of the multiples. One digit multiplication is performed with only one single addition. Therefore the total number of additions required to perform a 8 digit x 8 digit multiplication is 17 (9 additions to create the nine multiples, and 8 additions to perform the eight digit multiplication)

The algorithm in G6A assembler looks like this:

Code:
   .text
   .file   "main.s"

# ---------------------------------------------
# main
# ---------------------------------------------
    .globl   main
main:

# Load a into register r2:r3

   mov [&a], r2
   mov [&a+1], r3

# Store multiples of a from a*0 to a*1
   
   mov 0, r4
   
   mov r4, [&X0]     
   mov r4, [&X0+1]   // x0
      
   mov r2, [&X1]     
   mov r3, [&X1+1]   // x1
   
# Store multiples of a from a*2 to a*3   
   
   dad r2, r2, r0
   dac r3, r3, r1
   mov r0, [&X2]     
   mov r1, [&X2+1]   // x2

   dad r0, r2, r0
   dac r1, r3, r1
   mov r0, [&X3]   
   mov r1, [&X3+1]   // x3

# Store multiples of a from a*4 to a*9

.LXLoop:   
   dad r0, r2, r0
   dac r1, r3, r1
   mov r0, [r4, &X4]   
   mov r1, [r4, &X4+1]    // x4, x7
   
   dad r0, r2, r0
   dac r1, r3, r1     
   mov r0, [r4, &X5] 
   mov r1, [r4, &X5+1]    // x5, x8
   
   dad r0, r2, r0
   dac r1, r3, r1     
   mov r0, [r4, &X6] 
   mov r1, [r4, &X6+1]    // x6, x9
   
   add 6, r4
   cmp 12, r4  // loop two times
   bf .LXLoop
   
# r0:r1 will become the result
   
   mov 0, r0           // a will become the result
   mov 0, r1           // a will become the result
   
# load b   
   
   mov [&b], r2       
   mov [&b+1], r3
   
# Loop 4 times

   mov 4, r5
   
.LMLoop:
   
   rr4 r0, r1, r0     
   sr4 r1, r1         // Shift result one digit
   
   sl1 r2, r4
   and 0x1f, r4      // Find the index to the x0-x9 multiple
   
   dad [r4, &X0], r0
   dac [r4, &X0+1], r1  // multiply the right-most digit of b
   
   rr4 r2, r3, r2     
   sr4 r3, r3         // Shift b one digit
   
# next digit
   
   rr4 r0, r1, r0     
   sr4 r1, r1         // Shift result one digit
   
   sl1 r2, r4
   and 0x1f, r4       // Find the index to the x0-x9 multiple
   
   dad [r4, &X0], r0
   dac [r4, &X0+1], r1  // multiply the right-most digit of b
   
   rr4 r2, r3, r2     
   sr4 r3, r3         // shift b one digit
   
   sub 1, r5         // decrement counter
   bf .LMLoop        // next digit
   
# store the result

   mov r0, [&result]
   mov r1, [&result+1]
   
   hlt

# ---------------------------------------------
# Global Data
# ---------------------------------------------

   .data
a:   .long 0x05556789
b:   .long 0x12345670
   .comm result,2,2
   .comm X0,4,2
   .comm X1,4,2
   .comm X2,4,2
   .comm X3,4,2
   .comm X4,4,2
   .comm X5,4,2
   .comm X6,4,2
   .comm X7,4,2
   .comm X8,4,2
   .comm X9,4,2


* This method takes constant time regardless of the digit values involved. The the 8 digit x 8 digit (mantisa) multiplication takes 117 instruction cycles.

A non negligible time is spent at creating the nine multiples table. So I decided to improve this a bit by adding two new instruction that will accumulate additions into registers R0 or R1 at the same time that they are stored in memory. I named the new instructions 'r0a', and 'r1a', which stand for 'r0 accumuate' and 'r1 accumulate'. The first one is a decimal add without carry, and the second one is the decimal add with carry. They are identical to the normal 'dad' and 'dac' instructions but always using register R0 and R1 as the first source, and storing the result in R0, R1 along with the normal memory store. I have yet to implement them on logisim, but I think this takes only 2 additional relays each.

The simulator code using these new instructions is the following:

Code:
   .text
   .file   "main.s"

# ---------------------------------------------
# main
# ---------------------------------------------

    .globl   main
main:

# Load a into register r2:r3

   mov [&a], r2
   mov [&a+1], r3

# Store multiples of a from a*0 to a*9
   
   mov 0, r0
   mov 0, r1
   
   mov r0, [&X0]     
   mov r1, [&X0+1]   // x0
      
   r0a r2, [&X1]
   r1a r3, [&X1+1]   // x1
      
   r0a r2, [&X2]
   r1a r3, [&X2+1]   // x2

   r0a r2, [&X3]   
   r1a r3, [&X3+1]   // x3
      
   r0a r2, [&X4]   
   r1a r3, [&X4+1]   // x4

   r0a r2, [&X5]   
   r1a r3, [&X5+1]   // x5

   r0a r2, [&X6]   
   r1a r3, [&X6+1]   // x6

   r0a r2, [&X7]   
   r1a r3, [&X7+1]   // x7

   r0a r2, [&X8]   
   r1a r3, [&X8+1]   // x8

   r0a r2, [&X9]   
   r1a r3, [&X9+1]   // x9
   
# r0:r1 will become the result
   
   mov 0, r0           // a will become the result
   mov 0, r1           // a will become the result
   
# load b   
   
   mov [&b], r2       
   mov [&b+1], r3
   
# Loop 4 times

   mov 4, r5
   
.LMLoop:
   
   rr4 r0, r1, r0     
   sr4 r1, r1         // Shift result one digit
   
   sl1 r2, r4
   and 0x1f, r4      // Find the index to the x0-x9 multiple
   
   dad [r4, &X0], r0
   dac [r4, &X0+1], r1  // multiply the right-most digit of b
   
   rr4 r2, r3, r2     
   sr4 r3, r3         // Shift b one digit
   
# next digit
   
   rr4 r0, r1, r0     
   sr4 r1, r1         // Shift result one digit
   
   sl1 r2, r4
   and 0x1f, r4       // Find the index to the x0-x9 multiple
   
   dad [r4, &X0], r0
   dac [r4, &X0+1], r1  // multiply the right-most digit of b
   
   rr4 r2, r3, r2     
   sr4 r3, r3         // shift b one digit
   
   sub 1, r5         // decrement counter
   bf .LMLoop        // next digit
   
# store the result

   mov r0, [&result]
   mov r1, [&result+1]
   
   hlt

# ---------------------------------------------
# Global Data
# ---------------------------------------------

   .data
a:   .long 0x05556789
b:   .long 0x12345670
   .comm result,2,2
   .comm X0,4,2
   .comm X1,4,2
   .comm X2,4,2
   .comm X3,4,2
   .comm X4,4,2
   .comm X5,4,2
   .comm X6,4,2
   .comm X7,4,2
   .comm X8,4,2
   .comm X9,4,2


Now, the total (mantisa) multiplication time is reduced to 101 instruction cycles (constant execution time)

As, as comparison, the fastest multiplication algorithm that I found in existing modern relay computers is the one implemented by Roelh in his Risc Relay computer. He implemented special instructions that combine conditional adds with conditional jumps in a single instruction to perform multiplication as fast as possible. By looking at the code that he published on his excellent online simulator, I believe the figures for his computer are as follows: Worse case (Multiplicaton by all 7's) = 120 cycles, Best case (multiplication by all Zeros) = 100 cycles. So, my constant figure of 101 cycles looks pretty good in the view of it.


Fri Jan 17, 2020 11:17 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
That's a nice speedup from a small instruction set change - about 16%.


Fri Jan 17, 2020 12:42 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I've been trying to figure out ways to make this thing faster.

- On the one hand, I made some instruction encoding changes to reduce dependency on Zero Page accesses. Particularly, the 'r0a' and 'r1a' instructions described earlier, now use M addressing mode rather than ZP addressing mode. This means that the tables of multiples can now be placed anywhere in data memory, thus freeing space in Zero Page for other common variables.

- I also decided to replace the behaviour of the Indexed Memory addressing mode ( as in "mov [r4, 2], r0" ). The memory operand now performs an 'or' of the register and constant, instead of adding them together, to compute the address. The new behaviour is borrowed from Roelh Risc Relay. This makes the use of this addressing mode rather awkward, but it saves 64 relays that were previously required to compute effective addresses by 16 bit addition, and shortens the critical path. For the decimal multiplication algorithm, this means that the table of multiples must be stored in 32 word aligned memory for the addressing mode to work.

- I split the 'B' bus on the data path, so that now it is possible to load an address to the Memory Address Register, at the same time that a pair of registers are presented on the ALU inputs. The updated 'r0a', r1a' instructions take advantage of this. This did not require any additional relay or control line, but just a refactoring of the previous schema. I have successfully simulated the multiplication algorithm with the new datapath and instructions. The simulation schematic is in the LogisimDocs/Main.pdf file

I made a major update of the G6A-Risc.pdf file with everything that was missing. As always, the docs are available in this directory https://github.com/John-Lluch/G6A-RISC/tree/master/Docs

And now some interesting stuff:

While working on the above, I became aware of a design flaw that may (or may not) cause problems in the real implementation. The problem is easier to see than to explain. I showed before the clock phases generated by the sequencer. Inserted here again for reference:

Attachment:
SequencerV1.png
SequencerV1.png [ 20.86 KiB | Viewed 3085 times ]


As can be seen, Phase0 and Phase1 end at the same time. Phase0 is used to load the instruction register and initiate decoding. Phase1 puts the control lines out for datapath selection, alu execution, and alu-result register update. So Phase0 and Phase1 ending simultaneously means that the inputs of the alu-result register (i.e the alu output) may disappear just before the register input enable is cleared, thus potentially corrupting the register value. On TTL based logic, register updates are less of a problem because they are triggered by clock edges, and when this happens the inputs will be present on the input bus for long enough for that to work.

The problem with relay registers is that register inputs must be maintained stable until after the register input-enable signal is cleared, any glitches created while the Input Enable is on can cause the register to store a wrong value. The logisim simulator works fine because, after all, the model is based on logic gates, and the alu propagation delay prevents this to fail. But I am concerned that the current design may not be reliable enough on real relays.

One solution I can think of is to generate the phase signals in a way that the alu result is explicitly kept valid for a while after the alu-result register is updated. This of course adds an additional overall delay so this is something I am not comfortable with, and would like to learn how others have dealt with it. For now, I "will take the bull by the horns" (does that expression exist?) and will replace the sequencer by this:

Attachment:
SequencerV2.png
SequencerV2.png [ 22.94 KiB | Viewed 3085 times ]


Or even this:

Attachment:
Sequencer.png
Sequencer.png [ 25.18 KiB | Viewed 3085 times ]


The phase signals are essentially the same, but they are now generated by a longer "sequencer", up to 50% longer in the last case, meaning that the entire instruction cycle will take 6 clock cycles (instead of 4) at a much higher clock frequency. Rather than becoming slower, the processor now becomes faster because inter-phase iddle times are shorter (12 ms instead of 30 ms), and alu processing starts earlier, with 42% of the total instruction cycle time allocated for that, instead of only 25%. The previous estimate of 4 instructions per second at 16 Hz, now translates into 7 instructions per second and 42 Hz (and way tighter times for register and memory updates :shock: )


Wed Jan 22, 2020 7:04 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I have now (mantisa) division algorithm successfully tested in both the simulator and logisim. This is the division code:

Code:
   .text
   .file   "main.s"

# ---------------------------------------------
# main
# ---------------------------------------------

    .globl   main
main:

# Load a into register r2:r3 (dividend)

   mov [&a], r2
   mov [&a+1], r3

# Load b into register r4:r5 (divisor)

   mov [&b], r4
   mov [&b+1], r5   
   
# r0:r1 will become the result
   
   mov 0, r0     
   mov 0, r1           // R0:R1 will become the result
   
# We will loop 4 times

   mov 4, r6
   
# but first compute first digit
   
.LDivFirst:
   dsb r2, r4, r2
   dsc r3, r5, r3     // Subtract divisor
   adt 1, r0          // Only add result digit if no borrow (== carry) is produced
   bt .LDivFirst      // Subtract one more if no borrow

# For the first digit we shift divisor right (instead of dividend left)

   rr4 r4, r5, r4    // lower first
   sr4 r5, r5
   
   b .LDivMore      // branch to the restoring half of the loop
   
.LDLoop:

# Shift result one digit left

   rl4 r1, r0, r1   // higher first
   sl4 r0, r0
   
# Compute non restoring digit by repeatedly subtracting divisor

.LDivMinus:
   dsb r2, r4, r2
   dsc r3, r5, r3     // Subtract divisor
   adt 1, r0          // Only add result digit if no borrow (== carry) is produced
   bt .LDivMinus      // Subtract one more if no borrow

# shift dividend left

   rl4 r3, r2, r3      // higher first
   sl4 r2, r2
   
# Shift result left, initialize last digit with 9

.LDivMore:
   rl4 r1, r0, r1     
   sl4 r0, r0         // Shift result one digit left
   add 9, r0          // Initialize result digit with 9 for restoring
   
# compute the restoring digit by repeatedly adding divisor
   
.LDivPlus:
   dad r2, r4, r2
   dac r3, r5, r3     // Add divisor
   sbf 1, r0          // Only subtract if no carry
   bf .LDivPlus       // Add one more if no carry
   
# shift dividend left

   rl4 r3, r2, r3      // higher first
   sl4 r2, r2
   
# Next digit

   sub 1, r6
   bf .LDLoop
   
# store the result

   mov r0, [&result]
   mov r1, [&result+1]
   
   hlt


# ---------------------------------------------
# Global Data
# ---------------------------------------------

   .data
a:   .long 0x18465932
b:   .long 0x78901234
   .comm result,4,2       // reserve space in data memory for result


The above divides 18465932 by 78901234 and gives 02340385 which is correct.

The algoritm is the "non restoring technique". It repeatedly subtracts the divisor from the divident (or remainder) until the remainder becomes negative. Then, it shifts the divisor right, and adds it to the remainder until it becomes positive again. This is repeated 4 times to complete the 8 digits. The number of subtractions/additions are accumulated on the Result, one digit at a time.

In the code above, it's interesting to note the use of "adt" (conditional add if true), "sbf" (conditional subtraction if false) to prevent the Result digit to accumulate for the addition or subtraction that caused the remainder sign change, as this would produce an incorrect result. Without this instruction, more code would be required and the algorithm would take longer to complete.

The procedure takes a variable number of cycles depending on the dividend and divisor inputs. On average, considering 5 iterations per digit, it takes 176 instruction cycles.


Mon Jan 27, 2020 8:43 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I think this project is awesome.

Just to confirm I understand thing: the divider is working one digit at a time as opposed to one bit at a time? And the rr4/rl4, etc are four bit shifts?
There seems to be a fair bit of code for multiply / divide. How big is the system rom? Is it a standard eprom or diode logic?

_________________
Robert Finch http://www.finitron.ca


Tue Jan 28, 2020 4:52 am
Profile WWW
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
robfinch wrote:
I think this project is awesome.

Just to confirm I understand thing: the divider is working one digit at a time as opposed to one bit at a time? And the rr4/rl4, etc are four bit shifts?

Hi Rob, thanks for that.

Yes, you understood it well. The divider works one decimal digit at a time. It's not that different than paper and pencil division that we were all taught in school. The goal of this computer is implementing a decimal floating point calculator, and be able to run a few simple arithmetic algorithms, such as finding the fibonacci series or computing a few digits of pi. This is all what relay computers are able to do and what they did when no better technology was available.

The instruction set is tuned for decimal floating point arithmetic. Decimal digits are represented in BCD taking 4 bits per digit, that's why 4 bit shifts are used quite often (they mean multiply or divide by 10). I have not yet fully decided the entire format of the floating point number, but I know that the mantissa will have 8 digits (represented in 32 bits), and that the exponent will be base 10, possibly limited somehow between -99 and +99, like most scientific calculators. The instruction set allows for arithmetic on larger numbers, with say 12 or 16 digits mantissa or more, but execution time would exponentially increment, so I think that 8 digits is fair enough.

robfinch wrote:
There seems to be a fair bit of code for multiply / divide. How big is the system rom? Is it a standard eprom or diode logic?

I think the code looks longer than what it really is because loops are partially unrolled. For example, the multiply algorithm loops 4 times for a 8 digit multiplication, and same for the division.

The final implementation will be somewhat longer and slightly more expensive because it will have to take the floating point exponent into account. I also need to look at ways to fix precision loss that happens for some combinations of numbers. The division example above, for example, produces 02340385 as the result (mantissa). After the floating point number is normalised by incrementing the exponent, the mantissa will become 23403850, but the exact result should be 23403857, so there's obviously some precision loss in the last digit. At this time I'm unsure whether this is easy (or cheap) to fix.

About memory, well, this is a relay computer with concessions to current technology. One of the concessions is of course the use of modern pcb mount small relays, but I also allow diodes, and standard memory ics, so there's no worries at all about memory capacity. In fact, there's potentially 128 kbytes for program and 128kbytes for data. I will never use that much memory on this computer.


Tue Jan 28, 2020 8:47 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Square root !!

Code:
   .text
   .file   "main.s"

# ---------------------------------------------
# square root
# ---------------------------------------------

    .globl   Squareroot
Squareroot:                // sqrt(a)

# Store return address

   mov r6, [&return]

# Load radicand address in r4

   mov [&dsp], r4     // a
   
# r0:r1 will become the radicand remainder
   
   mov 0, r0     
   mov 0, r1           // R0:R1 will become the radicand remainder

# aux0, aux0+1 will become the accumlated root

   mov r0, [&aux0]
   mov r1, [&aux0+1]   // aux0, aux0+1 will become the accumlated root
   
# We will loop 6 times

   mov 6, r6

# Shift next 2 digits of radicand into remainder

.LSqrtOuter:
   mov [r4, 0], r2
   mov [r4, 1], r3     // Load radicand for this iteration

   rl4 r1, r0, r1
   rl4 r0, r3, r0
   rl4 r3, r2, r3
   sl4 r2, r2          // Shift first digit

   rl4 r1, r0, r1
   rl4 r0, r3, r0
   rl4 r3, r2, r3
   sl4 r2, r2          // Shift second digit

   mov r2, [r4, 0]
   mov r3, [r4, 1]     // Save shifted radicand for next iteration

# Compute accumulated root, Si

   mov [&aux0], r2
   mov [&aux0+1], r3

   rl4 r3, r2, r3
   sl4 r2, r2

   mov r2, [&aux0]
   mov r3, [&aux0+1]  // Save for next iteration


# Compute 2*Si - 1

   dad r2, r2, r2
   dac r3, r3, r3
   dsb 1, r2
   dsc 0, r3

# Current digit

   mov 0, r5      // n

# Find radicator Zi = 2*Si + Yn, subtract repeatedly to determine digit

.LSqrtInner:
   dad 2, r2
   dac 0, r3
   dsb r0, r2, r0
   dsc r1, r3, r1   // Subtract radicator
   adt 1, r5        // Only add digit if no borrow (== carry) is produced
   bt .LSqrtInner   // Subtract one more if no borrow

# Restore remainder for next iteration

   dad r0, r2, r0
   dac r1, r3, r1

# Accumulate root digit

   add r5, [&aux0]

# Go for next one

   sub 1, r6
   bf .LSqrtOuter

# Store result in the place of radicand

   mov [&aux0], r0
   mov [&aux0+1], r1
   mov r0, [r4 ,0]
   mov r1, [r4 ,1]
   
# return
   mov [&return], pc


# ---------------------------------------------
# main
# ---------------------------------------------

   .globl   main
main:

# Load a inputs to data stack

   mov [&dsp], r4   // load dsp to r4
   add 4, r4
   
   mov [&a], r0
   mov r0, [r4,0]
   mov [&a+1], r0
   mov r0, [r4,1]
   mov [&a+2], r0
   mov r0, [r4,2]
   mov [&a+3], r0
   mov r0, [r4,3]    // move a to dstack

   mov r4, [&dsp]

# call squareroot

   jl @Squareroot
   
# load result

   mov [&dsp], r4   // load dsp to r4
   mov [r4,0], r0
   mov r0, [&result]
   mov [r4,1], r0
   mov r0, [&result+1]
   mov [r4,2], r0
   mov r0, [&result+2]
   mov [r4,3], r0
   mov r0, [&result+3]    // move pstack to result
   
   hlt




# ---------------------------------------------
# Initialized Data
# ---------------------------------------------

   .data
   .p2align 1
dsp:
   .short dstack-4       // preincrement stores, postdecrement loads
     
sp:
   .short pstack+512    // predecrement stores, postincrement loads
   
   .p2align 3
a:
   .long 0x38598118
   .long 0x00100000      // +3.8598118e+01

# ---------------------------------------------
# Uninitialized Data
# ---------------------------------------------

   .comm return, 2, 2       // tail call subroutine return address
   .comm aux0, 8, 8         // reserved for auxiliary variable
   .comm aux1, 8, 8         // reserved for auxiliary variable
   .comm result, 8, 8       // reserved for result
   .comm dstack, 128, 8     // data stack, 128 bytes, 8 byte aligned, grows upward
   .comm pstack, 1024, 64   // program stack, 128 bytes, 64 bytes aligned, grows downward
   
   



I completed the square root subroutine (only dealing with mantissas for now and ignoring the exponent)

The program above essentially uses a variation of the paper and pencil algorithm. The root digit at every step is determined by subtracting a radicator from the radicand (or its remainder) until the result becomes negative, in a similar way than the division algorithm. The radicator at every step is computed as Yi*(2Si+Yi), where Si is the part of the root found so far (from the previous step), and Yi is the digit being tested. In practice, the radicator is computed by first doubling Si, and then incrementing and accumulating it by 1,3,5,7,9 on successive steps (same as adding 2 on each step), as this gives the same result. No wonder that I first tried this on a spreadsheet. In this case I used the restoring technique (always subtracting the radicator) instead of the non-restoring technique (alternative subtracting and adding) that I used in division, because I found no benefit in this case due to complications in the algorithm implementation.

The proposed algorithm takes on average 604 instruction cycles for the mantissa alone. I looks quite a lot, but I really don't have any better reference. The supposedly faster algorithms, such as the Babylonian method described in wikipedia, I think that can't avoid multiplication and division can't be avoided, so for this computer they will be slower.

The algorithm implementation also suffers from limited precision when computing square roots of 8 digit numbers, the last one or two digits of the result can be wrong. However it's fully accurate for numbers up to 6 digits with an exact root. The problem is that the current implementation can't be run up to the last radicand digit because of not-enough resolution on the radicator, which may overflow while computing the last digit.


Fri Jan 31, 2020 2:19 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
well done! I always like to see square root done as a basic operation- although in my case I was never taught how to do it manually.


Fri Jan 31, 2020 3:30 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
well done! I always like to see square root done as a basic operation- although in my case I was never taught how to do it manually.

I recall that square roots were taught to me sometime when I was a child, but it was boring and little time for exercises was given. Then I forgot it all completely as I never had to use that again. Well at least until now, :-). So now to get a 'working' algorithm I had to do a lot of research, including resorting to early-days computing books, and a couple of spreadsheets.


Fri Jan 31, 2020 5:18 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I second BigEd’s well done. I too remember learning how to perform a manual square root in school, but have since forgot. Calculators were just becoming popular when I went to school. I think I had a slide-rule for the first year of high-school.
There’s a trick to calculating the exponent.

_________________
Robert Finch http://www.finitron.ca


Sat Feb 01, 2020 4:33 am
Profile WWW
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
robfinch wrote:
I second BigEd’s well done. I too remember learning how to perform a manual square root in school, but have since forgot. Calculators were just becoming popular when I went to school. I think I had a slide-rule for the first year of high-school.
There’s a trick to calculating the exponent.

Hi Rob, thanks for that. Yes I also recall when calculators became to be popular, particularly scientific calculators. I spent a lot of time investigating numerical results of the several functions available. These devices were kind of magic for me at that early age of my life. About square roots, well, the exponent is of course calculated by halving the exponent of the input, I assume this is what you refer to.

Back on topic (more or less):

Point 1 - More on Square Root

Some days ago the youtube algorithm suggested this video to me https://youtu.be/gI8luQnyM9A. It's a "Relay-Based Floating Point Square Root Calculator". It's captivating that the video introduction and the music resemble one of those excellent BBC TV-shows on steam trains, but covering old computers in this case. The video is of course not a BBC production, but it's really fun to watch. Anyway, that 'engine' does just what the title suggests, that is, square roots, but VERY fast for a relay based device. Which made me wonder how was that possible with such an apparently slow clock rate and visibly few steps.

So, I dug on the design of such machine, which linked me to this article https://documents.epfl.ch/users/f/fr/froulet/www/HP/Algorithms.pdf from an old Hewlett-Packard Journal. The square root algorithm described on that article is quite clever and (on paper) it looks faster than anything that I tried before. So I tried to implement it, first on a spreadsheet, and then on my simulator, and it looks promising.

Point 2 - Floating Point Format

I also defined the format of my decimal floating point numbers. It looks like this:
Attachment:
FloatingPointFormat.png
FloatingPointFormat.png [ 34.91 KiB | Viewed 2899 times ]

The entire representation takes 64 bits (or 4 words), with the following fields:

- I am now aiming at 10 digits with full precision after rounding an normalising (d0 to d9), instead of the previously 8 digits mantissa only.
- The mantissa sign is coded in s0, but only the lowest bit is relevant.
- The exponent uses 3 digits of 10-complement signed BCD, e0 to e2, which have room for -500 to +499 exponents. However, only the range from -99 to +99 will be valid for normalised numbers, any result outside this range will be considered calculator error or overflow
- There are two additional digit places before the mantissa (labeled 'x') that will remain unused on normalised numbers. However, they may be used during calculations to keep result accuracy up to at least 10 digits after normalisation.
- The decimal point for normalised floating point numbers is considered to be always between digit d8 and d9.

The format is not particularly compact, but it has the following advantages:

- The correct exponent and mantissa sign after multiplication of two floating point numbers can be obtained by simply adding the upper words of the two floating point numbers. A similar trick can be done for division.
- The 'x' labeled additional digits allow for mantissa operations of up to 12 digits. This makes sure that after normalisation and rounding to 10 decimal places, the result is accurate with a maximum deviation of +/- 1 on the last digit.

Point 3 - Need for Instruction Set Adjustments

I found that I need some changes on the instruction set to make 12 digit mantissa calculations faster. The problem is that representing 12 digits take 3 registers. So, just two 12 digit numbers consume 6 registers, so there's only 1 register left (excluding the PC). The architecture already allows direct arithmetic on memory, which are helpful to reduce register pressure. However shifts are only allowed on registers, which is a problem because that implies a lot of saving/restoring of intermediate computations on memory when shifts must be performed. I will try to fix that in the next days or so.

- This may involve extending the Status Register to hold the one digit (4 bits) that is shifted on rotate operations, so it can be chained with subsequent shifts directly in memory, without intermediate registers. That should make the implementation of Square Root and Division shorter and thus cleaner faster.

- The 'shifter' is currently inserted before the A alu input. However, to make the previous point possible, it needs to be placed before the B alu input instead. This has a number of implications such as different behaviour of several instructions. Particularly, the 'shifter' is also used to generate Zero value in input A for the 'set' and 'sef' instructions. If the shifter is moved to B input, that feature will be lost. However, this may just imply that 'set' and 'sef' will become conditional moves, instead of conditional selects, which I guess are equally useful (if not more). Fortunately, the shifter does not add any delay to the critical path, so it can be placed at either side of the alu inputs.

Let's see what this comes up with.

[edited for typos, several times]


Mon Feb 03, 2020 9:36 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I managed (for once) to complete the tasks on my last message (except for the logisim project changes) earlier than anticipated.

The resulting sqrt code is pasted below:

Code:
   .text
   .file   "main.s"

# ---------------------------------------------
# Square Root
# Based on work from William E. Egbert
# (Published on the Hewlett-Packard Journal, June 1977)
# https://documents.epfl.ch/users/f/fr/froulet/www/HP/Algorithms.pdf
# ---------------------------------------------

    .globl   Squareroot
Squareroot:                // b = sqrt(a)

# Store return address

   mov r6, [&return]

# Load radicand address in r6

   mov [&dsp], r6     // a

# Get radicand (a) into registers

   mov [r6, 0], r0
   mov [r6, 1], r1
   mov [r6, 2], r2

# Check if exponent is an even number

   mov [r6, 3], r6
   and 0x10, r6

# If exponent is even do not shift the radicant

   bt .LSqrtEndShift  // Branch if exponent is even

# Shift the radicand left for odd exponents to keep 11 digit resolution

   sl4 r0, r0
   rl4 r1, r1
   rl4 r2, r2

# Multiply by 5

.LSqrtEndShift:
   dad r0, r0, r3
   dac r1, r1, r4
   dac r2, r2, r5     // Times 2

   dad r3, r3, r3
   dac r4, r4, r4
   dac r5, r5, r5     // Times 4

   dad r3, r0, r3
   dac r4, r1, r4
   dac r5, r2, r5     // Times 5

# Initialize 'One' constant, 'Five' constant, and radicator Rb

   mov 0, r0
   mov 0, r1
   mov 0x0100, r2
   mov r0, [&aux1]
   mov r1, [&aux1+1]
   mov r2, [&aux1+2]   // 0x0100_0000_0000
   mov 0x0050, r2
   mov r0, [&aux0]
   mov r1, [&aux0+1]
   mov r2, [&aux0+2]   // 0x0050_0000_0000

# We will loop 11 times as we want to get 11 decimal digits

   mov 11, r6

# At loop entry we have:
# Radicand accumulator Ra in r3:r4:r5
# Radicator remainder Rb in r0:r1:r2
# One constant in aux1
# Five constant in aux0

# Subtract radicator (Rb) from radicand (Ra)
# (Optimize the first subtraction so we do not need to jump
# or to perform an unnecessary radicator accumulation)

.LSqrtOuter:
.LSqrtInnerOne:
   dsb r3, r0, r3
   dsc r4, r1, r4
   dsc r5, r2, r5       // Subtract Rb from Ra
   bf .LSqrtInnerEnd    // Finish early if done

# Accumulate radicator (Rb) with a tentative digit
# Keep subtracting radicator (Rb) until radicand (Ra) becomes negative

.LSqrtInnerLoop:
   dad [&aux1], r0
   dac [&aux1+1], r1
   dac [&aux1+2], r2    // Add constant 1 to the digit place we are looking for
   dsb r3, r0, r3
   dsc r4, r1, r4
   dsc r5, r2, r5       // Subtract Rb from Ra
   bt .LSqrtInnerLoop

# Digit found, stop now if we already have all of them

.LSqrtInnerEnd:
   sub 1, r6
   bt .LSqrtMantisaDone

# Ok, so we have more digits to look for
# First restore radicand (Ra)

   dad r3, r0, r3
   dac r4, r1, r4
   dac r5, r2, r5    // Restore Ra from the extra subtraction performed above

# We must shift the 'five' constant in Rb one digit right.
# We do it by subtracting the existing one and adding the shifted one

   dsb [&aux0], r0
   dsc [&aux0+1], r1
   dsc [&aux0+2], r2   // Subtract 'five' constant from radicator
   sr4 [&aux0+2]
   rr4 [&aux0+1]
   rr4 [&aux0]         // Shift 'five' constant right
   dad [&aux0], r0
   dac [&aux0+1], r1
   dac [&aux0+2], r2   // Add back 'five' constant to radicator

# Finally, we must update the 'one' constant one digit left, and
# Shift the radicand (Ra) left for the next digit

   sr4 [&aux1+2]
   rr4 [&aux1+1]
   rr4 [&aux1]         // Shift 'one' constant right
   sl4 r3, r3
   rl4 r4, r4
   rl4 r5, r5          // Shift Ra left

# We are ready for the next digit, go for it

   b .LSqrtOuter

# We are done with the mantisa
# Store result in the place of radicand

.LSqrtMantisaDone:
   mov [&dsp], r6     // a
   mov r0, [r6 ,0]
   mov r1, [r6 ,1]
   mov r2, [r6, 2]

# Adjust exponent to be half the original one and clear sign

   mov [r6, 3], r0    // load exponent
   sr4 r0, r0         // shift the exponent value to lower position
   dad r0, r0, r1     // times 2
   dad r1, r1, r1     // times 4
   dad r1, r0, r1     // times 5
   sr4 r1, r1         // shift right once more to divide by 10 and clear sign
   dsb 1, r1          // subtract 1 to compensate for the one digit less on the mantisa result
   sl4 r1, r1         // return back the exponent to its right position
   mov r1, [r6, 3]    // store exponent
   
# return
   mov [&return], pc


# ---------------------------------------------
# main
# ---------------------------------------------

   .globl   main
main:

# Load a inputs to data stack

   mov [&dsp], r4   // load dsp to r4
   add 4, r4
   
   mov [&a], r0
   mov r0, [r4,0]
   mov [&a+1], r0
   mov r0, [r4,1]
   mov [&a+2], r0
   mov r0, [r4,2]
   mov [&a+3], r0
   mov r0, [r4,3]    // move a to dstack

   mov r4, [&dsp]

# call squareroot

   jl @Squareroot
   
# load result

   mov [&dsp], r4   // load dsp to r4
   mov [r4,0], r0
   mov r0, [&result]
   mov [r4,1], r0
   mov r0, [&result+1]
   mov [r4,2], r0
   mov r0, [&result+2]
   mov [r4,3], r0
   mov r0, [&result+3]    // move pstack to result
   
   hlt

# ---------------------------------------------
# Initialized Data
# ---------------------------------------------

   .data
   .p2align 1
dsp:
   .short dstack-4       // preincrement stores, postdecrement loads
     
sp:
   .short pstack+512    // predecrement stores, postincrement loads
   
   .p2align 3
a:
//   .quad 0x0010003859811800    // +3.859811800e+01
   .quad 0x0000002000000000    // +2.000000000e+00

# ---------------------------------------------
# Uninitialized Data
# ---------------------------------------------

   .comm return, 2, 2       // tail call subroutine return address
   .comm cntr, 2, 2         //
   .comm aux0, 8, 8         // reserved for auxiliary variable
   .comm aux1, 8, 8         // reserved for auxiliary variable
   .comm aux2, 8, 8         // reserved for auxiliary variable
   .comm result, 8, 8       // reserved for result
   .comm dstack, 128, 8     // data stack, 128 bytes, 8 byte aligned, grows upward
   .comm pstack, 1024, 64   // program stack, 128 bytes, 64 bytes aligned, grows downward


(I also posted it on the github account as "Program/squareroot5L.s", as well as an update to the "Docs/g6a-risc.pdf" file with the description of the modified shift instructions) https://github.com/John-Lluch/G6A-RISC

The code is now inspired on the old HP Calculators algorithm by William E. Egbert mentioned earlier, made to compute an 11 digit mantissa, so that full 10 digit precision will be obtained after rounding and floating point normalisation is implemented.

According to the code above, the root of +2.0 is +1.4142135623 which -if I am going to trust my smartphone calculator app- it is accurate up to the very last digit of the 11 provided. All numbers that I tried worked.

The code above takes on average 55 instruction cycles per digit -compared with about 100 cycles per digit of the early version that I posted before-. The last digit takes 34 cycles. This totals 584 cycles for the computation of a square root with 11 significant digits. That's pretty good compared with my early attempt which used 600 cycles to compute only 6 digits of it. [Hats off to the old guys at HP]

[edit:typo]


Last edited by joanlluch on Tue Feb 04, 2020 7:19 am, edited 1 time in total.



Mon Feb 03, 2020 7:31 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Nice work! I think there's a lot of good technical content in the old HP journals and of course a fair amount will be relevant to decimal calculators.


Mon Feb 03, 2020 9:05 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
About square roots, well, the exponent is of course calculated by halving the exponent of the input, I assume this is what you refer to.

Yes, that’s basically it. Also, the exponent must be made even, requiring the mantissa to be denormalized by a bit sometimes.

One thing that confuses me is the shift by a digit of the radicand for the odd exponent. I would think it would need to be multiplied by two instead of a digit shift?

_________________
Robert Finch http://www.finitron.ca


Tue Feb 04, 2020 3:46 am
Profile WWW
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
robfinch wrote:
Quote:
About square roots, well, the exponent is of course calculated by halving the exponent of the input, I assume this is what you refer to.

Yes, that’s basically it. Also, the exponent must be made even, requiring the mantissa to be denormalized by a bit sometimes.

One thing that confuses me is the shift by a digit of the radicand for the odd exponent. I would think it would need to be multiplied by two instead of a digit shift?

Hi Rob, I may not be able to provide a proper mathematical explanation for this, but I will try to illustrate it with examples. The reason is that performing decimal square root involve grouping digits in pairs from the decimal point. Consider the square roots of 2, times the powers of 10

Code:
sqrt(2)      == 1.41421...
sqrt(20)     == 4.47213...
sqrt(200)    == 14.1421...
sqrt(2000)   == 44.7213...
sqrt(20000)  == 141.421...
sqrt(200000) == 447.213...
...and so on...

It can be seen that 2, 200, 20000 will have the same mantissa if expressed in normalised decimal floating point. Also, the same is applicable to 20, 2000, 200000. The algorithm assumes a particular digit pairs alignment, so in fact, to get the correct result (mantissa), we need to shift by one digit the inputs that are not properly aligned for the algorithm. The same results expressed in normalised decimal floating point are:

Code:
sqrt(+2.0e+00)   == +1.41421e+00
sqrt(+2.0e+01)   == +4.47213e+00
sqrt(+2.0e+02)   == +1.41421e+01
sqrt(+2.0e+03)   == +4.47213e+01
sqrt(+2.0e+04)   == +1.41421e+02
sqrt(+2.0e+05)   == +4.47213e+02
...and so on...

This reveal another interesting property. If the input exponent is odd, we shift the mantissa one digit left to get the correct square root mantissa from the algorithm. The exponent of the result is just the original one divided by two, ignoring the division remainder, regardless of whether we shifted or not the mantissa.

Does this make sense?


Tue Feb 04, 2020 7:09 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 71 posts ]  Go to page Previous  1, 2, 3, 4, 5  Next

Who is online

Users browsing this forum: AhrefsBot, linkfluence and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software