Last visit was: Tue May 17, 2022 1:45 pm It is currently Tue May 17, 2022 1:45 pm

 Page 1 of 1 [ 12 posts ]
Integer arithmetic 'c' library ?
Author Message

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I am looking for a suitable collection of functions for integer multiplication, division and remainder. I suppose I could create them myself, but in this particular case I do not fancy reinventing the wheel. (As I'm already re-inventing a bigger wheel ),

As a starting point, I am looking for a 'c' implementation of the same, and found some code in the gnu open source project. But unfortunately, everything is 32 bit based, and although it's easy to just use the same code for 16 bit values, then I can't find anything that would work on 32 bit values for a 16 bit architecture.

At some point I thought at researching the PDP/11, but everything that I found online corresponds to a late version that already had hardware arithmetic, so no useful software functions.

Any pointers would be appreciated. (I may eventually ask the same question on the 6502 forum, as I suppose at some point there must have been the need for such a library)

Joan Lluch

Sat Jun 08, 2019 5:10 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1690
Interesting question. I would look at multiple-precision arithmetic libraries. But I fear that the best libraries will be optimised for the extreme cases of very large numbers. Or will be far too general in their interface.
http://www.ccs.neu.edu/home/ahchan/wsl/ ... Lib/MPLib/
https://gmplib.org/
http://www.drdobbs.com/parallel/using-t ... /184409477

But if you have for example 32 bit code which works for 64 bit calculations, won't that just work as 16 bit code which works for 32 bit calculations? (Mutatis mutandis, as we used to say.)

Sun Jun 09, 2019 5:21 pm

Joined: Fri Aug 01, 2014 3:00 pm
Posts: 23
There are a bunch of algorithms for both multiplication and division. I compared several of the ones for multiplication and found Quarter Squares to be fastest for BCD multiplication on the 6502 (it's probably fastest for binary multiplication too) (link). Later, I ran a similar comparison on the MSP430 and Quarter Squares was about 10% slower than the Russian Peasant algorithm (link). Which one works best will be highly dependent on the architecture you're designing. I would try it 5 or 10 different ways to figure out which one is best. Integer multiplication is not too hard to do in assembly so it shouldn't take you very long to find the best algorithm.

Mon Jun 10, 2019 3:11 pm

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Druzyek wrote:
There are a bunch of algorithms for both multiplication and division. I compared several of the ones for multiplication and found Quarter Squares to be fastest for BCD multiplication on the 6502 (it's probably fastest for binary multiplication too) (link). Later, I ran a similar comparison on the MSP430 and Quarter Squares was about 10% slower than the Russian Peasant algorithm (link). Which one works best will be highly dependent on the architecture you're designing. I would try it 5 or 10 different ways to figure out which one is best. Integer multiplication is not too hard to do in assembly so it shouldn't take you very long to find the best algorithm.

So far, these are the ones that I found more suitable for word multiplication and division (on a 16 bit CPU)

MULTIPLICATION
Code:
unsigned int __mulhi3 (unsigned int a, unsigned int b)
{
unsigned int r = 0;

while (a)
{
if (a & 1)
r += b;
a >>= 1;
b <<= 1;
}
return r;
}

DIVISION
Code:
typedef      int si_int;
typedef unsigned su_int;
su_int __udivsi3(su_int n, su_int d)
{
const unsigned n_uword_bits = sizeof(su_int) ;
su_int q;
su_int r;
unsigned sr;
/* special cases */
if (d == 0)
return 0; /* ?! */
if (n == 0)
return 0;
sr = __builtin_clz(d) - __builtin_clz(n);
/* 0 <= sr <= n_uword_bits - 1 or sr large */
if (sr > n_uword_bits - 1)  /* d > r */
return 0;
if (sr == n_uword_bits - 1)  /* d == 1 */
return n;
++sr;
/* 1 <= sr <= n_uword_bits - 1 */
/* Not a special case */
q = n << (n_uword_bits - sr);
r = n >> sr;
su_int carry = 0;
for (; sr > 0; --sr)
{
/* r:q = ((r:q)  << 1) | carry */
r = (r << 1) | (q >> (n_uword_bits - 1));
q = (q << 1) | carry;
/* carry = 0;
* if (r.all >= d.all)
* {
*      r.all -= d.all;
*      carry = 1;
* }
*/
const si_int s = (si_int)(d - r - 1) >> (n_uword_bits - 1);
carry = s & 1;
r -= d & s;
}
q = (q << 1) | carry;
return q;
}

I do not know which algoriths correspond to them though.

The multiplication function signature is for unsigned ints, but it seems to be used anyway for signed ints by compilers, so I'm unsure about whether this is correct.

The division function relies on a "clz" assembly instruction that is not available on my architecture, but it can be implemented as a function. This obviously will hurt performance. On the other hand, my compiler is highly efficient and most of the time I do not feel the need to replace its output by hand coded assembly, but of course it can be done to squeeze the last bit of performance.

However, rather than starting studying and implementing myself the several possible algorithms, I would want to find them already implemented in "C" code, which so far does not seem to be among the top results in my google searches.

Joan Lluch

Mon Jun 10, 2019 3:51 pm

Joined: Fri Aug 01, 2014 3:00 pm
Posts: 23
Wikipedia would be a good place to start: Multiplication algorithm. The multiplication example you posted is the Russian Peasant algorithm. I suppose you could code them in C instead of assembly to check performance. I was just thinking that multiplication is so common it would be one of the very few things it might be worth hand coding in assembly. You should be able to find C or pseudo code examples of all the algorithms on the Wikipedia page.

Mon Jun 10, 2019 4:00 pm

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Druzyek,

Well, Wikipedia is always the first place to look isn't it. However I found this particular article too theoretical, so that's why I was searching for already cooked implementations.

Just as a matter of completeness, the code that I posted above for multiplication gets compiled as shown below. I am posting my compiler progress and instruction set in the "Yet another 74xx processor" thread in the "projects" section. Assembly mnemonics are quite straightforward, but it's important to know that the destination operand is always the last one:
Code:
.globl   __mulhi3
__mulhi3:
mov   r0, r2
mov   0, r0
jmp   Lbl_BB2_2
Lbl_BB2_1:
mov   r2, r3
and   r3, 1, r3
neg   r3
and   r3, r1, r3
add   r3, r0, r0
lsl   r1
lsr   r2
Lbl_BB2_2:
cmp   r2, 0
brne   Lbl_BB2_1
ret

The compiler is able to avoid a conditional jump for the lower bit check by playing with 'negation' and 'and'. I think the only way to improve this code by hand is to remove the 'cmp r2,0' instruction after the loop body, as the last 'lsr r2' already sets the Zero flag, and by inserting a cmp r2,0 instruction just before the 'jmp Lbl_BB2_2, so that the Zero flag gets updated on the first iteration. Something like that:

Code:
.globl   __mulhi3
__mulhi3:                       # on entry: r0 is a; r1 is b;
mov   r0, r2                 # r2 = a;
mov   0, r0                  # r0 is r
cmp   r2, 0                  # set the Z flag in preparation for the first loop iteration
jmp   Lbl_BB2_2
Lbl_BB2_1:
mov   r2, r3                 # r3 = a;
and   r3, 1, r3              # r3 = r3 & 1;   [0, 1] depending on bit0 of a
neg   r3                     # r3 = 0 - r3;   [0, allOnes]
and   r3, r1, r3             # r3 = r3 & b;   [0, b]
add   r3, r0, r0             # r = r + r3;    [r, r+b], this is the result after the if statement
lsl   r1                     # b << 1;
lsr   r2                     # a >> 1;
Lbl_BB2_2:
brne   Lbl_BB2_1             # loop again if a != 0;
ret                          # return result in r0

I can't think on any way to do it better, provided that we stick with this algorithm.

Last edited by joanlluch on Mon Jun 10, 2019 7:27 pm, edited 1 time in total.

Mon Jun 10, 2019 5:11 pm

Joined: Fri Aug 01, 2014 3:00 pm
Posts: 23
I agree that Wikipedia can be hard to understand. It seems like it's written for people who already understand it. You could try googling each algorithm you find on the Wikipedia page individually to find a better explanation. This page was really helpful for trying to figure out the Russian Peasant algorithm: link.

I see that the bnre instruction is used for the while loop but what part of the code is performing the if statement?

Last edited by Druzyek on Mon Jun 10, 2019 7:53 pm, edited 1 time in total.

Mon Jun 10, 2019 5:59 pm

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Thanks for the link. I have yet to figure out whether this algorithm works for negative integers too.

The if statement and inner addition is performed by the sequence of instructions from "mov r2, r3" to "add "r3, r0, r0". I edited my post above with added comments, maybe it's clearer now. The square brackets represent the list of possible values at every step. The compiler replaces the 'if' by this sequence of instructions to avoid branching code.

Mon Jun 10, 2019 7:32 pm

Joined: Fri Aug 01, 2014 3:00 pm
Posts: 23
That is really neat! I don't think I would come up with that writing assembly by hand.

One thing to consider if you are deciding between algorithms is that the one you are using now does not require a lookup table, while Quarter Squares does. For that reason you might want to stick with it even if it's slower (although it might not be).

EDIT: I just realized that you might be able to make it slightly faster by doing the shift first and checking the bit shifted into the carry rather than ANDing with 1. It's maybe not clear from the C code that you can do it that way too.

Mon Jun 10, 2019 8:07 pm

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Druzyek wrote:
EDIT: I just realized that you might be able to make it slightly faster by doing the shift first and checking the bit shifted into the carry rather than ANDing with 1. It's maybe not clear from the C code that you can do it that way too.

Yes, I considered that too. The problem with this approach is that I need to insert an additional instruction after the shift to actually 'convert' the carry flag into something useable. My instruction set has conditional "set" and "select" instructions that are meant to set values out of flags, while avoiding jumps. But the same neg, and, add logic is required afterwards or before, to get the desired result. So the conditional 'set' is already one more instruction. Then, after the loop I can no longer remove the 'cmp' instruction, so that's yet another one that must be there. Total balance is 2 more instructions being executed on each iteration.

I suppose that using the 'carry' flag after the shift would be profitable if the inner branch (the 'if) could not be avoided, because the shift would possibly replace a 'cmp' instruction, however in the current case it doesn't seem to be an improvement.

Mon Jun 10, 2019 8:55 pm

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
I would look at multiple-precision arithmetic libraries. But I fear that the best libraries will be optimised for the extreme cases of very large numbers. Or will be far too general in their interface.

To my understanding, multi-precision integer arithmetic routines are not suitable for my purposes, because they assume existing arithmetic capabilities on the processor for the common sizes. What I need is the basic implementations for the common sizes, as my processor ALU will not do more than additions, subtractions and logical operations.
BigEd wrote:
But if you have for example 32 bit code which works for 64 bit calculations, won't that just work as 16 bit code which works for 32 bit calculations? (Mutatis mutandis, as we used to say.)
You are right, I suppose that by replacing everything 32 bit by 16 bit and everything 64 bit by 32 bit the function code can be left essentially the same. I shall try that.

Mon Jun 10, 2019 9:09 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1690
If you consider your 32 bit inputs as being 8-digit values in base 16, instead of 32 bits in base 2, you see another way to tackle the problem. The per-digit action is now a 4x4 multiplication, not a 1x1. Although your CPU may not have a 4x4, that's a suitable size for a small lookup. There are of course other stopping-off points. (And there may well be divide-and-conquer strategies, which are more applicable when the CPU lacks a fast multiply.)

I think, for twos-complement inputs, the solution always seems to be to convert to two positives, multiply, then adjust the sign.

Tue Jun 11, 2019 7:58 am
 Page 1 of 1 [ 12 posts ]

#### Who is online

Users browsing this forum: CCBot, PetalBot and 0 guests

 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot post attachments in this forum

 Jump to:  Select a forum ------------------ General Discussions Newbies Software    General programming    Languages and tools    Kernels and operating systems Hardware    Hardware in general    CPU/MCU choices and designs    Implementation and Construction Programmable logic Simulation and emulation Nostalgia Projects Anycpu.org