View unanswered posts | View active topics It is currently Sat Apr 27, 2024 10:02 am



Reply to topic  [ 4 posts ] 
 Multipliers 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I was just wondering why multiply and divide operations aren't usually made interruptible ? It seems like they would make them interruptible to reduce the interrupt latency.
All it would take is storing the intermediate state when an interrupt occurs then restoring it on the return.

_________________
Robert Finch http://www.finitron.ca


Thu Sep 01, 2016 10:16 am
Profile WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Rob:

When I looked into implementing an interruptable block move instruction for my core, I found that the internal state that I would need to preserve in order to correctly restart the instruction was quite extensive. So to conserve logic resources, I opted to provide a way for the programmer to control the interrupt latency the using the block move instruction would introduce. Thus, there is a single transfer mode where the block move instruction is interruptable after every transfer. The penalty for using this mode is that a conditional branch must be used by the programmer to continue the transfer. The benefit of this mode is that no special internal structures or microcode sequences are needed to provide low latency interrupt handling.

The block move instruction also operates in an uninterruptable manner. In this mode, the length of the transfer will determine the interrupt latency. The programmer can choose to use block size that provides the interrupt latency he/she/it :( is willing to tolerate.

Thus, for my core to support interruptable, high cycle count instructions a significant increase in the complexity of the control microprogram and internal structures would have been required to hold the processor state to allow the instruction to be restarted following the return from the ISR. Depending on the multiplier/divider (floating point or integer), I can see a similar problem arising. One solution I would consider would be to partition the instructions into separate interruptable, low latency instructions that only accomplish a portion of the whole uninterruptable instruction, so that the multiplication/division process is interruptable at clearly defined points, and easily supported by the processor interrupt service process.

_________________
Michael A.


Thu Sep 01, 2016 11:42 am
Profile

Joined: Tue Dec 11, 2012 8:03 am
Posts: 285
Location: California
I have not sharpened my pencil to look into the details of a hardware divide, but multiply can be done without any clocking, so the output is formed just as fast as the propagation delays allow. It may take longer than a single system clock cycle, but not dozens. I'm not sure there's any quick way to do a divide though. A possibility to look into is whether you can invert (so it's always specifically 1 divided by...), and then multiply by that inverse. That's what my large look-up tables for 16-bit scaled integer math are geared for. Each of the 65,536 inputs gives a pre-calculated 32-bit output, so you can use the bytes you want without losing accuracy at either end. Invert, then use other tables to speed up the multiply.

If speed and low interrupt latency is important, is it feasible to make the hardware do some factoring before starting the divide? Or how about using different instructions, or at least different paths, for different precisions, so you're not going for maximum precision when you don't need to?

I think everyone here knows how the block-move instructions work on the 65816. It's interruptible, because the instruction repeats itself for every byte it does, re-fetching itself and the source and destination banks every time, so it takes seven clocks per byte. So it's kind of slow, but because of the way I use interrupts sometimes (for audio sampling), I like the fact that no instruction is more than 9 clocks. 20 or 50 or 100 or 1000 would not be acceptable.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources


Thu Sep 01, 2016 8:15 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
I have not sharpened my pencil to look into the details of a hardware divide, but multiply can be done without any clocking

It still takes several clocks cycles for a multiply to complete at a high clock frequency compared to other logic. Multi-cycle paths have to be specified with appropriate waiting in the pipeline. The multiply is one of the things that shows up sometimes as being on the critical path for timing. I've broken the multiply instruction up into several partial products sums that take five or six cycles to complete (using multipliers in DSP slices). It gets tricky, I'd prefer to go with a bit simpler bit-by-bit multiplier. Divide can be done the same way a multiply is without clocking, but it is much slower (I'd say 4 to 5 times) because of the conditional logic and additional multiplexing. It's so much slower that it's maybe faster to do a clocked Newton's divide. I've got a divider that processes four bits at a time using cascaded logic completing in a about 10 clock cycles for a really slow 32 bit core. If you don't have to worry about the clock frequency a single cycle divider would be fine.
Quote:
I think everyone here knows how the block-move instructions work on the 65816....

It's one thing that could be improved on the 65816. It could have an interruptible block move without re-fetching the instruction every time. It could be reduced probably to three or four cycles. The hardware's a bit trickier to implement because the interrupt status would have to be tested in an additional location.
Quote:
That's what my large look-up tables for 16-bit scaled integer math are geared for...

Lookup tables are an awesome way to do some of the mathematics very fast. They get used in hardware in place of arithmetic functions like divide. I used a reciprocal lookup table and a multiply instead of a divide for calculating character positions in a text controller.

A software routine like the following could be used to perform a multiply. It has a couple of oddball instructions to speed things up. ADDOD adds conditionally based on whether or not the LSB of a register is set. Avoids branching in the multiply. It only take six clocks per bit to do the multiply in software.
Code:
; computes signed product of r1 and r2 and returns it in
; r3(low) and r4(high)
;
    push  r5,r6,r10
    mov   r3,r0         ; zero out product
    mov   r4,r0
    ldi   r6,#64        ; 64 bits to process
    eor   r10,r1,r2     ; compute sign of result
    abs   r1,r1         ; take absolute value of operand
    abs   r2,r2
j1:
; M1
    shl   r5,r4,#63     ; shift LSB to MSB
    shr   r4,r4,#1      ; shift product to right
    shr   r3,r3,#1      ;
    or    r3,r3,r5      ; set MSB of LSW
    addod r4,r4,r2,r1   ; add r2 to r4 if r1 is odd
    shr   r1,r1,#1      ; shift op over to next bit
; M2
    add   r6,r6,#-1     ; decrement loop count
    bne   r6,r0,j1

    bge   r10,r0,j2
    ; to make negative, take one's complement
    ; then increment.
    eor   r5,r3,#-1     ; take one's complement
    eor   r6,r4,#-1
    add   r3,r5,#1      ; increment LSW
    caci  r10,r3,r5,#1  ; compute carry from addition
    add   r4,r6,r10     ; add carry to MSW
j2:
    pop   r10
    pop   r6
    pop   r5
    rts

_________________
Robert Finch http://www.finitron.ca


Thu Sep 01, 2016 10:50 pm
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 4 posts ] 

Who is online

Users browsing this forum: No registered users and 24 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software