View unanswered posts | View active topics It is currently Thu Mar 28, 2024 3:06 pm



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 10, 11, 12, 13, 14, 15, 16 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Fixed up some display bugs in the assembler. Code and data display is improved.
Instruction formats for v3 look very similar to v2. Bit 7 of the v3 instruction determines the instruction size as 18 or 36 bits. So the opcode field is really only 7 bits. Missing in v3 vs v2 are branch prediction bits. They aren’t needed for a barrel processor.

Attachment:
File comment: FT64v3 Instruction Formats (uncompressed)
FT64v3Insn.png
FT64v3Insn.png [ 50.39 KiB | Viewed 5400 times ]

_________________
Robert Finch http://www.finitron.ca


Sun Jul 15, 2018 4:09 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Always interesting to see the architectural space explored. Do you have an idea of what's presently limiting clock speed? I don't have any feeling for how that might work out for a barrel processor.


Sun Jul 15, 2018 8:04 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I don't have an idea at the moment what's limiting clock speed. I do know that the floating point hardware has a low clock rate. It tries to do a lot in a single cycle. I do see that a lot of the dependency checking logic is not present in a barrel processor making the processor correspondingly smaller than a non-barrel processor. Smaller is better for routing.

The shift reg / mux for instruction alignment may come into play at some time. Some designs have byte aligned instructions, so aligning instructions on bit-pairs is only two more levels of logic. Additional levels of logic in a schematic do not necessarily translate into additional delay in an FPGA. Logic is treated in groups for muxes and it depends on whether or not additional resources are allocated by the synthesizer. Shifting would be by 2,4,8,16,32,64,128,256,512 to get bit pair alignment out of a 512+ bit cache line. To get 16-bit alignments it requires a good portion of the same hardware.

Unfortunately I still haven't been able to get Vivado working to do testing. I don't have a supported OS to run it on. It almost, but doesn't quite work on Windows 10 Home. It works in Windows 10 Pro, but I've yet to save up the money for an upgrade. At the same time, I figure if I have to spend money maybe I'll look into a different OS. I've been waiting months hoping that a patch to the OS or to Vivado will cause it to work. No luck so far.

_________________
Robert Finch http://www.finitron.ca


Mon Jul 16, 2018 1:39 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I’ve been working mainly on the software emulator for the FT64v3 core. It’s a port of the FT64 emulator modified for v3. The 36/18 bit instruction size presents some challenges as the emulator was originally written to use a 32 bit int for instructions. This means a switch to 64 bit integers in many places. The instruction decode for v3 is completely different than v1 so a lot of switch/case/if/else statements are changing for instruction execution and decode / disassembly.
For some reason the openFileDialog() method of Windows Forms fails to open sometimes to allow an Intel hex file to be selected. It seems to hang in the windows method, I never had trouble with this before.

_________________
Robert Finch http://www.finitron.ca


Sat Jul 21, 2018 3:53 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The ISA is on the verge of instructions containing variable numbers of bits. There are already two sizes 18 or 36 bits for integer instructions. Floating point and vector instructions require more bits, adding two more sizes 40 and 44 bits. Previously for a superscalar processor varying the instruction size was a headache because the size was needed during the fetch phase to determine where the next instruction was. However, with a barrel processor the size is not needed during the fetch phase. Given the varying instruction sizes implementing immediates of varying size is reasonable.

I would like to see at least two more bits for branch displacements. The current 13-bit displacement allows for branching only +/- 1kB. This is probably good for 90% of branches. Four more bits resulting in a 40-bit instruction, would provide just about 100% coverage.

Compressed Instructions: 18 bits
Integer Instructions: 36 bits
Float Instructions: 40 bits (similar to integer format, but with round mode field)
Vector Instructions: 44 bits (similar to float but needs mask register number field)

Why not just go all out variable ? I wonder.

_________________
Robert Finch http://www.finitron.ca


Sun Jul 22, 2018 2:59 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Modified how constants are handled. Rather than build up a constant using multiple fixed length instructions, a three-bit field now determines the length of the constant which is included as part of the instruction. Because the entire constant is included as part of the instruction there are corresponding fewer instructions processed. At the same time the instruction length is much larger. These two factors combine to result in a lower compression ratio when compressed instructions are present. Strangely, the program is shorter and faster, but has worse statistics. However, rather than looking at a compression ratio, looking at the ratio of instructions converted to compressed versions reveals a better statistic.

Statistics from recent assembly:
number of bytes: 93040.500000
total number of instructions: 24478
number of compressed instructions: 14992
3.800985 bytes (30 bits) per instruction
61.2 % of instructions were converted to compressed versions.

Found an error in the compiler. The peephole optimization of an indexing hint wasn’t checking for indexed address mode of the instruction. This caused a shift preceding some memory operations to be optimized away when it shouldn’t be.
The optimization, which should be applied only to indexed addresses, was:
Code:
   // hint #9
   // Index calc.
   //      shl r1,r3,#3
   //      sw r4,[r11+r1]
   // Becomes:
   //      sw r4,[r11+r3*8]

_________________
Robert Finch http://www.finitron.ca


Mon Jul 23, 2018 1:41 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I’m going in refinement circles now. Leaving FT4v3 going back to the original FT64 to add onto it.

One of the additional features for vector instructions is SIMD operations. The vector register can be processed as up to four independent lanes. With vector SIMD operations things almost begin to look like an array processor. 32 vectors, times 63 elements, times 4 lanes.

To support additional vector functionality, it is necessary to modify the fetch unit and instruction cache as vector instructions are 40 bits in size. It breaks the uniform 32-bit instruction set of v1.

_________________
Robert Finch http://www.finitron.ca


Wed Jul 25, 2018 4:09 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Allowing 40 bit instructions means there are two less bit for branch displacements; the two LSB’s can’t assume to be zero anymore. Also branch instructions didn’t take into account the precision of the operation. An additional three bit field for precision needs to be included. To add the additional information to the branches and get back the two bits of displacement branches were also converted to be 40 bit instructions. Now with 20% of instructions requiring 40 bits I decided to see what I could do with the rest of the instructions if they were all 40 bit. Having a few extra bits in most instructions meant that the register specifier fields could be expanded.

Attachment:
File comment: FT64v4 Instruction Formats
FT64v4 Instruction Formats.png
FT64v4 Instruction Formats.png [ 39.55 KiB | Viewed 5229 times ]

_________________
Robert Finch http://www.finitron.ca


Last edited by robfinch on Sat Jul 28, 2018 3:10 am, edited 1 time in total.



Thu Jul 26, 2018 5:09 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
It seems to me a tradeoff between the flexibility of branch instructions and the distance they can cover. With lots of bits dedicated to modes, you get a shorter displacement.

But it's not clear that the best answer will involve all branches being 'short' - so long as the most frequently encountered ones are self-contained, there should be relatively little performance penalty in the others needing to branch over a jump. Similarly for density: if the most common branches in the code are short, you get most of the benefit.

Is your quest for far-enough-branches based on the difficulty of having the assembler choose the right form?


Thu Jul 26, 2018 5:27 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Is your quest for far-enough-branches based on the difficulty of having the assembler choose the right form?

That is one factor. Although the syntax is fairly straight forward. The instruction spec can take a size specifier to bypass the default word size. A prediction indicator can be added after the label.
Code:
bne.b r1,r2,alabel    ; branch based on a byte comparison

Part of the problem was adding precision to the branches, which takes up three additional bits. That left only a seven bit displacement field in a 32 bit instruction. With only seven bits a large percentage of branches would be converted into branch around jumps and it would reduce the code density. The code density would be reduced by just as much (or more) as making the instruction 40 bits (25%), and the two instruction sequence would be slower. The precision field selects a 8/16/32 or 64 bit compare during the branch. Without the precision control values would have to be masked or extended with additional instructions before the branch.


An issue with this ISA is the precision control is explicit to allow things like SIMD operations rather than implicit. Making it explicit takes up bits in the instructions. For other designs where things are implicitly controlled for example values are automatically extended to the register width on loads (or calculated results), then full width comparisons can be used for branches. Some ISA's have separate instructions for SIMD or regular calcs. Given a desire for a fixed size instructions, and instructions being 40 bits, replicating the instruction set with a group of non-SIMD operations would be redundant. So a second solution would be to have variable width instructions and redundant forms.

_________________
Robert Finch http://www.finitron.ca


Thu Jul 26, 2018 1:41 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
With extra bits available a number of 2R forms have been converted to 3R forms. Basic logic operations such as ‘and’ bitwise ‘ands’ three operands together rather than two and stores the result in the target register. Additional operations like MIN and MAX determine the minimum or maximum of three registers. The benefit of 3R forms is very slight.
I’ve rearranged the ISA formats some more to make the precision field more consistently located.
Having set the ISA formats there’s a ton of documentation that needs to be updated.

_________________
Robert Finch http://www.finitron.ca


Sat Jul 28, 2018 3:08 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Went nuts and added more bit matrix multiply operations to the ISA. With extra opcode bits available more operations could be supported. So transpose as well as normal operations are now supported.

Musing about exception handling tonight. An issue is how to invoke a local handler from the global exception handling logic.

Suppose a divide by zero exception occurs. The processor will transfer control to a global exception processing routine, which should then go back to the exception handler in the program. What the global routine has to do is examine the current thread to see if it wants to process the exception. If not, a global routine to handle the exception should be invoked. Otherwise control should transfer back to the program’s exception handler. The global routine has to filter the exceptions for the currently running program. For instance, a disk dma or timer exception should not be passed to most applications. But a divide by zero likely would be handled at the application level. This is controlled by the exception cause code for which the processor supports an eight-bit code. Also, user generated exceptions should be supported.
The cause code could be used to index into a bit array to determine whether or not to pass the exception to the local exception handler. Inventing a register called the golex standing for global or local exception filter register. Any values over 256 are assumed to be user defined exceptions to be processed locally.
This is more challenging than it seems as there needs to be filtering for each thread of execution in the system. A 256 bit array is only four 64-bit words. But the filtering needs to be extremely fast or ISR time would be impacted. A fast bitfield extract needs to be done based on the cause code. I’m tempted to just add a hardware register in the processor to do this. The four 64-bit words are additional state that would have to be transferred on a thread switch.

Code:
brkrout2:
      ; Read the golex viewport register to determine if the exception
      ; should be handled globally or locally. This viewport = golex indexed by cause code
      csrrd   r1,#GOLEXVP,r0
      ; 0=global, 1=local handling
      beq      r1,r0,.0001      ; branch to global handler
      
      ; now setup to invoke the local hander
      ; load r1,r2 with cause and type
      csrrd   r1,#CAUSE,r0   ; get cause code into r1
      mov      r1:x,r1         ; put into exceptioned register set
      ldi      r2,#45         ; exception type = system exception
      mov      r2:x,r2
      
      ; Return to the exception handler code, not the exception return
      ; point. The exception handler address should be in r60.
      mov      r1,r60:x
      ; Should probably do a quick check for a reasonable return
      ; address here.
      csrrw   r0,#EPC0,r1      ; stuff r60 into the return pc
      sync
      rti                  ; go back to the local code
      
      ; Here global handling of exceptions is done
.0001:
      rti
      

_________________
Robert Finch http://www.finitron.ca


Sun Jul 29, 2018 4:59 am
Profile WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Rob:

Exception handling has always been a pet peeve of mine. There always seems to be a lot of hand waving expended on the subject, but there never seems to be any real solutions. I have always been unsatisfied with exception handling, like divide by 0, in real-time applications. Handling the exceptions in the commonly described manner through exception handlers almost never addresses the need to restart the computation and continue.

The standard approach almost always recommends restarting the application. In the meantime, what is the airframe supposed to do while the controller undergoes a restart. One reason there's a computer in the loop is that the airframe is unstable.

I don't encounter this situation very often, particularly, the dreaded divide by 0 fault. But when I've had to design something that implemented a controller for a large piece of machinery, I've resorted to working out how I would prefer the calculations to continue if that dreaded fault occurred. I resolved the issue by implementing saturation logic and letting the computations continue. I did flag the fault, but I did not let the processor blindly take an exception. The amount of work needed to backtrack the operation was invariably too much.

One thought that I had regarding the divide operation is that you could use some extra bits in the opcode to let the programmer choose the error recovery method. One option would be like the one I've used for my real-time systems, saturation of the result to a maximum/minimum value, and flagging the fault and not generating an exception. (I counted the number of times that the operation resulted in a saturated value, and took action on that.)

_________________
Michael A.


Sun Jul 29, 2018 10:04 pm
Profile

Joined: Tue Dec 11, 2012 8:03 am
Posts: 285
Location: California
The following is about a Hewlett-Packard calculator so it might not relate directly, but it might sprout some ideas. The HP-41 calculators use flag 25 as the error-ignore flag. You can set it, then if the next relevant instruction produces an error condition, program execution continues but the flag is cleared. You can then test the flag and decide what you want to do. If there was no error condition and the flag is still set, you can clear it if desired so that subsequent operations where stop-on-error behavior might be desired will do so. If flag 25 is clear and there's an error condition, program execution stops and control is returned to the keyboard with an error message. (Then pressing SST, the single-step key, will show what instruction it was on that had the problem.) In the case of a /0, the inputs remain unaffected, so you can take a different course of action with the same inputs for example to return the maximum representable number with the correct sign and keep going. One of my posted programs shows the use of flag 25, at http://wilsonminesco.com/HP-41daytimer.html, although it's about looking up an alarm that may not exist in the list, rather than about /0.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources


Mon Jul 30, 2018 1:38 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added the capability of setting the cause code from a register specified in the BRK instruction as well as using an immediate value. This makes implementing the throw() statement easier.

Using the special CC64 type __exception, which represents a system exception, in a throw statement causes a BRK to be generated, rather than the usual branch to the exception handler. The BRK will invoke the system’s break handler, which then in turn may return to the local exception handler via r60. This way it’s possible to mimic hardware exceptions.

A small program to check compiler output for throw operations:
Code:
void testexcept(int a, int b)
{
   if (a)
      throw (__exception)66;
   if (b)
      throw "Hello World";
   printf("Test over");
}


Generates the following assembler:

Code:
   code
   align   16
;====================================================
; Basic Block 0
;====================================================
public code _testexcept:
            sub     $sp,$sp,#32
            sw      $lr,24[$sp]
            sw      $xlr,16[$sp]
            sw      $r0,8[$sp]
            sw      $fp,[$sp]
            ldi     $xlr,#testexcept_10
            mov     $fp,$sp
            sub     $sp,$sp,#0
            sub     $sp,$sp,#16
            sw      $r11,0[$sp]
            sw      $r12,8[$sp]
            lw      $r11,40[$fp]
            lw      $r12,32[$fp]
;    if (a)
            beq     $r12,$r0,testexcept_13
;====================================================
; Basic Block 1
;====================================================
;       throw (__exception)66;
            ldi     $v0,#66
            brk     $v0,#1
testexcept_13:
;    if (b)
            beq     $r11,$r0,testexcept_15
;====================================================
; Basic Block 2
;====================================================
;       throw "Hello World";
            ldi     $v0,#testexcept_0
            ldi     $v1,#20015
            bra     testexcept_10
testexcept_15:
;====================================================
; Basic Block 3
;====================================================
;    printf("Test over");
            sub     $sp,$sp,#8
            ldi     $v2,#testexcept_1
            sw      $v2,0[$sp]
            call    _printf
            add     $sp,$sp,#8
            bra     testexcept_12
testexcept_10:
;====================================================
; Basic Block 4
;====================================================
            lw      $lr,16[$fp]
            sw      $lr,24[$fp]
testexcept_12:
            lw      $r11,0[$sp]
            lw      $r12,8[$sp]
            mov     $sp,$fp
            lw      $fp,[$sp]
            lw      $xlr,16[$sp]
            lw      $lr,24[$sp]
            ret     #32
endpublic



   rodata
   align   16
   align   2
testexcept_1:   ; Test over
   dc   84,101,115,116,32,111,118,101
   dc   114,0
testexcept_0:   ; Hello World
   dc   72,101,108,108,111,32,87,111
   dc   114,108,100,0
;   global   _testexcept
   extern   _printf


Rearranging the instruction set some more to improve bitfield operations, I’ve realized that the bitfield insert operation can’t be done in a single instruction because it requires reading four registers. It otherwise takes about four instructions to perform this operation.
Code:
LOAD desired field value into a reg
BFCLR clear the bits in the target
SHL field value reg by bit offset
OR field value into target register

Added a bitfield find first one in field instruction.

_________________
Robert Finch http://www.finitron.ca


Mon Jul 30, 2018 4:11 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 10, 11, 12, 13, 14, 15, 16 ... 52  Next

Who is online

Users browsing this forum: No registered users and 11 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software