Last visit was: Wed Jan 15, 2025 6:48 am
|
It is currently Wed Jan 15, 2025 6:48 am
|
Author |
Message |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2232 Location: Canada
|
Fixed up some display bugs in the assembler. Code and data display is improved. Instruction formats for v3 look very similar to v2. Bit 7 of the v3 instruction determines the instruction size as 18 or 36 bits. So the opcode field is really only 7 bits. Missing in v3 vs v2 are branch prediction bits. They aren’t needed for a barrel processor. Attachment: FT64v3Insn.png
You do not have the required permissions to view the files attached to this post.
_________________Robert Finch http://www.finitron.ca
|
Sun Jul 15, 2018 4:09 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1808
|
Always interesting to see the architectural space explored. Do you have an idea of what's presently limiting clock speed? I don't have any feeling for how that might work out for a barrel processor.
|
Sun Jul 15, 2018 8:04 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2232 Location: Canada
|
I don't have an idea at the moment what's limiting clock speed. I do know that the floating point hardware has a low clock rate. It tries to do a lot in a single cycle. I do see that a lot of the dependency checking logic is not present in a barrel processor making the processor correspondingly smaller than a non-barrel processor. Smaller is better for routing.
The shift reg / mux for instruction alignment may come into play at some time. Some designs have byte aligned instructions, so aligning instructions on bit-pairs is only two more levels of logic. Additional levels of logic in a schematic do not necessarily translate into additional delay in an FPGA. Logic is treated in groups for muxes and it depends on whether or not additional resources are allocated by the synthesizer. Shifting would be by 2,4,8,16,32,64,128,256,512 to get bit pair alignment out of a 512+ bit cache line. To get 16-bit alignments it requires a good portion of the same hardware.
Unfortunately I still haven't been able to get Vivado working to do testing. I don't have a supported OS to run it on. It almost, but doesn't quite work on Windows 10 Home. It works in Windows 10 Pro, but I've yet to save up the money for an upgrade. At the same time, I figure if I have to spend money maybe I'll look into a different OS. I've been waiting months hoping that a patch to the OS or to Vivado will cause it to work. No luck so far.
_________________Robert Finch http://www.finitron.ca
|
Mon Jul 16, 2018 1:39 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2232 Location: Canada
|
I’ve been working mainly on the software emulator for the FT64v3 core. It’s a port of the FT64 emulator modified for v3. The 36/18 bit instruction size presents some challenges as the emulator was originally written to use a 32 bit int for instructions. This means a switch to 64 bit integers in many places. The instruction decode for v3 is completely different than v1 so a lot of switch/case/if/else statements are changing for instruction execution and decode / disassembly. For some reason the openFileDialog() method of Windows Forms fails to open sometimes to allow an Intel hex file to be selected. It seems to hang in the windows method, I never had trouble with this before.
_________________Robert Finch http://www.finitron.ca
|
Sat Jul 21, 2018 3:53 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2232 Location: Canada
|
The ISA is on the verge of instructions containing variable numbers of bits. There are already two sizes 18 or 36 bits for integer instructions. Floating point and vector instructions require more bits, adding two more sizes 40 and 44 bits. Previously for a superscalar processor varying the instruction size was a headache because the size was needed during the fetch phase to determine where the next instruction was. However, with a barrel processor the size is not needed during the fetch phase. Given the varying instruction sizes implementing immediates of varying size is reasonable.
I would like to see at least two more bits for branch displacements. The current 13-bit displacement allows for branching only +/- 1kB. This is probably good for 90% of branches. Four more bits resulting in a 40-bit instruction, would provide just about 100% coverage.
Compressed Instructions: 18 bits Integer Instructions: 36 bits Float Instructions: 40 bits (similar to integer format, but with round mode field) Vector Instructions: 44 bits (similar to float but needs mask register number field)
Why not just go all out variable ? I wonder.
_________________Robert Finch http://www.finitron.ca
|
Sun Jul 22, 2018 2:59 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2232 Location: Canada
|
Modified how constants are handled. Rather than build up a constant using multiple fixed length instructions, a three-bit field now determines the length of the constant which is included as part of the instruction. Because the entire constant is included as part of the instruction there are corresponding fewer instructions processed. At the same time the instruction length is much larger. These two factors combine to result in a lower compression ratio when compressed instructions are present. Strangely, the program is shorter and faster, but has worse statistics. However, rather than looking at a compression ratio, looking at the ratio of instructions converted to compressed versions reveals a better statistic. Statistics from recent assembly: number of bytes: 93040.500000 total number of instructions: 24478 number of compressed instructions: 14992 3.800985 bytes (30 bits) per instruction 61.2 % of instructions were converted to compressed versions. Found an error in the compiler. The peephole optimization of an indexing hint wasn’t checking for indexed address mode of the instruction. This caused a shift preceding some memory operations to be optimized away when it shouldn’t be. The optimization, which should be applied only to indexed addresses, was: Code: // hint #9 // Index calc. // shl r1,r3,#3 // sw r4,[r11+r1] // Becomes: // sw r4,[r11+r3*8]
_________________Robert Finch http://www.finitron.ca
|
Mon Jul 23, 2018 1:41 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2232 Location: Canada
|
I’m going in refinement circles now. Leaving FT4v3 going back to the original FT64 to add onto it.
One of the additional features for vector instructions is SIMD operations. The vector register can be processed as up to four independent lanes. With vector SIMD operations things almost begin to look like an array processor. 32 vectors, times 63 elements, times 4 lanes.
To support additional vector functionality, it is necessary to modify the fetch unit and instruction cache as vector instructions are 40 bits in size. It breaks the uniform 32-bit instruction set of v1.
_________________Robert Finch http://www.finitron.ca
|
Wed Jul 25, 2018 4:09 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2232 Location: Canada
|
Allowing 40 bit instructions means there are two less bit for branch displacements; the two LSB’s can’t assume to be zero anymore. Also branch instructions didn’t take into account the precision of the operation. An additional three bit field for precision needs to be included. To add the additional information to the branches and get back the two bits of displacement branches were also converted to be 40 bit instructions. Now with 20% of instructions requiring 40 bits I decided to see what I could do with the rest of the instructions if they were all 40 bit. Having a few extra bits in most instructions meant that the register specifier fields could be expanded. Attachment: FT64v4 Instruction Formats.png
You do not have the required permissions to view the files attached to this post.
_________________Robert Finch http://www.finitron.ca
Last edited by robfinch on Sat Jul 28, 2018 3:10 am, edited 1 time in total.
|
Thu Jul 26, 2018 5:09 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1808
|
It seems to me a tradeoff between the flexibility of branch instructions and the distance they can cover. With lots of bits dedicated to modes, you get a shorter displacement.
But it's not clear that the best answer will involve all branches being 'short' - so long as the most frequently encountered ones are self-contained, there should be relatively little performance penalty in the others needing to branch over a jump. Similarly for density: if the most common branches in the code are short, you get most of the benefit.
Is your quest for far-enough-branches based on the difficulty of having the assembler choose the right form?
|
Thu Jul 26, 2018 5:27 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2232 Location: Canada
|
Quote: Is your quest for far-enough-branches based on the difficulty of having the assembler choose the right form? That is one factor. Although the syntax is fairly straight forward. The instruction spec can take a size specifier to bypass the default word size. A prediction indicator can be added after the label. Code: bne.b r1,r2,alabel ; branch based on a byte comparison Part of the problem was adding precision to the branches, which takes up three additional bits. That left only a seven bit displacement field in a 32 bit instruction. With only seven bits a large percentage of branches would be converted into branch around jumps and it would reduce the code density. The code density would be reduced by just as much (or more) as making the instruction 40 bits (25%), and the two instruction sequence would be slower. The precision field selects a 8/16/32 or 64 bit compare during the branch. Without the precision control values would have to be masked or extended with additional instructions before the branch. An issue with this ISA is the precision control is explicit to allow things like SIMD operations rather than implicit. Making it explicit takes up bits in the instructions. For other designs where things are implicitly controlled for example values are automatically extended to the register width on loads (or calculated results), then full width comparisons can be used for branches. Some ISA's have separate instructions for SIMD or regular calcs. Given a desire for a fixed size instructions, and instructions being 40 bits, replicating the instruction set with a group of non-SIMD operations would be redundant. So a second solution would be to have variable width instructions and redundant forms.
_________________Robert Finch http://www.finitron.ca
|
Thu Jul 26, 2018 1:41 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2232 Location: Canada
|
With extra bits available a number of 2R forms have been converted to 3R forms. Basic logic operations such as ‘and’ bitwise ‘ands’ three operands together rather than two and stores the result in the target register. Additional operations like MIN and MAX determine the minimum or maximum of three registers. The benefit of 3R forms is very slight. I’ve rearranged the ISA formats some more to make the precision field more consistently located. Having set the ISA formats there’s a ton of documentation that needs to be updated.
_________________Robert Finch http://www.finitron.ca
|
Sat Jul 28, 2018 3:08 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2232 Location: Canada
|
Went nuts and added more bit matrix multiply operations to the ISA. With extra opcode bits available more operations could be supported. So transpose as well as normal operations are now supported. Musing about exception handling tonight. An issue is how to invoke a local handler from the global exception handling logic. Suppose a divide by zero exception occurs. The processor will transfer control to a global exception processing routine, which should then go back to the exception handler in the program. What the global routine has to do is examine the current thread to see if it wants to process the exception. If not, a global routine to handle the exception should be invoked. Otherwise control should transfer back to the program’s exception handler. The global routine has to filter the exceptions for the currently running program. For instance, a disk dma or timer exception should not be passed to most applications. But a divide by zero likely would be handled at the application level. This is controlled by the exception cause code for which the processor supports an eight-bit code. Also, user generated exceptions should be supported. The cause code could be used to index into a bit array to determine whether or not to pass the exception to the local exception handler. Inventing a register called the golex standing for global or local exception filter register. Any values over 256 are assumed to be user defined exceptions to be processed locally. This is more challenging than it seems as there needs to be filtering for each thread of execution in the system. A 256 bit array is only four 64-bit words. But the filtering needs to be extremely fast or ISR time would be impacted. A fast bitfield extract needs to be done based on the cause code. I’m tempted to just add a hardware register in the processor to do this. The four 64-bit words are additional state that would have to be transferred on a thread switch. Code: brkrout2: ; Read the golex viewport register to determine if the exception ; should be handled globally or locally. This viewport = golex indexed by cause code csrrd r1,#GOLEXVP,r0 ; 0=global, 1=local handling beq r1,r0,.0001 ; branch to global handler ; now setup to invoke the local hander ; load r1,r2 with cause and type csrrd r1,#CAUSE,r0 ; get cause code into r1 mov r1:x,r1 ; put into exceptioned register set ldi r2,#45 ; exception type = system exception mov r2:x,r2 ; Return to the exception handler code, not the exception return ; point. The exception handler address should be in r60. mov r1,r60:x ; Should probably do a quick check for a reasonable return ; address here. csrrw r0,#EPC0,r1 ; stuff r60 into the return pc sync rti ; go back to the local code ; Here global handling of exceptions is done .0001: rti
_________________Robert Finch http://www.finitron.ca
|
Sun Jul 29, 2018 4:59 am |
|
|
MichaelM
Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
|
Rob:
Exception handling has always been a pet peeve of mine. There always seems to be a lot of hand waving expended on the subject, but there never seems to be any real solutions. I have always been unsatisfied with exception handling, like divide by 0, in real-time applications. Handling the exceptions in the commonly described manner through exception handlers almost never addresses the need to restart the computation and continue.
The standard approach almost always recommends restarting the application. In the meantime, what is the airframe supposed to do while the controller undergoes a restart. One reason there's a computer in the loop is that the airframe is unstable.
I don't encounter this situation very often, particularly, the dreaded divide by 0 fault. But when I've had to design something that implemented a controller for a large piece of machinery, I've resorted to working out how I would prefer the calculations to continue if that dreaded fault occurred. I resolved the issue by implementing saturation logic and letting the computations continue. I did flag the fault, but I did not let the processor blindly take an exception. The amount of work needed to backtrack the operation was invariably too much.
One thought that I had regarding the divide operation is that you could use some extra bits in the opcode to let the programmer choose the error recovery method. One option would be like the one I've used for my real-time systems, saturation of the result to a maximum/minimum value, and flagging the fault and not generating an exception. (I counted the number of times that the operation resulted in a saturated value, and took action on that.)
_________________ Michael A.
|
Sun Jul 29, 2018 10:04 pm |
|
|
Garth
Joined: Tue Dec 11, 2012 8:03 am Posts: 285 Location: California
|
The following is about a Hewlett-Packard calculator so it might not relate directly, but it might sprout some ideas. The HP-41 calculators use flag 25 as the error-ignore flag. You can set it, then if the next relevant instruction produces an error condition, program execution continues but the flag is cleared. You can then test the flag and decide what you want to do. If there was no error condition and the flag is still set, you can clear it if desired so that subsequent operations where stop-on-error behavior might be desired will do so. If flag 25 is clear and there's an error condition, program execution stops and control is returned to the keyboard with an error message. (Then pressing SST, the single-step key, will show what instruction it was on that had the problem.) In the case of a /0, the inputs remain unaffected, so you can take a different course of action with the same inputs for example to return the maximum representable number with the correct sign and keep going. One of my posted programs shows the use of flag 25, at http://wilsonminesco.com/HP-41daytimer.html, although it's about looking up an alarm that may not exist in the list, rather than about /0.
_________________http://WilsonMinesCo.com/ lots of 6502 resources
|
Mon Jul 30, 2018 1:38 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2232 Location: Canada
|
Added the capability of setting the cause code from a register specified in the BRK instruction as well as using an immediate value. This makes implementing the throw() statement easier. Using the special CC64 type __exception, which represents a system exception, in a throw statement causes a BRK to be generated, rather than the usual branch to the exception handler. The BRK will invoke the system’s break handler, which then in turn may return to the local exception handler via r60. This way it’s possible to mimic hardware exceptions. A small program to check compiler output for throw operations: Code: void testexcept(int a, int b) { if (a) throw (__exception)66; if (b) throw "Hello World"; printf("Test over"); } Generates the following assembler: Code: code align 16 ;==================================================== ; Basic Block 0 ;==================================================== public code _testexcept: sub $sp,$sp,#32 sw $lr,24[$sp] sw $xlr,16[$sp] sw $r0,8[$sp] sw $fp,[$sp] ldi $xlr,#testexcept_10 mov $fp,$sp sub $sp,$sp,#0 sub $sp,$sp,#16 sw $r11,0[$sp] sw $r12,8[$sp] lw $r11,40[$fp] lw $r12,32[$fp] ; if (a) beq $r12,$r0,testexcept_13 ;==================================================== ; Basic Block 1 ;==================================================== ; throw (__exception)66; ldi $v0,#66 brk $v0,#1 testexcept_13: ; if (b) beq $r11,$r0,testexcept_15 ;==================================================== ; Basic Block 2 ;==================================================== ; throw "Hello World"; ldi $v0,#testexcept_0 ldi $v1,#20015 bra testexcept_10 testexcept_15: ;==================================================== ; Basic Block 3 ;==================================================== ; printf("Test over"); sub $sp,$sp,#8 ldi $v2,#testexcept_1 sw $v2,0[$sp] call _printf add $sp,$sp,#8 bra testexcept_12 testexcept_10: ;==================================================== ; Basic Block 4 ;==================================================== lw $lr,16[$fp] sw $lr,24[$fp] testexcept_12: lw $r11,0[$sp] lw $r12,8[$sp] mov $sp,$fp lw $fp,[$sp] lw $xlr,16[$sp] lw $lr,24[$sp] ret #32 endpublic
rodata align 16 align 2 testexcept_1: ; Test over dc 84,101,115,116,32,111,118,101 dc 114,0 testexcept_0: ; Hello World dc 72,101,108,108,111,32,87,111 dc 114,108,100,0 ; global _testexcept extern _printf
Rearranging the instruction set some more to improve bitfield operations, I’ve realized that the bitfield insert operation can’t be done in a single instruction because it requires reading four registers. It otherwise takes about four instructions to perform this operation. Code: LOAD desired field value into a reg BFCLR clear the bits in the target SHL field value reg by bit offset OR field value into target register
Added a bitfield find first one in field instruction.
_________________Robert Finch http://www.finitron.ca
|
Mon Jul 30, 2018 4:11 am |
|
Who is online |
Users browsing this forum: Bytespider, claudebot, DotBot, SemrushBot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|