Last visit was: Tue Sep 10, 2024 9:40 am
|
It is currently Tue Sep 10, 2024 9:40 am
|
Author |
Message |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2157 Location: Canada
|
The new thread could just vector to a known address when it's finished, that would work. The branch problem is fixed. Here is a partial dump from simulation of the instruction queue when SMT is turned on. Code: Dump of instruction queue +--------------- program counter value | +-------- sequence number .. 0: 0 0 0 0 0 0 0 0 0 a0000001c 0 0 00 0000000000000000 0000000000000000 0000000000000000 1 00 0000000000000000 1 00 fffc0154.v 16 0# .. 1: 0 0 0 0 0 0 0 0 0 a0044081c 0 1 00 0000000141000000 0000000000000044 0000000000000000 1 00 0000000000000000 1 00 fffc0158.^ 01 0# <- second insn of thread 1 .. 2: 0 0 0 0 0 0 0 0 0 a0044081c 0 1 00 0000000140000000 0000000000000044 0000000000000000 1 00 0000000000010000 1 07 fffc0158.v 17 0# .. 3: 0 0 0 0 0 0 0 0 0 a6618085c 0 1 00 0000000000000001 0000000000006618 0000000141000000 1 01 0000000000000000 1 00 fffc015c.^ 02 0# CQ 4: 0 0 0 0 0 0 0 0 0 a0000001c 0 0 00 0000000000000000 0000000000000000 0000000000000000 1 00 0000000000000000 1 00 fffc0144.v 13 0# .. 5: 0 0 0 0 0 0 0 0 0 a0000001c 0 0 00 0000000000000000 0000000000000000 0000000000000000 1 00 0000000000000000 1 00 fffc0148.v 14 0# .. 6: 0 0 0 0 0 0 0 0 0 a0000001c 0 0 00 0000000000000000 0000000000000000 0000000000000000 1 00 0000000000000000 1 00 fffc014c.v 15 0# <- last single thread insn. .. 7: 0 0 0 0 0 0 0 0 0 a0000001c 0 0 00 0000000000000000 0000000000000000 0000000000000000 1 00 0000000000000000 1 00 fffc0150.^ 00 0# <- first insn of thread 1
... a little bit later
.. 0: 0 0 0 0 0 0 0 0 0 a1ff8f81c 031 00 0000000000001ff8 0000000000001ff8 0000000000000000 1 00 0000000000000000 1 00 fffc0164.v 1a 0# .. 1: 0 0 0 0 0 0 0 0 0 aff40f85c 031 00 00000000ff401ff8 ffffffffffffff40 0000000000000000 1 04 0000000000001ff8 1 00 fffc0168.v 1b 0# .. 2: 0 0 0 0 0 0 0 0 0 affc027dc 029 00 00000000fffc0170 00000000fffc016c 0000000000000000 1 01 0000000000000000 1 00 fffc016c.v 1c 0# .. 3: 0 0 0 0 0 0 0 0 0 a00aa101c 0 2 00 00000000000000aa 00000000000000aa 0000000000000000 1 00 0000000000000000 1 00 fffc0108.^ 09 0# <- thread 1 branch back to $AA leds .. 4: 0 0 0 0 0 0 0 0 0 afff8ffdc 031 00 00000000ff401ff0 fffffffffffffff8 00000000ff401ff8 1 01 00000000ff401ff8 1 01 fffc027c.v 1d 0# <- thread 0 continued ahead .. 5: 0 0 0 0 0 0 0 0 0 a0600b81c 023 00 0000000000000600 0000000000000600 0000000000000000 1 00 0000000000000000 1 00 fffc010c.^ 0a 0# CQ 6: 0 0 0 0 0 0 0 0 0 afd01005c 0 0 00 0000000000000000 00000000fffc0108 0000000000000000 1 04 0000000000000000 1 00 fffc0160.v 19 0# .. 7: 0 0 0 0 0 0 0 0 0 a1ff8f809 031 00 0000000000001ff8 0000000000001ff8 0000000000000000 1 00 0000000000000000 1 00 fffc0164.^ 04 0#
... and a little bit later
.. 0: 1 0 0 0 0 0 0 0 0 m0000efd2 029 00 0000000000000000 0000000000000000 00000000ff401fe8 0 06 00000000fffc0270 1 12 fffc0288.v 14 0# <--- thread 0 continues on .. 1: 1 0 0 0 0 0 0 0 0 m5402b802 0 0 00 0000000000000000 0000000000005402 0000000000000000 1 00 00000000ffdc0600 0 07 fffc0114.^ 03 0# <-+- thread 1 is looping around CQ 2: 1 0 1 0 0 1 0 0 0 m0000efd2 029 00 00000000ff401fe8 0000000000000000 00000000ff401fe8 1 02 00000000fffc0270 1 14 fffc0270.v 11 0# | .. 3: 1 1 0 0 0 1 0 0 0 a00aa1009 0 2 00 00000000000000aa 00000000000000aa 0000000000000000 1 00 00000000000000aa 1 00 fffc0108.^ 00 0# <-+ .. 4: 1 1 0 0 0 1 0 0 0 a0008ffc4 031 00 00000000ff401ff0 0000000000000008 00000000ff401fe8 1 00 00000000ff401fe8 1 00 fffc0274.v 12 0# | .. 5: 1 1 0 0 0 1 0 0 0 a0600b809 023 00 0000000000000600 0000000000000600 0000000000000000 1 00 00000000ffdc0600 1 05 fffc010c.^ 01 0# <-+ .. 6: 1 0 0 0 0 0 0 0 0 b0000efe9 031 00 0000000000000000 0000000000000000 00000000ff401ff0 1 04 00000000fffc0270 0 12 fffc0278.v 13 0# | .. 7: 1 0 0 0 0 0 0 0 0 affdcb85a 023 00 0000000000000000 ffffffffffffffdc 0000000000000001 1 03 00000000ffdc0600 0 05 fffc0110.^ 02 0# <-+
Test code that the core is running as it starts SMT: Code: FFFC0100 00000008 and r0,r0,#0 ; cannot use LDI which does an or operation FFFC0104 00B00031 bra .st1 .st2: FFFC0108 00AA1009 ldi r2,#$AA FFFC010C 0600B809 sb r2,LEDS ; write to LEDs FFFC0110 FFDCB85A FFFC0114 5402B802 FFFC0118 FF700031 bra .st2 ; First thing to do, LED status indicates core at least hit the reset ; vector. .st1: FFFC011C 00FF1009 ldi r2,#$FF FFFC0120 0600B809 sb r2,LEDS ; write to LEDs FFFC0124 FFDCB85A FFFC0128 5402B802 FFFC012C 00000809 ldi r1,#$10000 ; turn on SMT use $10000 FFFC0130 0001085A FFFC0134 8000004E csrrs r0,#0,r1 ; read and set control reg #0 FFFC0138 00000004 add r0,r0,#0 ; fetch adjustment ramp FFFC013C 00000004 add r0,r0,#0 ; uses a bunch of non-nop nop's FFFC0140 00000004 add r0,r0,#0 FFFC0144 00000004 add r0,r0,#0 FFFC0148 00000004 add r0,r0,#0 FFFC014C 00000004 add r0,r0,#0 FFFC0150 00000004 add r0,r0,#0 FFFC0154 00000004 add r0,r0,#0 FFFC0158 0044080E csrrd r1,#$044,r0 ; which thread is running ? FFFC015C 66180862 bfextu r1,r1,#24,#24 ; extract thread# - bit 24 from status FFFC0160 FD010071 bne r1,r0,.st2 ; thread#1 branches back FFFC0164 1FF8F809 ldi r31,#$FF401FF8 ; set stack pointer FFFC0168 FF40F85A
_________________Robert Finch http://www.finitron.ca
|
Thu Feb 01, 2018 5:32 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2157 Location: Canada
|
There were a few things I didn’t think of when starting to add SMT to the core. One of them was that the RSB (return stack buffer) is almost useless with SMT turned on. The RSB really needs to be present on a per thread basis to be effective. Time for an update. The RSB which was integrated into the fetch buffer module has now been broken out as a separate module. This facilitates future updates and makes it easier to implement multiple RSB’s.
A second consideration not thought of when starting SMT was how to handle interrupts. As it is right now both threads of execution will try and run the interrupt handler at the same time when an interrupt occurs. I’m not sure what to do about this one, except to perhaps turn SMT off and on in the interrupt handler. There’s a problem because an interrupt stacks only one PC value. I guess somehow the best solution would be to have only a specific thread handling interrupts. I have the feeling it ain’t going to be easy to implement.
_________________Robert Finch http://www.finitron.ca
|
Fri Feb 02, 2018 6:32 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2157 Location: Canada
|
I think I got the interrupt issues worked out. If SMT enabled the core now services the interrupt using just thread #0. I should write interrupt hammering code to really test it, but it works with a one-time interrupt pulse.
A bit of a kludge, but the core uses 32-bit sequence numbers to track the order of instructions. That number can roll over ever so often. This is bad news for the core because when the number rolls over determining the order of instructions is no longer possible. That means that at predictable points in time the sequence number has to be reset so that the order of instructions can remain known. My thought is to “manually” reset the sequence number in an interrupt routine where the ordering of instructions can be controlled. Assuming a 10GHz clock the 32-bit sequence number would roll over at a rate of 2.3Hz. So a 3.0Hz or faster interrupt could reset it. There usually is a periodic interrupt in a system. Another alternative would be a hardware based periodic reset. It would have to flush the instruction queue, reset the counter. I’m favoring an interrupt routine right now because it’s less hardware. A smaller sequence number would be better, but then a higher frequency interrupt would be required.
_________________Robert Finch http://www.finitron.ca
|
Sat Feb 03, 2018 8:03 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1796
|
I think with the right approach to signed arithmetic, you can reliably mark 2^(n-1) objects with an n bit counter and not have any problem with overflow. (Maybe you can even do better than that. I only have a vague and distant memory from Occam's timers on the transputer.)
|
Sat Feb 03, 2018 8:58 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2157 Location: Canada
|
A program written to hammer test the i-cache failed, strangely because of a data cache problem. The program is kinda cool. It copies a short test calculation routine to a random address in memory over an address range that spans cache boundaries. Then it calls the test routine. If the test routine returns the correct answer then instructions were likely cached correctly from the source location. The program then goes back and makes another try at a different random address. If it fails it displays a code on the LED output.
When I wrote the data cache I allowed for unaligned data access and shifted the data to the proper place. Well the core also shifted the data to the proper place outside of the data cache, so it was being shifted twice. This would result in valid data sometimes but not other times depending on the data alignment. I also found that the vector set instruction and the compare instruction conflicted because the vector set instruction wasn’t decoded in the correct place. This caused compares not to work properly.
Currently it isn’t possible to switch register sets in an interrupt routine. Which is likely when one would want to switch them. The issue is the second thread’s register set is tied to the first one, so switching one switches the other. This could be fixed by un-tieing the register sets but it’s more hardware. The second thread would also need its own record of pcs for interrupt processing. If the second thread can’t make API calls because it can’t use the system call (brk) instruction then the value of a second thread is pretty limited. So I’ve undertaken to re-write the code with multiple copies of a number of CSR’s for each thread.
Next to hammer test interrupts. Reseting the sequence number can be tested at the same time. It should just be a matter of toggling a bit in a control register during the interrupt, and then a short ramp of instructions for which an invalid sequence number doesn't matter.
_________________Robert Finch http://www.finitron.ca
|
Sun Feb 04, 2018 7:50 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1796
|
Quote: When I wrote the data cache I allowed for unaligned data access and shifted the data to the proper place. Well the core also shifted the data to the proper place outside of the data cache, so it was being shifted twice. This would result in valid data sometimes but not other times depending on the data alignment. Thus demonstrating that two rights make a wrong - at least, two right shifts make a wrong shift!
|
Sun Feb 04, 2018 8:20 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2157 Location: Canada
|
8,000 test vectors have been run through the floating-point adder, multiplier, and divider for both single and double precision. The results are fairly close to those of the workstation. The results are out by one sometimes in the least significant bit, and also underflow is not handled the same way. On underflow in many cases the workstation simply sets the result to zero. FT64' FP spits out the bits of the mantissa. I wrote a short C program to generate the test vectors and expected outputs.
The floating point isn't fantastically fast, but it's low latency compared to many units. Excepting divide which takes many cycles.
_________________Robert Finch http://www.finitron.ca
|
Mon Feb 05, 2018 8:51 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2157 Location: Canada
|
Been working on the floating-point for a change.
I’ve realized it’s possible to do five or six iterations of a Newton-Raphson divide in the same length of time as a floating-point divide in FT64, and the N-R code is interruptible. The N-R code is just about as fast as the divide. It’s very tempting to save some LUTs and remove the float divide operation. I think I have to go with either a better divider or drop it from the core. So I went with a lower latency divider, radix 16, 4 levels of cascaded logic instead of a single level. I was going to try running a simpler divider at a four times clock rate, but then I got to thinking that means that I’m guessing the clock could handle four times as much logic.
_________________Robert Finch http://www.finitron.ca
|
Thu Feb 08, 2018 7:11 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2157 Location: Canada
|
Added some documentation of the floating-point cores.
With SMT and other things, the core size is about 128,000 LUTs or about 95% the size of the target device. It took about 11 hours to place and route the system which also included a video generator and keyboard interface in the last few LUTs. Implementation failed trying to generate a bitstream.
Other things include limited vector operation chaining, greater support for interrupts, and a square root function.
_________________Robert Finch http://www.finitron.ca
|
Fri Feb 09, 2018 10:21 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2157 Location: Canada
|
Worked on interrupt hammering the core. Found and fixed a couple of bugs. One software bug was neglecting to save off the assembler’s constant building register r23 during an interrupt. Spent some time studying how exception handling is performed in software. My thought was to try and provide some hardware support but it looks like it isn’t really needed. The compiler will likely be changed to generate better exception handling code in the future. As it is right now it works as if in ‘C’ not C++ and does not destroy objects, but rather simply unlinks the stack. The throw statement in CC64 acts like a return and does a return to the latest defined catch handler. That could end up being a multi-level return. Like the C return statement the stack is unlinked. Unlike C++ objects created in routines are not automatically destroyed. Like C++ the catch handlers are searched at more outer levels until one is found that can handle the thrown type. The throw statement currently stores the thrown value in r1 which must be of a type of object that can fit into a register (an integer, float or pointer). The object’s type is placed in r2. I was hoping to have the processor throw an exception object itself automatically for things like divide by zero, that could then be caught with local exception processing code without having to go through a global exception handler (read fast). So the processor would have to load r1 and r2 on an exception. But there isn’t a way to load two registers in a single instruction. So I suppose just r2 could loaded by the processor with the exception type, and the associated value could be assumed to be null. The value in r1 would basically be random data then. Here is a link with a reasonable description of exception handling in VC++. https://www.codeproject.com/Articles/2126/How-a-C-compiler-implements-exception-handling
_________________Robert Finch http://www.finitron.ca
|
Mon Feb 12, 2018 12:22 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2157 Location: Canada
|
Changed the size of the cause code in the processor to 8 bits from 9 bits and assigned the most significant bit to indicate a hardware interrupt. The cause code now nicely fits into a byte. The design has sat for a while with about 20 some odd cause codes so there doesn’t need to be a huge number of them. Originally, I had been thinking along the lines of the overcrowded x86 INT vector table.
Been working on garbage collection when I started out working on exceptions. I’m going to try having an interrupt driven garbage collector so that when garbage collection occurs is predictable. IRL garbage collection is periodic, every other Thursday for instance. Garbage collectors are often run on separate background threads so running one from an interrupt is similar. The garbage collector interrupt will be a low priority interrupt so that just about anything else can interrupt it.
As a program runs and objects are allocated with new they are added to a list of objects created in the function. When the function exits all the objects still on the function’s object list are added to the garbage collector’s list. The garbage collector runs in spurts processing 25 items at a time with the list locked. This is to allow other processing to continue to add to garbage list as it’s being cleaned up.
_________________Robert Finch http://www.finitron.ca
|
Thu Feb 15, 2018 7:47 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2157 Location: Canada
|
Added an increment and branch if not equal instruction (IBNE) which compares two registers and branches if they are not equal. Then it increments the first register. This is useful for counted loops where the count increments. Probably best used in hand-written assembler, in tight loops as for string instructions. There is also a decrement and branch not equal, the only difference between the two instructions is the constant value added to the register +1 for increment and -1 for decrement. Code: naked char *memset(register char *p, register char val, register int size) { asm { beq r20,r0,.xit sub r2,r20,#1 ldi r1,#0 .again: sb r19,[r1+r18] ibne r1,r2,.again .xit: mov r1,r18 ret } } Finally got around to building the system out to a bitstream again. It turns out timing failed in the branchmiss logic. So I added a register in which causes a branchmiss to be delayed by a cycle, but otherwise seems to work. The benefit is a much faster clock cycle time. The branchmiss logic (44, yes 44! logic levels) limited the cycle time to about 18 MHz.
_________________Robert Finch http://www.finitron.ca
|
Fri Feb 16, 2018 12:07 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2157 Location: Canada
|
FT64 failed timing. Now the timing failure is in the register file access and using 4x clock (100 MHz clock is just a bit too fast). So, I’ve re-wrote the register file update to use a 2x clock instead of a 4x clock.
Been working on software today. Coding for memory management and application start-up.
_________________Robert Finch http://www.finitron.ca
|
Sat Feb 17, 2018 11:16 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2157 Location: Canada
|
Duh, I had the time-slice interrupt as just about the highest priority interrupt in the system. I think it should be about the lowest priority one. Don’t want time-slicing interrupting disk or network activities. Tonight’s nonsense was coming up with a way to enable / disable interrupts for the operating system. It’s not quite as simple as a single “sei” or “cli” instruction because of the processor’s pipeline. Because of the pipeline for several instructions after the “sei” to disable interrupts, an interrupt could still occur. So a safety zone is needed. The zone allows an interrupt to occur before reaching the critical code. I wrote the following piece of bloated code to handle setting the interrupt level. It also returns the current level. Code: int SetImLevel(register int level) { int x;
if ((x = GetImLevel()) >= level) return (x); __asm { csrrd r1,#$044,r0 // read machine status register #$044 bfins r1,r18,0,2 // insert the desired level in the im bits csrrw r1,#$044,r1 // and update the status reg and r1,r1,#7 // return only the im bits // The following safety ramp is present because the interrupt level // won't be set for a few machine cycles after the instruction to // set the level is fetched. An interrupt still might occur and // be recognized after the CSR is set. It takes a few cycles for // the setting to take effect. add r0,r0,#0 add r0,r0,#0 add r0,r0,#0 add r0,r0,#0 add r0,r0,#0 add r0,r0,#0 add r0,r0,#0 add r0,r0,#0 } }
_________________Robert Finch http://www.finitron.ca
|
Tue Feb 20, 2018 1:12 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2157 Location: Canada
|
Programmable interval timers were this morning’s topic. A pit component was added into the FT64 mpu. The mpu includes mmu, pit and pic components. The pit contains three counters, two of the counters were assigned use for the time slice and garbage collect interrupts. The third counter isn’t used yet. The idea behind the pit is similar to the 6840 or 8254 pit’s, but with more pins available the bus interface has been made simpler. The pit can operate in one-shot or continuous mode. It can use an external clock source or the bus clock. And it has a gating signal available. The basics one might expect from a pit.
The assembler and compiler were updated. There used to be a number of phase errors building the system due to instructions that weren’t implemented in the assembler or were implemented incorrectly. All the phase errors have been fixed. Support for a separate floating-point register file has made the compiler and hardware more complex. Initially floating-point just used the same register set as integer registers. When a separate register file was used for fp, fp register loads and stores were now required in the core. They had to be shoe-horned into the instruction set.
_________________Robert Finch http://www.finitron.ca
|
Wed Feb 21, 2018 11:24 am |
|
Who is online |
Users browsing this forum: CCBot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|