View unanswered posts | View active topics It is currently Thu Mar 28, 2024 7:14 pm



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 21, 22, 23, 24, 25, 26, 27 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The system works a bit better all the time. Some of the beginnings of device drivers are written. Routines for four devices have been roughed out. NULL (easy) PTI (the parallel transfer interface), PRNG (random number generator) and DBG (debugger).
A super simple BASIC interpreter is being sketched out. My latest effort is to get memory management working. Specifically, I’d like to get the malloc() and free() library functions working. The malloc() library function calls sbrk() and the sbrk() instruction needs to be supplied. This function has been written. The C standard library from Plauger’s book is being used in an unaltered fashion.

_________________
Robert Finch http://www.finitron.ca


Sun Nov 25, 2018 4:21 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added 16MB region $00xxxxxx to the area excluded from address mapping via the ipt. This region of memory is reserved for the video frame buffer. The upper 16MB region which contains the ROM and I/O is also excluded from mapping.
Ran into a situation where the cpu hung trying to read from an invalid address. The address was supplied by invalid code after the end of a routine. Because the cpu fetches code in advance it was fetching code from what was a jump table embedded in code. This was fixed by placing a series of nop instructions just prior to the table. I’m wondering if there should be a mechanism in the cpu to handle this type of situation. For instance, a special flag instruction indicating that the following section of memory is not executable code.

_________________
Robert Finch http://www.finitron.ca


Mon Nov 26, 2018 4:10 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Modified the compiler to generate inline string constant parameters. The string constant is placed inline directly after the function call.
For code like the following:
Code:
 int main(int arg)
{
   TestInline(I"Hello World!");
}

The compiler generates:
Code:
 ;====================================================
; Basic Block 0
;====================================================
public code _main:
            sub         $sp,$sp,#32
            sw          $fp,[$sp]
            sw          $r0,8[$sp]
            sw          $xlr,16[$sp]
            sw          $lr,24[$sp]
            ldi         $xlr,#TestInline_14
            mov         $fp,$sp
            sub         $sp,$sp,#8
            call        _TestInline
            dc            "Hello World!",0
            mov         $t0,$v0
            bra         TestInline_16
TestInline_14:
;====================================================
; Basic Block 1
;====================================================
            lw          $lr,16[$fp]
            sw          $lr,24[$fp]
TestInline_16:
            mov         $sp,$fp
            lw          $fp,[$sp]
            lw          $xlr,16[$sp]
            lw          $lr,24[$sp]
            ret         #32
endpublic

The ‘I’ character directly before the string indicates an inline argument. Of course, the routine that gets called must be able to process the inline argument. And it only works for character strings. It is possible to pass multiple inline string arguments.
The goal is to conserve memory. If there are a lot of character strings without passing them inline a register must be loaded with the address of the string, then pushed on the stack. This is about 6 to 10 bytes per string used.
To make things a little clearer the “inline” keyword may be used when declaring a string parameter.

Code:
 naked int TestInline(inline char *str)
{
   __asm {
.0002:
      lc      r1,[lr]
      beq      r1,r0,.0001
      push   lr
      push   r1
      call   _DBGDisplayChar
      lw      lr,8[sp]
      add      sp,sp,#16
      add      lr,lr,#2
      bra      .0002
.0001:
      add      lr,lr,#2
      ret
   }
}

_________________
Robert Finch http://www.finitron.ca


Tue Nov 27, 2018 6:05 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Started working on a config utility for FT64. It’s really just a gui interface to the FT64_config.vh file.
Did some experimentation with allowing the register file access to take two clock cycles, which gives the block ram used to implement the register file the best timing. This would introduce some additional latency in processing. Scrapped the idea.

I’m pondering the idea of using a centralized DMA controller for the system, but the requirements are hefty. The idea is to reduce the number of different channels going to the memory controller. There are 5 channels of audio DMA, 1 channel video frame buffer DMA, 1 channel graphics accelerator, 32 channels sprite DMA, 1 channel SDCard DMA, 1 channel Ethernet DMA and possibly more. Instruction and data caches loads could also be considered a form of DMA. At the moment each device has its own DMA controller. This seems to me somewhat inefficient in terms of the amount of multiplexors in use. It would be better to use ram as multiplexors but that requires a centralized solution.

The csr_read() task used to read the csr registers didn’t work properly. It returned the value of the first register, ignoring which register was addressed. So, I re-wrote the task as an always block then it worked. This is an okay solution for reading the csr as it’s only read in a single place.
I've been trying to get calls to the operating system via the BRK instruction to work. It gets to the break routine but then begins looping around.

_________________
Robert Finch http://www.finitron.ca


Wed Nov 28, 2018 4:07 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Made RTI a synchronizing instruction. A synchronizing instruction ensures that all instructions in the queue are done before issuing more instructions. The sync is needed because RTI switches register sets back to original set.
The BRK routine is malfunctioning, it appears to ignore the cause code when deciding what to do and doesn’t call the routine which prints a message. Then the RTI instruction fails and causes a loop back to the start of the break routine. This can be seen with the ILA. This kind of thing drives me nuts because it appears to work in simulation. I’d like to get the glitches in the BRK instruction / routine worked out because it’s a stepping stone to getting interrupts working. The logic is similar. In fact the core inserts a BRK instruction into the instruction stream when an interrupt occurs.

_________________
Robert Finch http://www.finitron.ca


Thu Nov 29, 2018 4:12 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Took a look back at the original Thor processing core and ISA. I’ve been thinking about starting yet another core that’s a little closer to the original Thor. It would have fixed 40 bit instructions and use predicate registers similar to the IA64. There would also be 64 registers. I also want to try a design with a split dispatch / reorder buffer rather than a circular instruction window.
I managed to get further with FT64 by removing a privilege check. Why it caused a problem I don't know AFAICK it shouldn't have been, and it worked in SIM. There's still something wrong with the core in that it takes a long time to process some instructions. I saw on the ILA where the commit timeout fired and caused an exception. Then an instruction in the exception routine timed out, causing a loop.
I'm more interested at the moment in trying to get some neural network stuff working so I may have to shelve the core in favor of using a canned solution.

_________________
Robert Finch http://www.finitron.ca


Fri Nov 30, 2018 4:58 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
(In case you do shelve the core:) You've certainly had some interesting adventures along the way - thanks for sharing them!


Fri Nov 30, 2018 8:30 am
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Rob:
Perhaps some time away from the core will let your background processor sift through the data and help you find a solution to these issues that you've been writing about. I certainly have noted your progress, and have followed up many of your posts with some research into features that you describe but which are new to me. So I echo Ed's sentiments as well.

_________________
Michael A.


Fri Nov 30, 2018 12:05 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Perhaps some time away from the core will let your background processor

I think that's it, I need to take a step back for a bit and work on something else.

Decided to make the processor more complex because I was getting bored with it. I realized the processor wasn’t complicated enough. I found I was studying the old Thor processor core and making up a document for a new core. So, I first added a segmentation model to it, then I’m adding optional predicated instruction execution. The segmentation is mainly software driven for the complex parts of it. (The load of descriptor information and privilege level checking). Loading a selector register triggers a segment load exception and lets the software handle it. Otherwise the segment registers are all initialized to a flat memory model. I had to modify the load and store instructions to accept a segment register override. This is enabled for the long form (48 bits) of the instructions where an extra segment id field was shoehorned in. For code addressing the upper 24 bits of the program counter store the code segment selector. This lets it be saved and restored to memory along with the rest of the program counter. It also then only requires a single link register. The limitation is that the program counter is limited to 40 other bits. I don’t see this as a problem. Last time I looked the largest module on my workstation was only about 4MB.
Optional predication is a little more tricky to add but I’m sure can be done. When it’s turned on the instruction lengths are increased by one byte and the first byte of the instruction is processed as a predicate byte. This will require some changes to the instruction cache which has to return an extra byte. It also means that instructions could be byte-aligned, so addresses have to be made byte addresses rather than 16-bit parcel addresses.
In the meantime, I came up with some models for artificial neurons.

_________________
Robert Finch http://www.finitron.ca


Sat Dec 01, 2018 8:32 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I added segmentation to the core and got most of the predicated logic written. I wrapped it all in #ifdef statements so the core could be built without these features. I made numerous other changes to the core. Fortunately after building and testing it in the FPGA it still works the same way.

Taking one final try at figuring out why the core is hanging. It’s definitely a hardware hang, in the logic analyzer the program counter address remains static at a particular address, so it’s not a software loop. I had thought that instructions perhaps weren’t committing to the machine state properly, so I included “unstick” logic. This logic isn’t being triggered, so that can’t be the cause. I tried rearranging the code a little bit, and the core still hangs in approximately the same location. Tonight I realized that the hung up core could also be due to a panic condition. So, I moved the logic analyzer probes over to the panic signal. Fortunately, the number of different signals that could cause the core to remain at a fixed pc location is limited. They can probably be ticked off a list one by one.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 02, 2018 6:09 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Starting work on an ann project. Here is a diagram of an input node of the ann. It should be fairly standard as it's implementing the classic formula in hardware. The plan is to use 18 bit floating point which makes it tempting to implement a 72 bit (4 x18) control processor. There needs to be a means to load all the ram for both the weights and inputs, then read back the outputs.
Attachment:
File comment: ANN input node
Slide1.PNG
Slide1.PNG [ 66.21 KiB | Viewed 7420 times ]

_________________
Robert Finch http://www.finitron.ca


Sun Dec 02, 2018 6:17 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I was working on the segmentation model and realized it wasn’t matching with what was needed. Some of the needs were 1) common address space for code, data, and stack. This is usually achieved with the “flat”memory model of segmentation. There’s little sense to having multiple base registers available when they are typically always sets to the same value. 2) bounded areas with different attributes (execute only, read-only, read-write, privilege levels) *within* the same “segment”.
This could be achieved with a base and bounds type architecture with multiple bounds registers.
3) no memory access required to lookup attribute or bounds information (too slow). -> restricting the number of registers. So, I have the following at the moment.
Attachment:
File comment: Base And Bounds Memory System
BBMS.png
BBMS.png [ 44.78 KiB | Viewed 7364 times ]

_________________
Robert Finch http://www.finitron.ca


Wed Dec 05, 2018 5:00 am
Profile WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
It appears that your model is intended to speed the evaluation of an address against the memory model described in your diagram above. Therefore, it appears that on a context switch, the operation must save these memory model registers and exchange them for those of the new context. What I don't follow is how this approach is significantly different in the approach where the segment registers are loaded on first use in the new context.

What am I missing?

_________________
Michael A.


Wed Dec 05, 2018 2:10 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The model registers don’t necessarily have to be saved and restored to main memory. You’re right it isn’t significantly different if segment register were loaded only on first use. A segmented system could also cache multiple descriptor entries with a cache separate from the data cache. The FT833 has the entire descriptor table cached. My understanding of the IA32 architecture is that they only have cached the segment descriptors for only the working set of selector registers, so the descriptors must be reloaded from memory every time the usage changes. While the selector remains the same it doesn’t reload but as soon as it changes it does reload from main memory and this happens on task switches. If an A->B->A switch is done the selectors will load twice from memory (not a cache). There’s only one entry in the descriptor cache for CS, one entry for DS, etc. So, whenever a selector changes it must go out to memory to load the descriptor cache for the selector. It’s probably not a big deal since task (selector) changes are infrequent. Interrupts are often handled without a task switch in IA32 to avoid the overhead.
FT64 caches multiple entries in addition to the working ones, similar to a TLB cache.
The processor contains 64 sets of registers (There are really 2048 registers but only 32 visible at once). There are also 64 sets of base and bounds registers built into the core. If the register set being switched to is present in the core then it does not have to be loaded from memory, simply setting the register set selection in the status register is enough to switch the registers around. A group of registers may be wired so that they are always present and never swapped to memory. This should allow faster context switches for interrupts (similar to wiring TLB or cache entries). The very same thing could be done with segment registers and in fact the base and bound system is really just a form of a segmented system with a different name. RiSC-V calls this a base and bound system rather than a segmented system so I figured I’d follow their lead. Using the word segment has become taboo. There is a common base register to all the segments, however. I find the concept slightly different than pure segmentation because segments are defined by their base address and an extent of upper limit. Base and bounds as implemented in FT64 have three components to them rather than two. A base address and multiple upper and lower bounds within the address range.
The base and bounds registers are being loaded from entries in the thread control block (thread control block – TCB roughly equivalent to a task state segment TSS) so there’s no descriptor table. IA32 segmentation also defines system segment descriptor entries for things like task switches and call gates. FT64 will handle inter-segment calls using exceptions and leaving it up to the OS software.
I’m not sure why the IA32 has only single cache entries for descriptors. It may have to do with cache coherency and eliminating additional overhead required in multi-processor system for instance. Or it may be just the transistor budget that was available at the time the system was designed. You’d think they’d go back and improve the descriptor caching but maybe there’s no market demand.
PS. I’m going by all the documentation I’ve seen on IA32 descriptor caches. I suppose it could be that docs are wrong all over the place.
Code:
 // ----------------------------------------------------------------------------
// Swap thread's context.
//
// Parameters:
// octx - old (current context)
// nctx - new context (target context)
// ----------------------------------------------------------------------------

naked void SwapContext(register TCB *octx, register TCB *nctx)
{
   int n;
   int th;

   // First search and see if the target register set is already present in the
   // processor. If it is, then just switch register sets. Otherwise evict a
   // register set, and load the register set with the target thread's registers.   
   for (n = 63; n >= 0; n--) {
      SetRegsetWindow(n);   // *** need to be able to access a different register set
      th = GetTH();            // get thread handle
      if (th==nctx->number)
         goto j1;               // this jump avoids a lot of loads/stores to memory
   }
   n = GetWiredRegsets();
   n += GetRand(0);
   n &= 63;
   SetRegsetWindow(n);
   octx->tb = GetTB();      // thread base
   octx->cl = GetCL();      // code lower bound
   octx->cu = GetCU();      // code upper bound
   octx->ro = GetRO();      // read-only bound
   octx->dl = GetDL();      // data lower bound
   octx->du = GetDU();      // data upper bound
   octx->sl = GetSL();      // stack lower bound
   octx->su = GetSU();      // stack upper bound
   octx->regs[1] = GetRegx1();
   octx->regs[2] = GetRegx2();
   octx->regs[3] = GetRegx3();
   octx->regs[4] = GetRegx4();
   octx->regs[5] = GetRegx5();
   octx->regs[6] = GetRegx6();
   octx->regs[7] = GetRegx7();
   octx->regs[8] = GetRegx8();
   octx->regs[9] = GetRegx9();
   octx->regs[10] = GetRegx10();
   octx->regs[11] = GetRegx11();
   octx->regs[12] = GetRegx12();
   octx->regs[13] = GetRegx13();
   octx->regs[14] = GetRegx14();
   octx->regs[15] = GetRegx15();
   octx->regs[16] = GetRegx16();
   octx->regs[17] = GetRegx17();
   octx->regs[18] = GetRegx18();
   octx->regs[19] = GetRegx19();
   octx->regs[20] = GetRegx20();
   octx->regs[21] = GetRegx21();
   octx->regs[22] = GetRegx22();
   octx->regs[23] = GetRegx23();
   octx->regs[24] = GetRegx24();
   octx->regs[25] = GetRegx25();
   octx->regs[26] = GetRegx26();
   octx->regs[27] = GetRegx27();
   octx->regs[28] = GetRegx28();
   octx->regs[29] = GetRegx29();
   octx->regs[30] = GetRegx30();
   octx->regs[31] = GetRegx31();
   SetTH(nctx->number);
   SetTB(nctx->tb);
   SetCL(nctx->cl);
   SetCU(nctx->cu);
   SetRO(nctx->ro);
   SetDL(nctx->dl);
   SetDU(nctx->du);
   SetSL(nctx->sl);
   SetSU(nctx->su);
   SetRegx1(nctx->regs[1]);
   SetRegx2(nctx->regs[2]);
   SetRegx3(nctx->regs[3]);
   SetRegx4(nctx->regs[4]);
   SetRegx5(nctx->regs[5]);
   SetRegx6(nctx->regs[6]);
   SetRegx7(nctx->regs[7]);
   SetRegx8(nctx->regs[8]);
   SetRegx9(nctx->regs[9]);
   SetRegx10(nctx->regs[10]);
   SetRegx11(nctx->regs[11]);
   SetRegx12(nctx->regs[12]);
   SetRegx13(nctx->regs[13]);
   SetRegx14(nctx->regs[14]);
   SetRegx15(nctx->regs[15]);
   SetRegx16(nctx->regs[16]);
   SetRegx17(nctx->regs[17]);
   SetRegx18(nctx->regs[18]);
   SetRegx19(nctx->regs[19]);
   SetRegx20(nctx->regs[20]);
   SetRegx21(nctx->regs[21]);
   SetRegx22(nctx->regs[22]);
   SetRegx23(nctx->regs[23]);
   SetRegx24(nctx->regs[24]);
   SetRegx25(nctx->regs[25]);
   SetRegx26(nctx->regs[26]);
   SetRegx27(nctx->regs[27]);
   SetRegx28(nctx->regs[28]);
   SetRegx29(nctx->regs[29]);
   SetRegx30(nctx->regs[30]);
   SetRegx31(nctx->regs[31]);
j1:
   octx->epc = SetEpc(nctx->epc);
   SetRegset(n);      // Actually switch the register set.
   // The register set was switched so a simple return using the return address
   // register can't be performed. Fortunately switching the context is the last
   // thing done in the service routine so it's possible to return with an RTI.
   __asm {
      rti
   }
}

_________________
Robert Finch http://www.finitron.ca


Thu Dec 06, 2018 4:23 am
Profile WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Rob:
Thanks for the clarification. The additional on-chip caching of your base-bounds registers will likely provide a performance advantage. I like the base-bounds concept. Once the 32-bit flat memory model became the preferred memory model, I always felt that the number of segment registers in IA32 did not really make sense. This base-bounds approach seems to provide a simpler approach to the control of a a single memory segment.

_________________
Michael A.


Fri Dec 07, 2018 4:29 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 21, 22, 23, 24, 25, 26, 27 ... 52  Next

Who is online

Users browsing this forum: No registered users and 10 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software