View unanswered posts | View active topics It is currently Sun Jun 16, 2019 11:41 pm



Reply to topic  [ 102 posts ]  Go to page 1, 2, 3, 4, 5 ... 7  Next
 DSD7 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 870
Location: Canada
DSD7 = Dark Star * Dragon Seven, the seventh iteration of the Dark Star series.

My most recent processing core project is a smaller 32 bit core. It’s about 7500 LUTs (or 12,000 LC’s).
2,500 LUTs were used for a 4 way set associative instruction cache.

The core is pipelined with a three stage overlapped pipeline. Most instructions are single cycle. The JAL instruction call to an absolute address is single cycle. So most subroutine calls can be single cycle. Otherwise if a register is involved it takes three cycles. Not taken branches are single cycle, taken branches take 3 cycles. Memory operations and multiply / divide take more cycles.

The core uses 16, 32, 48, and 64 bit instructions. 64 bit instructions are the normal immediate mode instructions when they have 32 bit constants. All instructions and data are addressed in multiples of 16 bits. There are no byte operations. Byte operations can be supported by the compiler as bit-field operations.

Up to 2GiW (words) or 8GiB of memory are supported.

There is an internal interrupt stack that allows interrupt nesting up to 16 levels. Interrupts should be fast as the first three general purpose registers are stored on the interrupt stack automatically. The same three registers are restored from the stack automatically by an interrupt return. Stacking and unstacking the registers is a single cycle operation including all three registers.

The core supports compressed instructions with a compressed instruction lookup table. The instructions compressed must be 32 bit.

Currently the first iteration of the core is coded, but the supporting toolset (assembler, compiler ) isn't usable yet. So it's been synthesized but not tested yet.

http://github.com/robfinch/Cores/blob/master/DSD/trunk/rtl/DSD7.v

_________________
Robert Finch http://www.finitron.ca


Tue Nov 08, 2016 10:18 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 870
Location: Canada
Today I got the bright idea to pre-load part of the boot rom into the i-cache rather than resetting the i-cache to NOPs on reset. It was almost identical logic. The issue was needing to set the instruction cache tags to a valid state.

I also modified the core so that it uses a common bus for data and cache loads.

Last night I tried the first simulation of the core. The simulator hung while starting up and locked up the whole workstation. Task manager wouldn't even start up. I had to power the workstation on and off. Then Windows didn’t restart properly.

_________________
Robert Finch http://www.finitron.ca


Fri Nov 11, 2016 6:33 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1188
Not good that your new core can break your OS even from simulation! Must be a very powerful core...

(You remind me: I've seen embedded devices boot using cache-only, with enough code there to initialise the RAM controller - or perhaps the flash controller - and then load the next stage of bootstrap. In FPGA land, it's very handy that it costs us nothing to have content in the block rams at time zero.)


Fri Nov 11, 2016 7:39 am
Profile
User avatar

Joined: Tue Jan 15, 2013 5:43 am
Posts: 180
robfinch wrote:
I had to power the workstation on and off. Then Windows didn’t restart properly.
Instead, try off first -- then on! ;) :D

_________________
http://LaughtonElectronics.com


Sat Nov 12, 2016 1:51 pm
Profile WWW

Joined: Tue Dec 31, 2013 2:01 am
Posts: 98
Location: Sacramento, CA, United States
Attachment:
rimshot.png
rimshot.png [ 297.63 KiB | Viewed 10457 times ]


Mike B.


Sat Nov 12, 2016 4:10 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 870
Location: Canada
Today’s problem was bitfield operations performed by the compiler. When bitfield values are dereferenced normally the value is sign or zero extended to the width of the register so that it may be used in a subsequent operation (such as a compare). However when performing an assignment the bitfield being assigned to has to return all the bits in the word as they are in order for a subsequent bitfield insert operation to work. That means the dereferencing operation has to work differently based on whether the dereference is on the left or right side of an assignment.

This problem was solved in a rather ugly fashion by introducing a bit flag to indicate if a bitfield assignment was taking place. Based on the flags setting either a left hand side dereference or right hand side dereference should take place. I’m not sure that this flag works in all cases as only a simple case was tested.

I added “hint” instructions to the compiler to make it easier to optimize code. These take the form of a NOP operation with different constants specified in a constant field in the NOP instruction to indicate the hint. The hint instructions are coded as NOP operations in case one makes it through the compiler to the assembler. In theory the hint instructions should not show up in the compiler output. So now the assembler treats a hint instruction as just a remark if it finds one.

I took a step backward for a little bit in order to work on DSD6. DSD6 is a 64 bit machine. It's about 25,000 LUT's though.

_________________
Robert Finch http://www.finitron.ca


Sat Nov 12, 2016 11:58 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 870
Location: Canada
I've been busy for a couple of days updating the DSD7 'C' compiler. I sometimes log things as I work.

2016/11/12
The compiler now supports register parameters to subroutines. The parameter must be declared with the “register” keyword in order for it to work. Previously the compiler only handled parameters passed on the stack. Stack storage space still gets allocated for the parameter, but it’s just a subtract from the SP rather than a memory operation.
I also did some work on the peep-hole optimizer to group together the subtracts into a single operation where possible.

2016/11/13
Function prolog code can’t be fully optimized. This happens because full optimization allocates variables to registers which are first saved on the stack. However the prolog code happens before the stack frame is established and before the registers are saved. Custom prolog code typically sets up the stack and would often be hand optimized assembler code. So this typically isn’t much of a problem.

I was mystified for a number of hours by observing the following instruction output by the compiler in several spots.
ADD #1
This isn’t a stack machine. It was missing operands. I finally managed to narrow the problem down to the fact that more than 256 instruction opcodes were added to the compiler. And the compiler was indexing modulo 256 to identify the instruction. One of the hint instructions got converted into an ADD.

The FMTK system software isn’t re-entrant. That means that on a system call there needs to be a way to detect when a call is occurring in a re-entrant fashion. The obvious way to do this is to set a bit flag (semaphore) somewhere indicating the system is active. Then refuse to do a system function if the bit flag is set. The problem here is that the bit flag has to be manipulated *before* the task context is saved. This would be handled on some systems by a test and set memory bit instruction. Manipulating the bit flag has to be done without changing any registers. This has resulted in semaphore/flags register being added as a CSR. There was also a need to store an address reservation status somewhere so that was included in the register as well.

The compiler has now sprouted inline code as an option, identified by the 'inline' keyword. The inline code has to be in the same file as it’s used (or an included file) because otherwise it would be necessary to handle inline code with a linker. Inline code is handled in a really stupid fashion at the moment. The compiler just copies the code for the subroutine directly inline with the current code stream minus the return statement. It still pushes parameters on stack (unless specified as register parameters) and allocates and deallocates the stack frame for the function. However as it’s intended to be used for short functions that take register parameters it works not too badly.

_________________
Robert Finch http://www.finitron.ca


Sun Nov 13, 2016 10:54 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 870
Location: Canada
2016/11/14
I spent some time today “improving” the DSD6 core based on ideas from DSD7. DSD6 is a more complex core supporting multiple operating levels and segmentation. I synchronized the changes for the DSD6 compiler with the compiler for DSD7.

I ran into a case where I decided to implement CSR access via the contents of a register in addition to being able to specify the CSR register number directly in the instruction. Previously the CSR register number could only be specified as a constant. This made it difficult to write a small routine to set the CSR based on a variable register number. I wanted a routine with an interface like:
Code:
int ReadWriteCSR(int CSRnum, int CSR value)

So that setting the CSR could be done from a high-level language. Sure it could be implemented with a giant switch statement, but there are potentially 4096 registers/cases ! The PowerPC has something like this called an indirect move. As far as I can tell this isn’t possible on the RISCV core.

_________________
Robert Finch http://www.finitron.ca


Tue Nov 15, 2016 1:53 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 870
Location: Canada
2016/11/15
I coded up a simple MMU for DSD7 which is about ½ way in complexity between a simple bank switch and a full blown paging MMU. The MMU contains 32 mapping tables. Each mapping table can map 64MB of memory for a map from a 512MB total memory. Memory is mapped in 64kB pages, so each table has 512 entries. Multiple tasks would have to share the same maps if there are more tasks than maps. It sounds complex but only uses about 40 LUTs and 6 block rams. When mapping is enabled a clock cycle is added to the memory access for the page table lookup.

I also updated a simple PIC (programmable interrupt controller) for use with DSD7. Together with the cpu core and mmu these are all wrapped up in a module called DSD7_mpu.

_________________
Robert Finch http://www.finitron.ca


Tue Nov 15, 2016 4:40 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 870
Location: Canada
Added the __attribute__ keyword ala gcc and a function attribute called __no_temps which indicates that the function doesn’t use any register temporary variables. This means when the function is called the temporary registers don’t need to be saved and restored. I tried to find the same attribute in gcc but couldn’t.
Modified how the register keyword works. Now if a parameter is declared as a register parameter storage is *not* allocated on the stack for it (previously storage would be allocated). Unless the parameter is also declared as an “auto”. “auto register” parameters allocate storage on the stack for the parameter. This is usually not necessary.
Ran into a problem with register parameters. Silly me, it’s more complex than I thought. If a function with register parameters calls another function with register parameters, the parameters from the first function need to be saved and restored. That was the first problem I encountered, which was easily fixed. The second problem was that a register parameter in the caller would get overwritten when it was assigning to a register because both functions are trying to use the same registers. Code got generated like this:
Code:
            push    r18
            push    r19
            mov     r18,r19
            mov     r19,r18

Which is wrong because the value in r18 disappears when overwritten with r19. The original value of r18 is what was wanted. This can be handled with temporaries. But the way I’ve coded it at the moment is ugly. The program generates all the expressions for the parameters ahead of time. Next it checks to see if any resulted in register conflicts with desired register parameters. If there are conflicts then the register is assigned to a temporary. It sounds like it could work and it does, but, if there are more than a few parameters it won’t work. With too many parameters the compiler will run out of temporaries to hold the generated expressions in.
One thought I had was to boost up the number of registers in the core to compensate for the compiler’s limitations. IF there were 20 or 30 registers available for temporaries it likely wouldn’t be a problem. A 64 register machine is a thought.

As it is the code now does this (where r3,r4 are temps):
Code:
            push    r18
            push    r19
            mov     r3,r19
            mov     r4,r18
            mov     r18,r3
            mov     r19,r4

Right now the compiler has a limit of 20 or possibly fewer parameters. I've always felt that if there were many more than say a half-dozen parameters there is probably something structurally wrong with the code.

_________________
Robert Finch http://www.finitron.ca


Wed Nov 16, 2016 3:29 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 870
Location: Canada
2016/11/16
I broke down and added a pop instruction to the instruction set. The pop instruction requires a second result bus so that both the SP update result and the popped data can be present in the core at the same time. It also needs more forwarding logic. Doing so increased the size of the design by a whopping 1.5%. There are so many other things that make the design the size it is that I figure it’s worth it for the pop instruction. The core has also been switched back to storing return addresses on the stack and using CALL / RET instructions. It actually has a jump-and-link instruction needed to handle a plain old jump so either calling convention could be used. I made the stack pointer and base pointer registers configurable in a config register.

Some stack bounds checking was added for loads and stores along with bus error checking. If an exception occurs the instruction acts like a branch and branches back to itself. However it feeds in an INT instruction into the instruction stream at that point rather than repeating the instruction, in order to handle the exception. This means that exceptions have double branching which takes six clock cycles. But it’s only for exceptional conditions that should never occur.

I got the assembler mostly working. I noticed that the compiler never outputs a compare instruction. This is because compare and branch can be done in one operation as just a branch.

With all the creeping features the code has grown past 8600 LUTs. This is larger than desired. I was hoping it would fit into about 1/3 or less of a xc7a35 but it’s about 40%. There must be room for other controllers !

_________________
Robert Finch http://www.finitron.ca


Thu Nov 17, 2016 3:54 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1188
Very interesting to hear what kinds of changes in the HDL cause what kind of changes in the fitting - and indeed, what kind of advantages they provide to the software layer above.

CPUs on FPGA do open up a lot of design space
- how you code your HDL
- how you connect things in your SoC
- what you choose to put into your SoC
- how that affects the compiler
- and how it affects the application


Thu Nov 17, 2016 3:56 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 870
Location: Canada
2016/11/17
I managed to figure out what was hanging the simulator and so was able to start simulation runs for DSD7. I quickly hit cancel on the simulation and managed to see where it stopped. In LFSR code there was an unconditional always block assigning a constant value based on a parameter. For some reason or other the simulator didn’t like that. My guess is that sim thought it was some sort of asynchronous logic block. I changed the code to register the constant on the clock edge and sim works now. Hopefully synthesis gets rid of the extra clocking.

Found a number of errors already.

One gotcha with the jump/call instruction was that it assumed in the IF stage that it’s an absolute jump and doesn’t need a register addition to determine the target address. The core did this so that absolute address jumps can be single cycle. However it couldn’t be done that simply. The core needed to look at the register to see if it was r0 as well. Otherwise the core could conceivably try to jump anywhere in memory (based on the constant in the instruction). There might not be valid instructions in that case.

Further along, I discovered that you can’t reliably perform unconditional control transfers in the I-Fetch stage. Because suppose you do something like the following (yes I actually hit this case):
Code:
JMP   MyResetRout
DW   <garbage data that looks like another jump>

If the data following the unconditional transfer of control looks like another jump, that would get executed in the IFetch stage as well, but it might just be a jump to an invalid address. You might think this shouldn’t be a problem but for the following code:
Code:
CALL   MySubroutWithStaticParams
DW   Parm1
DW   Parm2

The parameters might just get interpreted as a jump depending on their values.


I spent some more time working on DSD6.

_________________
Robert Finch http://www.finitron.ca


Sat Nov 19, 2016 6:38 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 870
Location: Canada
2016/11/18
Previously I said you couldn’t do a reliable jump or call in the IF stage. That was plain wrong. If the call or jump is performed immediately assuming the target address is known, instructions will be fetched immediately from the target address. I got confused as I had PC relative jump being performed. I commented out the code that was causing grief, in order to make to core smaller.

Found what I think might be a bug in the simulator. I had the CIT table interfaced like the following which didn’t work, it stored data to every other address as if address lines [11:0] were hooked up.
DSD7_ciLookupTbl u3
(
.wclk(clk_i),
.wr(cs_hl),
.wadr(adr_o[12:1]),
.wdata(dat_o),
.rclk(~clk_i),
.radr({isid[1:0],insn[15:6]}),
.rdata(cinsn)
);

When I did this, assigning the address slice to a wire first, then it worked.

wire [11:0] citAdr = adr_o[12:1];
DSD7_ciLookupTbl u3
(
.wclk(clk_i),
.wr(cs_hl),
.wadr(citAdr),
.wdata(dat_o),
.rclk(~clk_i),
.radr({isid[1:0],insn[15:6]}),
.rdata(cinsn)
);

I suppose it might be a bad memory bit on my machine 1->0. I’ve encountered bad bits before.

_________________
Robert Finch http://www.finitron.ca


Sat Nov 19, 2016 8:12 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 870
Location: Canada
2016/11/19

A successful simulation of compressed code at work was performed. It involves setting up the MMU which also seems to work. The MMU setup code is shown below, the compressed instructions have the lower six bits equal to ‘1F’ hex.
Code:
                       ;----------------------------------------------------------------------------
                           ; Setup MMU
                           ;
                           ; Sets up map #0 so that virtual and physical addresses match.
                           ;----------------------------------------------------------------------------
                           SetupMMU:
FFFFC03D 019F                csrrw   r0,#3,r0      ; access map #0, disable paging
FFFFC03E 0809 8000          ldi      r1,#$FFDC4000   ; mapping table address
FFFFC040 4000 FFDC
FFFFC042 01DF                ldi      r2,#512         ; number of map entries
FFFFC043 021F                ldi      r3,#0
                           .smmu1:
FFFFC044 025F                sh      r3,[r1]
FFFFC045 029F                addi   r3,r3,#1
FFFFC046 02DF                addi   r1,r1,#1
FFFFC047 1084 FFFF          subi   r2,r2,#1
FFFFC049 0092 FFD9          bne      r2,r0,.smmu1
FFFFC04B 040F 000D          csrrs   r0,#3,#$80000000   ; turn on paging
FFFFC04D 0000 8000
FFFFC04F 001A                nop                  ; synchronize register update
FFFFC050 001A                nop
                                 ; The following ret should only work if paging was setup
                                 ; correctly.
FFFFC051 0099                ret


Compressed instructions make the best use of cache resources. They can make the program faster by allowing more instructions to fit into the cache. I compiled a couple of thousand lines of code and with compressed instructions the code is over 24 bits per instruction. Out of about 1,500 lines of code only 150 different compressed instructions were present. The assembler can’t compress a number of different instructions for a variety of reasons. Also the most common short instructions are already present without having to be compressed further. For example pushes and pops.

I moved the reset address of the core to $FFFC0000 from $FFFFFFF4 at the end of memory. The problem was that for compiled code the assembler would group the rodata segments together and place them *after* the code segment. With the code segment org’d to $FFFFFFF4 for reset, the assembler ended up trying to wrap around to a zero address for the rodata and got all confused. The simplest solution was to just move the reset address so there was enough room for the rodata segment.

One of the next steps is to setup a system on chip in real hardware.
Just about time to start work on a software emulator as well.

_________________
Robert Finch http://www.finitron.ca


Sun Nov 20, 2016 12:41 pm
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 102 posts ]  Go to page 1, 2, 3, 4, 5 ... 7  Next

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software