View unanswered posts | View active topics It is currently Fri Mar 29, 2024 2:51 pm



Reply to topic  [ 138 posts ]  Go to page Previous  1, 2, 3, 4, 5 ... 10  Next
 FISA64 - 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I'm stuck on how to manage the tags for memory management (the memory management system outlined previously). The tag memory has to be accessed in parallel with an ongoing memory access, the access rights are checked before an actual memory access takes place to non-tag memory. This requires two read ports on the tag memory (one for instructions and one for data for access checks). Another read / write port is required for updating / reading the tag memory itself. Part of the problem is the number of ports required on the tag memory. This should be kept to a minimum. I'd like to reuse the effective address bits used to index into the tag memory during an access check, in order to read/write the contents of the tag memory. Otherwise more ports would be required. This would keep the number of ports to a minimum, but requires special instructions in the processor in order to do so.
In order to update / read the tags, should the tag memory be implemented as if it were an ordinary memory accessed via load and store instructions ? Or should it be accessed like a special purpose registers ? There are pros and cons to both ways of implementing it.

If the tag memory is part of the normal memory system, then where in memory should it go ? There could be a lot of tags in a large memory system, however part of the idea is to have a dedicated memory for the tags.

I have it looking like this so far:
Code:
reg [31:0] rpc;      // registered PC value
always @(negedge clk)
   rpc <= pc;
reg [15:0] lottags [2047:0]; // dedicated tag memory
always @(posedge clk)
   if (advanceEX && xopcode==`RR && xfunct==`MTSPR && xir[24:19]==`TAGS)
      lottags[a[10:0]] <= a[26:16];
wire [15:0] lottag = lottags[ea[26:16]];    // This is the effective address port I'd like to reuse
wire [15:0] lottagX = lottags[rpc[26:16]];
wire [15:0] lottagR = lottags[rSprn];

reg [9:0] lotgrp [7:0]; // lot groups that a process belongs to (allows memory sharing)

wire isLotOwner = km | (((lotgrp[0]==lottag[15:6]) ||
               (lotgrp[1]==lottag[15:6]) ||
               (lotgrp[2]==lottag[15:6]) ||
               (lotgrp[3]==lottag[15:6]) ||
               (lotgrp[4]==lottag[15:6]) ||
               (lotgrp[5]==lottag[15:6])))
               ;
wire isLotOwnerX = km | (((lotgrp[0]==lottagX[15:6]) ||
               (lotgrp[1]==lottagX[15:6]) ||
               (lotgrp[2]==lottagX[15:6]) ||
               (lotgrp[3]==lottagX[15:6]) ||
               (lotgrp[4]==lottagX[15:6]) ||
               (lotgrp[5]==lottagX[15:6])))
               ;


Then later on for access checks:
Code:
LOAD2:
   begin
      // Check for read attribute on lot
      if ((isLotOwner && (km ? lottag[2] : lottag[5])) || !pe) begin
         wb_read1(ld_size,ea);
         next_state(LOAD3);
      end
//      else begin
//         exc <= `EXC_ACCESS;
//      end
   end

_________________
Robert Finch http://www.finitron.ca


Sat Mar 07, 2015 9:27 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Adding a PEA (push effective address) instruction to the design could make for shorter and faster code at the cost of a small amount of additional processor complexity. I ran into this sample when looking at the compiler output (it uses a lot of LEA's followed by stores):
Code:
            subui   sp,sp,#64
            lea     r3,-72[bp]
            sw      r3,56[sp]
            lea     r3,-64[bp]
            sw      r3,48[sp]
            lea     r3,-56[bp]
            sw      r3,40[sp]
            lea     r3,-48[bp]
            sw      r3,32[sp]
            lea     r3,-40[bp]
            sw      r3,24[sp]
            lea     r3,-32[bp]
            sw      r3,16[sp]
            lea     r3,-24[bp]
            sw      r3,8[sp]
            lw      r3,-8[bp]
            sw      r3,[sp]
            jsr     date_split
            addui   sp,sp,#64


The code with a PEA instruction becomes
Code:
            pea     -72[bp]
            pea     -64[bp]
            pea     -56[bp]
            pea     -48[bp]
            pea     -40[bp]
            pea     -32[bp]
            pea     -24[bp]
            lw      r3,-8[bp]
                subui   sp,sp,#8
            sw      r3,[sp]
            jsr     date_split
            addui   sp,sp,#64

It's actually fairly common to pass pointers to objects to routines, and this requires pushing an address onto the stack. The PEA instruction could also be used for pushing small constants onto the stack, an operation which is also fairly common.
So incorporating PEA into the design doesn't represent the minimalist philosophy of RISC, but it does increase code density and reduce the number of instructions to execute.
In order to use a PEA instruction some fancy manipulation of the stack frame will be required.

_________________
Robert Finch http://www.finitron.ca


Wed Mar 11, 2015 5:19 am
Profile WWW
User avatar

Joined: Fri Jan 10, 2014 9:46 pm
Posts: 37
removed


Last edited by legacy on Sun Apr 12, 2015 6:09 pm, edited 1 time in total.



Fri Mar 13, 2015 1:43 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
i am developing a very easy soft core, i do not find simple to implement pipeline, so i want a RISC core without it, and i have design my soft core in a simple multi cycle way. The first thing we always need is the TAP machine! The hw debugger! So i am developing the TAP, plus it's interface (host side).


Sounds like you've got a great start. I hope to hear more about your project.

I've had great success with things working in the real world if they work in simulation.

Got a first simulation run done on FISA64. Checks to make sure constant prefixing works without the interference of interrupts or cache misses. A simple program was run to add a register to itself:

Code:
                           ;
                           ;
                               code
                               org     $E000
00E000 7C 2B 1A 09        ldi     r1,#$1234567812345678
00E004 7C 34 12 78
00E008 0A 10 F0 AC
00E00C 82 10 02 28        addu    r1,r1,r1
00E010 3F 00 00 00        nop
00E014 3F 00 00 00        nop
00E018 3F 00 00 00        nop


I changed a few things from the original architecture. There are only 32 regs now, and 32 bit instructions. For the scalar processor the pipeline stayed about the same.

Simulation Output:
******************************
******************************
Finished Reset
******************************
******************************
0000e000: 091a2b7c
0000e004: 7812347c
0000e008: acf0100a
0000e00c: 28021082
0000e010: 0000003f
r 1 = 1234567812345678
0000e014: 0000003f
r 1 = 2468acf02468acf0
0000e018: 0000003f
0000e01c: 00000000
0000e020: xxxxxxxx
0000e024: xxxxxxxx

_________________
Robert Finch http://www.finitron.ca


Sat Mar 14, 2015 2:10 am
Profile WWW
User avatar

Joined: Fri Jan 10, 2014 9:46 pm
Posts: 37
removed


Last edited by legacy on Sun Apr 12, 2015 6:10 pm, edited 2 times in total.



Sat Mar 14, 2015 12:36 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
the real problem is not to simulate, but to debug, and to achieve this purpose i am designing an hw debugged attached to the soft core, a TAP in short words, something easier than "ejtag" (MIPS patent). Still not completed, btw


One of the things I've wondered about is how to hook up a debugging circuit to be able to test/debug adequately a processor. Processor's tend to be complex. Testing taking place at different stages of development. I suspect it's not a simple thing to do. Coming up with a list of operations the TAP should support could be a nightmare. For instance how do you test that pipeline overlapping works from a test rig ? Or what about branch prediction ? Or arbitrary ALU ops ? I would think the debugging circuit would have to be tied tightly to design. Another issue is how much and where to place debugging circuits in the design. In a complex CPU there might be half a dozen internal data buses. Which one is golden ? A TAP has been one of the things on my todo list, but it adds a whole level of complexity that I haven't wanted to deal with.

During development my current means of verification is to write small software test routines, and run them in simulation. It is possible to dump almost anything in simulation, registers / databuses, pipeline controls, etc. Almost all of the bugs are worked out before anything is tested in real hardware. But what about a system that's almost finished, and working in the real world ? Usually by the time the processor is working well enough to run in real hardware, it is possible to do things like code register dumps, memory dumps, and perhaps a small monitor program. Then the real (software)debugging nightmare begins.

***********************

From simulation: processor CPI is about 2.4 without branch prediction, and just a little over 1.0 with branch prediction for a simple test program. Assuming single cycle memory access.

_________________
Robert Finch http://www.finitron.ca


Sat Mar 14, 2015 3:49 pm
Profile WWW
User avatar

Joined: Fri Jan 10, 2014 9:46 pm
Posts: 37
removed


Last edited by legacy on Sun Apr 12, 2015 6:10 pm, edited 3 times in total.



Sat Mar 14, 2015 4:29 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
I'm sure one person can do a superscalar machine (or, generally, an ambitious implementation of an architecture) if they know what they are doing and take care - not any person, perhaps, but someone with a bit of experience and self-discipline. Surely though it shouldn't be a first project - much better to build up. But then I know Rob has made several previous microprocessors.


Sat Mar 14, 2015 4:50 pm
Profile
User avatar

Joined: Fri Jan 10, 2014 9:46 pm
Posts: 37
removed


Last edited by legacy on Sun Apr 12, 2015 6:10 pm, edited 2 times in total.



Sat Mar 14, 2015 4:57 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
From a perspective of forum etiquette, it's probably best not to be too strident with negative views about someone else's project. Better to stay silent if you think it can't be done, or shouldn't be tried. A helpful suggestion is always welcome, but if all you have is a negative sentiment, that's not a useful conversation to have here. In this particular case, a positive approach would be to ask politely what the story is for testing a complex implementation.


Sat Mar 14, 2015 6:59 pm
Profile
User avatar

Joined: Fri Jan 10, 2014 9:46 pm
Posts: 37
removed


Last edited by legacy on Sun Apr 12, 2015 6:10 pm, edited 1 time in total.



Sat Mar 14, 2015 7:19 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
i found out 43 hw bugs !

There is some truth to what you say. Doing a good job validating a core requires a lot of man-time. It's partly a matter of the quality achieved. As your friend shows, it's entirely possible to get a core working and perhaps useful, even though it is not perfect. It sounds like he's got his core working well enough for his use as he handed it over to someone else. One has to be careful about achieving perfection. The problem is it's achievable *at any cost*. Some people pursue perfection at ridiculous costs. The intended market for a core can dictate the resources spent on validation.
It's sometimes acceptable to leave bugs in hardware that don't detract from the usefulness of a device. It comes down to how much time and effort one wants to spend to achieve perfection. Less time spent in validating a core may mean it has more bugs in it, and as a result is of lower quality. It comes down to how acceptable the bugs are.

For instance, one thing I haven't tested thoroughly is core clock rate handling. It's present in the core but it's not been a high priority for me to test. It's possible that it doesn't work at all and changing the clock rate might hang the processor. Even if it doesn't work the processor can still be used. I had a bug in the first branch predictor I developed. It didn't actually work. But the core still ran at a lower performance level (lower quality). Later I fixed the bug.

Quote:
also, nothing of personal, but i can't trust you are able to run a superscalar project as it is simply impossible with one man power, the effort requires a lot testing guys and and a lot of resources to achieve the purpose


Quote:
I'm sure one person can do a superscalar machine (or, generally, an ambitious implementation of an architecture) if they know what they are doing and take care


I'm not the first person to attempt developing a superscalar processor project. It is not impossible to do, it requires a little bit of faith though. I found a superscalar DLX processor on the web, developed I believe as a university project by only a couple of people. Also, I based my superscalar on someone else's work, they came up with the processor themselves basically. However, you are correct in a sense that the processor wasn't perfect. It was just good enough to use as an academic demonstration. I've already gotten one superscalar basically working in simulation, but it's too big for an low cost FPGA.

Quote:
my softcore has no pipeline (so no branch prediction, no delayed slot, and no hazards) and no cache just because i do not want to use all my free time just for one hobby (i have many hobbies), a friend of mine is working on a pipelined version of mips3, and i can assure you he has been working on such a things for 4 years, pretty simulating every kind of things. One day he asked me to put real code on his soft core i found out 43 hw bugs !



A good place to start is with something simple. However after a while one gets experienced an the simple things become boring, so more a more challenging work is needed. With some experience of the past one finds they can work much faster at a new project. I've been working on processor projects for over a decade now. I've gotten really fast at some things.


One thing about testing is minimizing the size of code to test. With fewer lines of code present, there's less testing required. FISA64 is currently only about 2,000 lines of code. That's small.

_________________
Robert Finch http://www.finitron.ca


Sun Mar 15, 2015 2:49 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Tonight's quandary is a design decision that leaves the same FISA64 branch instruction branching to one of two different locations depending on whether or not it's predicted taken. FISA64 makes use of immediate prefixes to extend immediate values beyond a 15 bit limit set in the instruction.

Branch instructions can’t make proper use of an immediate prefix because they don’t detect an immediate prefix at the IF stage in order to keep the hardware simpler. (There is no requirement for conditional branching more than 15 bits). However a branch instruction just uses the same immediate value that is calculated for other instructions in the EX stage also to keep the hardware simple. This could lead to branches branching to two different locations if an immediate prefix is used for a branch.

For example if a prefix is used with a branch, BEQ *+$100010 for instance (the $100000 displacement would require a prefix). Then the branch will branch to *+$10 if it is predicted taken (ignoring the prefix), but to *+100010 if it’s predicted not taken, then taken later in the EX stage.

If the branch is predicted taken, it’ll branch using the 15 displacement field from the instruction. If the branch is predicted not taken, but is taken later in the EX stage, it’ll branch using the full immediate value, which with prefixes could be up to 64 bits. The solution is that the assembler never outputs branches with prefixes. There is no hardware protection against using an immediate prefix with a branch. It's a feature, not a bug.

In the IF stage ,rather than look at the previous instructions for an immediate prefix, the processor simply ignores the fact a prefix is present, and sign extends the branch displacement in the instruction without taking into account a prefix.

IF stage:
Code:
 if (iopcode==`Bcc && predict_taken) begin

                  pc <= pc + {{47{insn[31]}},insn[31:17],2'b00};   // Ignores potential immediate prefix

                  dbranch_taken <= TRUE;

           end



However, the EX stage uses a full immediate including any prefix, also to simplify hardware.

EX stage:
Code:
`Bcc:       if (takb & !xbranch_taken)

                                  update_pc(xpc + {imm,2'b00});   // This uses a “full” immediate value

_________________
Robert Finch http://www.finitron.ca


Sun Mar 15, 2015 5:19 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
You could take a MIPS-like approach here, and say that having such a prefixed branch is undefined behaviour.


Sun Mar 15, 2015 7:13 am
Profile
User avatar

Joined: Fri Jan 10, 2014 9:46 pm
Posts: 37
removed


Last edited by legacy on Sun Apr 12, 2015 6:10 pm, edited 14 times in total.



Sun Mar 15, 2015 12:39 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 138 posts ]  Go to page Previous  1, 2, 3, 4, 5 ... 10  Next

Who is online

Users browsing this forum: No registered users and 20 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software