View unanswered posts | View active topics It is currently Fri Mar 29, 2024 8:56 am



Reply to topic  [ 159 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8, 9, 10, 11  Next
 ANY-1 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Started breaking out the memory state machine and components into its own separate module. Its about 1/3 to 1/2 of the main source file. The state machine controls access for load and store operations, cache lines fills, and TLB and other operations.
This should improve code re-usability.

_________________
Robert Finch http://www.finitron.ca


Fri Jun 11, 2021 3:33 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Got a memory controller written. It incorporates instruction, data, and key caches and also processes load and store requests. Loads and stores use request and response fifos. The caches are hidden from the outside. The instruction cache does not use a fifo, instead the cache line is available directly as an output. Caches are 32kB in size with a 512-bit line size for instructions and 512+120-bits for data. The controller is designed around a 128-bit wide data bus interface and generates signals for accessing data through a WISHBONE bus. A central idea behind the controller is that it is pluggable into more than one project with minimal changes.
Just about ready for debugging.

_________________
Robert Finch http://www.finitron.ca


Sat Jun 12, 2021 2:52 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
That sounds like a handy encapsulation!


Sat Jun 12, 2021 6:34 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
That sounds like a handy encapsulation!
I have written the memory interface for a number of projects now in an ad-hoc fashion. I tried to make reusable cache components which were incorporated in several projects with minimal changes, but I've always had this big mound of code in the middle of cores for memory interfacing. Having an explicit component will help a lot I think.

Put a check in that a modifier must be applied to the next sequential instruction according to the ip values. Had a case where modifiers were in the branch shadow and incorrectly applied to the instruction at the branch target address. Whew! The core now works well enough to call a simple subroutine to output to leds and return. But it is two steps forwards and one step backwards all the time.

This is the simple routine called:
Code:
                           _Delay2s:
FFFFFFFFFFFC0630 04 15 A0 00                      ldi     $a1,#10
                           .0001:
FFFFFFFFFFFC0634 59 00 04 63                       srl         $a2,$a1,#16
FFFFFFFFFFFC0638 1C 56 05 14                 
FFFFFFFFFFFC063C 50 06 DC FF                       stb         $a2,LEDS
FFFFFFFFFFFC0640 70 00 60 61                 
FFFFFFFFFFFC0644 04 55 F5 FF                       sub       $a1,$a1,#1
FFFFFFFFFFFC0648 4F 6C 05 FC                       bne        $a1,$x0,.0001
FFFFFFFFFFFC064C 42 40 00 00                       ret

The neat thing is that instruction modifiers are working. The SRL instruction is actually an extract instruction which makes use of four register fields and hence an instruction modifier is required. SRL might have its own dedicated opcode in the future, but it is not used very often.

_________________
Robert Finch http://www.finitron.ca


Sun Jun 13, 2021 2:45 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added the POPQ instruction. POPQ pops a value off a hardware queue identified by Ra or a constant in the instruction. The only hardware queue currently defined is the instruction trace queue – queue #15. Also added the PEEKQ instruction which looks at the queue without advancing it.

_________________
Robert Finch http://www.finitron.ca


Mon Jun 14, 2021 6:45 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Went through all the I/O addresses which had been long established for the test system and re-assigned them to be on 16kB boundaries. This may seem like a waste of address space as for example the keyboard controller only requires two addresses. However, with a 64-bit address space possible there is lots of room. Heck, there is even lots of room for I/O with a 32-bit address space. I had previously defined all the I/O to fit within a 1MB space because that limited the size of an address to 20-bits. And at one point there was a core that supported 20-bit displacements in load/store instructions. Why the use 16KB? Memory pages controlled by the MMU are 16kB in size. Assigning each I/O device a 16kB range allows it to be protected using the MMU and key access system. All the I/O still fits within a 4MB block of memory $FF800000 to $FFBFFFFF.

Finally built a system that could run in an FPGA. It did not work, at least not to the point of clearing the screen.

_________________
Robert Finch http://www.finitron.ca


Tue Jun 15, 2021 4:47 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
That is not a bug, it is feature ... Microsoft basic (c) 1971 version .75 will live forever, on your screen.
Is not virtual memory block size about 32 or 64 K now, to match i/o device block size? Ben.


Tue Jun 15, 2021 9:17 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Is not virtual memory block size about 32 or 64 K now, to match i/o device block size? Ben.

I think most systems are using 4kB pages with multi-MB pages for larger blocks. The only device in the test system requiring more than 16kB is the text screen memory. I suppose there is also the frame buffer memory but that is multiple megabytes. 64k seems like quite a large page size to me, but maybe with 64-bit systems and tera-bytes of memory maybe it is not so large.

The ack signal is arriving before the data causing bad data to be latched. However, this only happens for one cycle at about the 10us mark. It turns out to be an issue with simulation.
Made x0 a general-purpose register. It is no longer forced to read as zero. There is no need for x0 to be zero as the constant zero can be specified with a register specification.
Started working on version #3 which uses a fixed 40-bit instruction.

_________________
Robert Finch http://www.finitron.ca


Wed Jun 16, 2021 8:05 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Revamped the instruction set for version #3. This time using 34-bit instructions. 15 34-bit instructions fit nicely into a 512-bit cache line with just two bits unused. The core is fooled into thinking it is 32-bit instructions and increments the IP by four, that is until it reaches 56 mod 64 when it then increments by eight to get to the next cache line. It was a lot of fun modifying the assembler to output 34-bit instructions. Fortunately, there was code already in place to support 36-bit instructions, and with a few tweaks and head-scratching it could be modified to suit.
It is not quite working yet, but close.

_________________
Robert Finch http://www.finitron.ca


Thu Jun 17, 2021 7:02 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
After realizing it would be easier to have bit-pair addressable memory to keep things like data pointers and instructions in sync, version #3 of the core has been switched to use 36-bit instructions. 36-bit instructions are about 6% less code dense than 34 bits ones. However, there are more operations that can be encoded into 36 bits without using instruction modifiers. About 10% more branches may be encoded, that represents 2% of instructions. Also, larger constant fields in the 36-bit instructions also mean fewer modifiers, say another 2%. So, there is really bound to be very little difference in code density between 36 and 34 bits. The core has gone from 40 to 34 then back to 36 bits for instructions. Updating the documentation may take some time.
The idea of allowing three register source ports for some instructions is being considered. There are a few useful instructions that need three ports that must currently use modifiers. FMA, FMS, FNMA, FNMS, indexed address stores, MUX, and a couple of others.

_________________
Robert Finch http://www.finitron.ca


Fri Jun 18, 2021 4:33 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Ibm gave us 8 bits/32 bit words with the IBM 360 computer, because they could not build the IBM 7030 (supercomputer)
as commercial
product. As the 7030 was 64 bit computer, with multi-processing units for floating point, integers and other operations
like n bit character processing, they needed to strip it down to make it marketable product in my view, thus the IBM 360
or similar design. RISC is not RISC today but somethng easy to pipeline, but just as complex a CISC design.
Reviewing the IBM 7030 may give some ideas of where computers where going in the early 1960's, before the advent
of C and C++ and internet TV, and how they planned a super-computer, with possible time sharing and their view of
cached memory. Having the fastest computer design is only the best case, what is the best balance of speed I see
as more important with different styles of computer data structures.
Ben.


Fri Jun 18, 2021 7:49 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Some of the stuff implemented or planned for older mainframes is pretty impressive.

Busy day of ‘makes it work’ category of changes. Got burned a couple of times on the fact that x0 is a general-purpose register now. Instead of specifying x0 when zero is desired, #0 needs to be specified. The assembler required some changes to do this. The LDI instruction was coded to use x0 it had to be changed to #0.
The ABI will spec that x0 is a scratch register and no assumptions should be made about the contents. In particular it does not need to be saved and restored across context switches.
The boot rom was modified to support burst memory access. This trims many cycles (>10) off a cache load.

I am rather liking this v3 ISA. Just converting some of the boot rom and the LOC is shorter all over the place due to the use of small immediate constants. It looks like the increased size of extra four bits per instruction probably will not affect overall code size.
Attachment:
File comment: Code Compare
Code Compare.png
Code Compare.png [ 81.17 KiB | Viewed 719 times ]

_________________
Robert Finch http://www.finitron.ca


Sun Jun 20, 2021 5:44 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Setup the palette trick in the GFX_FrameBuffer component. The trick is to use the palette as regular memory when not needed for bitmap display. Including being able to execute instructions from the palette. The palette is a 64x1024 ram (the smallest 64-bit ram). Not all the entries are required for display purposes.

Testing in the FPGA reveals things do not work. The core attempts to load the I$. A burst load of five 128-bit accesses can be seen happening. But it is stuck in an infinite loop. There does not appear to be a successful hit on the I$ after the cache load. Just trying to think of how to debug this, it works in sim.

_________________
Robert Finch http://www.finitron.ca


Mon Jun 21, 2021 3:51 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
After some discussion on comp.arch, the core has been modified to use branch displacements in terms of whole instructions. This gives 9x the branch range over nybble addressing but means instructions must be contiguously laid out in memory. Branches have approximately 17 bits worth of byte displacement.

Some recent changes are: x0 now reads as always zero. This change was made so that the constant zero could be used at the Ra register port. The Ra register port must now always be a register, it is not allowed to be a constant. This saves one bit in instruction encodings that is better used for other things. The data type field in branch instructions was re-purposed for branch displacement bits. Instead of a data type there are now separate branching instructions for different data types. This uses more opcode space.

_________________
Robert Finch http://www.finitron.ca


Tue Jun 22, 2021 5:00 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added stack-based subroutine call and return instructions. These instructions are more code dense than using link registers. It also helps with porting software that does not use link registers.

Moved the execute module back into the mainline code. It was causing too many issues with pipelining. As a separate module there was an extra cycle required to update the re-order buffer. This leads to code that is a bit uglier and less modular.

_________________
Robert Finch http://www.finitron.ca


Thu Jun 24, 2021 8:41 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 159 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8, 9, 10, 11  Next

Who is online

Users browsing this forum: No registered users and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software