Last visit was: Wed Oct 09, 2024 8:23 pm
|
It is currently Wed Oct 09, 2024 8:23 pm
|
Author |
Message |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Started breaking out the memory state machine and components into its own separate module. Its about 1/3 to 1/2 of the main source file. The state machine controls access for load and store operations, cache lines fills, and TLB and other operations. This should improve code re-usability.
_________________Robert Finch http://www.finitron.ca
|
Fri Jun 11, 2021 3:33 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Got a memory controller written. It incorporates instruction, data, and key caches and also processes load and store requests. Loads and stores use request and response fifos. The caches are hidden from the outside. The instruction cache does not use a fifo, instead the cache line is available directly as an output. Caches are 32kB in size with a 512-bit line size for instructions and 512+120-bits for data. The controller is designed around a 128-bit wide data bus interface and generates signals for accessing data through a WISHBONE bus. A central idea behind the controller is that it is pluggable into more than one project with minimal changes. Just about ready for debugging.
_________________Robert Finch http://www.finitron.ca
|
Sat Jun 12, 2021 2:52 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1799
|
That sounds like a handy encapsulation!
|
Sat Jun 12, 2021 6:34 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Quote: That sounds like a handy encapsulation!
I have written the memory interface for a number of projects now in an ad-hoc fashion. I tried to make reusable cache components which were incorporated in several projects with minimal changes, but I've always had this big mound of code in the middle of cores for memory interfacing. Having an explicit component will help a lot I think. Put a check in that a modifier must be applied to the next sequential instruction according to the ip values. Had a case where modifiers were in the branch shadow and incorrectly applied to the instruction at the branch target address. Whew! The core now works well enough to call a simple subroutine to output to leds and return. But it is two steps forwards and one step backwards all the time. This is the simple routine called: Code: _Delay2s: FFFFFFFFFFFC0630 04 15 A0 00 ldi $a1,#10 .0001: FFFFFFFFFFFC0634 59 00 04 63 srl $a2,$a1,#16 FFFFFFFFFFFC0638 1C 56 05 14 FFFFFFFFFFFC063C 50 06 DC FF stb $a2,LEDS FFFFFFFFFFFC0640 70 00 60 61 FFFFFFFFFFFC0644 04 55 F5 FF sub $a1,$a1,#1 FFFFFFFFFFFC0648 4F 6C 05 FC bne $a1,$x0,.0001 FFFFFFFFFFFC064C 42 40 00 00 ret
The neat thing is that instruction modifiers are working. The SRL instruction is actually an extract instruction which makes use of four register fields and hence an instruction modifier is required. SRL might have its own dedicated opcode in the future, but it is not used very often.
_________________Robert Finch http://www.finitron.ca
|
Sun Jun 13, 2021 2:45 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Added the POPQ instruction. POPQ pops a value off a hardware queue identified by Ra or a constant in the instruction. The only hardware queue currently defined is the instruction trace queue – queue #15. Also added the PEEKQ instruction which looks at the queue without advancing it.
_________________Robert Finch http://www.finitron.ca
|
Mon Jun 14, 2021 6:45 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Went through all the I/O addresses which had been long established for the test system and re-assigned them to be on 16kB boundaries. This may seem like a waste of address space as for example the keyboard controller only requires two addresses. However, with a 64-bit address space possible there is lots of room. Heck, there is even lots of room for I/O with a 32-bit address space. I had previously defined all the I/O to fit within a 1MB space because that limited the size of an address to 20-bits. And at one point there was a core that supported 20-bit displacements in load/store instructions. Why the use 16KB? Memory pages controlled by the MMU are 16kB in size. Assigning each I/O device a 16kB range allows it to be protected using the MMU and key access system. All the I/O still fits within a 4MB block of memory $FF800000 to $FFBFFFFF.
Finally built a system that could run in an FPGA. It did not work, at least not to the point of clearing the screen.
_________________Robert Finch http://www.finitron.ca
|
Tue Jun 15, 2021 4:47 am |
|
|
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 649
|
That is not a bug, it is feature ... Microsoft basic (c) 1971 version .75 will live forever, on your screen. Is not virtual memory block size about 32 or 64 K now, to match i/o device block size? Ben.
|
Tue Jun 15, 2021 9:17 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Quote: Is not virtual memory block size about 32 or 64 K now, to match i/o device block size? Ben. I think most systems are using 4kB pages with multi-MB pages for larger blocks. The only device in the test system requiring more than 16kB is the text screen memory. I suppose there is also the frame buffer memory but that is multiple megabytes. 64k seems like quite a large page size to me, but maybe with 64-bit systems and tera-bytes of memory maybe it is not so large. The ack signal is arriving before the data causing bad data to be latched. However, this only happens for one cycle at about the 10us mark. It turns out to be an issue with simulation. Made x0 a general-purpose register. It is no longer forced to read as zero. There is no need for x0 to be zero as the constant zero can be specified with a register specification. Started working on version #3 which uses a fixed 40-bit instruction.
_________________Robert Finch http://www.finitron.ca
|
Wed Jun 16, 2021 8:05 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Revamped the instruction set for version #3. This time using 34-bit instructions. 15 34-bit instructions fit nicely into a 512-bit cache line with just two bits unused. The core is fooled into thinking it is 32-bit instructions and increments the IP by four, that is until it reaches 56 mod 64 when it then increments by eight to get to the next cache line. It was a lot of fun modifying the assembler to output 34-bit instructions. Fortunately, there was code already in place to support 36-bit instructions, and with a few tweaks and head-scratching it could be modified to suit. It is not quite working yet, but close.
_________________Robert Finch http://www.finitron.ca
|
Thu Jun 17, 2021 7:02 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
After realizing it would be easier to have bit-pair addressable memory to keep things like data pointers and instructions in sync, version #3 of the core has been switched to use 36-bit instructions. 36-bit instructions are about 6% less code dense than 34 bits ones. However, there are more operations that can be encoded into 36 bits without using instruction modifiers. About 10% more branches may be encoded, that represents 2% of instructions. Also, larger constant fields in the 36-bit instructions also mean fewer modifiers, say another 2%. So, there is really bound to be very little difference in code density between 36 and 34 bits. The core has gone from 40 to 34 then back to 36 bits for instructions. Updating the documentation may take some time. The idea of allowing three register source ports for some instructions is being considered. There are a few useful instructions that need three ports that must currently use modifiers. FMA, FMS, FNMA, FNMS, indexed address stores, MUX, and a couple of others.
_________________Robert Finch http://www.finitron.ca
|
Fri Jun 18, 2021 4:33 am |
|
|
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 649
|
Ibm gave us 8 bits/32 bit words with the IBM 360 computer, because they could not build the IBM 7030 (supercomputer) as commercial product. As the 7030 was 64 bit computer, with multi-processing units for floating point, integers and other operations like n bit character processing, they needed to strip it down to make it marketable product in my view, thus the IBM 360 or similar design. RISC is not RISC today but somethng easy to pipeline, but just as complex a CISC design. Reviewing the IBM 7030 may give some ideas of where computers where going in the early 1960's, before the advent of C and C++ and internet TV, and how they planned a super-computer, with possible time sharing and their view of cached memory. Having the fastest computer design is only the best case, what is the best balance of speed I see as more important with different styles of computer data structures. Ben.
|
Fri Jun 18, 2021 7:49 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Some of the stuff implemented or planned for older mainframes is pretty impressive. Busy day of ‘makes it work’ category of changes. Got burned a couple of times on the fact that x0 is a general-purpose register now. Instead of specifying x0 when zero is desired, #0 needs to be specified. The assembler required some changes to do this. The LDI instruction was coded to use x0 it had to be changed to #0. The ABI will spec that x0 is a scratch register and no assumptions should be made about the contents. In particular it does not need to be saved and restored across context switches. The boot rom was modified to support burst memory access. This trims many cycles (>10) off a cache load. I am rather liking this v3 ISA. Just converting some of the boot rom and the LOC is shorter all over the place due to the use of small immediate constants. It looks like the increased size of extra four bits per instruction probably will not affect overall code size. Attachment: Code Compare.png
You do not have the required permissions to view the files attached to this post.
_________________Robert Finch http://www.finitron.ca
|
Sun Jun 20, 2021 5:44 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Setup the palette trick in the GFX_FrameBuffer component. The trick is to use the palette as regular memory when not needed for bitmap display. Including being able to execute instructions from the palette. The palette is a 64x1024 ram (the smallest 64-bit ram). Not all the entries are required for display purposes.
Testing in the FPGA reveals things do not work. The core attempts to load the I$. A burst load of five 128-bit accesses can be seen happening. But it is stuck in an infinite loop. There does not appear to be a successful hit on the I$ after the cache load. Just trying to think of how to debug this, it works in sim.
_________________Robert Finch http://www.finitron.ca
|
Mon Jun 21, 2021 3:51 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
After some discussion on comp.arch, the core has been modified to use branch displacements in terms of whole instructions. This gives 9x the branch range over nybble addressing but means instructions must be contiguously laid out in memory. Branches have approximately 17 bits worth of byte displacement.
Some recent changes are: x0 now reads as always zero. This change was made so that the constant zero could be used at the Ra register port. The Ra register port must now always be a register, it is not allowed to be a constant. This saves one bit in instruction encodings that is better used for other things. The data type field in branch instructions was re-purposed for branch displacement bits. Instead of a data type there are now separate branching instructions for different data types. This uses more opcode space.
_________________Robert Finch http://www.finitron.ca
|
Tue Jun 22, 2021 5:00 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2187 Location: Canada
|
Added stack-based subroutine call and return instructions. These instructions are more code dense than using link registers. It also helps with porting software that does not use link registers.
Moved the execute module back into the mainline code. It was causing too many issues with pipelining. As a separate module there was an extra cycle required to update the re-order buffer. This leads to code that is a bit uglier and less modular.
_________________Robert Finch http://www.finitron.ca
|
Thu Jun 24, 2021 8:41 am |
|
Who is online |
Users browsing this forum: CCBot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|