Last visit was: Fri Jul 19, 2024 5:45 am
It is currently Fri Jul 19, 2024 5:45 am



 [ 54 posts ]  Go to page Previous  1, 2, 3, 4  Next
 Astorisc : A pipelined Risc-V from scratch ? 
Author Message

Joined: Mon Oct 07, 2019 2:41 am
Posts: 619
alrj wrote:
oldben wrote:
DMA and IRQ's make the software more complex

Yes, but they also make the hardware much simpler, and my background is definitely more software than hardware ;)

I know it will always be a tradeoff. Hardware is more expensive but faster, software is slower but cheaper and potentially more flexible.
I'm glad I've got the experience of my breadboard 8088 computer, because I know I can use a CompactFlash and a video framebuffer (character based, though) without needing IRQs or DMA, and even with a slow clock and an 8-bit bus, the system is really usable.

In my mind, having IRQs for things like keyboard, mouse and serial connection means that I don't have to build dedicated buffers and logic in hardware, I can just react to the event and push the data into main memory. A 16 bytes buffer may not be a big deal inside an FPGA, but it takes space when physically built on a PCB.

With me I am using SD (standard) cards for I/O and serial I/O 1200 baud on FPGA setup. A 127 byte FIFO on the serial hardware, gives me 1 second buffer, ample time to write a disk sector, while reading from the serial port. I emulate 2 removable drives (RK05's) @ 1.8 MB for mid 1970's time frame. 4 82S100's are emulated with 22v10's and 512x8 roms replaced with fast EPOMS. 2901A's used rather than 2091's from 1976.

While the DE1 is outdated I can pick up used ones for a song. A 2901 bitslice version of the hardware, may just be a paper design for a while. With this design I only need two PCB's, one for the front panel, and one for I/O, and a case.
The design goal here, is a IBM AT like machine, in the late 1970's time frame, with a very simple 32 bit cpu, and dual (1.4MB) floppies, and serial printer and modem ,256Kb ram. As for the price in 1976, with 96Kb 4K dram don't even ask.
Quote:

Anyways, since the IO are memory-mapped, I can always rework these parts at a later time. The CPU itself will be multiple boards all connected to a backplane, but I think I'll also have slots for peripherals directly connected to the bus, more or less like ISA/Vesa. Then, nothing prevents me to use whatever IO interface I want to drive each of them.

oldben wrote:
I use the 74LS6502 :)

But, but, ... I'm already building a CPU! :D

Recursive Hardware: :)

Right now I am playing with language development for my computer, more a ALGOL based system, rather than C.
I don't think I can get a modern C. to self compile in 64K of memory.
Develop software, and hardware together, works for me but is rather slow because the FPGA tends to break with
new hardware or internal rom changes. Any software I use is written in a C subset ( no structures or ++/--) to make
porting the code easy.


Fri Jun 24, 2022 5:00 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2104
Location: Canada
IRQ hardware could be designed in schematics then left out of the build. So that pin headers could be present to add the IRQs to real hardware at a later date.

Ben is right, many early systems get by without supporting IRQs directly.

Perhaps a memory mapped I/O port that records events occurring in the system could be polled. 74LS259.

For many of my systems all significant processing for IRQs takes place in a time slice(s) triggered by the timer ISR. The IRQs typically only store a value in a buffer and/or set a flag and return. If hardware stored data in buffers automatically and only set a flag, then hardware may be optimized, only a timer ISR would be needed.

_________________
Robert Finch http://www.finitron.ca


Mon Jun 27, 2022 3:45 pm WWW

Joined: Thu Feb 25, 2021 8:27 am
Posts: 32
Location: Belgium
Please bear with me, I know the "old" PC architecture quite well, but I'm quite lacking in everything else.
In the PC achitecture, the timer is nothing else than an external source for interrupt. So in my mind, whatever I do, if I want a timer (and I really do want one), I need a way to react to an IRQ.

I haven't really thought about how I would handle external IRQs, so what follows really is nothing more than a draft.

My intention was to have a single INTerrupt line to the CPU, and if it's active, to transfer control to the ISR. The ISR code could then do whatever is needed. It could be reading the output of a priority encoder located in memory-mapped IO used by external devices, or perhaps stay closer to the risc-v spec and have a 'mcause' status register that contains a well defined interrupt number.

Using a bit-field located in mem-IO as you suggest is more or less what I want to do. I just don't know the details yet. But as you say, it doesn't matter, I can leave the implementation for later.

Also, that INT line would not be activated only by the timer, but rather by a big logical OR of all the possible sources: timer has ticked, a key has been pressed, mouse has moved, UART has received a character, screen is blanking (why not?), ...

I hope it makes sense. But if you think I'm going straight into the wall with that, don't be afraid to tell me this stupid ! :D

_________________
https://www.alrj.org/pages/Astorisc.html


Mon Jun 27, 2022 5:33 pm WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1789
That sounds fine to me - it feels like you need at least one mechanism for an external async signal to grab the attention of the CPU, and if you have one mechanism that's enough.

To handle the interrupt, you need some way to be sure you'll be able to continue after (or restart) the instruction which got interrupted, which means preserving enough state, and perhaps abandoning some actions, so that you can pick up where you left off.

I suspect that this is a little more tricky in the case of a pipelined CPU.

But it might be that it doesn't really matter how efficient it is, at least initially, so long as it works. (You might write a lot of state to the stack, as the 68k does, and reload it when the ISR is done. Or maybe you could abandon the current instruction and then restart it. Except, in a pipelined machine there's more than one instruction in flight...)


Mon Jun 27, 2022 6:18 pm

Joined: Thu Feb 25, 2021 8:27 am
Posts: 32
Location: Belgium
BigEd wrote:
To handle the interrupt, you need some way to be sure you'll be able to continue after (or restart) the instruction which got interrupted, which means preserving enough state, and perhaps abandoning some actions, so that you can pick up where you left off.

I suspect that this is a little more tricky in the case of a pipelined CPU.

It is more tricky indeed, but thankfully the literature on the subject is abundant. Risc-V is really close to MIPS, and both architectures are used in a lot of CS courses that can be found online. The key principle seems to be called "precise interrupts". In a nutshell, when an instruction generates an exception in an early stage of the pipeline, it still goes further down the pipeline until it reaches the last point before anything could be committed (i.e. the MEM stage), where it is handled. An IRQ doesn't invalidate the current instruction, and is thus treated more or less like a jump/branch: the next two instructions in the pipeline are killed. In the case of an exception, the current instruction must also be invalidated, so three instructions are killed.

In the Risc-V architecture, interrupts/exceptions handling uses a few dedicated control and status registers and leaves to the ISR (trap handler, in their parlance) the task to setup its stack pointer and push things onto the stack. This simplifies the hardware an awful lot, because it removes all bus operations from the hardware logic: every data can simply be clocked into its corresponding status register in parallel, in a single cycle. The CSR 'mepc' will receive the "return" address (typically PC or PC+4). 'mtval' gets the helper value for the exception, if any (load/store address in case of a misaligned access or page fault, or the instruction word if the instruction is not valid). The ISR can then use 'mscratch' to temporarily save one of the general purpose registers and setup its local stack.
The nice thing is, everything still works even if the interrupted program has no stack set up!


BigEd wrote:
But it might be that it doesn't really matter how efficient it is, at least initially, so long as it works.

That's right, I don't care too much if the interrupt handling is not super optimized, my experience with the original IBM PC architecture tells me this will be just fine :)
This is the beauty I've found in the Risc-V ISA: it is designed to be implementable with extremely simple hardware, then better hardware can make it more efficient. That, and the ISA being free, are what hooked me ;)

_________________
https://www.alrj.org/pages/Astorisc.html


Wed Jun 29, 2022 9:41 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1789
Nice info and insights - thanks!


Wed Jun 29, 2022 4:36 pm

Joined: Mon Oct 07, 2019 2:41 am
Posts: 619
BigEd wrote:
But it might be that it doesn't really matter how efficient it is, at least initially, so long as it works.

That's right, I don't care too much if the interrupt handling is not super optimized, my experience with the original IBM PC architecture tells me this will be just fine :)
This is the beauty I've found in the Risc-V ISA: it is designed to be implementable with extremely simple hardware, then better hardware can make it more efficient. That, and the ISA being free, are what hooked me ;)[/quote]

All my CISC designs are free too. :) Just tweeking the hardware, software logic for code generation at the moment.

Is the 8080 restart instruction and IRQ service, I.P. by Intel. On instruction fetch, IRQ jams a CALL or RST N instruction
on the bus rather than the fetched opcode. DI/EI needs a bit of care, but that is all. No NMI or SWI's kept it simple.
Mind you they only expected one or two devices to use IRQ's, like a timer or keyboard press.
Ben.


Wed Jun 29, 2022 5:02 pm

Joined: Thu Feb 25, 2021 8:27 am
Posts: 32
Location: Belgium
Glad to see the forums are back!

Haven't posted in a long time. Work has been slow (too many hobbies and projects) but not completely halted :)

I was working on the external memory bus, and thought it was the perfect opportunity to move away from the temporary Harvard architecture and map the ROM at the end of the address space.
But then I realized that having the instruction's address coming from somewhere else than the PC register, as I had done for jumps/branches, wasn't such a great idea at all! Because now that my ROM isn't at address 0 anymore, I need a real Reset vector to be clocked into PC, otherwise I feel like I'm gonna run into a lot of timing issues!

So, in addition to the Memory access part, I got to rework the Instruction Fetch stage, and it has impacts on the pipeline in further stages as well, mostly in how jumps and branches are handled.

In the end, I hope it will make things a bit cleaner, not only for jump/branch and reset but also when I'll start working on the interrupt handler.

_________________
https://www.alrj.org/pages/Astorisc.html


Thu Sep 15, 2022 9:32 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1789
Thanks for the update!


Thu Sep 15, 2022 9:35 am

Joined: Thu Feb 25, 2021 8:27 am
Posts: 32
Location: Belgium
Not exactly part of the CPU itself, but I got the memory mapper and the "byte shuffler" done.
This byte shuffler is the part that re-arranges the bytes from/to the external 32-bit data bus when a halfword (16-bit) or byte (8-bit) access is made at an address that is not 32-bit aligned. It also performs the sign-extension if needed.
All memory accesses must be aligned on their width, although no exception is raised if it's not the case, only undefined value stored or loaded.

With that in place, I could rework the Fetch Stage so that it includes a reset state that loads the PC register with the reset vector.
Of course, this called for a reset circuit that holds the reset line active for a predetermined number of clock cycles before releasing it to clear the pipeline.

The Fetch unit is also ready to load the interrupt register (the 'mtvec' CSR), but I don't have interrupts yet :lol:

Digital has a very nice high-level component: a framebuffer display! Map it on your address bus, write to it, and it will open a window displaying the content of the framebuffer.
Being so easy to use, I couldn't resist the urge to play a bit with it and quickly wrote a short "xor-pattern" generator program.

Apologies for the huge screen capture in attachment, but I'm super happy with the result so far!


You do not have the required permissions to view the files attached to this post.

_________________
https://www.alrj.org/pages/Astorisc.html


Tue Sep 20, 2022 7:27 am WWW

Joined: Thu Feb 25, 2021 8:27 am
Posts: 32
Location: Belgium
Just spent way more time than I would like to admit on a bug where spurious garbage instructions were inserted when the Instruction Fetch stage was supposed to be stalled and send a bubble to the Decode stage.
Turns out it was entirely my fault, I was doing something that I knew was very wrong.

I was gating the clock signal with the "enable" signal in my pipeline registers! :evil:

There, I've said it. I have absolutely no idea why I did it that way.

With that being fixed, I can continue the work on the missing parts and features.

_________________
https://www.alrj.org/pages/Astorisc.html


Tue Sep 27, 2022 3:11 pm WWW

Joined: Thu Feb 25, 2021 8:27 am
Posts: 32
Location: Belgium
More small progress is made on the design.

I added a "READY" input that slower peripherals can pull low to insert wait states and make the processor wait. This input is sampled at the falling edge of the clock, so in the middle of the cycle. This gives the device about 15ns to react, hopefully this will be enough. Crossing my fingers...

I immediately put that feature to good use and redrew the memory interface to the EEPROM/Flash circuit to work with the kind of actual part that I'll be using, most likely something like a 128k x 8-bit 55ns flash ROM. I don't want to have to flash four chips every time I make a change to the Flash ROM ;-)
For now, I'm using extremely conservative timings with 4 cycles for each of the 4 bytes to read.

I don't think I will use shadow memory. If I need some code from Flash ROM to run faster, I can always copy it to RAM (main memory will be 8ns static RAM, accessed in one cycle).

_________________
https://www.alrj.org/pages/Astorisc.html


Thu Oct 13, 2022 11:54 am WWW

Joined: Thu Feb 25, 2021 8:27 am
Posts: 32
Location: Belgium
I know it's been a long time, but the project is not dead, only slowed down :)

I've been tracking some bugs with the memory interface that I had introduced with the multi-cycle memory access.
To keep things simple, all the stages of the pipeline are stalled when a multi-cycle access is performed, even if this access is caused by the Fetch unit at the beginning of the pipeline. I quickly realized that doing otherwise was not going to work unless I'd throw an awful lot of hardware at it!

I also did some tidying up in the schematic pipeline control logic, but that's purely cosmetic.

I have now finally started to work on the CSRs and the interrupts. I think I finally understand more or less how all of this will integrate with the pipeline logic. It may also end up being slightly bigger than I expected, but apparently that has been the case for every part of this project :D

Now that the system was apparently running correctly, I played a bit more with it and the result is the Mandelbrot set as rendered in the Digital framebuffer component.
Attachment:
Astorisc-mandelbrot.png


There's no hardware multiplier (yet?), so the code is using a software multiplication routine. Resolution is 160x100 pixels, with maximum 14 iterations. The code runs in less than 12 million clock cycles, the simulation takes less than 5 minutes on my laptop.


You do not have the required permissions to view the files attached to this post.

_________________
https://www.alrj.org/pages/Astorisc.html


Fri Mar 31, 2023 8:41 am WWW

Joined: Sun Oct 14, 2018 5:05 pm
Posts: 62
Nice!

-Gordon


Fri Mar 31, 2023 9:11 am

Joined: Sun Mar 27, 2022 12:11 am
Posts: 41
Multiplication can be fairly cheap on a RISC with a 'multiply step' instruction, and reasonably fast if you can use radix-4 Booth's algorithm. However, it does require an extra 32-bit shifter on one of the inputs to the adder in addition to the 2-bit shift and add/subtract hardware.


Wed Apr 05, 2023 1:30 am
 [ 54 posts ]  Go to page Previous  1, 2, 3, 4  Next

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software