Last visit was: Mon Sep 27, 2021 10:03 pm
It is currently Mon Sep 27, 2021 10:03 pm



 [ 237 posts ]  Go to page Previous  1 ... 9, 10, 11, 12, 13, 14, 15, 16  Next
 rj16 - a homebrew 16-bit cpu 
Author Message

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
So, first video in the GPU series is up. In this episode I get the VGA timing working and get a test pattern displayed.

[029] VGA Blinkenlights! (Part 1) https://youtu.be/nVaOJ6CwIic

I have probably at least 2 more videos just to implement text mode.

If there's demand, perhaps I can do tile maps and sprites. Tile maps are basically text mode but with customizable characters and more colours. Sprites are arbitrary images displayed at an arbitrary location. But the main issue with implementing tile maps and sprites right now is the CPU is not really mature enough to be able to use them yet.

I have also recorded part a video of implementing a proper ALU, but I put that on pause to do the GPU miniseries. I will put those videos out after.


Fri Apr 30, 2021 5:16 pm

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
Next part is up. I maybe should wait longer between videos but meh, I am having fun :D

In this episode, text mode is implemented with a character generator.

[030] VGA Text Mode! (Part 2) https://youtu.be/PFazC5LR2eI

Next episode I show it working in the FPGA and get some dynamic text being displayed. I got an el cheapo $15 HDMI to USB capture device that works really well for capturing the HDMI output from the FPGA.


Sat May 01, 2021 4:58 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1632
Hey, if you're enjoying it, that's great motivation!


Sat May 01, 2021 5:01 pm

Joined: Mon Oct 07, 2019 2:41 am
Posts: 256
It is good to see the FPGA works well. At this end of the world with the other brand hardware (ALTERA) I have been getting a lot almost working builds with the routing.
1 out 4 builds work with the hardware. Add a halt for testing it works, take out the halt it crashes. I have a 32 bit design, and wonder if any other people have build problems.
with a design larger than 16 bits?


Sun May 02, 2021 2:30 am

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 71
The videos are coming thick and fast - I haven't watched them all yet, but I've enjoyed the ones I have watched - thanks for sharing.

oldben wrote:
It is good to see the FPGA works well. At this end of the world with the other brand hardware (ALTERA) I have been getting a lot almost working builds with the routing.
1 out 4 builds work with the hardware. Add a halt for testing it works, take out the halt it crashes. I have a 32 bit design, and wonder if any other people have build problems. with a design larger than 16 bits?


It sounds like your design's failing to meet timing - and 32-bit operations certainly take longer than 16-bit. Do you have timing constraints set up? (If the design's self-contained on the FPGA and not, for example, making use of the SDRAM, then setting up timing constraints should be pretty simple. If you'd like some help with that process maybe start a new thread - or feel free to PM me.)

The place-and-route process uses Monte Carlo methods which means the tiniest change to the code can result in primitives being placed completely differently in the FPGA, and thus the paths between them varying quite a lot in length. Chances are something in your design is just on the edge of working timing-wise, and making a change to the codebase re-rolls the dice as to whether or not it works.

If the FPGA's getting full these problems tend to get worse, simply because it's less likely that primitives will be placed in such a way that they can be routed efficiently.


Sun May 02, 2021 7:17 am

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
robinsonb5 wrote:
The videos are coming thick and fast - I haven't watched them all yet, but I've enjoyed the ones I have watched - thanks for sharing.


The very latest two are mostly independent of the previous ones if you want to just watch them. I would be curious about your (or anyone else's) thoughts on my design. It's my first foray into trying to make something run fast on an FPGA. 640x480 only requires a 25.175 MHz clock, but it would be nice to do 720x400 to be more widescreen, but then we're looking at 35.5 MHz and I am not sure what optimizations I need to make to do that kind of clock in the ice40 up5k. Currently it runs at 34 MHz, so it's close. I suspect I need registers around any adders at the very least, and maybe convert the timing to a state machine instead of comparing numbers.

And then if I do get it to run at 35.5 MHz, then I may need to think about having two clock domains. The cpu is only able to do around 26 MHz and I'd have to make it multi-cycle to try to run it faster, and that will be a while. Crossing clock domains sounds a bit intimidating, but all I need to do is pass debugging information one way across clock domains from the slower one to the faster one. A quick google shows just three registers required. Seems not too bad?

Edit: The critical timing path runs through this:

Attachment:
critpath_sub32.png


This subtracts 32 from the ASCII character so I don't need to store the unprintable first 32 ASCII characters in ROM. I wonder if there's a way to do that without having to do a subtract? Or maybe there's a more efficient way to subtract 32?


You do not have the required permissions to view the files attached to this post.


Last edited by rj45 on Sun May 02, 2021 1:34 pm, edited 1 time in total.



Sun May 02, 2021 12:54 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1632
I haven't watched your video, but: a preloaded count-down which hits zero will probably be simpler and might be faster than a count-up-and-compare.


Sun May 02, 2021 12:57 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1488
Location: Canada
Larger designs are a PITA. I have a similar issue with a design where if a dummy pipeline stage is added it works, remove the dummy stage and ‘poof’ it no longer works in the FPGA. But it works in simulation without the dummy stage.

Try and ensure that there are enough FF’s in use in the design. Breaking up the long signal lines with FF’s seems to help. Of course, that means more pipeline stages for some signals.

Super-pipelining the design might help. FPGA’s like to see the use of an FF after a LUT.

To get high-speeds use FF’s and pipeline wherever possible. I have got up to 1366x768 (85Mhz) in a slow Artix-7 Xilinx FPGA. However normally my SoC runs 800x600 VGA mode (40MHz) because it works far better with other timing (cpu and vga to HDMI converter). It sounds like the ICE part may be a bit slower. 800x600x56 Hz is only 36MHz which is close to what the ICE is running at. It may work even if it is officially too fast for the part.

I have found it handy to generate vertical blank interrupt, end-of-frame (eof), and end-of-line (eol) which are a single clock width pulses from the VGA timing block. I also have a border signal generated in case the displayed area is less than the blanking area. Some of the more obtuse display modes in use generate partially displayed text rows or columns unless a border area is defined. The border area is inside the blanking area.

_________________
Robert Finch http://www.finitron.ca


Sun May 02, 2021 1:33 pm WWW

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
See my edit above for the critical timing path. What I am trying to do is optimize the amount of character ROM required so I don't use all my block RAM just to store pixels.

I just need the uppercase letters, and numbers.

It looks like the upper two bits of the 7 bit ASCII code will be either 01 (for digits) or 10 (for A-Z). I could map those two to 00 and 01, and anything else will just map to zero.

Edit: Oh! I could map 11 to 01 and I get lowercase ASCII mapping to the uppercase letters for free! Sweeeeet :D


Sun May 02, 2021 1:57 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1632
yes, you shouldn't need an actual subtract. Having said that, I'd hope that synthesis would spot a subtraction of a power of two constant and make simple fast logic for it.


Sun May 02, 2021 2:00 pm

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 71
rj45 wrote:
I would be curious about your (or anyone else's) thoughts on my design. It's my first foray into trying to make something run fast on an FPGA. 640x480 only requires a 25.175 MHz clock, but it would be nice to do 720x400 to be more widescreen, but then we're looking at 35.5 MHz and I am not sure what optimizations I need to make to do that kind of clock in the ice40 up5k. Currently it runs at 34 MHz, so it's close. I suspect I need registers around any adders at the very least, and maybe convert the timing to a state machine instead of comparing numbers.


To be honest I tend to let the tools do the hard work for me - I set the clock to the speed I want, do a build, and examine the timing report to see what failed. I've not used the Lattice toolchain at all, but Timequest within Quartus will let you see each element with a path, and see how much delay each on adds to the path. It's not always immediately obvious which part of your design it's referring to, though, since the optimisations can change things around quite a bit.

Quote:
And then if I do get it to run at 35.5 MHz, then I may need to think about having two clock domains.


The simplest way to do multiple clocks is to make sure that they're generated from the same PLL, and are integer multiples of each other - that way the edges are always aligned and you won't have metastability issues.

Quote:
A quick google shows just three registers required. Seems not too bad?


Yeah it's not too bad. To state the obvious, if you're transferring, say, a 32-bit value you probably don't want to register the entire 32-bit value three times - just make sure it doesn't change until after it's been transferred, synchronise a req signal with a couple of flip-flops, then register the value in the target domain when the req signal emerges from the synchroniser.
(You'll often see this done using a dual-port-RAM-based FIFO queue, which removes the requirement for the input signal to remain stable throughout the entire transaction.)

Quote:
This subtracts 32 from the ASCII character so I don't need to store the unprintable first 32 ASCII characters in ROM. I wonder if there's a way to do that without having to do a subtract? Or maybe there's a more efficient way to subtract 32?


So the mapping you want is:
0x00 -> ------
0x20 -> 0x00
0x40 -> 0x20
0x60 -> 0x20
0x80 -> ------
0xa0 -> ------
0xe0 -> ------
If you don't mind the character set being aliased into the unused input ranges, you can do this just with wiring - ignore the incoming bit 5 completely, and map bit 6 to bit 5.


Sun May 02, 2021 3:13 pm

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
Yeah that's exactly what I ended up doing just now:

Attachment:
criticalpath_optimized.png


Much better than a subtract.... I basically just take the 7 bit ASCII and just take bits 0-4 and 6 to make a 6 bit number. The digits repeat twice and the letters also repeat twice. Zero ends up being the space, which is perfect.

It now runs at 42.5 MHz.

The next two critical paths are from the font ROM to the R/G/B outputs, and the timing circuit. The timing circuit could be faster by converting it to a state machine and removing the compares (if possible). This is a bit more tricky though:

Attachment:
critpath_rgb.png


The crit path seems to be from CD (font ROM data) to R/G/B. I guess I could insert another set of pipeline registers? Not sure if there's a better way to pick a bit though, maybe there is?


You do not have the required permissions to view the files attached to this post.


Sun May 02, 2021 3:19 pm

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 73
The real problem is color. If it matters, starting from SVGA you start needing a decent DAC.

I have used a couple of them, but for the current VDU project I only want 1 bit green/black screen, with fixed intensity but with hardware blinking and highlighting.

I love to have a loadable font ROM. With VGA 640x480, a good choice is chars of 8x16 pixels. I have font-set loadable into BRAM, so I can reload it on the fly for a different font-set. The screen is 80 column 30 rows, more than enough for my needs.


Sun May 02, 2021 3:30 pm

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
Yeah, the HDMI adaptor I am using has 4 bits per channel, so 12 bit colour. But the fonts are just monochrome, and you can set the desired colour. I don't think I released the video where I show that part yet.

As for the critical path, I looked at the timing. The multiplexers aren't the problem, it looks like (if I am reading this right) it takes 5.3 ns for the font ROM to produce a value, then just 1.3 ns to run through the multiplexers. The real problem though looks like routing -- 17 ns, and I am guessing that's because it's going to an output pin.

Maybe registering the outputs will help?


Sun May 02, 2021 3:51 pm

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
So I registered the outputs and inserted registers before the BRAM address. So the last remaining critical path is the counter in the VGA timing circuit. I reduced that from 16 bits to 12. And now it runs at just a hair under 50 MHz. I think that's the best possible unless I get fancy with the counter. Interestingly, I did not have to convert the timing circuit to a state machine.


Sun May 02, 2021 5:13 pm
 [ 237 posts ]  Go to page Previous  1 ... 9, 10, 11, 12, 13, 14, 15, 16  Next

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software