View unanswered posts | View active topics It is currently Tue Mar 19, 2024 7:56 am



Reply to topic  [ 11 posts ] 
 16-bit Programmable 2D graphics processor plus simple CPU 
Author Message

Joined: Sun Mar 27, 2022 12:11 am
Posts: 39
The CPU is a bit of an after thought, but it will be a 16-bit RISC machine. The graphics processor is the main purpose of the project, and should have enough capability to emulate the different techniques of doing 2D graphics. From the NES style tilemaps and sprites up to a blitter with scaling.

Essentially this will be a programmable state machine driving a couple of address adders that can shift one to two* 8-bit pixels per cycle (*Depending on how expensive alignment is). The pixel pipe will look like this: Source memory -> bit shifter/alignment -> colour lookup table(1,2,4,8 to 8-bit) -> back buffer memory / front buffer memory -> colour look up table(8-bit to 16-bit) -> VGA DAC.


I don't want to spend any money on this so I'm constrained on parts like 5v SRAMs, but have heaps of DRAM.


Fri Sep 15, 2023 1:11 am
Profile

Joined: Sun Oct 14, 2018 5:05 pm
Posts: 59
DockLazy wrote:
The CPU is a bit of an after thought, but it will be a 16-bit RISC machine. The graphics processor is the main purpose of the project, and should have enough capability to emulate the different techniques of doing 2D graphics. From the NES style tilemaps and sprites up to a blitter with scaling.

Essentially this will be a programmable state machine driving a couple of address adders that can shift one to two* 8-bit pixels per cycle (*Depending on how expensive alignment is). The pixel pipe will look like this: Source memory -> bit shifter/alignment -> colour lookup table(1,2,4,8 to 8-bit) -> back buffer memory / front buffer memory -> colour look up table(8-bit to 16-bit) -> VGA DAC.


I don't want to spend any money on this so I'm constrained on parts like 5v SRAMs, but have heaps of DRAM.


Sounds interesting - will it have an interface to (say) 8-bit CPUs? SPI, Parallel? Native Bus?

Over on 6502.org there are a few threads of video/vga/hdmi output from the old systems...

Cheers,

-Gordon


Fri Sep 15, 2023 6:36 am
Profile

Joined: Sun Mar 27, 2022 12:11 am
Posts: 39
drogon wrote:
Sounds interesting - will it have an interface to (say) 8-bit CPUs? SPI, Parallel? Native Bus?

Over on 6502.org there are a few threads of video/vga/hdmi output from the old systems...

Cheers,

-Gordon

Good question. Interfacing to the CPU is the hardest part. Ideally I'd like to double pump the data memory(where the tilemaps/display list etc. are stored) so no signaling and or waiting is needed for either side. However, the clock rate will likely be in the high teens. Times that by two and I don't think timing can be met with TTL. The problem is managing data contention when switching between read and writes.

The CPU needs to interface to the data SRAM, the instruction SRAM for the state machine, and a status register. Since double pumping won't work I think native via some lockable tristate buffers or registers is the way to go. Access will then depend on the state machine.


Sat Sep 16, 2023 2:22 am
Profile

Joined: Sun Mar 27, 2022 12:11 am
Posts: 39
Well, I've run into my first headache. Also I'm switching to processing one pixel per cycle to save chips.

To support textures of different bit depths(1,2,4,8-bit) the bit shifter needs to borrow the three least significant address bits. The problem here is the address adder has a fractional component for scaling. So the carry needs to be propagated to the appropriate place to work properly, or do a painful shift of the whole address.


Sat Sep 16, 2023 7:33 am
Profile

Joined: Sun Mar 27, 2022 12:11 am
Posts: 39
The fix for the above is ORing in some '1's so that the carry can propagate to the correct position without messing up CLA.


Sun Sep 17, 2023 12:44 am
Profile

Joined: Sun Mar 27, 2022 12:11 am
Posts: 39
A little explanation of the different components (very much WIP)

Starting with the frame buffer:
-The address bits are split into a Y and X component. So that pixels are stored exactly as you would see them on screen. Top left is Y0,X0. Bottom right is Ymax, Xmax.
-Writes are in row major order, one line of pixels at a time. Support for flipping to column major might be added if I feel like tackling DOOM.

The destination adder:
-Only needs enough bits to cover the X component of the address. Y is held in a register.
-Y destination register gets updated by the state machine for each new line.
-All the hardwork for scaling happens on the source side.
-Since only a straight line of pixels can be moved. It only needs to advance the address by one for each write.
-12-bits wide to support 640 horizontal resolution.

The source adder:
-Currently 20-bits wide. 9-bit X component, 3-bit shift select, and 8-bit fraction. Plus a Y register, and a fixed X register that holds any overflow bits.
-The 3-bit shift select is used to select a pixel within a byte when using a texture bit depth of less than 8-bits.
-The fraction component is just used for scaling.
-The step register stores the addend and will be 20ish bits.
-Might need a shifter for loading the registers because of the different texture bit depths.
-As well as generating addresses, this adder can also be used by the state machine to do basic maths.
-Only addition and subtraction is supported.

The pixel pipeline:
-Bytes from texture memory are fed into a 2-stage pipeline.
-1st stage is a 0-7 bit right shifter driven by the 3-bit shift select. This supplies the appropriate bits to the colour look up table, and a zero detect circuit.
-An 8-bit register selects the palette. To keep things simple there are only 256 palettes no matter the selected bit depth.
-The zero detect circuit is used for transparent pixels by suppressing writes to the framebuffer for the zeroth colour. Sacrificing one colour in the palette is a lot cheaper than a mask.
-2nd stage is just the write to the frame buffer.

The state machine:(under heavy construction so bear with)
-It's job is to use the display data from the CPU to preset the adders then move bytes from one memory to another, rinse and repeat.
-This is a fast SRAM addressed with a program counter.
-I think it will need a scratch pad of some description. If it's not too costly I'd prefer register only.
-All loops are assumed to be unrolled so there is no state to keep track of. Can use relative jumps to move a variable number of pixels by jumping into the middle of a loop.
-No support for recursion. Subroutine calls use jump and a location in the scratchpad for the return address. Might use a link register as a faster alternative.
-I don't think branches are needed.


Tue Sep 19, 2023 3:21 am
Profile

Joined: Sun Mar 27, 2022 12:11 am
Posts: 39
I think I'm going to switch direction and integrate most of the graphics into the cpu.

This will be in the form of a single cycle instruction that selects a pixel from a 16-bit general purpose register, passes it through a colour lookup table, then does a byte write to memory if the pixel is not zero.

The frame buffers will be stored in data memory and I'm switching to an FPGA to drive the screen. Mostly because I've had aliasing issues using VGA to drive LCD screens at low resolutions in the past. Hopefully running at the LCDs native resolution should fix that. If it doesn't DVI is an easy upgrade. Also having access to a couple off PLLs makes timing the computer much simpler.

For now FPGA will just shadow CPU writes to it's own frame buffers. The only trick I want to implement is to have support for a tear free single buffer by making a copy of the backbuffer instead of doing the usual page flip. For this class of computer redrawing the entire screen each frame is very expensive. It's much more efficient to use a single buffer and only draw the changes between each frame.


Tue Oct 03, 2023 8:55 am
Profile

Joined: Sun Mar 27, 2022 12:11 am
Posts: 39
While the CPU is a classic 16-bit RISC I want to add some support to increase the memory space to 32-bits, viewtopic.php?f=3&t=989 discusses this a bit. The idea is to use a VM to emulate a 32-bit machine. There may be some hardware and instruction support to help make this more efficient. I think it should be possible to match clock for clock performance with a 68k.

Current WIP CPU features:
-16-bit RISC 2-stage barrel processor with 32-bit instructions. Getting a RISC ISA to fit in 16-bits is stressful and not recommended.
-Harvard architecture. Instruction memory is 256k x 32-bits and write only. Data is 1024k x 16-bits. Both are fast 10ns SRAM.
-Pipeline:
Code:
Stage 1, phase 1:  writeback / instruction fetch*
         phase 2: operand fetch / instruction decode

Stage 2, phase 1: ALU or effective address / zero and negative detect for branches / pixel shift and LUT / shifter
         phase 2: Data memory / zero and negative detect for comparisons / PC increment or branch displacement or load / instruction fetch* / shifter

*instruction fetch and branching is in a state of flux. If decode can't be done in time instruction fetch can be slotted into stage two..
-Clock cycle worst case should be under 100ns
-For shifts I'd like to do a 32-bit shift in 2 cycles. My first thought was a funnel shifter, but that may be too big/slow.

-The register file is made is made of single port SRAM wired up for 3R/1W in a write through mode. The extra read is for 32-bit addressing.
-31 registers plus zero. All writes to R0 are zeroed out instead of doing it read side as the hardware can be reused for set on condition instructions.
-Registers are 32-bit wide(16-bit physical) but split into two 16-bit halves. For 16-bit ops I think this will be treated as 64 registers.
-Although 32-bit writes are possible for storing the PC, the upper 16-bit word isn't shadowed to the other banks. So will require moving it to itself, which will copy it to all banks.
-There's plenty of space for multiple register files so there will be an instruction to switch register file.

-Initially branches were going to be similar to Alpha and branch on comparison with zero. However moving the instruction fetch to the next stage means there is enough time for compare and branch instead.
-Compare and set instructions will be available
-Unconditional jumps will be rolled into Jump and Link, and Jump and Link Register instructions. JAL will probably be PC relative. The link register can be any register.

-External DRAM. Access time will probably be two cycles unless some kind of forwarding is implemented.
-It would be nice for the VM to have support for virtual memory, but I haven't come up with a performant way of dealing with this.


Last edited by DockLazy on Tue Oct 10, 2023 2:58 am, edited 2 times in total.



Mon Oct 09, 2023 2:12 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 581
Don't over design, that is what happened to my last project.
Get the base machine working in 16 bits. but have MAR say 20 bits.
Then when you go to 32 bits, you can port the I/O and memory cards over.


Mon Oct 09, 2023 7:51 am
Profile

Joined: Sun Mar 27, 2022 12:11 am
Posts: 39
Funny enough this design doesn't have a MAR. The address lines are connected straight to the ALU via a buffer.

The I/O interface needs to be designed early as it can't be easily changed later. For example interrupts or polling, Harvard or memory mapped I/O etc.


Tue Oct 10, 2023 2:56 am
Profile

Joined: Sun Mar 27, 2022 12:11 am
Posts: 39
Registers are going to be switched from 16-bit to 24-bit. The extra 8-bits are meant for addressing. The data bus and memory remains at 16-bits.

This makes instruction decoding a bit easier, but adds some other complications. Like the shifter requiring more chips. Zero detect requires more chips but only just remains at two logic delays.

Since only 23-bits are needed for addresses. I was thinking of using the 24th bit as a carry flag and as a guard bit for signed comparisons. The way this would work is the 24th bit going into the ALU will be hardwired zero or sign extended for signed arithmetic. All 24-bits of the result will then be stored in the register file. That extra bit can then be used by a following ADC or SBC instruction.

The reason I do it this way is that detecting overflow is a bit of a pain, and having extra state in the pipeline is a special pain when it comes to faults or interrupts. Especially in this case where control is hardwired.

How the Colour Look Up Table is used has been modified as the RAM I was using for that is now part of the register file. Instead of being it's own separate thing. The CLUT will now be stored in one of three 1MB SRAM banks. The three banks are actually wired up to be used as general purpose RAM. It' s just that two banks have a special function. One is used to store the display buffers, there's not extra hardware, the FPGA shadows this bank. The other bank used by the CLUT has an address mux that creates an address from the palette plus pixel data instead of using the effective address. This allows reads from the CLUT and writes to the display bank at the same time.

As far as graphics performance goes. It will take 320 cycles(1.25 cycles per 4-bit pixel) to draw a 16x16 4-bit sprite. That is 64 16-bit load instructions plus 256 pixel shift and store instructions. That's max 4 million pixels per second per thread at 10Mhz. That doesn't include the cycles required for setup which is heavily dependent on what is being drawn.

If pixels are aligned it is possible to move two 4-bit pixels per cycle. Realistically this will only be useful for fills and some tilemaps or bitmaps. Scrolling is handled by changing the display buffer pointer so most tilemaps should be aligned. To draw a 8x8 4-bit tile will take 16 load instructions and 32 double pixel shift and store instructions. For a total of 48 cycles or 0.75 cycles per pixel. That's 6.7Mpixels per second per thread at 10Mhz.

Comparing this to my original hardware design. It was setup to move lines, so for the best performance moving large wide blocks is best. Each row change takes one cycle. For a 16x16 4-bit sprite it's 16 cycles row change plus 256 cycles pixel move that is 272 cycles total or 1.0625 cycles per pixel. For a 8x8 tile it is 8 cycles row change plus 64 cycles pixel move for a total of 72 cycles or 1.125 cycles per pixel. This was designed to reduce the workload on the CPU as much as possible. So it took a brute force approach and drew every frame from scratch. Because of that all pixels are assumed to be unaligned so only one can be moved per cycle, and it only needed an 8-bit data path.

Worst case clock for this design was 15Mhz. So it could move between 13.33Mpixels(8x8) - 14.11Mpixels(16x16) max. It had about the same performance as a SNES or Genesis but much much more flexible.

Out of interest here is how bad it would be with no hardware support, using the same packed 4-bit pixels. 16x16: 64 * (1 cycle load + 4 * ( 4-bit shift 1 cycle + mask 1 cycle + add CLUT pointer 1 cycle + load CLUT 1 cycle + store pixel 1 cycle)) for a horrifying total of 1344 cycles or 5.25 cycles per pixel. That's only 0.952Mpixels per second per thread.

Adding pixel instructions to the CPU was well worth it. The cost is a mux for the CLUT RAM and some extra chips in the instruction decoder. Performance wise a separate GPU would've been better. Not only would it shift more pixels but it would free up an entire thread on the CPU. But alas electronics are too expensive at the moment especially, as it turns out, for something I don't really need.


Thu Oct 19, 2023 3:40 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 11 posts ] 

Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software