I have created another demo, and this time I also made some changes to the logic design.
This one is a sprite animation demo, and it turned out that drawing four 32x32 pixel sprites in software is too slow - so I added a minimum amount of hardware acceleration to the video controller. It can now shift a word of pixel data and it can generate a mask for having a transparent color in a sprite. These operations were the parts of the sprite routines that took the most cycles. Now the sprite routines are five times faster.
The additions to the logic design are minimal. I did not want to change the interface to the video RAM, so the new hardware functions are not connected to it. They are just some new registers that provide the shifting and masking operations, and afterwards you have to write the data to video RAM yourself.
Here is a video showing the sprite demo before and after the new hardware features:
https://youtu.be/67gn44C7D5ANow I am thinking of doing something similar to provide integer multiply/divide functionality which the Tridora-CPU itself does not have. Of course, I could do it right and create new instructions for that, but I fear that might break the logic design which seems to already have some random timing problems (which most of the time go away if I change synthesis optimization settings). I should also probably do some research on optimizing critical paths.