View unanswered posts | View active topics It is currently Thu Mar 28, 2024 4:10 pm



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 28, 29, 30, 31, 32, 33, 34 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Moved the remaining code to load the icache from the mainline code into the icache controller module. An extra clock cycle was introduced on an L2 cache miss.
Since scrapping the loop in the data cache things hadn’t been working. It turns out another cycle of delay was required before registering the output of the data cache. This had been hidden before by the loop back to re-test for a hit which took a cycle.
For some reason the size of the bootrom was dimensioned just beyond 192k by a couple of words. This is inefficient to implement with block rams. The size was reduced to an even 192k. The length of the code in the rom and the rom checksum were stored in the last two half-word of the rom.

_________________
Robert Finch http://www.finitron.ca


Mon May 06, 2019 2:38 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Sprites are back! At least in test mode. I had changed a ‘<’ to a ‘<=’ for the z-order compare. Since the default screen display z-order was a zero the sprites could no longer appear on-top.
Since there seems to be no issues with displaying 32 sprites, I figure I’ll try for 64 sprites. It would be nice to have enough sprites to render a game like space invaders or galaga using sprites without having to resort to tricks.

The assembler wasn’t recognizing the correct registers for the temporary register $t0 to $t7. These should be register r3 to r10, but the assembler was encoding them as $r5 to $r12. The only reason things worked was because there are hardly any functions using more than three or four temporary registers. I finally decided to use $t7 to hold the stack offset of the link register. This conflicted with $r12 which is a register variable.

Traced one error to zeros being loaded into the instruction register. There seems to be an issue with the instruction cache. The zeros are in the middle of the cache line. It’s as if the cache missed loading a couple of words OR it’s bad cache memory which is stuck at zero. I put in a fault handler for this case that just invalidates the instruction cache then returns. This should cause a reload of the cache. Since the cache is 4-way associative it will likely not be reloaded to the same way. If there’s a problem in the icache then, it might be bypassed by reloading and continuing execution.
The beginnings of the instruction bus error handler. Just invalidate and return. Hopefully it’s just a spurious issue.
Code:
                           .ibeFlt:
FFFFFFFFFFFCD1B4 05 C0 22 01                          csrrd   $r22,#$048,$r0      ; get EPC
FFFFFFFFFFFCD1B8 1E 76 00 00                          cache   #2,[$r22]               ; invalidate line
FFFFFFFFFFFCD1BC 02 00 48 04                          sync
FFFFFFFFFFFCD1C0 28 14 D5 FC                          jmp      _return_from_interrupt


I think I got the invalidate icache line command to work. It’s tricky because it has to look like a miss and stall the ifetch while the cache line is being invalidated. The address bus used to update the icache is shared.

_________________
Robert Finch http://www.finitron.ca


Tue May 07, 2019 5:00 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The tlb component was moved outside the alu. It didn’t really belong there, and it could not be used with two alu’s. It still remains programmable via the alu issue path in the core. The reset signal was carefully removed from the TLB. I’ve been removing resets wherever possible. The fewer signals there are that need to be routed the better.

Inventing more compiler goodies; today the ‘auto new’ construct. If the ‘new‘ keyword is preceded by the ‘auto’ keyword then the memory allocated is ‘automatically’ freed at the function return point. The compiler generates code to maintain a linked list of objects allocated. This saves the programmer having to create the list and code to go through it in the event of errors. So,
Code:
 int *pd = auto new int[256];
would allocate an array of 256 integers which is automatically freed when the function returns. The generated code keeps track of allocations using a singly linked list. The head of the list is stored in the return block for the function. Items are stored such that they may be freed in the reverse order they were allocated in. The new/delete keywords implement the standard C mechanism for allocating and freeing memory where it is all controlled by the program. To get memory which is freed by the system, the ‘gcnew’ keyword is used.

_________________
Robert Finch http://www.finitron.ca


Wed May 08, 2019 2:42 am
Profile WWW
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
robfinch wrote:
Sprites are back! ....
Since there seems to be no issues with displaying 32 sprites, I figure I’ll try for 64 sprites. It would be nice to have enough sprites to render a game like space invaders or galaga using sprites without having to resort to tricks.

Your mention of "sprites" as well as space invaders and galaga is interesting to me because I would also aim to be able to run a basic version of space invaders on my 74xx cpu at some time. I refer to one of the original ones that just had 'aliens' moving left and right without hardware sprites. So, so far I have not plans for sprites. However, your comment above makes me think about whether hardware 'sprites' would be any difficult to implement. I only know that once you have that in hardware, you can run very fast games with comparatively much reduced speed exigence for the software that actually runs in the main processor. Sprites is an interesting addition for a 80's like computer. I wonder if you can offer some pointers or links about details about how to implement that on hardware, or how you did that (even if it's in the FPGA)?

Thanks
Joan


Wed May 08, 2019 6:50 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
robfinch wrote:
If the ‘new‘ keyword is preceded by the ‘auto’ keyword then the memory allocated is ‘automatically’ freed at the function return point.

Would it work just as well if the compiler allocated onto the stack at this point?


Wed May 08, 2019 8:21 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
robfinch wrote:
Inventing more compiler goodies; today the ‘auto new’ construct. If the ‘new‘ keyword is preceded by the ‘auto’ keyword then the memory allocated is ‘automatically’ freed at the function return point.

Some earlier higher level languages such as java implemented what's is known as a "garbage collector" but I think that's not really what you mean to do here. I am aware of at least one more efficient way for a language to deal with memory management than garbage collection, and it’s equally forgiving for the programmer.

I can describe the case of the latest specification of the Objective C language, and the new Apple Swift language. They implement what is known as "Automatic Reference Counting". Manual reference counting is nothing new and has been used by programmers for decades. But its 'automated' form is surprisingly a very late addition to compilers. If I recall well, automatic reference counting (ARC) was introduced for the first time on Objective C in 2011. Broadly, it works like this:

-The compiler only allocates (or creates) objects in the 'heap', never on the stack. The stack is only used to store intermediate scalar values and object references or to pass them among functions. This explicitly means that all objects are passed by reference, never by value. The later is explicitly forbidden by the language.

-All objects implicitly inherit a reference count word, along with a runtime type information field (also a word). The reference counter value (RC) is physically the second word on the object memory, and this is always the case.

-New objects can only be created through two very well defined mechanisms and not by other means.

*The first mechanism is the 'alloc' keyword. This creates the object in memory and assigns 1 to its RC. The object is uninitialised at this point and can be initialised any time later by calling its 'init' method. This is useful for example to create uninitialised objects that might be passed to a function for initialisation, or to create uninitialised graphs of data.

*The second mechanism is the 'new' keyword. This implicitly allocates and initialises the object in a single step. There are more explicit ways to 'allocate' and initialize objects which involve more programmer responsibility, but I will leave it out from my description for clarity reasons.

- Objects can be assigned or copied.

* Assigning an object involves coping its reference but not its contents. This is conceptually like assigning a C pointer to a variable, but there's a crucial difference: the assigned object gets its RC incremented by 1. Incrementing RC is referred as retaining. The assignee variable gets the pointer of the assigned object so it looses the reference to its previously pointed object, therefore its previously pointed object gets the RC decremented by 1 just before the pointer assignation actually takes place. Decrementing the RC is referred as releasing an object. In the language, this is just a regular C type assign operator:
Code:
dstObj = srcObj;
It implicitly involves releasing dstObj and retaining srcObj.

* Copying an object involves creating a new object that is a copy of the original one. The new object is physically a different one so it gets its own RC. When you copy an object the original one keeps its RC intact. The copied object is then assigned to a destination variable and the assign semantics described above are applied to the assignee object. In the language that's like writing this (the syntax is slightly different because you can implement hooks on the ‘copy’ behaviour, but you get the idea):
Code:
dstObj = copy srcObj;


- Every time that an object gets released (its RC is decremented) to 0 (Zero), the object is automatically deallocated (deleted from memory). In assignments this will happen when the assignee had a RC value of 1 before the assignation took place. It also happens when variables holding objects with RC of 1 get out of scope, for example if you create an object at the beginning of a function and you do not transfer it outside of the scope of the function, the object will be automatically deleted sometime before the end of the function. Possibly at the point where the variable gets its last use.

- As synonyms of assigning we have passing objects to functions, returning objects from functions, putting objects to collections (such as arrays) and so on. These all are treated by the compiler the same exact way as assignations, and are provided the release/retain mechanism as appropriate.

That's basically it. By just having both the compiler adhering to the above rules and the language discouraging non standard ways to deal with objects (I said 'discouraging' because there are always workarounds for cases where you really want more control), memory management gets totally simplified and transparent to programmers.

The compiler furthermore optimises retain/release code by removing retain/release pairs and folding retains/releases together in all cases where it's safe to do so. The result is even more efficient code. For example, if you use a temporary variable to hold an object, you have very high chances that retain/release code will not be applied at all for the involved object, thus the code generated will be identical to a simple C pointer assignation which may get compiled in just one or two machine instructions. The compiler will even remove ‘copy’ code and replace it by simple retain/release code if the source object was declared as immutable and it determines that the destination object will end its life unmodified.

It's an overwhelmingly simple, fast and efficient way to deal with memory, and yet it was not implemented on a mainstream compiler until 2011, which is quite surprising to me.

Joan


Wed May 08, 2019 11:09 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Would it work just as well if the compiler allocated onto the stack at this point?
It would work just as well or maybe better. The thought had crossed my mind since the ‘auto’ keyword is associated with stack variables in C. However, I was worried about allocating large amounts of memory on the stack. How the memory is allocated is actually hidden from the programmer. It would be simpler just to allocate on the stack; just a subtract from the SP and return the pointer.

@joanlluch As you say, reference counting is a well-known and good means to manage objects.
I got the garbage collected language bug a few months ago. I think it’s also good way of handling memory management. I’d like to build it right into the OS if possible. I wonder if there are any languages that age objects in addition to reference counting. TLB entries are aged automatically by the FT64 TLB unit.

The input to the L2 icache was switched from a 64-bit bus to a 304-bit bus, the same as the L1 icache. This allows loading the entire line at once. The line is loaded from a buffer that is loaded via a 64-bit bus. Well that was an exercise. It took over an hour to P&R then didn’t meet timing and didn’t run in the FPGA. I guess having multiple busses hundreds of bits wide is a bit much. So, I switched it back to 64-bit. One nasty complication of the L2-icache is that it indicates a hit too soon when it’s being updated. This occurs because the line number to use in the cache has to be determined from the tag part before the line can be updated. That means the tag gets updated before the line does, consequently it matches the address and returns a hit. That means subsequent processing can’t be driven by the L2 hit signal, another signal must be used. Fortunately, there is one available, the signal used to advance the random way selection – L2_nxt.

Ran into this compiler bug / issue. The following code excepted from strncpy.c:
Code:
       *s++ = *s2++;

compiles to:
Code:
             add         $r11,$r11,#2
            add         $r13,$r13,#2
            lc          $v0,[$r13]
            sc          $v0,[$r11]

Note that the increments are being done before the assignment. Most compilers would compile this the other way around and perform the increments after the assignment.
The issue is that the postfix operator ‘++’ has a lower precedence than the assignment operator. That means it gets evaluated first. It even gets evaluated before the ‘*’. That makes it a bit of a trick to get placed into the expected order. The compiler would have to be able to “remember” the postfix operator while it’s evaluating expressions, then apply it finally at the end of expression evaluation. Fixed up the compiler to do this.

Ran into this issue with the keyboard not working. I had defined references to the lookup tables like the following:
Code:
 extern byte keybdExtendedCodes[];
extern byte keybdControlCodes[];
extern byte shiftedScanCodes[];
extern byte unshiftedScanCodes[];

This worked fine before I ‘fixed’ array referencing.
Now the compiler says ‘aha’ a zero-size array, I don’t need an index into it. And it only spits out code to reference the first byte. I had to change the code to look like the following:
Code:
 extern byte keybdExtendedCodes[64];
extern byte keybdControlCodes[64];
extern byte shiftedScanCodes[128];
extern byte unshiftedScanCodes[128];

I don’t know whether to leave it as a compiler feature or not.

_________________
Robert Finch http://www.finitron.ca


Thu May 09, 2019 2:48 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Some pointers related to sprites.

One sprite trick I picked up on by studying other’s code was that it’s only necessary to test for the top-left corner of the sprite during display. Sprite bitmap shifting starts when the left edge is encountered on-screen, but there’s no need to detect the right edge. Simply keep shifting and once the buffer is emptied nothing will be displayed anyway. In a similar manner only the top edge of the sprite display needs to be detected. Keep a count of the number of display bytes and when the count is exhausted the sprite display is finished. In the first sprite controller I made I had detection circuits for all sides of the sprite and it consumed more logic resources.
Sprites can be implemented quite nicely using dual ported block-rams and treating them like mini-raster displays. But block ram resources are probably limited so it might be more desirable to use main memory for sprite images. Each sprite really only requires a scan buffer for a single scanline of the sprite. The buffer can be loaded with image data during the horizontal retrace of the display device. I would recommend making the sprite a fixed width. It’ll make the circuitry simpler.
The size of the sprite should be chosen to have an appreciable area on the display. It’s pretty pointless to use 8x8 sprites on a 2048x1600 display. The sprite would be too small.
One trick used for the sprite generator is that it requires only a single bus access to read all the pixels for the scan-line. Done by choosing a sprite width of 32 pixels and using a 64-bit bus.
I would also limit the number of colors for sprites. Many sprites do not need full 24-bit color capability, and it’s always possible to stack sprites on-screen to get more apparent colors. It’s probably better to have more sprites than more colors. It’s wasteful to provide the hardware capacity for full color sprites. The caret, which is usually implemented with a sprite is only two colors. That’s a common case.
Group all the registers pertaining to each sprite together. It’ll make the software a little easier to manage. Sprites can then be treated as an array object. Use byte, 16-bit, 32-bit and 64-bit values in the sprite registers. For instance, if the X-coordinate can fit into 10 bits, use a 16-bit value. A structure variable could likely be defined then, and sprite info referenced using standard means. Updating a whole byte is much faster than updating a bit-field on most architectures. Otherwise a lot of macros and subroutines are needed to access the sprite data. This isn’t 1980 where the address space is extremely limited. Wasting a few bits can substantially improve performance.

_________________
Robert Finch http://www.finitron.ca


Thu May 09, 2019 3:29 am
Profile WWW
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Rob,
Thank you for your recommendations on sprites. I hoped for some more down to metal pointers. Although I have shown a relatively advanced understanding of software and low level programming, I don't currently grasp hardware concepts that well. Wordings such as "sprite bitmap shifting", "dual ported block-rams", "mini raster displays", "scan buffer", "single bus access" etc, make currently little sense to me until I am able to figure out the basics. That's why I would appreciate some links or starter points on the very basics of it. I suspect I may need to understand first how VGA works and ways to implement the required signals in hardware. Maybe then I will be able to move onto adding sprites. Thank you very much for any help, and I appreciate your patience. Just recall that what I am asking for is the very basics.
Joan


Thu May 09, 2019 6:34 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
In that case I would start with something really simple for instance getting a counter to work. Counters are the basis of many circuits. I’d suggest getting a solderless breadboard some hookup wire, a counter chip, resistors and leds to experiment with. A switch hooked up to the clock input of a counter, and some leds hooked up via resistors to the output. Experimenting with sync generators might be a next step. Just getting a timed pulse to come out correctly; it requires gating the output of a counter. A logic probe can be a handy tool to have.
There is a project on 6502.org that has a sprite circuit made up of 74xx logic.
It might also be worthwhile to look a block diagrams for a TMS9918 or 6567 chips. Using a TMS99x8 to get sprites going in a system is about the easiest chip to use.
I do most of my work on FPGA's because otherwise it'd consume many breadboards of circuits. ssi takes a lot of chips to build a large system.

_________________
Robert Finch http://www.finitron.ca


Thu May 09, 2019 8:19 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
It took me a while to figure out how to handle the following expression:
Code:
   sign = *sc == '-' ||| *sc == '+' ? *sc++ : '+';

The compiler was incorrectly assigning a register variable to hold the value *sc. The issue is that sc gets updated by the post-increment operation. That means that *sc isn’t a constant value, it depends on sc. The fix was to the optimizer which wasn’t voiding nodes properly. Anyway, fixing this cost about 20% in code size. All kinds of additional loads and store are required because things aren’t in registers now. But the plus side is that maybe the generated code will actually work.

_________________
Robert Finch http://www.finitron.ca


Fri May 10, 2019 8:06 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Well I took another stab at it and figured out how get back the nice small code size while still having working code. It isn’t pretty but it seems to work.
The constant-register optimization was changed to evaluate how many times the constant would be used in place of a register. If the constant value in a register is used a large number of times (>3) then it remains in the register because for many constants this results in a shorter instruction saving code space.

The system is working much better now with software bug fixes. It comes up to the monitor program, but none of the monitor commands are working properly yet. I traced the memory dump command failure to a bad cache load. For some reason the pc goes to $7FFE0000 instead of holding its current value. The only thing I can think of is this is a bad exception handler address. So, I put in code to initialize all four vectors. I had only bothered with the level zero handler since that’s the level the core is supposed to be running in.

_________________
Robert Finch http://www.finitron.ca


Sat May 11, 2019 2:54 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Found out that one of the register variables was never used. This only affected larger routines. There was a ‘<’ comparison that should have been ‘<=’ to allow the last register variable to be put to use.

Added precision specifications for types. It works like the following:
Code:
int main(int x)
{
   float:16     flt16;      // half precision
   float:32     flt32;      // single
   float:40     flt40;      // basic uses this
   float:64     flt64;      // double
   float:80     flt80;      // extended double
   float:96     flt96;      // triple
   float:128   flt128;      // quad
   float flt;               // assumes double precision
   int:8      abyte;
   int:16    achar;
   int:32   ashort;
   int:64   aint;
   int:128   aint128;
   int aint;                  // assumes 64-bit
   char:8   achar;
   char:16 achar;
   char      achar;         // assumes 16-bit   
   return (x+flt);
}

The base type is followed by a colon followed by an integer expression which must evaluate to a compile time constant.
The plan is to drop all the extra types (double, __int8, __int16, __int32, __int64, wchar_t, short, long) in favor of using a precision spec. Bitfields are still supported. The extra types can easily be defined as macros in a compatibility header file.

The compiler was modified to support the ptrdif instruction which returns the difference between two pointers.
Load effective address (LEA) was given its own opcode rather than using an add instruction. The difference is that the LEA instruction sets the upper bits of the target register to indicate a pointer is present.
The keyword ‘nullptr’ was added. ‘nullptr’ can be tested using one of the equals ops (== or !=) and returns a true value if the pointer is a null pointer. This is not the same as testing for null or zero because a null pointer could be one of the two values zero or $FFF0100000000000.

_________________
Robert Finch http://www.finitron.ca


Sun May 12, 2019 2:17 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
robfinch wrote:
Found out that one of the register variables was never used. This only affected larger routines. There was a ‘<’ comparison that should have been ‘<=’ to allow the last register variable to be put to use.

Quite a difficult class of problems to find, where the compiler produces correct but suboptimal code.


Sun May 12, 2019 5:43 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Quite a difficult class of problems to find, where the compiler produces correct but suboptimal code.
Found by reading through generated code. I have a setup now where a bunch of code (C standard library + BIOS routines) is generated into a single large file (66,000 lines). I then do file compares to previous versions of the generated file to see if the compiler changes worked out. I'm using a visual compare tool that hi-lites differences.

Looking at the compiler I finally realized there were two different representations of the same thing – the specification of address modes. There was an enumeration for this am_* and a set of defined constants F_*. I merged these two representations together into the am_* enumeration. This makes it possible to use the address mode class associated with an instruction operand as a mask when generating instructions.

Finally threw the expression parsing routines into their own class and made most of them private member functions. At the same time I got rid of a routine called binaryops() which used function pointers in a fancy fashion to implement several of the binary operators. This function has been a pita to trace through during debugging in the past. The binary ops were simply broken out into separate functions. It’s more code, but easier to work with.

A class called OperandFactory for producing operand objects was created as a place to hold a bunch of Make*() functions. Anytime there are a bunch of functions called make*() it’s good bet they could be produced using a factory class.

_________________
Robert Finch http://www.finitron.ca


Mon May 13, 2019 3:05 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 28, 29, 30, 31, 32, 33, 34 ... 52  Next

Who is online

Users browsing this forum: AhrefsBot and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software