View unanswered posts | View active topics It is currently Thu Mar 28, 2024 6:14 pm



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 14, 15, 16, 17, 18, 19, 20 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
This is an interesting point to have reached. Whereas, for a custom silicon chip, every square millimeter counts, and every decision needs to take into account area cost, on an FPGA everything is more coarsely quantised. You can fill up the FPGA with no incremental cost, then possibly you justify moving up to the next one, and the process repeats. Until eventually you've filled up the largest FPGA you can justify. It's at that point that you can look back at decisions which added logic, and try to decide whether or not they earn their keep.


Wed Sep 05, 2018 7:33 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Just possibly you'll be interested in this write up of a new low-latency garbage collector for Java, because it talks about 'pointer colouring' - using the top bits of a 64 bit pointer to assist the GC by distinguishing different kinds (or ages?) of pointer.
https://www.opsian.com/blog/javas-new-z ... -exciting/


Wed Sep 05, 2018 6:40 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I coded a data cache and modified the alu to generate 9/8 addressing required for tagged memory. After spending several hours, I came to realize tagged memory would also impact instruction access. That’s because loads and store can be used to access instruction memory. There’s no reason to use tags in instruction memory and it’s difficult to have them present in the memory system. For instance, if tags are present then every ninth word of memory must be “skipped over” by the assembler and filled in with zeros. This could be including in the middle of an instruction. That’s nuts to do. It would also waste 12.5% of memory. So, tagged memory for instruction access should be disabled. Not using tagged memory addressing must be detected before the 9/8 address is generated. One way to do this is with a separate set of load / store instructions for instruction memory. It would double the number of load and store operations and use a lot of opcode space of which there isn’t any left. Another way to deal with a tagged memory system is to have a bit for each memory page that is tested by the address generator before a tag address calc. is applied. This is complicated to do, since addressing is calculated in the alu. The alu would have to look up tagged/untagged status for the address calculated to decide whether to output the 9/8 address or a normal address. -> been thinking of ripping the address generation code out of the alu and putting it in a separate address generation functional unit.
My current solution is to have an enable bit for each memory page. This bit is looked up and managed through the alu.

Axed some of the multiply / divide / modulus immediate instructions. They just aren’t used often enough to justify using up a full opcode at the root level. Kept unsigned divide immediate because it’s used for pointer calculations. And kept unsigned multiply immediate because it’s used for array index calculations.

I will read the article, sounds interesting. There are potentially a couple of bits potentially available with the NaN value used as a pointer indicator. It might be able to handle a three or four bit value.

_________________
Robert Finch http://www.finitron.ca


Thu Sep 06, 2018 3:09 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Maybe I should clarify 9/8 addressing. 9/8 addressing takes the virtual address and multiplies it by nine then divides by 8 to get the address fed to the mmu. This leaves holes after every eight words of memory. The holes are used to store tags for the preceding eight words of memory. Each tag is one byte in size and contains a pointer type indicator bit, a parity bit, a debug bit, and four bits allowed for applications, the last bit is reserved. Eight words of memory is the cache line size. So all nine words (eight plus a word containing tags) are loaded during a cache fill. Reference to the tag memory is local to the cache line. A tag and associated word are both on the same cache line. There are two special instructions for loading and storing the tags which can reference the tag word which is otherwise inaccessible.
I have all of this “tentative” as it seems like a somewhat dubious approach.

_________________
Robert Finch http://www.finitron.ca


Thu Sep 06, 2018 3:51 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
In the context of a system with an MMU, there'd be page tables, possibly several levels, and page table entries would contain type bits, and to make the scheme practical, there'd be an on-chip TLB to hold some number of recently used mappings. In that case, the page tables needn't be contiguous in memory or ordered in any particular way...

... so in this context, could you consider consolidating tags for memory (perhaps not tags for every byte, but for every word, or every paragraph) into pages of tags, and hold a cache of that data?

I suppose it's about density, and granularity, and fragmentation. And locality.

Having said all that, using the cache line as the level to contain data and tags is quite neat.

I was reading about one of Microsoft's security tactics, whereby there are a couple of bits of data for each paragraph of code, which annotate whether there are callable addresses (function pointers) in that paragraph. I think the meanings for the four states were 'nothing here valid', 'everything here valid', 'first word only is valid', 'everything except first word is valid'. This is relatively dense, and is all packed into one big memory area (which will be sparsely populated in the usual way, page by page.) The safety mechanism just needs to know the start address of the big memory area.


Thu Sep 06, 2018 4:01 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Well, there are problems with a 9/8 addressing scheme. One is that an oddball number of words need to be loaded into the cache on a miss. One a miss nine words need to be loaded. This may not work well with the rest of the memory system.

Quote:
... so in this context, could you consider consolidating tags for memory (perhaps not tags for every byte, but for every word, or every paragraph) into pages of tags, and hold a cache of that data?
I thought about doing that and that’s the way I had it originally, but I figured it would be slow and larger resource wise. Doing it that way there are two caches to maintain and it might double up on the number of cache loads required. If the data is stored in only a single cache loading the data word might dump the tag from the cache, resulting in additional cache misses.
By having the tags and the words on the same cache line one doesn’t have to worry about one dumping the other from the cache. It also might be possible to load both the tag and data word using a single load instruction (LWAT) – load word and tag. (Another case where I’d like dual result busses).

Quote:
In the context of a system with an MMU, there'd be page tables, possibly several levels, and page table entries would contain type bits, and to make the scheme practical, there'd be an on-chip TLB to hold some number of recently used mappings. In that case, the page tables needn't be contiguous in memory or ordered in any particular way...

I would l like for the bit indicating to use 9/8 addressing to be in the same table entry as other information about the memory page. That’s the logical place it should go. But….
One thing I don’t like about the scheme is that there are two sets of virtual addresses. The virtual address generated by normal load and store instructions (LS address), and the virtual address after it is multiplied by 9/8. (The 9/8 address is still a virtual address). To determine if a 9/8 address is needed the load / store (LS address) needs to be looked at first. Right now, there is only a single set of virtual addresses (the 9/8 address) being fed to the mmu. The core outputs the 9/8 address to the mmu which is external. Things could be managed by having dual result busses out of the address generator providing both the 9/8 and LS addresses. Both sets of addresses could then be forwarded to the mmu. The problem is it’s doubling up on the number of address busses in the system. The quick and dirty solution which isn’t very appealing is to just put the bits in the alu. It means that there’s extra software required to manage the bits in concert with the bits in the mmu’s page entry. But I suppose it could be hidden in the call to set the page bits. The value passed to the set routine could have the extra bit included making it look like it was just present in the page table.
I’m planning on running the system with a really simple mmu. It acts a bit like giant TLB with thousands of entries in it. Instead of caching just a subset of entries, all the entries are in the TLB. The page tables are directly in the mmu not in external memory. It works like the old 6829 mmu. This does put limits on the amount of memory in a map. With 8kB pages a single map is limited to 8MB. Fortunately, 512kB pages are also possible which allows for 512MB, which is all the memory board has on it. (1024 pages per map, 64 maps).
I do have another mmu with TLB caching that might be possible to plug in, and the mmu walks paging tables similar to an x86 mmu. But it seems overly complex for a smaller system and would require more software to manage.
The test system isn’t well balanced. A smaller more frugal 32-bit cpu would make more sense.

Quote:
The safety mechanism just needs to know the start address of the big memory area.
Hmm. You’ve got me rethinking my approach. I wonder how they use that information. I wasn’t thinking about security. I’m assuming that security measures could be applied to the big memory area to protect it. I wonder if the tags require a higher level of security than the associated data. Or if it’s acceptable to use the security attributes for the data page for the tags as well.

I’ve started taking a serious look at the LLVM compiler. I ran into more problems with my own compiler. And it never would be as full featured as something like LLVM.

_________________
Robert Finch http://www.finitron.ca


Fri Sep 07, 2018 3:18 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The 9/8 addressing requires a lookup table controlling when to apply the addressing. This lookup table is fairly large meaning it has to be implemented with block-ram. Implementing the table adds two more clock cycles on-top of the two cycles required to generate an address, meaning address generation would take four clocks. This is too slow for my liking, so I’ve decided to shelf the idea. It could be done by pipelining the address calc. and having more address generators available.

Found a way to free up opcodes at the root level. The constant field of load and store instructions had effectively unused bits in them when the load or store was larger than a byte, because addresses must be aligned on a boundary corresponding to the size. So, char, half, and word, loads and stores use progressively more of the low order bits of the constant to indicate the load/store size. For a char there’s 1 bit available meaning it can distinguish between a char and a half. With a half there is an additional bit available meaning it can distinguish between a half and a word. the char, half, and word opcodes can therefore all share the same root-level opcode.

Given a half-dozen more opcodes, I decided to expand the use of set instructions and get rid of an instruction format. I also eliminated the compare instructions from the instruction set as they are redundant.

_________________
Robert Finch http://www.finitron.ca


Mon Sep 10, 2018 4:19 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I’m about to undertake major changes to the core. Changes will include moving instruction decoding out to an explicit functional unit. Rather than decode the instruction when it’s loaded into the queue, the raw instruction is queued and marked for decoding. Then the instruction decoder acts like any other unit generating results. The result will be a decoded instruction which is then placed back in the instruction queue for further processing. By making the decoders a functional unit they can potentially be pipelined or disabled for lower power. Right now, decoding is placed between the fetch buffers and the instruction queue. This is expected to be single cycle to maintain the core’s performance. Removing this layer should help with the fmax. There are only about 30 decoded signals which would fit easily onto the 64-bit result bus.
Having rewritten parts of the core and moved the instruction decoding, the core is now significantly smaller. (About 80,000 LUTs instead of 102,000) which is great, more features possible.
Having the decoders as a functional unit also allows the instruction to be potentially sourced from a register instead of the instruction stream.

_________________
Robert Finch http://www.finitron.ca


Tue Sep 11, 2018 2:49 am
Profile WWW

Joined: Tue Dec 11, 2012 8:03 am
Posts: 285
Location: California
Out of curiosity, how long does it take to synthesize the core for the CPLD now? (I suppose part of what bring up the question is that I just finished a PCB design that took the computer several hours to generate the gerber files.)

_________________
http://WilsonMinesCo.com/ lots of 6502 resources


Tue Sep 11, 2018 5:01 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Out of curiosity, how long does it take to synthesize the core for the CPLD now? (I suppose part of what bring up the question is that I just finished a PCB design that took the computer several hours to generate the gerber files.)

It's taking about 15 minutes, I'm running it on an Acer Aspire GX-785. (i7-7700) (multiple cores 3.6 GHz). I just purchased it as refurbished earlier this year. (It was on sale too).

It sounds like it must be a complex PCB design. Does that time include auto-routing ? I ran an autoroute on a small PCB and it took a few minutes. It starts to take much longer if it can't route. It didn't take hardly any time at all to generate the gerber files for a routed PCB.
To me, if it's not a complicated board it sounds like there might be something wrong with the PC. Thermal shutdown ? -> On my laptop the memory went bad, and it started taking really, really long times to boot up and do simple things. Replacing the memory fixed it.

What software are you using to generate the files ?

_________________
Robert Finch http://www.finitron.ca


Tue Sep 11, 2018 11:10 am
Profile WWW

Joined: Tue Dec 11, 2012 8:03 am
Posts: 285
Location: California
The PCB has close to 200 parts, with parts on both sides of a 4-layer board with a ground plane. I'd like to post a screenshot, but I probably shouldn't, keeping in mind the non-disclosure agreement I signed. I still do my PCBs in a DOS-based CAD though, and the DOS emulator is really slow compared to running actual DOS on a modern computer. In addition, I have several versions of the board in the same PCB file, ie, on the same work area, and the time seems to be proportional to the square of the number of pads and trace nodes (including for text for silkscreens, which are only for reference on the monitor, since the board is too dense to print anything on it). Even for versions that are outside the area marked for gerberizing, it seems to have to check everything anyway, and discard results fall outside that area. I suppose the time to synthesize your HDL design is also more or less proportional to the square of the complexity too. Is that a reasonable approximation?

_________________
http://WilsonMinesCo.com/ lots of 6502 resources


Tue Sep 11, 2018 7:25 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
I suppose the time to synthesize your HDL design is also more or less proportional to the square of the complexity too. Is that a reasonable approximation?
I don't know, but I'd guess that's to be expected. It's been a while since I read up on routing algorithms. There's routing happening for the CPLD, it's probably remotely similar to PCB routing. I've noted sometimes though that making the design just a small bit more complex, suddenly causes it to take much longer. I've gotten a message from the toolset that the design is too congested.

I spent part of the day testing the v5 core in simulation. It runs for about 1,000 instructions and dies when an address calc fails. The address calc fails because of bad register contents due to a shift operation happening on ALU#1 where shift isn’t supported. For some reason in sim the shift instruction is assigned to be executed on ALU#1. This is starting to look like a bit error during sim. I note also there's another calc that has the wrong result and I don't see how it could be in just that one case. Why would it work 999 times out of 1,000 is what I'm wondering.
Code:
   case(instr[7:6])
   2'b00:   bus[`IB_LN] <= 3'd4;
   2'b01:   bus[`IB_LN] <= 3'd6;
   default: bus[`IB_LN] <= 3'd2;
   endcase

The above case statement is returning a length of 2 when it shouldn't be. It's not that complex all it is is a conversion from 2 to 3 bits. It's late so it could be I'm looking at the wrong fields where it's placed.

_________________
Robert Finch http://www.finitron.ca


Wed Sep 12, 2018 3:35 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The instruction length thing was working fine. I didn’t notice the instruction was an expanded compressed instruction. The length shows the compressed length, but my other screen dump dumps the decompressed instruction.
There appears to be a problem with functions in simulation. Either that or I’m misunderstanding something. I dumped a var fed into a function and then I dumped the same var from inside the function and it got two different results. The var inside the function looked like random data.
On a hunch sim might be getting confused, I tried disabling one of the instruction decoders and then it worked. The one function expand() that was causing me problems I quickly re-wrote as a module then the cpu worked.
There’s some room now at the root level of opcode for more instructions (10). I was thinking of giving indexed memory operations their own root level opcode to simplify decoding. One possible use for opcodes is unaligned loads and stores, left/right loads and stores like what’s on the MIPS might be useful. That would use up four opcodes.

_________________
Robert Finch http://www.finitron.ca


Thu Sep 13, 2018 7:31 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I played around today with the core in simulation. The simulator just wasn’t co-operating. There seems to be an issue with the order signals are processed and sometimes this causes bits in the instruction queue to “stick”. The core runs for about 20 instructions now in sim. Yesterday I had it running for hundreds of instructions, but I’ve modified the rtl code since. It appears to be the sim of the core that’s broken, otherwise I think the core basically works.

_________________
Robert Finch http://www.finitron.ca


Fri Sep 14, 2018 4:25 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Finally got sim to run a significant number of instructions >1000 (till I got bored reading the output). Instruction decoder outputs were “sticky” and I switched them to single pulses and voila the core works. I’m setting up the decoders so that they can be switched off to reduce power consumption. (It won’t be possible to switch off all the decoders at the same time :). I’ll also do this for the fpu as well and some of the memory queues through the magic of gated clock buffers. I need to add a power management register to the design. Sometimes we care more about power consumption than performance.

_________________
Robert Finch http://www.finitron.ca


Sat Sep 15, 2018 3:11 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 14, 15, 16, 17, 18, 19, 20 ... 52  Next

Who is online

Users browsing this forum: No registered users and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software