View unanswered posts | View active topics It is currently Thu Mar 28, 2024 9:47 am



Reply to topic  [ 79 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next
 TOYF (Toy Forth) processor 
Author Message

Joined: Sun Jul 23, 2017 1:06 am
Posts: 93
Hugh Aguilar wrote:
I have a new version of the TOYF that is about twice the efficiency of the previous version attached to the previous post.
The NEXT code is one clock cycle faster. The EXIT code is faster. Quotation calls are now the same speed as function word calls (this required the introduction of a new 5-bit register). I have 32/16 division now, and multiplication is faster. Linked lists are faster and cleaner as EX is used as the current node pointer. Copying blocks of data is faster. I now have logic instructions (and, ior, xor) that work with the CF and AX.
There are many improvements throughout. Everything is faster! :D

Despite doubling the efficiency, I have reduced the complexity. I now have two group-A and two group-B instructions undefined that could be used for application-specific purposes. Also, group-M no longer needs an ALU, which should reduce the FPGA resource-usage somewhat. Group-A needs a 32-bit ALU and group-B needs a 16-bit ALU.

Well, I attached the new document for anybody who is interested.

So far, there has been a lot of confusion about what the TOYF is. I have been told that the group-A, group-B and group-M instructions within an opcode have to be executed sequentially. No! The idea is that they execute in parallel.

It makes no sense that the three groups execute sequentially. What order would the three groups execute in?

I think the VLIW idea is viable. Testra came out with their MiniForth in 1995 in which each opcode contained 5 groups, all of which executed in parallel. This is 20th century technology --- it should be possible to do this today --- mostly the difficulty is in writing the assembler that generates out-of-order machine-code, but I wrote MFX for the MiniForth in 1994, so I know how to do this (I described the algorithm clearly in the document to prove that I know what I'm talking about).


Attachments:
File comment: About twice as efficient as what I posted previously, and less complicated. I also have support for 1-bit data, which is common in I/O ports.
toyf.txt [67.32 KiB]
Downloaded 355 times
Thu Apr 05, 2018 6:41 pm
Profile

Joined: Sun Jul 23, 2017 1:06 am
Posts: 93
Here is a bug fix in the group-M instructions related to NXT.


Attachments:
File comment: bug fix
toyf.txt [67.28 KiB]
Downloaded 341 times
Tue May 29, 2018 12:40 am
Profile

Joined: Sun Jul 23, 2017 1:06 am
Posts: 93
I have an upgrade on the TOYF.

I now have interrupts, but somewhat crude. They only occur at places where there is no state to save and restore, so they are fast going in and out. They have a variable amount of interrupt latency though, which may be a lot, so they are only useful for non-critical timed IRQs --- I primarily provided them to support a UART --- this can be done in the background without cluttering up the main-program.

I boosted the speed of some common primitives, such as the logic operators (of flags on the data-stack). Of course, micro-controllers will mostly use the 1-bit variables for logic, which aren't affected by my upgrade. Still though, flags on the data-stack are common in all programs, so the upgrade should help.
I boosted the speed of FETCH too. It would seem like such a simple primitive as FETCH couldn't be improved, but I did find a way to shave off one clock-cycle.
For the most part, boosting the speed is all about finding ways to do things in parallel that had previously been done in sequence.

I fixed some bugs. Some of my code was using instructions that I no longer have --- I rewrote these to only use instructions that I do actually have.
I still haven't written my assembler and simulator. I might find more bugs then.


Attachments:
File comment: Upgrade to include interrupts.
Also, boost the speed of FETCH and logic operators.

toyf.txt [74.86 KiB]
Downloaded 313 times
Thu Oct 25, 2018 7:12 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Is this still at the stage of a paper design? I'd like to know more about how you know you've shaved a cycle off FETCH - you must have some idea, somewhere, of how many cycles each instruction takes. Perhaps you even have some kind of microarchitecture, so you know what blocks there are, how they communicate, and where the clock boundaries are?


Thu Oct 25, 2018 9:31 am
Profile

Joined: Sun Jul 23, 2017 1:06 am
Posts: 93
BigEd wrote:
Is this still at the stage of a paper design? I'd like to know more about how you know you've shaved a cycle off FETCH - you must have some idea, somewhere, of how many cycles each instruction takes. Perhaps you even have some kind of microarchitecture, so you know what blocks there are, how they communicate, and where the clock boundaries are?

Yes, still a paper design.

I do have "some idea, somewhere, of how many cycles each instruction takes" --- each instruction takes one clock-cycle. :P
FETCH is not an instruction --- it is a primitive --- all the primitives are written in assembly-language, so you have to count opcodes to determine how many clock cycles each primitive takes.
The TOYF is a VLIW, so we can have up to 3 instructions packed into each opcode, and each opcode takes one clock cycle. The rules for how instructions are packed into opcodes are fairly simple (described in the document), so it easy to hand-assemble source-code for a primitive to figure out how many opcodes it will need.

I don't know enough about HDL to know what a "block" is. I do know however that there is no "communication" because there are no multi-cycle instructions --- everything happens in one swoop, so there is no need for communication from one clock cycle to the next.
What I have in the document is the "micro-acrchitecture" --- this is essentially micro-code --- what would normally be an instruction in a "Forth engine" (RTX2000, B16, BUGS18, etc.) is a primitive in the TOYF and is written in assembly-language.

1 edit: added the last paragraph


Last edited by Hugh Aguilar on Thu Oct 25, 2018 5:50 pm, edited 1 time in total.



Thu Oct 25, 2018 3:31 pm
Profile

Joined: Sun Jul 23, 2017 1:06 am
Posts: 93
BigEd wrote:
I'd like to know more about how you know you've shaved a cycle off FETCH

Well, my upgrade was that when NXP (used in the NEXT code) executes a primitive (after determining that the xt isn't a colon word), it now does some prep work for the primitive.
Specifically, it sets LX= DX that is useful to FETCH, and it sets MA=BX that is useful to the logic operators. Setting these registers isn't useful to the majority of primitives, but it doesn't cause any harm because those registers are assumed to be free.

This was how FETCH was written previously:
Code:
FETCH:              ; adr -- val
    mov BX,MA
    ldr LX
    mov LX,BX
    nxt

I got rid of the MOV BX,MA instruction however (I needed room in group-M for the new XOR 1,IF instruction that is used for toggling the IRQ mask, so another instruction had to be discarded to make room). So, the above code would no longer be valid. This would be needed:
Code:
FETCH:              ; adr -- val
    mov BX,AX
    mov AX,MA
    ldr LX
    mov LX,BX
    nxt

This is pretty inefficient! None of the instructions pack together because all of them are dependent upon the previous instruction! In a VLIW, failure to pack more than one instruction into an opcode is the :evil: that must be avoided, especially in commonly used primitives.

At this time I came up with the idea of having NXP prep FETCH. It can move BX into MA overcoming the lack of a MOV BX,MA instruction. Now I was very :D because I had beat the :evil:.
This is how FETCH is written now:
Code:
FETCH:              ; adr -- val    ; expects MA=BX
    ldr LX
    mov LX,BX
    nxt

FETCH depends on MA getting set by NXP to the needed value.

We still have none of the instructions packing together (except for MOV LX,BX that packs with NXT), but there are only 2 instructions, so speed has been increased a lot.

It might seem as if upgrading NXP just to support FETCH is a waste, but FETCH is a very commonly used Forth primitive so we want it to be fast. I was puzzled as to which group-M instruction I could discard to make room for the new XOR 1,IF instruction. All of them seemed to be necessary. Then it hit me that I could get rid of MOV BX,MA and instead do that inside of NXP --- there is no case (at this time) in which MOV BX,MA is used anywhere except at the beginning of a primitive, so having NXP do it is not only faster but is all that is really needed. If I ever need MOV BX,MA inside of a primitive after MA has already been modified, I can use AX as an intermediate value because I do have MOV BX,AX and MOV AX,MA available, although this takes 2 opcodes (because they don't pack). If AX is in use for something else, then I guess I'm in trouble --- I'll worry about that problem when it arises --- it may never arise.

Designing the TOYF is like solving a puzzle. In many puzzles, every time that you make a move, you block yourself from being able to make other moves. Like here, I wanted to add the XOR 1,IF instruction, so I had to discard another instruction --- it would seem as if discarding an instruction would necessarily be a setback (all of the instructions must be useful or I wouldn't have put them in originally) --- instead though, I was able to discard the MOV BX,MA instruction and also shave one clock cycle off of FETCH that was relying on MOV BX,MA.


Thu Oct 25, 2018 4:42 pm
Profile

Joined: Sun Jul 23, 2017 1:06 am
Posts: 93
Hugh Aguilar wrote:
At this time I came up with the idea of having NXP prep FETCH. It can move BX into MA overcoming the lack of a MOV BX,MA instruction. Now I was very :D because I had beat the :evil:.
This is how FETCH is written now:
Code:
FETCH:              ; adr -- val    ; expects MA=BX
    ldr LX
    mov LX,BX
    nxt

FETCH depends on MA getting set by NXP to the needed value.

We still have none of the instructions packing together (except for MOV LX,BX that packs with NXT), but there are only 2 instructions, so speed has been increased a lot.

It might seem as if DX_STORE could be improved by taking advantage the MA=BX that we now have. Did I miss a trick???
Actually, it comes out to 4 clock cycles either way, which is why I didn't rewrite it. Just now though, I made a slight upgrade in the document to explain that I didn't actually miss a trick. I won't bother uploading the new document because the explanation is trivial and I don't want to make people download slightly different documents every day.

This is the explanation:
Code:
DX_STORE:           ; adr --        ; expects DX=SOS
    mov DX,AX       ; AX= val
    mov BX,DX       ; DX= adr
    drop
    mov DX,MA
    nxt AX

DX_STORE assembles as:
                group-A     group-B     group-M
1st opcode:     MOV DX,AX   MOV BX,DX   PHS         ; the PHS comes from the DROP macro
2nd opcode:                             STR BX      ; the STR BX comes from the DROP macro
3rd opcode:                             MOV DX,MA
4th opcode:                             NXT AX

This is an alternate version of DX_STORE that takes advantage of MA=BX that we now have:

DX_STORE:           ; adr --        ; expects DX=SOS, MA=BX (MA is set to BX by NXP)
    str DX
    drop
    nxt

This assembles as:
                group-A     group-B     group-M
1st opcode:                             str DX
2nd opcode:                             PHS         ; the PHS comes from the DROP macro
3rd opcode:                             STR BX      ; the STR BX comes from the DROP macro
4th opcode:                             NXT

It seems as if DX_STORE could benefit from MA=BX that we now have, but it actually comes out to 4 clock-cycles either way.

1 edit: fixed a typo in the text


Thu Oct 25, 2018 6:12 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Thanks for the extra explanations - I get the picture much better now!


Fri Oct 26, 2018 7:46 am
Profile

Joined: Sun Jul 23, 2017 1:06 am
Posts: 93
BigEd wrote:
Thanks for the extra explanations - I get the picture much better now!

Good! Let us just hope that it is a picture of something real, not just a fantastic dream.

In 1994 at Testra I was told that routing was a problem. This problem is intractable --- you can't do a brute-force search on it (the computer would require more silicon than is available in the galaxy, and it would still take longer to run than our galaxy's expected lifetime).
The HDL they had written used heuristics to do the routing, but there was no guarantee that this was optimum, or even close to optimum. It was better than what LDL (Lattice Design Language) had though.
The MiniForth was designed largely "by guess and by golly" --- designs would be tried out, and if they wouldn't route then they would be modified, and this would continue until the router did make the design fit in the PLD. In many cases, which external pins were used for which purpose would make a big difference in whether the design would route or not. In some cases, the design wouldn't route, but when it was made more complicated with more features, it would route. That is pretty lucky! You would expect that making the design simpler would make it more likely to route, but this wasn't necessarily true.

Routing was the major problem in the 20th century, and I was told that this problem was unlikely to be solved in the 21st century either. It would never be solved!
I think that what happened with FPGAs is that the routing problem didn't get solved. FPGAs have a lot more resources than PLDs did, but connectivity is still a big problem. The solution people have is to do things in sequence rather than in parallel, because parallel execution requires a lot of connectivity, and they don't have a way to route that connectivity. So, they are doing things in sequence with multi-cycle instructions. They are using all those resources to provide more registers. A RISC-V enthusiast recently informed me proudly that the RISC-V has thirty two 32-bit registers, which is ever so much better than my handful of 16-bit registers. Of course, he is not considering that the RISC-V is doing things in sequence that the TOYF would be doing in parallel.
Two Verilog programmers have looked at my design and both have told me the same thing --- parallel processing, and the attendant high connectivity, is a problem on an FPGA --- a design featuring sequential execution is much more realistic on an FPGA.

It is possible that my TOYF design is no good. I am expecting parallel execution, which implies high connectivity, and hence the design is going to fail to route on any FPGA.
Honestly, nobody has ever said anything positive about my design. Everybody thinks that it is no good. Maybe they know something that I don't --- they know that my design isn't going to route --- if it doesn't route, then it is nothing more than a fantastic dream.


Sun Oct 28, 2018 3:53 am
Profile

Joined: Sun Jul 23, 2017 1:06 am
Posts: 93
Hugh Aguilar wrote:
A RISC-V enthusiast recently informed me proudly that the RISC-V has thirty two 32-bit registers, which is ever so much better than my handful of 16-bit registers. Of course, he is not considering that the RISC-V is doing things in sequence that the TOYF would be doing in parallel.

Another guy recently informed me that the Silicon Laboratory's C8051 is pretty awesome. Processors have fans, just like musicians!
Quote:
The C8051 is a ‘hard wired’ implementation of the 8051 microcontroller CPU, as opposed to the original micro-coded version.
The instruction set is mapped to a basic two-stage pipeline to increase throughput while maintaining an 8-bit program memory width.
The result is a high-performance 8051 microcontroller architecture that executes most instructions within 1 or 2 clock cycles and delivers 20 to 25 times the performance of the original 8051 core.

In this case the trend was away from sequential execution (the 8051 took 12 or 24 clock cycles for every instruction) and toward parallel execution (the C8051 takes 1 or 2 clock cycles for every instruction).
The C8051 is an ASIC though, not an FPGA --- maybe routing is easier in an ASIC than an FPGA --- maybe my TOYF would only work as an ASIC, not an FPGA.

My understanding is that an ASIC has a huge upfront cost, so this is only done when you already have a customer who is committed to buying in large volume. Obviously not going to happen with the TOYF!


Sun Oct 28, 2018 4:54 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Not so long ago, part of the expertise in making a custom chip was floorplanning the device: what goes where, how things connect, where busses go. Several reasons for this: not many layers of metal, and needing tidy organisation so the wires are short enough to make the speed high enough. That is, to meet timing and cost goals, you needed to architect your design with full knowledge that the implementation is on a 2D surface with limited number of layers of interconnect.

However, chips got bigger, transistors smaller, wires thinner, and more layers of metal were possible and economic. Tools got better and computers got faster and these days you often see very little structure in a CPU layout.

I believe the fastest and most efficient FPGA designs will still be written with knowledge of the implementation, and may also be floorplanned. In such a case, you need to know something about the connectivity. Everything-connects-to-everything is always going to be difficult to map to a 2D surface.

It's clear from Rob's thread nearby that a large and complex design may have trouble routing, and also clear that reviewing the design and exploring possibilities can make for step function improvements in routability.

I don't think you're going to find out about how implementable your design is until you code some HDL and try it. Or, possibly, draw a block diagram with some realism about the largest blocks and the widest busses, and keep redrawing it until the connectivity looks sensible.


Sun Oct 28, 2018 7:24 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Two Verilog programmers have looked at my design and both have told me the same thing --- parallel processing, and the attendant high connectivity, is a problem on an FPGA

I don't think it's necessarily the parallel processing, but the rather the logic complexity that causes problems. Routing multiple busses to multiple targets in a controlled fashion. FPGA's are used for signal processing that takes place in parallel. There can be a lot of parallel processing happening if interconnections are limited. Newer FPGA's have DSP blocks for a reason. I think a really simple parallel design might work well, even in an FPGA. It's the number of nested if/else/case statements causing me issues. Parallel processing may be complex depending on things like dependency checks. I've considered trying to get a VLIW parallel design working, by having the dependencies worked out in advance. Shades of Itanium. Of course I haven't studied the routing networks in the FPGA very much so I may be out to lunch. I have read that there are enough routing resources in an FPGA to handle most designs, so I would try things before saying it won't work.

_________________
Robert Finch http://www.finitron.ca


Mon Oct 29, 2018 6:52 am
Profile WWW

Joined: Sun Jul 23, 2017 1:06 am
Posts: 93
robfinch wrote:
I don't think it's necessarily the parallel processing, but the rather the logic complexity that causes problems. Routing multiple busses to multiple targets in a controlled fashion.

What do you mean by "logic complexity?" What do you mean by "targets?"

Each register is part of a group, meaning that only instructions in that group can write to the register.
Instructions in any group can read from any register though.
Some of the registers have only a few instructions accessing them (CX EX LF HF) because they have dedicated uses. Other registers have more instructions accessing them. AX is heavily used.

It is not very complex --- I tried to make it as simple as possible --- my original expectation was that the TOYF would be easier to implement than the 65ISR because the TOYF has only single-cycle instructions and the 65ISR has multi-cycle instructions, but apparently this doesn't relate to ease.
Lots of processors comparable to the 65ISR have been implemented in FPGAs, including your own 65c02, but nothing like the TOYF has been tried before except for the MiniForth in 1994.

To get an advantage with the TOYF, it will be necessary to write some primitives in assembly-language.
Unfortunately, the assembly-language is too difficult for most people --- I may be the only person who ever does that.
Most people are going to want to use a statically defined language such as ANS-Forth --- if you are going to impose such restrictions on your software though, then there is no point in being innovative with the hardware either --- actually, people want GCC not Forth anyway, which makes no sense on the TOYF at all.


Tue Oct 30, 2018 5:43 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Maybe you're thinking about your machine at the behavioural level, and the questions about implementation difficulty are a little lower down, at the implementation level.

For example you mention that certain instructions write registers - that's a question of the control logic, of how to enable the write of a register. That might indeed prove to be very complex logic which takes many nanoseconds and therefore affects your clock speed. However, there's also the question of the data flow. From which sources might this register be written? What bus structure or multiplexor structure will be needed? Likewise, when a register is read, to which destinations might the data need to go? How does it get there? And another question is how many ports a register file has, if you have a register file: in a given cycle, might more than one register be read? Might a read happen in the same cycle as a write? To the same register or to a different one?

If you draw out all your registers, and your ALU, and your machinery for computing addresses, and your port to RAM, what connectivity is needed to implement your instructions? Does everything connect to everything? That's a lot of wires! Or a lot of multiplexors.


Tue Oct 30, 2018 6:16 pm
Profile

Joined: Sun Jul 23, 2017 1:06 am
Posts: 93
BigEd wrote:
Maybe you're thinking about your machine at the behavioural level, and the questions about implementation difficulty are a little lower down, at the implementation level.

For example you mention that certain instructions write registers - that's a question of the control logic, of how to enable the write of a register. That might indeed prove to be very complex logic which takes many nanoseconds and therefore affects your clock speed. However, there's also the question of the data flow. From which sources might this register be written? What bus structure or multiplexor structure will be needed? Likewise, when a register is read, to which destinations might the data need to go? How does it get there? And another question is how many ports a register file has, if you have a register file: in a given cycle, might more than one register be read? Might a read happen in the same cycle as a write? To the same register or to a different one?

If you draw out all your registers, and your ALU, and your machinery for computing addresses, and your port to RAM, what connectivity is needed to implement your instructions? Does everything connect to everything? That's a lot of wires! Or a lot of multiplexors.

Yes, it is possible to read from a register and write to that register in the same clock cycle. I would expect that most of the time when instructions parallelize, this is what is happening.

This is an example I hand-assembled in the document:
Code:
DX_MINMAX:          ; a b -- max            ; sets DX= min
    mov BX,AX
    mov S0,MA
    ldr LX
    mov LX,DX       ; DX= a
    mov AX,LX       ; LX= b
    mov 0,CF        ; the TOYF's SBC needs CF set to 0 as done in the MC6800, not to 1 as done in the popular 6502
    sbc AX,BX       ; CF= a > b
    mov 1,AX        ; AX= offset to branch to next primitive
    mov LX,BX       ; BX= b
    add IP,AX
    not CF          ; CF= a <= b
    bcs             ; if DX is minimum and BX is maximum, then we are done
SWAP_DX:            ; a -- b        ; needs DX=b, sets DX= a                    ; a label prevents code afterward from packing with code before
    mov BX,AX
    mov DX,BX
    mov AX,DX       ; we no longer have an XCH BX,DX instruction
    lit
    ldr LX
    execute_LX

                group-A     group-B     group-M
1st opcode:     MOV BX,AX   MOV 0,CF    MOV S0,MA       ; MOV 0,CF moved all the way back here because it has no dependencies
2nd opcode:                             LDR LX          ; LDR LX often doesn't pack well because everything after it depends on LX
3rd opcode:     MOV 1,AX    MOV LX,DX   MOV AX,LX       ; it is legal to use a register and store into that register concurrently
4th opcode:     ADD IP,AX   SBC AX,BX                   ; BX gets set, but all we needed was CF --- the BX result was unneeded
5th opcode:                 MOV LX,BX
6th opcode:                 NOT CF     
7th opcode:     MOV DX,AX               BCS             ; BCS will terminate the primitive if a<=b
8th opcode:     MOV BX,AX   MOV DX,BX   LIT
9th opcode:     MOV 1,AX    MOV AX,DX   LDR LX          ; the MOV 1,AX is from EXECUTE_LX
10th opcode:                            NXP             ; the NXP is from EXECUTE_LX
11th opcode:                            NXF             ; the NXF is from EXECUTE_LX

Opcodes #3 is a typical example.

I wouldn't say that "everything connects to everything" though. Some of the registers are not used very much.

EX is not used very much:
Code:
group-A:
    mov EX,AX       ; move EX to AX
    sum EB,CA       ; if low-bit of DX, then move CX:AX + EX:BX to CX:AX        ; used in multiplication
group-B:
    mov AX,EX       ; move AX to EX
    bit EX          ; shift EX left 1 bit, move ~CF into low-bit of EX      ; used in division with EX = quotient
    adj DX,EB       ; shift DX right, move 0 into high-bit | shift EX:BX left, move 0 into low-bit, move high-bit into CF

I added the EX register for the purpose of supporting multiplication and division.
Multiplication is necessary for the PID algorithm used in motion-control, which is the point of the TOYF.

Note that group-M does not need an ALU --- I did that to simplify the processor --- I have made a lot of effort to simplify the processor as much as possible, while still making it able to do useful work.
The TOYF is about twice as big and complicated as the MiniForth that fit on the Lattice isp1048 PLD in 1994 --- that was a quarter of a century ago --- but now the TOYF is too big and complicated for the modern FPGA chips?
I've got a few 16-bit registers --- the RISC-V has thirty-two 32-bit registers --- but mine is too complicated to implement?


Wed Oct 31, 2018 2:36 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 79 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software