View unanswered posts | View active topics It is currently Mon Sep 16, 2019 8:35 pm



Reply to topic  [ 102 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7  Next
 DSD7 
Author Message

Joined: Tue Dec 31, 2013 2:01 am
Posts: 100
Location: Sacramento, CA, United States
I really like 40-bit chunks ... eccentric, yet versatile! My (still) incomplete 65m32 was very close to being 36 or 40 bits wide, but I went with boring old 32 bits ... still not sure why, since I had a previous design called the m-824 that worked seamlessly with 8-bit bytes and/or 24-bit words. The 8-bit part was boring but allowed very compact code, while the 24-bit part was kinda "out-there", but allowed more power than your classic 8-bitter.

Keep up the good work. Maybe soon you'll invent your "Goldilocks" core and stick with it long enough to put the finishing touches on it. Or is it one of those "the journey is more interesting than the destination" things? I ask with a touch of envy, both for your perseverance and for your impressive skill set.

Mike B.

Code:
000000:           ;   1 ;234567890123456789012345678901234567890123456789012345
000000:           ;   2 ;     primmz by barrym for the m-824 microprocessor
000000:           ;   3 ;               by barrym 2010-10-30
000000:           ;   4 ; The m-824 is an experimental microprocessor design
000000:           ;   5 ;    based on a 24-bit architecture.  It was inspired
000000:           ;   6 ;    by Steve Wozniak's Sweet16 interpreter from the
000000:           ;   7 ;    original Apple II, but adds several ALU functions
000000:           ;   8 ;    (and, or, neg, lsl, ror, etc.) and the ability to
000000:           ;   9 ;    operate on byte and word data with equal ease.
000000:           ;  10 ;    The instruction set encourages relocatable coding
000000:           ;  11 ;    via (nearly) penalty-free relative addressing.
000000:           ;  12 ;
000000:           ;  13 ; primmz stands for PRint IMMediate and simply outputs
000000:           ;  14 ;    the Zero-terminated string immediately following
000000:           ;  15 ;    the subroutine call before continuing execution
000000:           ;  16 ;    at the instruction immediately after the zero.
000000:           ;  17 ; All registers used except [p] are restored to their
000000:           ;  18 ;    original values before returning to the caller.
000000:           ;  19 ;    Is there any other microprocessor out there that
000000:           ;  20 ;    can do the same in 14 bytes or less?  I doubt it.
000000:           ;  21 ; This routine can be changed to a software interrupt
000000:           ;  22 ;    service routine with the addition of a few bytes;
000000:           ;  23 ;    it can then be conditionally executed, and even
000000:           ;  24 ;    [p] is saved in that case!
000000:           ;  25 ; >cout (built-in SWISR) emits [a] as ascii to stdout.
000000:           ;  26 ;
000000:2b         ;  27 primmz  exbs         save [b], get str addr
000001:27         ;  28         exb7         use [7] as str pointer
000002:cd         ;  29         pshb         save original [7]
000003:08         ;  30         (sk)
000004:ae10       ;  31 primmz2 >cout         while (*[7]++ != 0)
000006:b6         ;  32         la@7            shoot it to stdout
000007:01defa     ;  33         bne<   primmz2
00000a:bd         ;  34         pulb         restore original [7]
00000b:27         ;  35         exb7            and [b], modified
00000c:2b         ;  36         exbs            [7] becomes the new
00000d:2e         ;  37         rts            return address
00000e:           ;  38         .en


Thu Dec 15, 2016 6:24 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Quote:
I really like 40-bit chunks ... eccentric

Yes, it is a bit eccentric. With 64 regs and eight bit opcode there's 20 bits out of 32 used up already. It was leaving less than 10 bits for branch targets. With only 32 bits branches and other instructions were having to be split into two instructions, consuming 64 bits anyway. 40 bits makes a number of instructions easier to encode. I looked seriously at squeezing things into 32 bits. If there were only 32 regs 32 bits would probably do. Also with an 80 bit core immediate constants might need more bits. The core also supports some 24 bit instructions, and a handful of 8 bit instructions as well to help with code density. I'm planning on a compressed instruction set at some point, so there are some opcodes available still. It now has 80/40/32/16/8 bit loads and stores.
I have to admit DSD9 isn't optimized for memory efficiency. It's an ugly machine. Memory is cheap these days.
Quote:
"the journey is more interesting than the destination" things?

I do have a destination in mind. I'd like to get a two-way superscalar going. I'd be working on Thor except that I decided to drop predicated instructions, and it would require way too many mods to Thor.

That's pretty compact code for the m-824. I've been following along the 65m32, no recent posts ?

2016/12/13
I had to write a 128 bit integer math class for the assembler in order to use 80 bit integers. It required modifying the assembler’s expression parser and symbol table. You would think there’d be a class already available on the web. I found tons of queries about code for 128 bit math operations and very little available code. I still don’t know if divide works (it hasn’t been used yet). Perhaps it has been deemed as too easy a task to do by the pundits.

_________________
Robert Finch http://www.finitron.ca


Thu Dec 15, 2016 9:31 am
Profile WWW

Joined: Tue Dec 31, 2013 2:01 am
Posts: 100
Location: Sacramento, CA, United States
robfinch wrote:
... I've been following along the 65m32, no recent posts ?

I have a dozen excuses, but none of them stand up to serious scrutiny. My best attempt at a legitimate excuse is that I have to find myself with a rare combination of the proper inspiration and the proper amount of (uninterrupted) time to make significant progress, and let's just say that progress has been slow recently. I'm also a bit of a perfectionist, and that can be a hindrance when trying to get a prototype completed.

Mike B.


Sat Dec 17, 2016 2:47 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
2016/12/15
Opened elaborated design for the first time after fixing numerous omissions and syntax errors.
Forgot to put the instruction address through the mmu, and had to rewrite parts of the core.

Accessing memory is going to take a lot of clock cycles but hopefully the cycle time will be fast. It takes two clock cycles for an address to go through the mmu, then three more to load data from the data cache. The data cache output is double registered and there's a shift register to align data which is also registered. I takes a minimum of about six clocks to access data through the cache, more for uncached data.

_________________
Robert Finch http://www.finitron.ca


Sat Dec 17, 2016 4:30 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
2016/12/16
Simulated the DSD9 core for the first time. Lots of errors to fix.
Synthesis couldn’t infer block ram for the data cache. It synthesized it to 200,000 LUTs and FF’s rather than 4 block rams. Took over an hour to run.
So I created the memories using the IP core generater in Vivado. Then it worked, but it makes the design specific to the vendor.
The design synthesizes in about five minutes, but there are still some false problems to do with multi-driven nets.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 18, 2016 3:58 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
2016/12/17
Can’t synthesize the design because of a multi-driven nets error. I’ve run into this before. There aren’t really multi-driven nets. It has something to do with the use of tasks. I posted a message on the Xilinx community support forum. Found the error was in my code there really was a multi-driven net. But I had to find it by expanding out all the tasks manually. The same task was being used in two different always blocks. This isn't necessarily a problem but in this case it was.

DSD9 is running in an FPGA. Hangs after the second LED status display. But it runs in simulation at least until the screen clear routine.

_________________
Robert Finch http://www.finitron.ca


Mon Dec 19, 2016 2:55 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
DSD9 has reached the clear-screen point. It successfully clears the screen, but it hangs before screen randomizer test. There are a number of subroutine calls before clearing the screen to do things like setup the mmu and interrupt tables. DSD9 can read the status of buttons now so it's possible to have some control over the code executed. Working LEDS, buttons and switches make things so much easier for bootstrapping.
The core is about 30,000 LUTs.

2016/12/18
Added memory operate key (asid) to the cache tags so that the cache could be accessed with virtual addresses. This shaves two clock cycles off of every memory access by allowing address translation to occur in parallel with cache access. Updated documentation. The docs were seriously wrong having originated from the FISA64 project.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 20, 2016 12:00 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
2016/12/19

Changing the decoding to use functions rather than wires. The benefit of using functions is that the function can be reused more easily. This isn’t really much of an issue until multiple decoders are needed. The decoding functions can easily be ported to another machine with a similar instruction set.
Tetra wide loads didn’t update the register file when the load was unaligned and crossed a 128 bit boundary.
Added memory indirect jumps and calls to the core.
Added volatile load instructions to the core.
The assembler was outputting the wrong opcodes for shift operations.
The assembler didn’t set the correct target register for register to register operations.
It's amazing the core worked as well as it did given the previous two problems.

_________________
Robert Finch http://www.finitron.ca


Wed Dec 21, 2016 4:51 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1255
robfinch wrote:
It's amazing the core worked as well as it did given the previous two problems.

It's like the flipside of how hard it is to get high coverage from a testsuite - any given test has low coverage, and so even with numerous remaining bugs a program may run OK.

(You might recall that some 30% of the transistors of the 6502 can be removed, and a C64 emulation will still initialise Basic and get to the READY prompt.)


Wed Dec 21, 2016 7:53 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Quote:
It's like the flipside of how hard it is to get high coverage from a testsuite - any given test has low coverage, and so even with numerous remaining bugs a program may run OK.

I tend to get the primary instructions (call, ret, add, load, store) working first and leave the bells and whistles. I have to admit my test coverage isn't very good yet. Ideally every possible control flow through the core's code should be tested as a start.

2016/12/20
Branch displacement shortened to 16 bits (from 18) which allows room for static branch prediction bits in all the branch instructions. This reduced branch range reduces the ability of a branch being used to randomly branch into the middle of code. There’s no real reason to have a displacement much larger than 12 bits.
Along the way MMIX like conditional set instructions were added.

_________________
Robert Finch http://www.finitron.ca


Thu Dec 22, 2016 5:14 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Added call target exceptions to the core. As suggested in another post. The call instruction doesn't really read the target address, so there's no slow down of a load instruction. Instead it sets a flag indicating the execution of a call instruction, then the next instruction to execute checks for this flag. If the call flag is set and the next instruction isn't a target, then an exception is raised.

2016/12/22
Mulling over data cache-ability bits in an mmu. Even when data addresses aren’t being mapped through the mmu the core still needs to know if the data is cacheable or not. Which means there has to be some means outside of page table entries to indicate caching. In machine mode the core runs with unmapped addresses. It would be slow if it couldn’t use cached data. Cache-ability is tied to physical addresses, the I/O address range should not be cached, no matter how it’s mapped. A bit was about to be added to the mmu page entries to indicate cache-ability, but it’s simpler for the core just to use a single compare to detect the I/O range. Instructions are always cached.

_________________
Robert Finch http://www.finitron.ca


Fri Dec 23, 2016 6:16 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
2016/12/24
Increased the fmax for the core/ifetch by placing more pipelining in the instruction fetch stage. Previously the instruction fetch used the negative edge of the clock to read the cache. This cut the possible frequency in half especially when combined with instruction alignment and PC increment detect. Now it’s all positive edge triggered. There are two drawbacks, all instructions are now five bytes in size, so there is a loss of some code density. The other drawback is that branches are predicted at a later stage so they now take 1, 3 or 5 clock cycles depending on whether or not the branch is taken and if it was predicted correctly. The additional clock cycles for branches probably lowered performance 10 %, but quite a bit is gained in fmax for ifetch. This change moved the instruction fetch off the critical timing path.
2016/12/25
Christmas break.
2016/12/26
Previous changes introduced a pipeline bug. The core insists on skipping over a RET instruction without executing it.
Got past that pipeline bug plus several more. The core runs for about 75,000 instructions before crashing due to a bad address popped during a RET instruction. This could be a software bug, (compiler , assembler), but is more likely a pipeline bug still.
Spent some more time optimizing the core and documenting and got the timing to over 50MHz (57MHz).
The paradigm of "make it work first" then optimize it is being followed.

_________________
Robert Finch http://www.finitron.ca


Wed Dec 28, 2016 5:04 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
2016/12/27
Modified the icache and split it into an L1 and L2 cache in order to make the IFETCH stage pipelining simpler. The L1 cache is accessible in a single cycle. The L2 cache is accessible in two clock cycles. Memory is accessible in 6+ clock cycles. L1 cache is made up of LUT ram , L2 cache is made up of block RAM. L1 is small 2kB. L2 is larger 16kB.

Added two stages to decode in order to support large constants directly in the instruction stream. This is instead of using a wide instruction window. Now instruction fetch is always the same size, and instructions no longer can span cache lines. The instruction alignment multiplexer is simpler now and uses few core resources.

The latest craziness is to omit interrupt hardware from the IFETCH stage in order to reduce the amount of multiplexing taking place. And instead provide an interrupt check instruction to check for interrupts. This instruction would be placed in code every ten or twenty instructions to check for interrupts. It increases the code bloat (5% for every 20 instructions). The interrupt latency would be worse but probably good enough for many purposes. Running at 50MHz, if an interrupt check instruction were placed every 20 instructions it would be checking for interrupts at about 1MHz rate. Of course this relies on the programmer to make sure interrupts are checked for, but this could be done automatically in the language compiler / interpreter.

_________________
Robert Finch http://www.finitron.ca


Thu Dec 29, 2016 11:44 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1255
Making interrupt checking explicit is quite radical, and a new one for me! Is the fmax advantage better than the (tunable) 5% penalty?


Thu Dec 29, 2016 11:51 am
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 177
Location: Huntsville, AL
An idea similar to that employed in the VIPER processor. Without stacks or interrupts, these operations have to be performed in software. No doubt that there will be a performance penalty, but I can see certain benefits to keeping the hardware simple and letting well defined and tested software components perform these functions. I expect that as the processor fmax increases due to reduced complexity, some of the functions performed in HW can be performed in SW without a noticeable loss of performance.

Rob's observation that a HLL compiler can automatically insert the required interrupt checking instructions also means that it is easier to create atomic operations when needed without having to manually insert disable interrupt instructions. The generally accepted mechanism of disabling interrupts around critical regions causes many SW errors. With the language tools providing this service, critical regions can be declared and no interrupt handling code would be generated: entire routines of an executive would be devoid of interrupt handling instructions.

_________________
Michael A.


Fri Dec 30, 2016 3:21 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 102 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7  Next

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software