View unanswered posts | View active topics It is currently Thu Mar 28, 2024 9:41 am



Reply to topic  [ 108 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8  Next
 One Page Computing - roll your own challenge 
Author Message
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Revaldino.

Thanks for taking the time to write these macros. It's quite interesting and fully reflects the design constraint of the OPC designs. I can't wait to see whether the application of the 'condition flag' along with the others that you propose makes this better without adding too much hardware.

With the risk of being a bit off-topic, but motivated by Ed's comment above on the Risc-V. I researched a bit, and found a RISCV implementation of 'multiply' in assembler.

Code:
   .globl __mulsi3
   .type  __mulsi3, @function
__mulsi3:
   mv     a2, a0
   mv     a0, zero
.L1:
   andi   a3, a1, 1
   beqz   a3, .L2
   add    a0, a0, a2
.L2:
   srli   a1, a1, 1
   slli   a2, a2, 1
   bnez   a1, .L1
   ret

It's quite interesting how compact it is.

By looking at what it does, I'm pretty sure that the original C code (assuming this was compiler generated) was this:

Code:
unsigned int __mulsi3 (unsigned int a, unsigned int b)
{
  unsigned int r = 0;
  while (a)
  {
    if (a & 1)
      r += b;
    a >>= 1;
    b <<= 1;
  }
  return r;
}


Just as a matter of curiosity, I took one step further and tried to compile that C code in LLVM for the RISCV-32 architecture. The result was this:
Code:
   .globl   __mulsi3
   .type   __mulsi3,@function
__mulsi3:
   mv   a2, zero
   beqz   a0, .LBB0_2
.LBB0_1:
   andi   a3, a0, 1
   neg   a3, a3
   and   a3, a3, a1
   add   a2, a3, a2
   slli   a1, a1, 1
   srli   a0, a0, 1
   bnez   a0, .LBB0_1
.LBB0_2:
   mv   a0, a2
   ret

So not quite the same as the original, but pretty close. The difference is that there's an early exit when 'a' is zero, and the internal branch is replaced by a series of branch-less operations that should do the same. I suppose that one or the other might be better depending on branch prediction efficiency on the target hardware. I found that the LLVM compiler tends to consider branches as next to hell, so this output is consistent with that approach.

The RISC-V is definitively an architecture to look at when trying to design our own cpus...

Joan


Sat Oct 05, 2019 4:36 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
joanlluch wrote:
Hi Revaldino.

Thanks for taking the time to write these macros. It's quite interesting and fully reflects the design constraint of the OPC designs. I can't wait to see whether the application of the 'condition flag' along with the others that you propose makes this better without adding too much hardware.

With the risk of being a bit off-topic, but motivated by Ed's comment above on the Risc-V. I researched a bit, and found a RISCV implementation of 'multiply' in assembler.

Code:
   .globl __mulsi3
   .type  __mulsi3, @function
__mulsi3:
   mv     a2, a0
   mv     a0, zero
.L1:
   andi   a3, a1, 1
   beqz   a3, .L2
   add    a0, a0, a2
.L2:
   srli   a1, a1, 1
   slli   a2, a2, 1
   bnez   a1, .L1
   ret

It's quite interesting how compact it is.

By looking at what it does, I'm pretty sure that the original C code (assuming this was compiler generated) was this:

Code:
unsigned int __mulsi3 (unsigned int a, unsigned int b)
{
  unsigned int r = 0;
  while (a)
  {
    if (a & 1)
      r += b;
    a >>= 1;
    b <<= 1;
  }
  return r;
}


Just as a matter of curiosity, I took one step further and tried to compile that C code in LLVM for the RISCV-32 architecture. The result was this:
Code:
   .globl   __mulsi3
   .type   __mulsi3,@function
__mulsi3:
   mv   a2, zero
   beqz   a0, .LBB0_2
.LBB0_1:
   andi   a3, a0, 1
   neg   a3, a3
   and   a3, a3, a1
   add   a2, a3, a2
   slli   a1, a1, 1
   srli   a0, a0, 1
   bnez   a0, .LBB0_1
.LBB0_2:
   mv   a0, a2
   ret

So not quite the same as the original, but pretty close. The difference is that there's an early exit when 'a' is zero, and the internal branch is replaced by a series of branch-less operations that should do the same. I suppose that one or the other might be better depending on branch prediction efficiency on the target hardware. I found that the LLVM compiler tends to consider branches as belonging to hell, so this output is consistent with that approach.

The RISC-V is definitively an architecture to look at when trying to design our own cpus...

Joan


Sat Oct 05, 2019 6:16 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
How about this idea, not a page design but a one page description.


Last edited by BigEd on Thu Oct 10, 2019 8:26 pm, edited 1 time in total.

link to new thread



Thu Oct 10, 2019 7:56 pm
Profile

Joined: Mon Aug 14, 2017 8:23 am
Posts: 157
I think it was the OPC Challenge that inspired me first to join anycpu.org and get involved with simpler computers back in 2017.

Putting artificial constraints on a project certainly focuses the mind to keep things simple.

This week I have written a cpu simulator for my Suite-16, 16-bit cpu, which runs on any Arduino compatible board in less than 60 lines of C++ code.

Perhaps the TTL implementation of the Suite-16 cpu will be in fewer than 66 TTL ICs

I have started a Github Repository for the project

Here's the simulator running a Hello World! program on an Arduino:

https://github.com/monsonite/Suite-16/b ... orld_1.ino

and the project on Hackaday.io is here: https://hackaday.io/project/168025-suite-16/details


Looking back at the original challenge, 66 lines of 132 column fan-fold paper - that's more than 8K characters of sourcecode - which should be enough for most die-hards.


Sun Oct 20, 2019 1:32 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Nearby, B.Bibby has found a four-instruction machine with a one page Verilog model:
viewtopic.php?p=5336#p5336


Fri Jan 24, 2020 8:45 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I’ve got a start on a new one-page-challenge cpu. Called FT20200324 (the date) for lack of a better name. It’s a 32-bit risc, 32 register machine with instruction predicates. 16 predicate registers. 1 Link register. Basic instruction formats are as follows:
Code:
// {RR}:   pppp 00010 ttttt aaaaa bbbbb oooooooo
// ADD:      pppp 00010 ttttt aaaaa bbbbb 00000100
// SUB:      pppp 00010 ttttt aaaaa bbbbb 00000101
// AND:      pppp 00010 ttttt aaaaa bbbbb 00001000
// OR:      pppp 00010 ttttt aaaaa bbbbb 00001001
// XOR:      pppp 00010 ttttt aaaaa bbbbb 00001010
// MUL:      pppp 00010 ttttt aaaaa bbbbb 00001011
// SHL:      pppp 00010 ttttt aaaaa bbbbb 00010000
// SHR:      pppp 00010 ttttt aaaaa bbbbb 00010001
// ASR:      pppp 00010 ttttt aaaaa bbbbb 00010010
// RET:      pppp 00010 00000 ----- ----- 10000000
// NOP:      pppp 00010 00000 ----- ----- 11101010
// Cxx:      pppp 00010 -PPPP aaaaa bbbbb 1111oooo
// ADDi:   pppp 00100 ttttt aaaaa nnnnnnnnnnnnn
// ANDi:   pppp 01000 ttttt aaaaa nnnnnnnnnnnnn
// ORi:      pppp 01001 ttttt aaaaa nnnnnnnnnnnnn
// XORi:   pppp 01010 ttttt aaaaa nnnnnnnnnnnnn
// LD:      pppp 10000 ttttt aaaaa nnnnnnnnnnnnn
// ST:      pppp 10001 sssss aaaaa nnnnnnnnnnnnn
// ADDIS:   pppp 1001n ttttt nnnnnnnnnnnnnnnnnn
// JMP:      pppp 10111 l aaaaaaaaaaaaaaaaaaaaaa
// Cxxi:   pppp 11ooo oPPPP aaaaa nnnnnnnnnnnnn
// A ton of compares including a generate carry into predicate register

According to the toolset it should run at 120MHz+ (without multiply) with most instructions being single cycle. The multiply operation slows things down to about 75MHz.
Core size is about 2600 LC’s.

_________________
Robert Finch http://www.finitron.ca


Wed Mar 25, 2020 4:36 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Impressive! Do please start a thread about it and give the full run-down!


Wed Mar 25, 2020 6:29 am
Profile

Joined: Tue Dec 18, 2018 11:25 am
Posts: 43
Location: Hampshire, UK.
Another candidate for the OPC challenge?
A FPGA based Harvard style 8 bit CPU in 66 lines of SpinalHDL.

https://justanotherelectronicsblog.com/?p=543


Thu May 14, 2020 6:46 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Great find! And I notice revaldinho gets a namecheck too.


Thu May 14, 2020 7:21 pm
Profile

Joined: Mon May 28, 2018 8:01 am
Posts: 7
I haven't been here for a couple of years, but here's something I designed last year (but haven't tested yet, even in simulation, so don't assume it's bug free):

Attachment:
File comment: Schematic
IMG_20200623_084209.JPG
IMG_20200623_084209.JPG [ 312.95 KiB | Viewed 5965 times ]


This is the simplest processor design I could think of -- simple enough that I can hand draw it on a single sheet of A4 with a substantial amount of space left over.

It was designed for a specific purpose -- handling the input and output requirements of a front panel for an 8-bit system that uses a hexadecimal keypad and 7-segment displays for entering programs into memory -- and is customized to the needs of that purpose, but there is nothing that ties it to that purpose. Features are:

* Program counter is implemented with a simple counter. There are no jump instructions: every instruction is repeated in an infinite loop, but any instruction can be made conditional on one of 3 condition flags.
* Harvard architecture, with the programs being stored in a microcode-like structure (i.e. they directly code for control signals). Instructions are 14 bits wide.
* 3 8-bit general purpose registers (R0-R2), plus 1 special-purpose 8-bit register (T) that stores flags and temporary data. Additional GP registers can be added at a cost of 2 instruction bits per register (so a 4th register making the instruction width 16 bits would be a sensible addition - I only didn't add it because I have no application for it)
* 4-bit ALU operates on the low order bits of the source register and stores its output in the high order bits of the destination register. ALU supports 4 modes: pass through, increment, add carry bit, add bit 0 of T.
* two operating modes for different data flows: PUSH copies the low order bits of T into the low order bits of the destination while copying the unused high order bits of the source to the low order bits of T (if src=dest this is a 12-bit-wide rotate left by 4 bits operation, with 4 bits passing through the ALU - or you can see it as a 4-bit rotate with carry) while SWAP simply stores the unused high bits of the source in the low bits of the dest (if src=dest this is just a regular 4-bit rotate). If neither mode is selected, low order bits of output are cleared.
* LOAD operation stores an 8-bit external input in the T register
* SIGNAL operation strobes an output line intended to show that output port data (calculated via the same path as described in operating modes above) is currently valid. Signals are brought out to identify which register the output data will be stored in, allowing external circuits to mirror register contents.
* All instructions may be made conditional on any of the three unused flag bits in the T register. A simple extension would be to allow carry flag or other T register bits as a condition, but I didn't need this.


Tue Jun 23, 2020 7:44 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Nice one periata! I see you posted a few months ago, I wonder if you've made progress since then in any way?

I've just seen another one-page drawn CPU, in this case a RISC-V on one page, hand drawn then entered into and simulated on logisim. See this video:
Hand Drawing a RISC V CPU and Playing Bad Apple on It

and this repo:
RISC-V Single Cycle CPU

(via)


Wed Oct 07, 2020 7:42 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
...
in this case a RISC-V on one page, hand drawn then entered into and simulated on logisim. See this video:
Hand Drawing a RISC V CPU and Playing Bad Apple on It
...

That's quite inspiring and entertaining, thanks for sharing !


Mon Oct 12, 2020 10:23 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
.
Here's MiteCPU, a cpu in only 34 lines of verilog, by Jeff Bush:

Quote:
This is my attempt to see how small I could make a useful processor. It is Harvard architecture: instructions and data are stored in separate memories. Data memory is 8 bits wide and has 256 locations. Instruction memory is 11 bits wide. Each instruciton has 11 bits: a 3 bit opcode and 8 bit operand.


Lawrie Griffiths has ported it to a page of nmigen (a python HDL).


Wed Oct 28, 2020 8:22 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
XPROZ is a simple16 bit 3 address computer. Text in German.
http://www.unibwm.de/ikomi/pub/index.htm
It is a small computer looking at the mother board, no PIE here.


Sun Jan 03, 2021 8:36 pm
Profile

Joined: Mon Aug 14, 2017 8:23 am
Posts: 157
This week I have been looking at the C simulations of some experimental and commercial cpus.

1. Blue https://brainwagon.org/2011/07/07/a-bas ... hitecture/
2. PDP-8 https://github.com/KedalionDaimon/DEC-PDP-8-on-Arduino
3. 8080 https://github.com/companje/Altair8800/ ... er/i8080.c
4. Z80 https://github.com/MohammedRashad/ArduZ80
5. 6502 https://forum.arduino.cc/index.php?topic=193216.0
6. J1 https://github.com/samawati/j1eforth/blob/master/j1.c
7. Gigatron https://gigatron.io/media/Gigatron-manual.pdf p 57-58
8. Suite-16 https://github.com/monsonite/Suite-16/b ... orld_1.ino
9. SIMPL https://github.com/monsonite/SIMPL

These represent a range of cpus with differing architectures, wordlengths, number of instructions and complexity.

In order to standardise the simulations, I have chosen ones that will all run on an ordinary Arduino based on the ATmega328, with limited RAM.

I have avoided using the Arduino language routines such as setup(), loop() and serial.xxx() - as these tend to be somewhat verbose.

Instead have provided my own getchar(), putchar() and UART initialisation routines that run on the ATmega328 UART. These add an overhead of just 210-250 bytes to the codesize.

What I will be looking for in this study:

1. Codesize (bytes) and lines of code for the simulation
2. Code density for some basic routines
3. Implementation
4. Complexity - which instructions consume a lot of code-bytes
5. Usefulness of the instruction set - what is essential and what is seldom used
6. Performance

The simulations for Blue, PDP-8, J1, Gigatron, Suite-16 and SIMPL are fairly concise. They are mainly based on a switch-case structure for the instruction decoder.

The commercial microprocessors 8080,Z80 and 6502 have greater complexity mainly because of their larger instruction sets, and the additional decoding required.

Hopefully this exploration will keep me busy for a few days.


Tue Jan 26, 2021 8:24 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 108 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8  Next

Who is online

Users browsing this forum: AhrefsBot and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software