View unanswered posts | View active topics It is currently Thu Mar 28, 2024 4:56 pm



Reply to topic  [ 20 posts ]  Go to page 1, 2  Next
 Bit Serial CPUs 
Author Message

Joined: Mon Aug 14, 2017 8:23 am
Posts: 157
Lockdown 2.0 here in the UK has inspired me to start looking at bit serial cpu architectures this week.

Almost exclusively, the immediate post-WW2 machines were bit serial - as this was the only way to keep the hardware cost manageable.

Bit serial allowed early machines to have long wordsizes - such as the 35/71 bit EDSAC. The architecture was designed around the memory technology of the day - mercury accoustic delay lines - which were essentially a recuperating acoustic shift register.

Bit serial architectures probably reached "end of life" with the 1966 introduction of the PDP-8/S, which was about 20 to 30 times slower than the original PDP-8, but the hardware was reduced from about 1500 to 1000 transistors - and this was sufficient to allow DEC to sell the machine for under $10,000, when the original machine was selling for $18,500.

The advent of the PDP-8/I and PDP-8/L in 1968 with much lower cost TTL design, made the parallel architecture much more affordable - and the PDP-8/S was dropped in 1970 - with between 1000 and 1500 machines sold.

More recently there have been some new forays into bit serial architecture. Olof Kindgren has created a bit serial version of the RISC-V RV-32, called SERV and proceeded to cram as many as 1000 SERV cores into an FPGA.

https://github.com/olofk/serv

He has also made some videos describing its architecture, operation and use cases:

https://diode.zone/videos/watch/0230a51 ... cc09411013

The main takeaway is that bit serial is a trade off between instruction execution time and hardware complexity.

At the heart of any cpu is the ALU.

Traditionally (50 years ago), the ALU might be designed as a bit-slice design using for example the popular 74181 4-bit ALU TTL device. To create a 16-bit machine, you would use 4 of these devices in parallel and run 16-bit busses between memory, registers and ALU. This led to large pcbs dominated by pcb tracks and wirewrapped backplanes consisting of a ratsnest of wires.

With a bit serial architecture, almost all of this complexity and cost is eliminated. You can use the same simple ALU design, regardless of whether you are building an 8,12,16,32 or 64 bit machine. You just need to provide longer shift registers to supply the operands, and a train of clock pulses to clock each bit in turn through the ALU.

Most cpus have a need for some common ALU instructions: AND, OR, XOR, INVert, ADD, SUB, SHL, SHR.

In a bit serial architecture, the AND, OR and XOR can be provided by single 2-input gates, using a multiplexer to choose which result to use. ADD and SUB can be done using a single full adder with a flip-flop to hold and re-insert any intermediate carry that is generated as each bit pair is clocked through.

The shift left and shift right operations can be performed within the accumulator shift register itself - if you use a universal shift register such as the 74xx299.

The ALU of a bit serial architecture thus reduces to a few TTL gates on a breadboard, or if you are implementing it on an FPGA, a handful of Logic Elements.

Testing such an ALU should be fairly straight forward. An Arduino or similar can be programmed to issue the instructions, set any signals for ALU and shift register control, and generate the clocks and timing signals. The Arduino can be used to emulate the rest of the cpu and memory system, whilst the ALU is being developed in isolation. The MSP430 has a some FRAM based devices that would make a good choice to provide a non-volatile memory subsystem from debugging and development purposes.

ALU Design.

If we start with the basic half-adder consisting of a 2-input XOR and a 2-input AND, we can see that the Sum is the XOR of A and B, and the Cout is the AND of A and B

Attachment:
halfadder.jpg
halfadder.jpg [ 55.49 KiB | Viewed 5769 times ]


Extending this to a full adder, so we can allow for a Carry_in to be included in the addition we get the following:

Attachment:
fulladder.jpg
fulladder.jpg [ 106.02 KiB | Viewed 5769 times ]


We now see that we still have access to an AND and an XOR function when Cin=0, but when we set Cin=1 we have access to OR and XNOR functions too.

And now that we have a full-adder, addition and subtraction functions become fairly straightforward.

In my implementation of the full-adder below, I have used a quad XOR and a quad NAND. This eliminates the need for the extra OR gate as this can be implemented as a NAND with inverted inputs. it also leaves spare gates for additional functions.

Attachment:
fulladder_2.jpg
fulladder_2.jpg [ 44.16 KiB | Viewed 5767 times ]


In the next post I will add logic for subtraction and negation.


Last edited by monsonite on Fri Nov 20, 2020 11:48 pm, edited 1 time in total.



Fri Nov 20, 2020 4:59 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
A two or three bit serial may be the best compromise for gate usage.

You can't scale things like 30 years ago,signal delay is becoming slower and
slower realitive to gate switching speeds.

Serial cpu's mapped well to drum memory at the time.
Modern DRAM reminds
of drum memory again, not random access any more. No matter how fast the cpu is
it still takes say 6 clocks to read the first byte of memory. 1 clock sync,2 ras and cas,
2 data out delay,1 clock sync again. I favor do more in 1 clock cycle, rather than a faster
clock.
Ben.


Fri Nov 20, 2020 9:14 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Nice post, Ken, thanks! I look forward to the next installment.


Fri Nov 20, 2020 9:45 pm
Profile

Joined: Mon Aug 14, 2017 8:23 am
Posts: 157
Ed,

It occurred to me that a lot of peripherals that we use now rely on SPI, so my rational for this project is to create a very small cpu which fits comfortably with SPI memory and devices.

There is a plethora of SPI memory devices available, SRAM, FRAM, Flash, all of which have a very similar access protocol and command set.

Historicaly almost every cpu design has been dictated by the requirements of its memory technology, so this project is to create a cpu that is designed to work with 8 pin SPI and uSD memory.

At the heart of the cpu will be a minimal instruction set, bit serial cpu, designed specifically to meet the access requirements and command set of SPI devices.

SPI memory typically has block mode and continuous streaming modes. You only have to set up the address once and it keeps sending bytes until you stop clocking it.

This is fine when you have a lot of consecutive bytes to transfer from A to B (such as video frame buffering) but it does not play nicely with modern language implementation, where you require random access, and frequent calling of subroutines, located randomly throughout the memory map.

So my approach is to have an SPI flash or FRAM be analagous to a disk drive, and have more conventional SRAM for program and data storage.

Upon power-up, the cpu boots and reads the first 64K (or more) bytes from the FRAM, and loads up a 128K x 8 parallel RAM. FRAM can be accessed at 8MHz, with read and write speeds of about 2Mbits/s - so a complete 128K transfer will take only about 500mS at boot-up time.

The MSP430FR series of microcontrollers are FRAM based, and for the purpose of initial experimentation, a low cost MSP430 devboard can provide the whole memory and debug support system for the experimental bit-serial cpu.

The latest TI "TTL" shift registers are limited to around 200MHz clock frequency - but even if we have to divide this by 8 or 16 for the bit serial cpu - we are still looking at a 10 to 20 MHz instruction execution frequency.

The plannned approach for this project is as follows:

1. Use the Digital simulator to create a model with which to explore bit serial architectures.
2. Create a software simulator in C running on a Teensy 4.x to explore instruction set options
3. Build cpu on a breadboard or prototype pcb using "off the shelf" TTL parts
4. Use an external microcontroller to provide the memory and support system during the hardware debug phase
5. Transcode the architecture to a low cost FPGA using verilog
6. Rinse and Repeat until the bit serial bug has been exorcised


Fri Nov 20, 2020 11:42 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
SD cards seem to have a gotya. I can not figure a way to put 2 devices on the same
interface. Also often forgotten is a card inserted switch. Things get messy with more than
one device.
Technical Articles Folder web site (if you have not seen it) has several interesting
ideas on alu and computer design. *Warning * transitors or 6502's may appear
http://6502.org/users/dieter/index.htm
I was just there a few days ago, looking for BCD addition/subraction, the one thing
my cpu design is missing.
Ben.


Sat Nov 21, 2020 12:41 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Love the idea of a bit-serial cpu. I have toyed with trying to create a bit-serial cpu myself. I can not wait to see the result. There are few enough bits that the ALU could be implemented with lookup tables using a small high-speed ram chip.

_________________
Robert Finch http://www.finitron.ca


Sat Nov 21, 2020 3:41 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
After the next installment, look up the Mystery TTL parts 74LS384 and 74LS385.
While a serial ALU is simple, I do not know of a simple design for the control section.
Ben. 74LS299 is nice chip too.


Last edited by oldben on Sat Nov 21, 2020 10:06 am, edited 1 time in total.



Sat Nov 21, 2020 9:44 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Quite interesting, Ken, that if your new processor is parallel on the outside but serial on the inside, it shares something with Ferranti's F100-L. But also with several others, as discussed previously.

Of course, always, prior art is no reason not to embark on a journey.


Sat Nov 21, 2020 10:04 am
Profile

Joined: Mon Aug 14, 2017 8:23 am
Posts: 157
Ben - Thanks for the heads-up on the 74LS384 and '385. I was not aware that these devices even existed, but I guess, as is the case with a lot of older TTL, it was created to fulfill a definite requirement - at a particular time in history. Sadly few of these "interesting" devices are around anymore - and certainly not recommended for new designs.

Rob - when I first encountered the bit serial concept - was probably 4 years ago when I studied the EDSAC architecture. I then came across it again in early electronic desktop calculators, some HP calculators and of course the PDP-8/S.

It was a clear engineering solution to the problem of "how do we make this cheaper?" or "how do we implement this at all?"

Some 50 years after the "Golden Age" of bit serial, we now have the advantage of low cost logic that can be clocked at 100 - 200 MHz. So having to take 16 clock cycles to add two numbers is not too much of a hit - you still have a machine that can run at 5 or 10 MIPS but with a fraction of the hardware.

However, to avoid complacency, it would be wise to think about primitive pipelining - fetching the next instruction whilst the current is executing.

The logic may either the latest TI "Little Logic" - a successor to TTL in a wide range of tiny SMT packages - intended for glue logic in mobile phones and other space sensitive applications.

Or it can be in the form of low cost FPGAs - several now which come with open source toolchains.

Many of the SPI memories offer quad SPI. This might be an incentive to think about nybble serial architectures and a 4-bit instruction word.


Sat Nov 21, 2020 2:59 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Just had a bit of a senior moment. Was revisiting some old tabs, found tabs about the bit-serial SERV RISC-V cpu, thought I should post it up on anycpu, and of course it turns out I got those links from the head post on this thread! Anyhow, another video on that machine here.


Sun Nov 22, 2020 3:24 pm
Profile

Joined: Mon Aug 14, 2017 8:23 am
Posts: 157
Some further research on bit serial technology shows that it was common-place in the Japanese desktop calculators around 1968-1970.

Some of the first MSI ICs produced in Japan in the late 1960's were shift registers and serial adder/subtractor ICs intended for the calculator market.

The link below from a vintage calculator collector/enthusiast highlights a couple of such serial adder/subtractor devices - plus a link to the calculators they were built into.

http://madrona.ca/e/eec/ics/serialadders.html


Sun Nov 22, 2020 6:38 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
To my understanding, this project http://mynor.org takes the concept to the limit. It uses a single 'nor' gate to perform all alu operations. So it will not take only 16 cycles to perform a 16 bit addition, but a lot more because even single bit operations are recycled through the only single gate . Interestingly, the author has even made the 'nor' gate more prominent by implementing it with 2 transistors in the middle of the main pcb.


Tue Nov 24, 2020 10:42 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
But you now need tons of microcode and look up tables.
I suspect the classic non alu computer is still the IBM 1620 cadet*.
Ben.
CADET Can't Add Don't Even Try.


Tue Nov 24, 2020 11:29 pm
Profile

Joined: Mon Aug 14, 2017 8:23 am
Posts: 157
My bit serial exploration continues this weekend by way of H. Neeman's "Digital" simulator.

The first challenge was to create a source of correctly timed serial data. Almost every new processor starts with its memory sub-system - so I thought that would be a good place to start.

The circuit I derived I will refer to as the "Serial Sequencer".

The sequencer consists of a ROM, some counters to address it sequentially (with the option of a forced jump to a random address) and a parallel to serial shift register to serialise the data. There is also an additional counter, used to count the serial bits, and some combinatorial logic to provide the various load and latch signals.

Whilst this might sound fairly trivial - it has taken the best part of a weekend to derive the necessary clocks and control signals to get it to run reliably.

As a source of synchronous serial data, the sequencer ROM can be programmed within the simulator environment with a hex file, and if we are dealing with byte oriented data, the sequencer will output a serial "byte" every 8 cycles of the master clock. If we were dealing with a 16-bit word design, it would require 16 cycles of the master clock to output a word.

In addition to the serial sequencer, I have devised a debug circuit, which is essentially a de-serialiser, fitted with 7-segment hex displays, in order that I can see the serial data in a hexadecimal format.

I have found that the best TTL device for de-serialising is the 74xx595, because not only does it contain a serial-in parallel-out shift register, but also an octal latch, which allows the contents of the register to be sampled and held, at the correct instant that it contains the full 8-bit serial packet.

Incorrect timing of the serial latch signal makes for some very confusing results.

The 74xx595 is available in a number of TTL family variants, including the 74LVC8T595, which not only can be clocked at 200MHz, but includes voltage level translators so that it can work within a variety of mixed voltage logic environments.

Having built up some reliable test and debug circuits, I can now progress with the rest of the bit serial design, including the ALU and the register file.

The ALU can be tested by arranging two serial sequencers (synchhronised) to supply the test data to the A and B inputs of the ALU, and the hexadecimal display on the output to confirm the operation of the ALU is correct.

Whilst the bit-serial architecture is generally considered to be economical in terms of hardware, it is the interfaces between serial and parallel domains where the complexity reappears in the form of the overhead of serialisers and de-serialisers.

Whilst a serial SPI RAM might be an 8 pin device costing a couple of dollars, to recreate the same "serial memory" from conventional parallel memory and TTL takes between 6 and 10 packages - depending on the address space and the length of the memory word. Unfortunately 8-bit counters in TTL are no longer readily available and so must be constructed of 4-bit parts.

The next installment of "Bit Serial" will be when I get the ALU up and running. There will be a Github repository (very soon) for the progress to date and the TTL designs simulated in "Digital".

https://github.com/monsonite/Bit_Serial

There should be more than enough here to keep me going up until 2021 !


Sun Dec 06, 2020 2:00 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
I note that an oscilloscope was a common way to view registers of a serial CPU (in the old days, when they would necessarily be recirculating, whether in mercury or nickel or other form) but your deserialiser and LED display sounds good too.


Sun Dec 06, 2020 3:53 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 20 posts ]  Go to page 1, 2  Next

Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software