Last visit was: Sat Sep 07, 2024 10:24 am
It is currently Sat Sep 07, 2024 10:24 am



 [ 12 posts ] 
 transcript: "Designing Propeller II - 16(?) 32-bit Cores" 
Author Message

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
.
Chip Gracey of Parallax took part in an online chat, largely about the upcoming (always upcoming?) Propeller II, and a cleaned-up summary transcript has just appeared:
https://hackaday.io/post/58433

Quote:
Chip: Well...I've been into electronics since I was about 10. Self taught. Started Parallax right after high school with a friend from junior high. Parallax slowly grew to look like a company.

What else? Ah, working on a new chip called the Propeller II. It has sixteen 32-bit cores and has been much designed by the community on the Parallax forums.

I've been busy in Verilog for most of the last 17 years.

Everything has been a lot of fun. It was fun to get the Prop1 done, as it was a full-custom chip. I designed my own RAMs, ROMs, PLL's, logic, I/O pads, etc. That was a big project. It took 8 years. The Prop2 has been going on for 11 now, and I think it's done.


Quote:
Chip: The Prop2 will have 512KB of hub RAM and each processor has 4KB of dual-port RAM. Compare that to the Prop1's 32KB/2KB.

I just added a feature to the Prop2 that, in six clocks, can fetch a byte from hub memory, look it up in the local LUT RAM, and jump to code within the cog (processor) memory with a SKIP pattern, where it jumps over unwanted instructions. It means efficient bytecode execution for custom bytecode engines.


Quote:
Each cog has:
    Access to all I/O pins, plus four fast DAC channels
    512 longs of dual-port register RAM for code and fast variables
    512 longs of dual-port lookup RAM for code, streamer lookup, and variables
    Ability to execute code directly from register RAM, lookup RAM, and hub RAM
    ~350 unique instructions for math, logic, timing, and control operations
    2-clock execution for all math and logic instructions, including 16 x 16 multiply
    6-clock custom bytecode executor for interpreted languages
    Sustained 32-bits-per-clock hub RAM access for contiguous or looped reads/writes
    Ability to stream hub RAM and/or lookup RAM to DACs and pins, also pins to hub RAM
    Live colorspace conversion using a 3 x 3 matrix
    Pixel blending instructions for 8:8:8:8 data
    16 unique event trackers that can be polled and waited upon
    3 prioritized interrupts that trigger on selected events
    Hidden debug interrupt for single-stepping, breakpoint, and polling


Mon Jul 03, 2017 5:58 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2153
Location: Canada
Looks interesting. I wonder if it can interface to external DDR ram.

_________________
Robert Finch http://www.finitron.ca


Wed Jul 05, 2017 10:39 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
This might be a good place for us to collect information about the P2, if we can find it - instruction set, programmer's model, microarchitecture. It has been a moving target, and maybe still is moving. But they've had working models running on FPGA (which is to say, something P2-like but possibly not final) and in September they got silicon - again, of something. Maybe 1500 parts.

It seems that maybe 16 cogs didn't fly, and it's back to 8.

To start us off, here's a thread on the parallax forums:

It might be that somewhere in there is the latest information on what it is and how it works, but there are a great many threads.

Edit: a couple of other threads here on anycpu:


Fri Nov 09, 2018 1:59 pm

Joined: Tue Jan 15, 2013 10:11 am
Posts: 114
Location: Norway/Japan
The P2 was originally planned with 16 cogs and a certain amount of RAM, but it turned out that there wasn't enough die space available with the 180nm process. So either cogs or some RAM had to go, so after some discussion it ended up with 8 cogs.
2 cogs is not a typo, Chip has been thinking about various possibilities for lower-cost variants - and 2 cogs was one of them.

The interesting thing about the silicon samples is that it's way faster than simulations at OnSemi indicated. It's designed to handle 180 MHz over the full temperature range, and hopefully over 200MHz in a narrower range. But samples have been running at 340 MHz with no issues, not even active cooling.
Which means that it can support HDMI (which would need 250MHz IIRC), so Chip designed in HDMI support for the next spinout.
Two hardware problems were found, one which made initial reset tricky due to a pin which would need a pull-up (IIRC), the other was a sign extension issue in Verilog which worked differently in silicon than in the FPGA. That only affected a small subset of the functionality - NTSC output, I think it was.
The samples were good enough to be used for initial evaluation boards (a small batch of a simpler board has already been distributed to early testers, many of them in Australia and New Zealand).


Last edited by Tor on Fri Nov 09, 2018 5:03 pm, edited 1 time in total.



Fri Nov 09, 2018 2:40 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
Oh, great, you're plugged into progress on those forums? Do let us know of interesting developments!

If the P2 is now only 8 cores, or even cut down further, are there specific architectural advances which still make it attractive compared to P1?


Fri Nov 09, 2018 4:18 pm

Joined: Tue Jan 15, 2013 10:11 am
Posts: 114
Location: Norway/Japan
Just off the top of my head (there's much more):
- The P2 is much faster (P1: 80MHz, P2: 180MHz, design frequencies. The P1 easily does 100MHz, the P2 up to 340MHz so far)
- The P1 has 32 I/O pins, the P2 has 64 I/O pins
- The P2 I/O pins are "smart pins", which means that the pins have some intelligence and can be programmed in various self-contained ways
- The P2 has more HUB memory (P1: 32KB, P2: 512KB. I think. It was planned with 1MB, but maybe there wasn't enough die.)
- The P2 also has LUT RAM, which can be used for various things, e.g. streaming video. The P1 doesn't have that.
- The P1 has the Spin interpreter in ROM, the P2 instead has an SD bootloader (not sure if there's also flash and/or serial bootloader fallbacks), and Forth (w/monitor) included in ROM. Spin has to be loaded, as it's not in ROM. (Of course both processors have other language options)
- P2 includes CORDIC for math
- P2 includes a PRNG (pseudo-random generator)
- Packing is different. The P1 exists in various variants, including a 40-pin DIP. That's not possible with the P2, it needs many more pins.
And there are more differences too. Functionality-wise the P2 is a superset of the P1.

Then of course the whole Propeller concept should be explained.. too much for today, but if there ever was a self-contained do-it-all MCU then that's the Propeller. You can literally connect a kbd, a monitor, input and output just about directly to the chip, add power, and start programming.

The smart pins of the P2 is something I haven't seen elsewhere. I just found a list of what they can do:
Quote:
Each pin has the following functions:

8-bit, 120-ohm (3ns) and 1k-ohm DACs with 16-bit oversampling, noise, and high/low digital modes
Delta-sigma ADC with 5 ranges, 2 sources, and VIO/GIO calibration
Logic, Schmitt, pin-to-pin-comparator, and 8-bit-level-comparator input modes
2/3/5/8-bit-unanimous input filtering with selectable sample rate
Incorporation of inputs from relative pins, -3 to +3
Negative or positive local feedback, with or without clocking
Separate drive modes for high and low output: logic/1.5k/15k/150k/1mA/100uA/10uA/float
Programmable 32-bit clock output, transition output, NCO/duty output
Triangle/sawtooth/SMPS PWM output, 16-bit frame with 16-bit prescaler
Quadrature decoding with 32-bit counter, both position and velocity modes
16 different 32-bit measurements involving one or two signals
USB full-speed and low-speed (via odd/even pin pairs)
Synchronous serial transmit and receive, 1 to 32 bits
Asynchronous serial transmit and receive, 1 to 32 bits, up to clock/3


Edit: I forgot - the P2 also has 'HUB exec', it can execute code directly from HUB (common) memory, while the P1 can execute only from COG memory (which the P2 can too, of course). To execute larger programs in the P1 it was necessary to use various (ingenious) schemes to copy chunks of code from hub to cog memory and then execute. The P2 avoids that, thus it's possible to run larger programs much faster.


Last edited by Tor on Fri Nov 09, 2018 11:09 pm, edited 1 time in total.



Fri Nov 09, 2018 10:58 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
That's very impressive - thanks for the lists!

I've only just twigged that LUT memory is meant for look-up tables. (In the FPGA world, it would be memory that's constructed from look-up tables.)


Fri Nov 09, 2018 11:05 pm

Joined: Tue Jan 15, 2013 10:11 am
Posts: 114
Location: Norway/Japan
LUT does mean look-up table, but in typical Propeller fashion everything can do much more.. it's even possible to execute code out of the LUT.


Fri Nov 09, 2018 11:08 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
So the LUT acts as a third memory space? Is it now shared in some way, also allowing synchronised inter-cog communications?


Sat Nov 10, 2018 7:05 am

Joined: Tue Jan 15, 2013 10:11 am
Posts: 114
Location: Norway/Japan
Yes, it's a third memory space - and I *think* it's dual ported and available between adjacent cogs, but I'm not absolutely certain - there was a lot of discussion some time ago because then 'all cogs are the same' wouldn't be true anymore. I'm not sure if the docs I have are up to date. This should be more clear soon I think, as people start using the new silicon.


Tue Nov 13, 2018 11:21 am

Joined: Tue Jan 15, 2013 10:11 am
Posts: 114
Location: Norway/Japan
A follow-up..
The LUTs are indeed shareable between two cogs (and dual-ported).
Code can execute from local COG memory (2KB), LUT (2KB), and HUB (512KB)
The HUB RAM is divided into 8 blocks, and each block is available to one COG at each cycle, so all eight COGs can access RAM at the same time, just not the same RAM. Next clock, the 'eggbeater' HUB turns and the COGs get access to the next block. As a COG can read a 32-bit word in one cycle, by arranging the words just so, the COGs can read 8 words in 8 clocks - all of the COGs at the same time. Of course, you don't have to carefully map this.. it looks 'sequential' to each COG. But if the next word you need isn't in the current block then you'll stall for eight clocks. There's also a streamer feature to set up fast access for blocks of data.

The engineering evaluation samples of the silicon turned up a couple of bugs, nothing very serious, so there's a re-spin coming up - maybe there'll be new chips in May. There are around a hundred evaluation boards out among testers and developers, using the sample chips.

Update: The latest estimate is that new samples will be available in June.


Last edited by Tor on Fri Mar 15, 2019 1:25 pm, edited 1 time in total.



Mon Mar 11, 2019 1:37 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
Thanks for the update!


Mon Mar 11, 2019 1:51 pm
 [ 12 posts ] 

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software