Hi Ed,
It would be nice to have multiplexers at various points, but the AUC logic family doesn't include any. The closest next option is the gates that offer hi-Z output, which includes the SRAMs, 16-bit (quad 4-bit) buffers, and 16-bit (dual 8-bit) flip-flops. The firmware loaders are still being designed, but we know what they have to do:
Firmware injection points are shown on page 19 of the
Dauug|36 preprint and page 1 of the
Dauug|18 overview. (These are simplified diagrams, so where stuff is not drawn, the firmware load path isn't shown either.) In many instances these locations where firmware is brought in is not where it needs to be, but it can then follow existing flows within the CPU to get where it needs to go. So the firmware loader in effect replaces the control decoder ('36) or code SRAM ('18) when the system boots.
Storage for firmware, Dauug|18 needs at most 256Ki * 18 * 5 (size * width * chips) bits. "Inspectable" to me implies punched tape, but on "standard" (pretend with me) 8-bit tape with holes every 0.1 inch and no error checking, that's 4.7 miles of tape. Dauug|36 would use much more. Humanly achievable, but not practical for use and not very fast to boot.
So the next best option, I'll use a serial NOR flash IC. This will be the only IC on the board with non-volatile memory, so we'll know the bounds of where state can "hide" when the power is off. I chose a flash chip that supports booting in that it doesn't need a read command to start a data transfer when the power comes up. A power-ok input + clock will start outputing bits serially. We won't provide an electrical path
to the flash from the CPU, only
from, so it won't be possible for remote attackers to save modified firmware.
Once bits are streaming from the flash, the logic's not terribly hard. First up, we need data in parallel, so we fashion a shift register from 16-bit flip-flops. (The AUC family doesn't offer shift registers.) Then between some minimal combinational logic (soldered) and some precomputed control signals embedded within the firmware bitstream, we load the SRAMs, set the program counter, disconnect from the firmware loader, and start the CPU. This is the Dauug|18 boot process.
Booting Dauug|36 is more logically and electrically complex, so what we do with that is build a Dauug|18 onto the same board, boot the '18, then have the '18 read the '36 firmware from the same NOR flash (just continue the existing bitstream), and use that to initialize the '36. Then that same Dauug|18 branches to code that lets it implement I/O (bit banging, all that) for the '36.
To the extent we need address counting hardware for the firmware loader, this will use Galois linear feedback shift registers (LFSRs) to keep the component count small. So the order of the firmware words won't be linear in the NOR flash; they will follow the LFSR's ordering instead.
The 6-bit ALU slices, the '18 and '36 work a little differently, so this description is approximate. Each slice is an SRAM, works like an EPROM as you suggested except it's 10 times as fast (SRAM is faster than EPROM). They compute any function on 18 input bits, and we get up to 18 output bits. That gives us, for the input, a 6-bit left operand, 6-bit right operand, and 64 functions they can implement (there are 6 bits available to chose which function).
Functions you'll find in these slices include 6-bit add, subtract, AND, NAND, OR, NOR, XOR, XNOR, NOT, left OR not right, right OR not left, left AND not right, right AND not left, exactly left, exactly right, NOT left, NOT right, FALSE (ignores operands), TRUE (ignores operands). Those are the easy ones.
Harder functions to understand: 6-bit multiply produces a 12-bit result. So there's a function to produce the low 6 bits of the product, and a different function for the high 6 bits. Popcounts involves some serious gymnastics; the operation to count bits within a 6-bit slice is easy, but operations are needed to combine the slices for a final total. Carry adjustments for addition and subtraction. Magnitude compare. Minimum and maximum. S-box operations for hashing, pseudorandom number generation, and possibly block ciphers. Bit permutations. Various unary functions. Special instructions to accelerate full-word multiplications and division. Shifts and rotates.
Why we transpose (you wrote "permute" which is accurate but less specific) 18- and 36-bit words: because otherwise, we're stuck. Suppose I want to rotate an word left 1 bit, so ABCDEF GHIJKL MNOPQR needs to become BCDEFG HIJKLM NOPQRA. The ALU is bit sliced, so there's no way to move G from the middle to the left subword, M from the right to the middle subword, or A from the left to the right subword.
Here's how the rotate works in real life: ABCDEF GHIJKL MNOPQR is rotated locally 1 bit left in each subword, becoming BCDEFA HIJKLG NOPQRM. That's legal, no bit crosses a subword boundary. Then we do our transposition: the left, middle, right two bits of each subword are relocated to the left, middle, right subwords. How? Just copper circuit board traces that go where we want the bits. That gives us BCHINO DEJKPQ FALGRM.
What does that buy us? We were trying to solve the problem that we couldn't get the G, M, or A to cross into their correct subwords. But in the transposed form, all three letters are in the right subword, currently FALGRM. Now one of the bit slice operations is to rearrange the right subword only: FALGRM is replaced with FGLMRA, the left and middle slices leave their subwords alone, and the 18-bit word is now BCHINO DEJKPQ FGLMRA.
We now transpose a second time, again that's just done using copper. The transposition is self-inverse, but we made a small rotation in the left subword while transposed. Our BCHINO DEJKPQ FGLMRA now becomes BCDEFG HIJKLM NOPQRA, which is our original ABCDEF GHIJKL MNOPQR rotated left 1 bit as we desired. That's how rotates work, and that's one example of several why we need a way to transpose 18-bit words.
Dauug|18 handles these transpositions awkwardly in the sense that it takes several instructions to complete tasks such as shifts and additions (addition also needs this transposition). Dauug|36 has three Dauug|18-like ALU layers with transpositions going into and out of the second layer, so one instruction can do the whole rotation, the whole addition, and so on. The penalty is there are a lot more components, and each instruction takes 4 clock cycles on the '36 instead of 1 clock on the '18. (The 4th clock cycle for the '36 is for register fetch and store, which the '18 doesn't have because it has no registers.)
Yes, assembly language support will precede any higher-level languages.
Marc