Last visit was: Sat Sep 07, 2024 11:16 am
It is currently Sat Sep 07, 2024 11:16 am



 [ 20 posts ]  Go to page 1, 2  Next
 M16C5x - PIC16C5x-compatible FPGA soft-core processor 
Author Message

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
M16C5x is an implementation of a soft-core processor in an FPGA using a PIC16C5x-compatible processor core. Initial release of the M16C5x is based on a 12-bit instruction word RISC-like core which implements the instruction set of the PIC16C5x processors.

The processor core is implemented as a 2 clock cycle core in order to ease the use of the on-chip block RAMs for user program storage. The M16C5x core supports a total of 4kW (6 kB) of program storage unlike the Microchip PIC16F59 which supports a maximum of 2kW (3 kB). Instruction execution is compatible with the 12-bit instruction width PIC16 family of processors.

Accessing the extended program memory implemented in the M16C5x soft-core processor will require programmer intervention. Initial testing shows that the M16C5x processor executes instructions in a manner similar to that of the Microchip components. The MPLAB tool has been used to generate test code for this purpose. The MPLAB simulator can be used to assemble and simulate test programs. The MPLAB tools, assembler and simulator, can be used, but the ROM initialization file and any custom I/O configurations can not be generated or tested using MPLAB. For final testing, the RTL simulator will need to be the tool of choice.

Also, the core implements the I/O port tristate control registers and the I/O port registers as external to the core. These registers are accessed using dedicated read and write enable signals, and separate input and output data buses. This approach to I/O allows the implementer greater flexibility and speed in connecting custom peripherals to the P16C5x processor core used in the M16C5x.

The flexibility of this approach is demonstrated in the connection of the SPI Master interface to the TRIS C and Port C output and input registers. The TRIS C register, a write-only register, functions as a write-only control register for the SPI Master I/F. It controls the SPI clock generator, the shift direction, and whether SPI input data is to be buffered or discarded. Writes to Port C fill the command/output data FIFO, and reads to Port C remove data from the input data FIFO. The FIFO flags are mapped to Port A input. This provides a very compact and efficient buffered SPI interface.

In the near future, the UART will be added using TRIS B and Port A input and output registers. In its present state, with the SPI interface, program memory, and temporary registers for TRIS A, TRIS B, Port A input, and Port B input and output, the M16C5x requires only 63% of an XC3S50A-4VQG100I FPGA. The remaining 37% should be sufficient to implement the UART and possibly a serial boot loader.

A repository for this project has been established on GitHUB.

_________________
Michael A.


Last edited by MichaelM on Thu Jul 14, 2016 6:35 am, edited 2 times in total.



Thu Jul 04, 2013 2:02 am

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
Interesting - for someone quite unfamiliar with PIC, can you please say a little about how you expose that larger memory to the CPU? Is it perhaps the case that the instruction set readily accesses all of the memory, but the toolchain doesn't expect it?

Cheers
Ed


Thu Jul 04, 2013 9:35 pm

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
The PIC processor families are Harvard architecture devices: program and data spaces are separately addressed. In general terms, they are divided into three general categories of parts according to their instruction word width: 12, 14, or 16 bits.

The M16C5x soft-core microcomputer and its corresponding P16C5x processor core are representative of the 12-bit instruction width family. In this family, the instruction has an operand field that is used for arithmetic literals (8 bits), "register file" addresses (5 bits), subroutine address (8 bits), or a portion of the destination of an absolute jump (9 bits).

Since the processor arithmetic core is an 8-bit ALU, holding an 8-bit literal value in the instruction word is readily understood. The addressing of data registers (memory-mapped processor peripherals, processor status register, and I/O port control, input, and output registers, and general registers or RAM) is another matter altogether. Furthermore, the address field in the instruction is not able to directly address all of program memory.

In the case of data memory, the instruction word contains a direct address field that can address 32 locations. The first 8 locations are assigned to various memory-mapped processor registers. Second 8 data memory locations are general purpose registers, i.e. RAM, on most members of the family (including the M16C5x). Some members of the family use data memory locations 8 and 9 for additional I/O ports, so only 6 RAM locations, A-F, are available in these processors. Data memory addresses 16-31 are used for RAM in all members of this processor family. Using bank switching controlled by the upper bits of the FSR, File Select Register (Offset 4), up to 128 bytes of RAM can be addressed.

Data memory offset 0, INDF, is used for indirect addressing of data memory, with the pointer in the FSR providing the address. Regardless of the FSR value, the first 16 data memory address always map to the processor registers and the 6 or 8 RAM locations. Thus, if bit 4 of the data memory address is 1, then the banked switched RAM is addressed, otherwise the processor registers and fixed RAM bank are addressed. Since bit 4 is fixed, only 7 bits of the FSR may be used to addressed bank switched RAM, which limits the amount of bank switched RAM to 128 bytes in 8 banks of 16 bytes.

The limitation imposed by the limited number of address bits in the instruction means that the programmer must be keenly aware of the program memory bank in which a subroutine or jump destination lies. In the processor status register, the 12-bit instruction members of the PIC16C5x family (PIC16C54, PIC16C57, and PIC16F59) have two bits that represent the upper two bits of the program counter. (The most significant bit is reserved, and the M16C5x employs it and a 12-bit (instead of an 11-bit) return address stack to expand the memory space to 4kW (6 kB). As currently being used, the M16C5x most closely resembles a PIC16C57 with expanded program memory.) Thus, these two bits plus the 9 direct bits in the instruction allow direct jumps within a 512 byte page of program memory. Furthermore, since there are only 8 bits reserved in the instruction for the subroutine address, the processor inserts a 0 for the 9th bit and appends the bank bits from the status register. This imposes a limitation that subroutines must be located in the first 256 locations of a 512 word program memory bank.

Interestingly, the lower 8 bits of the program counter is exposed as a programmer accessible register, offset 2. This allows computed gotos to be used.

Commercial tools from Microchip and others have extensive support for this odd ball memory architecture: warnings are provided when jumps and calls cross page boundaries. In addition, CCS and others have built C compilers to support the memory architecture of these processors.

To scale the processor, Microchip had to increase the number of bits in the instruction word. In doing so, larger page sizes were allocated for the processor registers window, fixed RAM, and the bank switched RAM. I am not familiar with the particulars of these enhancements, but with additional bits available in the instruction for addressing data and program memory

The M16C5x is intended for small embedded control applications of the type in which the PIC family processors excel. I expect these applications to be programmed using PIC assembler, and it's for that reason that I tested out its compatibility with the free MPLAB tools. The simulator in that toolset can be used for testing and debugging most code segments. Code segments dealing with the application-specific features of the M16C5x, i.e. SPI and UART interfaces, are not supported by either the free Microchip tools or the low-cost CCS tools. In fact, the assumptions made by the CCS tools regarding the I/O port hardware is very specific to the PIC processor which means that many of their I/O libraries would need to be re-written to support the M16C5x. Since it is possible to do that, the CCS PCW C compilers is a viable toolset for the M16C5x.

_________________
Michael A.


Fri Jul 05, 2013 1:05 am

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
Thanks for the explanation - oddball barely covers it!
Cheers
Ed


Fri Jul 05, 2013 11:01 am

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
An update has been uploaded to GitHUB for the M16C5x. A UART, which supports the SSP serial interface supported by the NXP LPC213x/LPC214x ARM processors, has been included and integrated into the previous release. With the previously integrated SPI Master interface, the internal UART and external SPI interface provide a good foundation for the M16C5x soft-core microcomputer.

During integration testing, a problem was detected in the BTFSC/BTFSS instructions. Previous testbenches did not do an exhaustive test, and the BTFSS instruction used in the test program operated correctly. The issue has been corrected in the current release. Further, the SSP UART was also found to have an issue in the interrupt generation and clearing logic. Previous use of the UART with the ARM LPC213x//LPC214x processors does not make use of polling of the UART Status Register to determine if the transmit FIFO has room available or is empty, or if the receive FIFO has data available, or if a receive timeout has occurred. Instead, the ARM polls an IRQ signal (or IRQ can be attached to an ARM interrupt input) and when that signal is asserted, the software can push data to or pull data from the transmit and receive FIFOs, respectively. Status bits in these transfer operations, in addition to the TX/Rx data, can be used along with a state machine in the software to operate the UART without ever reading or polling the status register.

Because the M16C5x (and its P16C5x PIC16C5x-compatible core) does not support interrupts, the test program polled the UART status register to determine when another character could be written to the UART Transmit Data Register. This rapid and repetitive operation uncovered a race condition, i.e. synchronization issue, in the interrupt setting and clearing logic. The solution required that the clearing pulse in the UART clock domain be further qualified with the corresponding interrupt status bits in the UART status register in SPI clock domain. This prevents an interrupt being generated on the UART clock domain and cleared before it is read/sampled in SPI clock domain. All told it was only necessary to add four qualifiers to the existing interrupt flag clearing pulses, and to add a register to sample and hold the status register bits in the SPI clock domain.

All source has now been released under the LGPL license. The Code/MPLAB subdirectory contains the complete MPLAB project files, source, and output files.

The M16C5x_Tst3.txt file is trimmed and doubled in length to generate the M16C5x_Tst3.coe memory initialization file. The test program simply keeps the UART transmitter filled with 0x55, "U", using polling. The program is able to do this for a UART transmitting at 3.6865Mbps. (The processor core is executing a 29.4912 MHz, and the SPI is shifting data to the UART at the same rate. Each UART frame is 16 bits to support the transfer of status bits pertaining to each received character in the same SSP frame.) The default baud rate of 9600 has also been tested.

Only a boot loader for loading images stored in the SPI configuration EEPROM after the FPGA configuration image is on tap. The plan is for the Xilinx tools to add the MPLAB Intel HEX output after the bit image, and that will complete the basic tool set for this project. The soft-core will boot into the boot loader, download the data file, program it into the "program ROM" and transfer control at the address defined in the interrupt vector location of the MPLAB program file. This should allow the Xilinx IMPACT tool to be used to build the SEEPROM image and to program the "program ROM" image into an SPI configuration EEPROM attached to the FPGA.

_________________
Michael A.


Fri Jul 12, 2013 10:34 pm

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
An update has been uploaded to GitHUB for the M16C5x. The new release improves the parameterization of the M16C5x soft-core. It also changed the DCM to generate a 73.7380 MHz output, and the soft-core operates just fine :) at this frequency in the XC3S200A-4VQG100I FPGA on my M65C02 development board.

Another component of the update was the addition of a Block RAM Memory Map (BMM) file to the project. Xilinx's Data2MEM tool was used in conjuction with this file to directly patch the bitstream file instead of resorting to re-synthesizing the FPGA with a new memory initialization file.

A description of this process is being developed. It will be posted on a GitHUB Wiki soon. It is sufficient to say that it is certainly possible to use Data2MEM for directly modifying the bitstream. Unfortunately, MPLAB does not output MEM files; it outputs various Intel Hex and/or Motorola S-record formatted programming files. Thus, a conversion tool is still needed to make the Data2MEM process more efficient. With a few simple, command-line tools, and a batch file or two, it should be fairly easy to take advantage of the Data2MEM tool and greatly speed the process of loading programs into the internal block RAMs of the FPGAs.

Further, I can also see a way to use different address spaces defined in the BMM file to load both program and data memories. But I also see a way for programs for multiple processors to be patched using this technique if multiple soft-cores are implemented in a single FPGA.

Unfortunately, as a final note, it seems that the order of the memory initialization data as used in Verilog memory initialization during synthesis, and the MEM file used by Data2MEM is reversed. Without a purpose-built tool to handle this operation, I had to use the column mode of my Ultra Edit text editor to manually perform the necessary data reversal; sure do like Ultra Edit's column mode. :)

_________________
Michael A.


Sun Jul 14, 2013 10:14 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
It would be good to have a working recipe to use Data2MEM - even if some steps can only be described verbally.

In principle that kind of data munging can be portably implemented in, say, python, but there's always a bit of a stumbling block trying to get bulletproof descriptions of how to install a tool on each platform (even though it's pretty straightforward - it's the description that's difficult, not the action.)

Cheers
Ed


Wed Jul 17, 2013 4:23 am

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Agreed. I have started the write up, but haven't had time to pursue its completion during the week. Will look to finish it up on Friday and the weekend.

_________________
Michael A.


Wed Jul 17, 2013 7:38 pm

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
I have created a forum topic in the Programmable Logic forum on forum.6502.org that provides the results of my efforts to use Data2MEM to patch block RAMs with updates. The process allows FPGA designers to utilize the Data2MEM utility to modify the contents of the block RAM without performing a re-synthesis, MAP, PAR, and BitGen to incorporate the changes into internal block RAMs. With appropriate tools to convert soft-core processor assembler/compiler outputs into Data2MEM-compatible MEM or ELF files, the time savings provided by using Data2MEM can be considerable.

Created a small C command line utility, released on GitHUB, that streamlines the process of converting the standard MPLAB Intel Hex programming file into Data2MEM compatible MEM files. Updated the GitHUB repository with the BMM (Block RAM Memory Map) file, the TCL script capturing the tool options, and the utility program.

Usage for the Utility program from a DOS box command prompt is:

type IntelHexFile.hex | IH2MEM > MEM-file.mem

Both IH2MEM.c and IH2MEM.exe are provided on GitHUB. IH2MEM is a simple command line filter program using stdin and stdout.

Edit: Refined the link to point to the post rather than the thread.

_________________
Michael A.


Last edited by MichaelM on Sat Jul 20, 2013 3:36 pm, edited 1 time in total.



Fri Jul 19, 2013 8:29 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
Great - thanks for the writeup and the tools!


Sat Jul 20, 2013 7:07 am

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
It also appears that the -4 FPGA mounted on the board can be pushed to 88.4736 MHz even though PAR was constrained to 80 MHz. :)

Not something I would do normally, but decided to try it since almost everyone else appears to do so. :D Will try 103.2912 MHz next.

_________________
Michael A.


Sat Jul 20, 2013 10:01 pm

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Well it appears to work when pushed. The core as currently provided on GitHUB has been operated at 58.9824 MHz (4x), 73.7280 MHz (5x), 88.4736 MHz (6x), 95.9904 MHz (6.5x), 103.2129 MHz (7x), and 110.5920 MHz (7.5x). I suppose this shows that Xilinx is somewhat conservative with its path delay estimates, and that the period constraint is satisfied at whatever extreme presents that worst case. At room temperature, the 80 MHz constraint is about 40% below the frequency at which the part appears to be operating correctly.

Sure is nice to use Data2MEM from within Bitgen to update the M16C5x's program memory in the Block RAMs.

_________________
Michael A.


Sat Jul 20, 2013 10:58 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
Just wondering, did you make an effort to sensitise and exercise the critical path as reported by timing analysis?
(It may well be that you have a decent margin because temperature and voltage are not worst-case)
Cheers
Ed


Sun Jul 21, 2013 8:45 am

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
The critical paths, i.e. the paths having the longest reported delays, in each clock domain (Clk - CPU and UART, and SPI_Clk - internal/external SPI Serial Clk) are being exercised by the programmed application. On the CPU side, the longest reported path length is 12.427ns and is for a path that goes through the instruction decoder and determines whether the register file address is direct or indirect. If there was an issue in this path, the data returned from the register file, which is in continuous use for counters and temporary values, would be incorrect. The longest path in the SPI clock domain is 15.5ns, which is the time between the rising edges; the actual limit is 7.75 ns, or the time between rising and falling edges of the SPI clock.

The CPU is operating with internal block RAMs, so instruction fetches require two clock cycles to complete. Previously there were some strictly combinatorial logic paths in the ALU when the core was configured for single cycle operation. With the need for 2 cycle operation with the internal block RAMs, I broke these paths at the output of the ALU by inserting a register. Thus, all critical paths in the core are registered. There may be a logic delay imbalance in some paths, but I set the synthesis options to enable the use of forward and backward register balancing to deal with this issue.

I clearly have not set the period constraints in the UCF, or added specific constraints, to account for the 2 cycle nature of the implementation. I don't exactly know how to configure the timing constraints in the UCF to specify that the combinatorial path delays for paths controlled by the clock enable can be double the period constraint of the clock. The tools appear to be constraining delays on all paths to the period of the clock for that path. This means that in most cases, the signals have twice as long to complete their transitions before they are clocked into a register.

Thus, I hypothesized that I could cut the period of the clock to half of the constraint used for Synthesize/MAP/PAR. For the Clk domain, which covers the majority of the logic, that meant I should set the DCM for 147.4560 MHz operation. For the SPI Clk domain, that meant I should set the DCM to 129.0322 MHz operation. With the reference attached to the FPGA, I can generate 147.4560 MHz (10x) to test the hypothesis with the Clk domain, and I can only generate 117.9648 MHz (8x) or 125.3376 MHz (8.5x) to test the hypothesis for the SPI clock domain.

With the DCM set to generate 147.4560 MHz (10x), the CPU/SPI components operated, but the UART Tx output remained in the MARK state indicating that the UART (at the end of the SPI channel) was not operating or receiving data from the SSP Slave interface. Since this clock domain is constrained for 80 MHz, or 12.5 ns, 147.4560 MHz is just less than the 160 MHz limit I theorized might be the limit for this clock domain. Since the CPU and the internal SPI Master are generating poll cycles to the SSP UART, it appears that my hypothesis regarding this clock domain holds.

With the DCM set to generate 117.9648 MHz (8x), the UART Tx output generated the expected 3.6864 MHz data stream. Thus, at this frequency, components on both clock domains are functioning as expected. With the DCM set to generate 125.3376 MHz (8.5x), the UART Tx output generated the expected 460.4 kHz data stream. Thus, at both of the frequencies (whose constraint is set by the rising to falling edge limit of the SPI clock), the CPU (Clk), SPI Master (Clk), SSP Slave (SPI Clk), and SSP UART (SPI Clk and Clk) components are all operating as expected.

I think that my hypothesis for this design is supported by these tests, but I will have to look into how to set up the constraints in the UCF so that experiments are not required to set the operating frequency limits. These results do not apply to the M65C02 soft-core. That design, although it uses four cycles to accommodate external memory accesses, makes address computations in a single system clock cycle. That structure is not going to be amenable to the pushing of the operating frequency as the M16C5x implementation.

There's always something to learn where FPGAs are concerned. ;)

Edit: Did not think of it when testing at 147.4560 MHz, but the SPI clock is derived from the 147.4560 MHz clock. Thus, went back and regenerated a bitstream for operation at this frequency, but changed the SPI clock divider setting in the test program. The result is that the UART Tx is operating as expected at a 1.8432 MHz bit rate, but the SPI transfer cycles are about the same as those at 73.7280 MHz because it is set for a divide by 4 rather than 2 (default). The SPI clock frequency at 125.3376 MHz was set for divide by 2, or 62.6688 MHz. At this frequency, the SPI clock has a period of 15.956 ns, which is greater than the minimum delay, 15.5 ns, that the timing analysis reported as the constraint for this clock. This means that 147.4560 MHz (10x) appears to be in the operating range of the M16C5x core.

_________________
Michael A.


Sun Jul 21, 2013 2:12 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
That's really very impressive!
I did once have to investigate a long path in a block on a shipped chip - this was in the very early days of static timing analysis - which, sinfully, crossed between blocks without being retimed and had turned out to be longer than the target clock cycle. As it turned out, the path was false: it could not be sensitised to the combinatorial worst case. Or, anyhow, I convinced myself this was so.
Cheers
Ed


Sun Jul 21, 2013 4:30 pm
 [ 20 posts ]  Go to page 1, 2  Next

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software