View unanswered posts | View active topics It is currently Sat Apr 20, 2024 9:46 am



Reply to topic  [ 121 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6 ... 9  Next
 RTF64 processor 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Do you at any point need to divide by 5?
I do not think so. The cluster is allocated as a unit and indexed into at a stride of 20 bytes. But there is no code (yet) that tries to manipulate the allocated pages as an array. Because it is virtual memory there is no guarentee that one group of pages will have any particular alignment compared to another group. There might be allocations of varying numbers of pages between the allocations for the page table.

Finished off the posit divider last night. The RTL code was not too bad, mostly taken from the PACoGen project. But I increased the size of the table to 2k entries as that uses a whole block ram then. The divider is an NR approach that starts off with an eleven-bit approximation. It should be fairly fast, seven or eight clock cycles to complete a divide.

Added posit arithmetic to the compiler. It has been a major pita to implement. It requires a 256-bit integer arithmetic class. I have gotten bogged down on all the bit manipulation that crosses word boundaries. I increased the size of the table used for the divide to 64k entries. That gives a 17-bit approximation to start off. For 32-bit posits only a single NR iteration is required then. I have not yet got the posit arithmetic working. More bit fiddling to do.

_________________
Robert Finch http://www.finitron.ca


Sat Oct 17, 2020 5:12 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Got the posit add/sub/multiply working in the compiler. It can almost parse numbers now, only the divider needs to be finished.

_________________
Robert Finch http://www.finitron.ca


Sun Oct 18, 2020 6:46 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
Does that turn out to be a dedicated divider, or an iterative divider using multiplication?


Sun Oct 18, 2020 7:49 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Does that turn out to be a dedicated divider, or an iterative divider using multiplication?
It is iterative, Newton-Raphson. It is using a 17 bit approximation to start with for the software version (11 for hardware) so it converges very quickly. It takes only about 3 iterations for 64-bits accuracy. That's small enough I figure for a hardware version. It works out to about 30 DSP blocks, there's about 700 DSP blocks in the FPGA so the resource usage is probably okay.

Finally got the posit divider code working for the compiler. It took about two days of bit fiddling. I had to run it side by side with the Verilog version running in simulation at the same time and compare all the signals. There were not actually very many issues with the divider but being off by just one in one spot caused things not to work. One slight difference from the Verilog version is the size of the reciprocal approximation table. The compiler uses a much larger table, because well why not?
It is now possible to write some C code that when compiled dumps posit values.

0x46487ed5110b460c

I am thinking about making a little posit number conversion program, but it would be a lot of work to make a web-based version. I was looking for such a thing wile working on the classes to support posits.

I have the posits sharing the floating-point register set, so they are treated a bit like just another special floating-point format in the compiler. It saves having to come up with a pile more load and store instructions. It can just use the float load / stores. Eventually, support for at least 16/32/64 bit posits and 32/64 bit floats would be provided.

_________________
Robert Finch http://www.finitron.ca


Wed Oct 21, 2020 8:07 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
More compiler updates.

Arrays were added as a basic type. Previously they had been processed as a special struct type. Aggregate lists are now checked to see if the type is consistent across the list. If it is then the type is set to array.

Numeric literals in arrays were being output twice when literals were dumped. This was a storage space issue.

Mulling over the idea of switching the register file to a unified float/posit/integer register file. The different classes of registers are making the compiler really complex. Essentially parts of the compiler are triplicated to support each register class. Having a unified register file would also help reduce the number of instructions for loads and stores. The drawback is that more registers would need to be available meaning likely six bits for register specs. That means changing the bit specs for instructions again.

One issue with the compiler is that it does not distinguish between different classes of registers for the color-graphing register allocator. Currently color-graphing is disabled. It needs to be updated to support multiple register classes. At the moment, it uses 1024 virtual registers. That could be increased to 1024 virtual registers for each register class. 1024 being essentially the same as an infinite number of regs.

_________________
Robert Finch http://www.finitron.ca


Thu Oct 22, 2020 3:21 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Whoa! I just found out that the compiler did not reset the float precision indicator at the start of parsing a float or posit value. This means that subsequent constants would inherit the last set precision, probably not what is desired. It has been that way in the compiler for years now.

_________________
Robert Finch http://www.finitron.ca


Thu Oct 22, 2020 4:19 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I took a break from working on the compiler to working on the RTL code for the cpu. I would like to be able to use at least an overlapped pipelined version of the core, if not a superscalar one.

I turned the non-overlapped pipeline version of the core into an overlapped pipeline version by adding some pipelining registers and splitting up the state case statement. Each stage of the overlapped pipeline takes multiple clock cycles to complete. The stage taking the most clock cycles is memory. The pipeline does not advance until all stages have completed their work. The shortest path assuming all stages uses a minimum number of clock cycles is four clock cycles to execute an instruction. Quite a bit better than the 14 clocks of the non-overlapped pipeline version.

If I did not know better, I would say synthesis hung. It has been running over 3 hours now. I was trying to get a size estimate for the core and I figured synthesis would be well under ½ hour.

_________________
Robert Finch http://www.finitron.ca


Fri Oct 23, 2020 3:14 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
More RTL code updates this time for the system-on-chip. The rgb2dvi and dvi2rgb component were modified to be used as a bus transfer mechanism. Modifications included a wider data-path of 36 bits instead of 24. Data is encoded as 14/12B instead of 10/8B as this is the max data the serial transceivers can handle. A seven times clock is used in place of a five times clock. There was some issue with the clock generation but eventually a 300 MHz serial clock was chosen. This results in a transfer rate of 43x4 or about 170 MB/s. The fake sync and blanking signals still need to be refined.

The core is too large to fit with much else in the target device, so the system-on-chip is being split onto two or more chips. The core is about 50,000LUTs. It should fit nicely in the 63,000 LUT device.

_________________
Robert Finch http://www.finitron.ca


Sat Oct 24, 2020 3:03 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Done lots of experimentation in simulation having the hdmi output connected to an hdmi input internal to the FPGA. It is looking like it would be a real challenge to use the modified hdmi format to transfer data. Things get transmitted beautifully but receive is another story. The serial stream does not get synchronized very quickly in processor terms. I tried shortening all the synchronization timing down but it still takes a long time. So, I am back to looking at other means of connecting things with a high-speed serial bus. I learned a bit about synchronous serial communications. I had hoped to be able to use “canned” components but I may have to create some of my own.

At least I got the core simulating far enough that gets to doing a store to the I/O area.

_________________
Robert Finch http://www.finitron.ca


Sun Oct 25, 2020 4:39 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
Is that a one-off synchronisation after initial HDMI connection, or something which needs to happen repeatedly?


Sun Oct 25, 2020 5:07 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Is that a one-off synchronisation after initial HDMI connection, or something which needs to happen repeatedly?
I do not think it would need to be repeated. At least not that often. Once the timing of the bus is known it should remain the same. But it might be a good idea to resynchronize every once in a while. It is similar to DDR ram setup. I have modified parts of the HDMI components for use with a bus. Sync and blanking tokens were eliminated. There is just a single sync in use now. It might take millisecond(s) to configure the bus timing, but that's not a big deal if it is done only once at startup. There cannot be millisecond type pauses for bus access while the system is running though. I could not get the synchronization to work with the existing code (phase aligner), so I wrote a new one that should work much faster. It works differently than the HDMI standard now. I have setup the RTL code for a bus bridge to output sync continuously until it gets a locked status back.

Today I came up with a signal phase aligner for the xbus (external peripheral bus). It is one of the times I wrote something in vhdl instead of Verilog. It had to interface with another vhd module. I am not as fluent in vhdl and there is an issue with the names I choose for process blocks. So, I had to leave a message on a forum asking why the name is invalid. Anyway, I started working on the compiler again while shelving the RTL code momentarily.
The compiler now supports default function / method arguments which must be constant values so it is possible to leave out arguments when a method is invoked.
I wrote this little test program demonstrating the use of default args:
Code:
posit sub1(int a = 10, int b, posit c = 210.17p)
{
  return (a + b);
}

int main(int a, int b)
{
  return (sub1(,2,));
}

_________________
Robert Finch http://www.finitron.ca


Mon Oct 26, 2020 6:19 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Pic showing the tail end of the alignment process as simulator waveforms.
Attachment:
File comment: xbus alignment
xbusAligned.png
xbusAligned.png [ 29.78 KiB | Viewed 731 times ]

_________________
Robert Finch http://www.finitron.ca


Mon Oct 26, 2020 9:45 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The xbus receiver and transmitter were made more complex because of the need to support multiple slave devices for each master and multiple masters for each slave. There could be multiple boards tied to an xbus channel. The slave receiver must detect when it is the chosen one, otherwise slave transmitter output drivers should be disabled. The receivers must record the bus timing characteristic for the addresses slave or master so that when the source changes there is no need to go through the entire synchronization process again. The recorded values of the bus delay can be used, allowing a much faster resynchronization. The delay in the input serial converter can be adjusted and recorded. There does not appear to be a means to record and reset the bitslip. The bitslip is triggered to adjust serial input by whole bits as opposed to the timing delay. Assuming the bus timing varies by only small amounts the bitslip should not need to be adjusted. Meaning resynchronization can be very fast (eg < 100ns).
A control port was added to the xbus master bridge so that specific slaves can be selected by software. Synchronization of individual slaves can be done by writing to the control port.

It is about 1,100LUTs for a transmit / receive pair. The plan is to support three pairs so about 3,300LUTs. I now recall that there are not enough TMDS signals on the FPGA module to support more than two pair. I hope I got the pin usage correct.

It is back to the drawing board for the main cpu board. It was back in March/April when I was working on it and I am not sure now that I have the correct work directory. I had made a PCB layout which my dad turned into a poster for my birthday. I cannot find the PCB layout.

Shown below is a current schematic for the board. The plan is to plug it into a PC AT style bus. Onboard are a serial port, audio beeper, random bitstream generator, and IDE drive port. The IDE drive port will be connected to a SATA drive by an IDE to SATA converter. There are about a dozen yet FPGA signals that could be used. I may set them up for a RTC module.
Attachment:
File comment: Schematic for RTF64
ISA_FPGA_cpu_v4.pdf [321.04 KiB]
Downloaded 88 times

_________________
Robert Finch http://www.finitron.ca


Tue Oct 27, 2020 3:25 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
More simulation and debugging of the xbus. I am wondering now if the lane used to transfer the clock signal can be used as a data lane instead. It seems somewhat wasteful bandwidth wise to transmit the clock signal when it is known what the frequency is. Also, it is a local bus, it is not going through meters of cable. There is a sync signal sent periodically and I am thinking this could be used to recover the clock. Or a lower speed clock could be sent by itself on the bus for reference. A PLL could then be used regenerate the serial clock. The FPGA plls which need about 20 MHZ or more. What are the prospects of sending a 32.768 MHz clock out to card edge connectors?

_________________
Robert Finch http://www.finitron.ca


Wed Oct 28, 2020 3:53 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 593
What do you use for schematic drawing? It is in readable B&W.
Ben.


Wed Oct 28, 2020 6:49 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 121 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6 ... 9  Next

Who is online

Users browsing this forum: No registered users and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software