View unanswered posts | View active topics It is currently Mon Jul 15, 2019 6:41 pm



Reply to topic  [ 7 posts ] 
 The ARM3 cache 
Author Message

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1201
Some interesting notes in this 1992 thesis by Rahul Mehra. Particularly of interest: the use of application traces to explore different design choices, and some choices being made pragmatically on the basis of complexity - too ambitious, and it takes too much effort to get something working.

Attachment:
ARM3-cache.png
ARM3-cache.png [ 54.06 KiB | Viewed 4206 times ]


Quote:
2.3.2 Existing Synchronous ARM Caches
A synchronous cache has been developed for the synchronous processor described in section 2.3.1. It is a 4kbyte, 4-set, 64-way associative virtual cache with demand fetch, random replacement and write through policies, its structure is shown in figure 2.10. It is currently used in the the ARM3, ARM600 and ARM610 although in the 600 series the cache is not necessarily virtually addressed since the MMU is also on chip.

A virtual memory address arrives and the bottom two bits are ignored (only used during byte writes). The next two bits choose the word in the quad-word line. Bits 4 and 5 select which CAM array is active this cycle and the rest of the address is presented to the active CAM. Three extra bits are also factored into the tag address. One valid bit per line which can all be cleared in parallel on cache flush operations and two access control bits. Use of different memory translation tables for user and supervisor code can be handled using these two extra access bits and control registers.

The tag comparison stage supplies a hit/miss signal and a six bit address of the matching line if a hit does occur. These are combined with the set selection and word number bits to give a ten bit SRAM address. The SRAM is organised as a 1K by 32-bit word long array and the ten bit data address locates a word within this array. Once selected the data is read out via sense amplifiers or new data is driven into the array on write cycles.

This cache structure was used in the ARM3 and later in the 600 and 610 variants. The design was primarily motivated because typical ARM samples were capable of being cycled at twice their regular clock rate but were limited to 10MHz because of slow DRAM parts that made-up main memory. With the possibility of shrinking the design rules for new batches of the processor to at least 1.5 micron and thus obtaining an even faster cycle times. Regular main memory is unable to cycle at these sorts of speeds. Thus instead of just producing smaller processors it was decided to develop a code-compatible, cached version of the ARM, the ARM3, for use in applications that demand more processing power [FURBER89].

By utilising the reduction in area used for implementing processor core, the remainder can be used for an on-chip cache memory, giving a single chip solution. Another goal was to maintain the memory interface allowing the new chips to be substituted in systems via the use of small daughter boards thus keeping system development costs to minimum. Since address translation is done off chip by the MEMC, only virtual addresses are available to the cache. The cache is therefore virtually addressed.

The ARM is strongly von Neumann in nature, transferring only one instruction or data word every cycle. This − coupled with the fact that space for the cache on the silicon die was limited − made the use of a mixed instruction and data cache desirable. A mixed cache is better at adapting to and balancing the amount of data and instructions stored in the cache for the currently executing workload, whereas fixed partitioning can be wasteful of resources for the small cache sizes considered.

Organisation of the cache was based on results of simulation of different structures. Real-time memory reference traces were collected from full ARM based system by using a hardware add-on. This allowed the execution of a typical work load (complete with user interaction) whilst recording all reference to memory. Such traces are more realistic than ones obtained by architecture simulation since simulated traces do not tend to model user or operating system activity. These traces were then used as input for simulations of different cache structures. It was found that a caches with a high degree of associativity gave better performance than direct mapped alternatives that would fit in the limited chip area.

A completely associative cache requiring the overhead of one 24-bit tag entry per cached data word was very inefficient and so a four-word cache line was employed (one 22-bit tag per 128 bits of data). This resulted in a 256-way associative structure but this was found dissipate too much power when accessing the CAM array. Experiments were done and it was found that splitting the CAM array into 4 sets of 64 entries had little effect on the performance but did lower the power consumption by requiring only one bank of CAM be active for each request.

A write-through strategy was used to avoid the need for a complex control circuit for cache flush or line copy back operations. The organisation of the cache as described allows the hit/miss decision to be made early since it is based entirely on the contents of the tag store with a single valid bit. A write back cache would possibly use a ‘dirty’ bit per word − indicating data to be written out. Such bits are more logically stored with the actual data in the SRAM thus making any hit/miss decision time slower. Given a write-through cache, a no-allocate on write policy was used since allocating on writes was found to be ineffectual.

Demand fetch and random replacement strategies were chosen for reasons of ease of implementation. Demand fetch is the most straight-forward of fetch policies. Although not quite as good as the least recently used algorithm, random replacement is much easier to implement in hardware. It also offers more graceful degradation in performance when pathological loops just break the LRU algorithm. Write buffers to avoid stalling the processor during write through operations were not added because this would exclude the use of existing translation exception mechanisms implemented by the MEMC.

The additional benefit of the cached processor over its uncached predecessor was that it utilises much less main memory bandwidth (10% of the original amount). This makes it less prone to performance degradation in situations where memory bandwidth is limited, for example in a system where the CPU and video subsystem are in contention for the same memory bus. Low bandwidth requirements also make the cached chip an attractive multiprocessor.


The development of the ARM600 saw the introduction of a write buffer in addition to the on-chip cache and MMU. The MMU has unusual features to aid in implementing an object orientated environment and will be not be described further (see [ARM600]). The presence of the MMU on chip allows address translation to be done up-stream of the cache and write buffer and so preserves the ability to have exact memory aborts.

The write buffer takes the form of eight data slots and two address slots. Up to eight writes occupy the data slots and the two address words control where in memory the data is sequentially written. This means at worst case only two independent write operations can be buffered but the buffer has been engineered in recognition of a feature of the ARM whereby multiple data values can stored (and loaded) in a single instruction. During a “store-multiple”, writes occur to sequential memory locations, in these situations the buffer need only hold the address of the first location written to. When a sequential write arrives, it is associated as being part of the current group of sequential writes by having an extra bit in the data slot set to indicate which address tag to use. When an nonsequential (unconnected) write arrives then the other address tag must be used (if free). If no address and/or empty data slot are available the processor is stalled until space becomes available.

The buffer maintains control of the port to main memory and data is written out on a continuous basis and all writes pass through it whilst reads are being satisfied from within the cache. Thus when the need for a read from main memory arises (eg. line fetch, uncacheable read etc), it has to wait while the write buffer empties. This done to preserve strict ordering of reads and writes. The use of this write buffer produces up to 10% improvement over a cached system without buffer for certain classes of program.


Mon May 15, 2017 10:42 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 901
Location: Canada
A good read. I skimmed through it quickly.

One thing I notice on page 59 in the table was the number of writes being roughly equal to (or greater than) the number of read in the table. I've read elsewhere the number of writes is usually substantially less than the number of reads. I'm a little curious as to what's going on here. Is it just the selection of test programs ?

_________________
Robert Finch http://www.finitron.ca


Mon May 15, 2017 5:53 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1201
Hmm, you're right, and I agree - I'd understood reads were 3x writes. I think I'd assumed that was data, rather than instructions, but of course it makes a big difference as to whether instructions are included. Possibly a machine with a large register file shows a different mix compared to an accumulator machine. And a machine which allows large constants as immediates might also skew things a little.

For reference:
Attachment:
cache-benchmarking.png
cache-benchmarking.png [ 67.19 KiB | Viewed 4193 times ]


Mon May 15, 2017 6:06 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 901
Location: Canada
I think this is the first time I've seen a cache implemented with CAMs for the cache tags. It allows for higher associativity, he does mention the use of comparators in lower associativity caches. Apparently the CAMs not in the set selected can be turned off power-wise. If one were to choose the bits that select the set from higher order address bits in the tag (eg 12:11 rather than 5:4), I wonder if it might improve power consumption. Given that program execution is fairly local.

Now I have a "new" cache to put together.

_________________
Robert Finch http://www.finitron.ca


Mon May 15, 2017 11:40 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1201
The T9000 also had 4 blocks of cache, with a fully-associative (CAM-based) tag. It also used random replacement - but my random replacement effort was replaced by someone else's, and I had my doubts as to whether they'd done the right thing. I suspected they'd used successive values from an LFSR.


Tue May 16, 2017 4:28 am
Profile
User avatar

Joined: Tue Jan 15, 2013 5:43 am
Posts: 180
I'm curious to hear more, Ed, if the explanation's not too much to undertake. What would be the relative merits of your scheme compared to an LFSR?

_________________
http://LaughtonElectronics.com


Tue May 16, 2017 2:56 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1201
My scheme was to spin the LSFR n times, to get an n-bit line address. I've a feeling there were 256 lines in each bank. It was a bit more subtle than that, because the caches were configurable to lock down one, two, or three quarters of the lines, so the RNG needs also to be able to produce numbers in the range 0-191 for example. My engine did that by spinning faster than needed and keeping a few in-range results, while skipping the out of range results. If it had to produce a value when it didn't have one, it bodged a value, giving in that rare case a non-uniform distribution.

I don't know exactly what the replacement engine did, but I later saw a presentation which gave a strong indication it was only spinning one new random bit for each line address needed. I'm pretty sure that's not ideal, but no idea how much difference it might make in practice.

Inmos, like Acorn, was a bit academic in its approach to things - random replacement is good for the theorists.


Tue May 16, 2017 3:04 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 7 posts ] 

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software