This post is some scratches of what I learned from a youtube video for ARM processor memory organization.
Memory organisation? how the different types of memories such as cache, RAM are used to set up the complete memory system in an embedded system.
This is the memory centric view of embedded system.







Memory organisation? how the different types of memories such as cache, RAM are used to set up the complete memory system in an embedded system.
This is the memory centric view of embedded system.

Memory system in an embedded appliance starts with the register file which is typical to that of the micro controller which is being used in the embedded system.
There may be one or more cache memory that feed data and instructions to the pipeline. coz most of the microcontroller that we have seen are pipelined processors. Scratch Pad Memory(SCM) is a kind of high speed on-chip memory.
Note: SDRAM Synchronized DRAM gives you fast memory access.
Complete Memory MAP of the processor

Different processors have different amounts of on-chip ram
Data cache: may have instruction as well as data. So I have got a unified cache. That will be the case with Harvard architecture.
Interesting thing is that data cache is mapped to the address space of the DRAM. And that is typically the feature of cache memory.
Access SPM is typically sync with 1 clock cycle as that of the cpu so that accessing this memory doesn't need wait state. DRAM require wait state (10-20 clock cycles of CPU ). that is why that memory is expected to mapped onto data Cache depending on your cache management policies.
Caches and CPU

What ever address cpu generates, that is processed by the cache controller to find out whether the data is typically in the cache or not.
If it is there in the cache, the data will be passed onto the CPU. If it is need to be fetched from main memory it will do that and at the same time a copy of the fetched data will be retained in the cache.
So if the data is there in the cache, it is called a cache hit otherwise it is a cache miss.
If it is cache miss, the processor cannot not accessing data or instruction in a single clock cycle.
Direct mapped cache

In general : a fixed mapping in there from memory locations to cache.
So how is it really done by the cache controller. For that purpose the address that is being generated by the processor is logically divided into tag, index and offset. In the cache block we store data definitely.
address contains: | tag | index | offset |
so Cache block contains: | valid | tag | data |
index: is to select which cache block to check. In fact there may be multiple such cache blocks. tag part of the address is compared with the tag part of the cache memory. if this two tags do match, that means there is a cache hit. if there is a cache hit they will get the value.
Two regularly used memory location maps to the same location leading to conflict misses.
Figure below shows the logic that is implemented by the cache controller.

Set associative cache
To remove the problem of conflict misses that is frequently encounter in direct mapped cache
consists of a number of sets
characterized by the number of sets it uses.
each set is implemented by direct mapped cache.
multiple locations are mapped onto different cache blocks. So copies can be found at different cache blocks
slower than direct mapped cache

in a typical arm implementation if we have two way set associative cache ( similar to DMCache instead of one block we can have two blocks (that is why two way) )
Other way to increase the cache efficiency is to use CAM ( Content Addressable Memory )
CAM produces an address if a given data value exists in memory.
Consider: I take the tag, which is really stored in the cache, I used the tag to generate an address in the cache
So tag address is fed to a tag-CAM(a hardware module)
tag + index is fed to tag-CAM CAM---> tag-CAM will generate the address of the location in the cache. effectively we are doing many such comparisons in parallel. But instead of doing many such comparisons we have intelligently used the address of the memory to generate address of the corresponding cache. CAM used in architectures ( ARM920T, ARM940T ).
For Von Neuman architecture there will be only a single cache for instruction and data. But for Harvard there will be two cache ( called split cache organisation ), such as instruction cache and data cache.
A small fast FIFO that temporarily holds data that processor would write to main memory is used along with cache--- Write buffer-- but why it is used ? a write back may require a kind of an overhead. To take care of that overhead ARM contains an on chip write buffer.
Processor writes data to the write buffer. from the write buffer the data is stored back to main memory.
Write buffer improves cache performance. coz cache controller during block eviction writes dirty block ( block for which dirty bit is set ) to write buffer instead of writing it into main memory. So eviction will occurs when a new block of data has to be loaded into the cache. If there is no WB when there is a block eviction time would have been doubled.
So putting everything together the following figure shows all.

CPU can have one path directly to writebuffer if it is required. writebuffer not strictly FIFO, arm 10 uses coalescing: group together and transfer as a block
All these issues takes us to different kinds of cache policies
Write policy can be
- write through - cache controller writes to cache and the memory at the same time. - variant of this would be writing to cache as well as to the writebuffer.
- write back - we waits till the block eviction takes place. after eviction only write back data either to the memory or to the write buffer.
Block Replacement Policy - which block is to be evicted ?
block eviction would takes place when N such blocks are filled up
- round robin(RR) : or cyclic replacement: predictable performance, coz we know which block are to be replaced as it is RR.
- random selection :
Allocation policy
when is really a cache block is allocated.
- read allocate
- read or write allocate
These policies are a part of hardware policies implemented in the hardware.
ARM cache core policy
core | write policy | replacement policy | allocation policy |
ARM720T | writethrough | random | read-miss |
ARM740T | writethrough | random | read-miss |
ARM920T | writethrough, writeback | random, Round Robin | read-miss |
ARM946E | writethrough, writeback | random, Round Robin | read-miss |
These options has to be opted for the purpose of cache implementation. Now how these are implemented. How they are made available ? And how they are really used ? Today we have on chip cache. So designer will decide policies and configure and implement them in the cache. and external programmers will not really have an option to play with them.
Cache Control in ARM
CP15 (most important co-processor) - also called system control co-processor and whose job is to manage a standard memory facilities including cache.
CP15 has registers - so we program this co-processor by writing onto this registers using the co-processor instructions
CP15 has register, using which different features and control functions of cache can be specified. We can specify size of cache and degree of associativity, enable disable cache operations. policy choices like write replacement and allocation.
There are also other kind of control functions like flushing of cache which can be initiated by an instruction on to the cache co-processor.
This co-processor actually what is the block that we have seen - the cache controller. So the cache controller manages the complete cache access for the case of ARM.
SPM occupies a part of the address space itself. advantage is that you will get a guaranteed access. There is no question of cache hit or miss, coz we are not doing any kind of mapping.
ARM provides a kind of idea which is called a lockable cache. So a part of the code can be locked into the cache.
Other important issue related to ARM is memory area is cache-able or not. Entire memory is mapped to cache. But I/O management in ARM is memory mapped I/O. So I/O ports are located in the memory ( or that also will come in the address space of ARM). Data in that port can be changed because of the external port. If I make those memory locations cache-able, actually I am getting an inconsistent data.
Cache controller has got the provision for indicating whether some area can be cached or not.
Cache Lockdown.
For lockdown purpose cache in ARM is divided into lockdown blocks. One block from each cache set can be marked as locked block. Your data from the main memory can be loaded onto this block. So that will not be really replaced. eg: in DSP looped cache.
Multi level cache
Advanced embedded systems are supporting multiple-level cache. - to minimize cache miss-rate due to limitations in capacity. L1 cache will have normally single cycle access and L2 cache will have latency of more than one CPU cycle but less than that of main system memory.