Computer Organization & Architecture
Pentium 4 Cache Organization
The Pentium 4 has three levels of cache.
The level 1 cache is a split cache and is 8KB in size and is four-way set associative. This means that each set is made up of four lines in cache. The replacement algorithm used for this is a “least recently used” algorithm. The line size is 64 bytes.
The Pentium 4 makes use of “Hyper-Pipelined Technology” for performance. The pipeline is a 20-stage pipeline, meaning that 20 instructions can be run simultaneously. This is an improvement from the Pentium III pipeline which only allowed 10.
With the longer pipeline, less actual work is being done as more time is dedicated to filling the pipeline. Therefore, the Pentium 4 pipeline has to run at a higher frequency in order to do the same amount of work as the shorter Pentium III pipeline.
Intel use an enhanced out-of-order speculative execution engine with the Pentium 4 using advanced prediction algorithms to obtain more instructions to execute using deeper out-of-order resources, up to 126 instructions in-flight. This is three times the instructions in-flight than with a Pentium III processor. This engine improves on the number branch mispredictions of the Pentium III by about 33%. This is done by using a 4KB branch target buffer to store past branches as well as an advanced branch prediction algorithm.
The level 1 cache is small to reduce latency, taking 2 cycles for an integer data cache hit and 6 cycles for a floating point. The level 1 data cache has a write-back policy, but a dynamic configuration allows this to be changed to write-through.
Instead of a classic level 1 instruction cache, the Pentium 4 uses a trace cache which takes advantage of the advanced branch prediction algorithms. After the instructions have been decoded into RISC-style instructions called micro-ops, they are stored in the trace cache. Six micro-ops are stored for each trace line. The trace cache can store up to 12K micro-ops. Since the instructions have already been decoded, the hardware knows about any branches and fetches instructions that follow the branch. Problems might occur in the case of conditional branches if the wrong one is predicted and a lot of additional instructions that are not needed have been pre-fetched and decoded into the cache. We would also have to wait for the cache to fetch the correct instruction from the level 2 cache if the correct branch was not stored in the cache. This may take up to 7 cycles, more if the branch is not found in the level 2 cache.
The advantage of the trace cache is that if the predictions work well, the cache is able to provide three micro-ops per cycle to the execution scheduler. This also means that since the trace cache is only storing instructions that will actually get executed, it is making more efficient use of the limited space.
The level 2 cache is a unified cache and is 256KB in size. The line size is 128 bytes and it is eight-way set associative. This means that each set is made up of eight lines in cache. The increase in size and set size means that it will reduce the chances of a miss occurring when accessing this cache, increasing its effectiveness as a trade-off for its reduced speed. The increase in line size can cause higher latency for line refills, so the Pentium 4 employs a 400MHz system bus using a 100MHz clock that delivers a data rate of 3.2GB/s to make up for the latency. The system bus has a 64 byte access length, requiring 2 main memory accesses to fill a level 2 cache line.
The level 2 cache employs a hardware pre-fetcher to fill up 2 cache lines to take advantage of locality of reference. The hardware monitors the history of cache misses to try to help it avoid unnecessary pre-fetches.
The level 3 cache is eight-way set associative and has a line size of 128 bytes.
All caches make use of a “least recently used” replacement algorithm.