COMP 206: Computer Architecture and Implementation Montek Singh We W d., Nov. v 12, 2003 To T pics: 1. Cache Performance (concl.) 2. Cache Coherence 1
Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache. 2
1. Fast Hit Times via Smal , Simple Caches Simple caches can be faster cache hit time increasingly a bottleneck to CPU performance set associativity requires complex tag matching slower direct-mapped are simpler faster shorter CPU cycle times – tag check can be overlapped with transmission of data Smal er caches can be faster can fit on the same chip as CPU avoid penalty of going off-chip for L2 caches: compromise keep tags on chip, and data off chip – fast tag check, yet greater cache capacity L1 data cache reduced from 16KB in Pentium III to 8KB in Pentium IV 3
2. Fast Hits by Avoiding Addr. Translation For Virtual Memory: can send virtual address to cache? Cal ed Virtual y Addressed Cache or just Virtual Cache, vs. Physical Cache Benefits: avoid translation from virtual to real address; saves time Problems: Every time process is switched logical y must flush the cache; otherwise get false hits – Cost is time to flush + “compulsory” misses from empty cache Dealing with aliases (sometimes cal ed synonyms); Two different virtual addresses map to same physical address I/O must interact with cache, so need mapping to virtual address Some Solutions partial y address these issues HW guarantee: each cache frame holds unique physical address SW guarantee: lower n bits must have same address; as long as covers index field & direct mapped, they must be unique; cal ed page coloring Solution to cache flush Add process identifier tag that identifies process as wel as address within process: can’t get a hit if wrong process 4
Virtually Addressed Caches CPU CPU VA VA VA TLB Cache Tags PA VA Cache TLB PA PA MEM MEM Conventional Virtually Addressed Cache Organization Translate only on miss 5
3. Pipeline Write Hits Write Hits take slightly longer than Read Hits: cannot paral elize tag matching with data transfer must match tags before data is written! Summary of Key Idea: pipeline the writes check tag first; if match, let CPU resume let the actual write take its time 6
Cache Optimization Summary Technique MR MP HT Complexity Larger Block Size + – 0 Higher Associativity + – 1 Victim Caches + 2 Pseudo-Associative Caches + 2 HW Prefetching of Instr/Data + 2 Compiler Control ed Prefetching + 3 Compiler Reduce Misses + 0 Priority to Read Misses + 1 Subblock Placement + + 1 Early Restart & Critical Word 1st + 2 Non-Blocking Caches + 3 Second Level Caches + 2 Smal & Simple Caches – + 0 Avoiding Address Translation + 2 7
Impact of Caches 1960-1985: Speed = ƒ(no. operations) 1000 CPU 1997 Pipelined 100 Execution & Fast Clock Rate Out-of-Order 10 completion DRAM Superscalar 1 Instruction Issue 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 1999: Speed = ƒ(non-cached memory accesses) Has impact on: Compilers, Architects, Algorithms, Data Structures? 8
Cache Coherence Section 6.3 & Appendix I 9
Cache Coherence Common problem with multiple copies of mutable information (in both hardware and software) “If a datum is copied and the copy is to match the original at al times, then al changes to the original must cause the copy to be immediately updated or invalidated.” (Richard L. Sites, co-architect of DEC Alpha) Copy becomes stale 1 2 3 4 A A A C Copies diverge;hard to recover from – A B B 1 2 3 4 1 2 3 4 A A A B A A A – - A B B – A B B Write update Write invalidate 10
Example of Cache Coherence I/O in uniprocessor with primary unified cache MM copy and cache copy of memory block not always coherent WT cache MM copy stale while write update to MM in transit WB cache MM copy stale while cache copy Dirty Inconsistency of no concern if no one reads/writes MM copy If I/O directed to main memory, need to maintain coherence INPUT OUTPUT WT WB WB WT Block in Invalidate cache block, Evict block, then Evict block, then (MM not stale) cache then I/O writes MM I/O writes MM I/O reads MM I/O reads MM Block not in I/O writes MM I/O writes MM I/O reads MM I/O reads MM cache 11
Example of Cache Coherence (contd) Uniprocessor with a split primary cache I-cache contains instruction D-cache contains data Often contents are disjoint If self-modifying code is al owed, then same cache block may appear in both caches, and consistency must be enforced MS-DOS al ows self-modifying code Strong motivation for unified caches in Intel i386 and i486 Pentium has split primary cache, and supports SMC by enforcing coherence between I and D caches Coordinating primary and secondary caches in uniprocessor Shared memory multiprocessors 12
Two “Snoopy” Protocols We wil discuss two protocols A simple three-state protocol Section 6.3 & Appendix I of HP3 The MESI protocol IEEE standard Used by many machines, including Pentium and PowerPC 601 Snooping: monitor memory bus activity by individual caches taking some actions based on this activity introduces a fourth category of miss to the 3C model: coherence misses First, we need some notation to discuss the protocols 13
Notation: Write-Through Cache (State) Read hit Write hit Read miss Write miss (CPU inputs) Invalid N/A N/A Valid Valid (Next state) N/A N/A Bus read miss Bus write miss (Action) Valid Valid Valid Valid Valid (Next state) • Bus write CPU reads Bus read miss Bus write miss (Action) cache • CPU writes cache 14
Notation: Write-Back Cache (State) Read hit Write hit Read miss Write miss (CPU inputs) Invalid N/A N/A Clean Dirty (Next state) N/A N/A Bus read miss Bus write miss (Action) Clean Clean Dirty Clean Dirty (Next state) CPU reads CPU writes Bus read miss Bus write miss (Action) cache cache Dirty Dirty Dirty Clean Dirty (Next state) CPU reads CPU writes • Evict block Evict block (Action) cache cache • Bus read miss Bus write miss 15
Three-State Write-Invalidate Protocol Minor modification of WB cache Assumptions Single bus and MM Two or more CPUs, each with WB cache Every cache block in one of three states: Invalid, Clean, Dirty (cal ed Invalid, Shared, Exclusive in Figure 6.10 of HP3) MM copies of blocks have no state At any moment, a single cache owns bus (is bus master) Bus master does not obey bus command Al misses (reads or writes) serviced by MM if al cache copies are Clean the only Dirty cache copy (which is no longer Dirty), and MM copy is written instead of being read 16
Understanding the Protocol MM Only two global states•Most up-to-date copy is MM copy, and all cache copies are Clean C1 C2 •Most up-to-date copy is a single unique cache copy in state Dirty A A •Bus owner wner Clean •Another Clean copy copy exists •Can an read without out notififying ying A A B — other caches •Bus owner Dirty A •Bus owner Clean •No other cache copies •No other cache copies •Can read or write without •Can read without notifying notifying other caches A — other caches 17
State Diagram of Cache Block (Part 1) (State) Read Write Hit Read Miss Write Miss (CPU Hit inputs) Invalid N/A N/A Clean Dirty (Next state) N/A N/A Bus read miss Bus write miss (Action) Clean Clean Dirty Clean Dirty (Next state) CPU reads • Bus write miss Bus read miss Bus write miss (Action) cache • CPU writes cache Dirty Dirty Dirty Clean Dirty (Next state) CPU reads CPU writes cache • Evict block • Evict block (Action) cache • Bus read miss • Bus write miss 18
State Diagram of Cache Block (Part 2) (State) Bus Read Miss Bus Write Miss (Bus inputs) Invalid Invalid Invalid (Next state) None None (Action) Clean Clean Invalid (Next state) None None (Action) Dirty Clean Invalid (Next state) Put block on bus, and Put block on bus, and (Action) prevent MM from doing prevent MM from doing the same the same 19
Comparison with Single WB Cache Similarities Read hit invisible on bus Al misses visible on bus Differences In single WB cache, al misses are serviced by MM; in three- state protocol, misses are serviced either by MM or by unique cache block holding only Dirty copy In single WB cache, write hit is invisible on bus; in three-state protocol, write hit of Clean block: invalidates al other Clean blocks by a Bus Write Miss (necessary action) 20
Correctness of Three-State Protocol Problem: State transition of FSM is supposed to be atomic, but they are not in this protocol, because of the bus Example: CPU read miss in Dirty state CPU access to cache detects a miss Request bus Acquire bus, and change state of cache block Evict dirty block to MM Put Bus Read Miss on bus Receive requested block from MM or another cache Release bus, and read from cache block just received Bus arbitration may cause gap between steps 2 and 3 Whole sequence of operations no longer atomic App. I.1 argues that protocol wil work correctly if steps 3-7 are atomic, i.e., bus is not a split-transaction bus 21
Adding More Bits to Protocols Add third bit, cal ed Shared, to Valid and Dirty bits Get five states (M, O, E, S, I) Developed in context of Futurebus+, with intention of explaining al snoopy protocols, al of which use 3, 4, or 5 states Valid Shared Dirty State Abbreviation 0 x x Invalid I 1 0 0 Exclusive E 1 1 0 Shared S 1 0 1 Modified M 1 1 1 Owned O 22
MESI Protocol Four-state, write-invalidate Improved version of three-state protocol Clean state split into Exclusive and Shared states Dirty state equivalent to Modified state Several slightly different versions of MESI protocol Wil describe version implemented by Futurebus+ PowerPC 601 MESI protocol does not support cache-to-cache transfer of blocks 23
State Diag. of MESI Cache Block (Part 1) (State) Read Hit Write Hit Read Write (CPU Miss Miss inputs) Invalid N/A N/A Shared Exclusive (Next state) N/A N/A Bus read Bus write (Action) miss miss Exclusive Exclusive Modified Shared Exclusive (Next state) CPU reads CPU writes Bus read Bus write (Action) cache cache miss miss Shared Shared Modified Shared Exclusive (Next state) CPU reads • Invalidate Bus read Bus write (Action) cache • CPU writes miss miss cache Modified Modified Modified Shared Exclusive (Next state) CPU reads CPU writes • Evict • Evict block (Action) cache cache block • Bus write • Bus read miss miss 24
State Diag. of MESI Cache Block (Part 2) (State) Invalidate Bus read Bus write (Bus inputs) miss miss Invalid Invalid Invalid Invalid (Next state) None None None (Action) Exclusive Invalid Shared Invalid (Next state) None None None (Action) Shared Invalid Shared Invalid (Next state) None None None (Action) Modified Invalid Shared Invalid (Next state) None Supply block and Supply block and (Action) update MM update MM 25
Comparison with Three-State Protocol Similarities Read hit invisible on bus Al misses handled the same way Differences Big improvement in handling write hits Write hit in Exclusive state invisible on bus Write hit in Shared state involves no block transfer, only a control signal A A A A — A A B — •Exclusive state •Shared state •Modified state •Can be read or written •Can be read only •Can be read and written 26
Comments on Write-Invalidate Protocols Performance Processor can lose cache block through invalidation by another processor Average memory access time goes up, since writes to shared blocks take more time (other copies have to be invalidated) Implementation Bus and CPU want to simultaneously access same cache Either same block or different blocks, but conflict nonetheless Three possible solutions Use a single tag array, and accept structural hazards Use two separate tag arrays for bus and CPU, which must now be kept coherent at al times Use a multiported tag array (both Intel Pentium and PowerPC 601 use this solution) 27