Preface....5
Acknowledgments....7
Contents....8
About the Editor....12
Section Editors....14
Contributors....16
Part I Single Core Processors....21
1 Microarchitecture....22
Contents....22
Introduction....23
Single-Cycle Processor Design....24
Processor Data Path....25
Processor Control Unit....30
Pipelining....31
Pipeline Principle and Performance Metrics....31
Pipelined Processors....35
Pipeline Hazards....37
Data Hazards....38
Control Hazards....44
Structural Hazards....54
Multiple-Issue Processor....54
Conclusions....62
References....64
2 The Architecture....65
Contents....65
Introduction....66
Terms and Notations....68
Laws and Models in Microprocessor/System-on-Chip (SoC) Architectures....68
ISA Selection and Considerations....72
CISC: Complex Instruction Set Computer....72
The Baseline: Looking at the ISA of the 8088 and the 8086 Processors....72
IA32 Architectures....75
Extending the Architecture to 64 Bits (X86-64 ISA)....75
IA-64 Registers....76
Adjusting the Architecture to Support New Technologies....77
Summary....79
RISC: Reduced Instruction Set Computer....79
MIPS....80
SPARC ISA....81
ARM: Advanced RISC Machines....82
ARM7-32 Bits....84
AA64 Architecture....84
Summary of ARM ISA....87
The RISC-V Approach for ISA....87
RISCV: Basic ISA (RISCV 2021)....89
RISCV: Extensions (RISCV 2021)....89
Extension M: Integer Multiplication and Division....90
Extension A: Atomic Instructions....90
Extension F: Single-Precision Floating Point....90
Summary of ISA Selection....91
Vector and SIMD Extensions....91
SIMD Architectures....92
MMX....93
Streaming SIMD Extensions (SSE)....93
Advanced Vector Extensions (AVX)....94
Support for Machine Learning....95
Discussion on the Use of SIMD Operations (in Intel's Cores)....96
Support for Vectors....96
Cross-Layers Optimizations....97
Background....98
Delayed Branch in MIPS....98
The User-Defined Microcode Programming....99
VLIW Architectures....99
HW/SW Codesign: The CUDA Approach....100
ISA Agnostic Systems....102
The Use of Intermediate Representations....102
Binary Translation....103
Summary....104
References....104
3 Architectures for Self-Powered Edge Intelligence....107
Contents....107
Evolution of Edge Intelligence and a Pathway to Self-Powered IntelligentComputations....108
Architectures for Energy Harvesting in IoT Edges....110
A Self-Powered Image Sensor System with Autonomous Mode Management (AMM)....111
Factors Affecting Self-Power Performance....112
Effects of a Processing Pipeline....112
Effects of Unit Pixel Size....112
Effects of SRAM Leakage Energy....113
Effects of Power Converter Efficiency....113
ROI-Aware Image Processing Architecture....113
Moving Object Detection Architecture....114
Low-Power Moving Object Detection....114
Noise-Robust Moving Object Detection....115
ROI-Based Coding Architecture....115
Temporal ROI-Based Coding....115
Spatial ROI-Based Coding....116
Resource-Aware Control of Target Data Rate....116
Conventional Target Data Rate Control....116
Energy- and Content-Aware Target Data Rate Control....117
Resource-Aware Control of Encoding Data Rate....118
Challenges in Data Rate Control....118
Low-Power Data Rate Control....119
Architectural Support for Handling Sparsity in IoT Devices....119
Approaches in Matrix Multiplication....120
Inner Product-Based Approach....120
Outer Product-Based Approach....121
Compressed Sparse Formats....121
Recent Hardware Architecture for Handling Sparsity....122
Hardware Architecture for Inner Product Approach....123
Hardware Architecture for Outer Product Approach....126
Architectures for Power-Gating-Based Active Leakage Control....129
Overview of Power-Gating....129
Challenges and Trade-Offs in Power-Gating....130
Power-Gating Efficiency Learner....133
Self-Adaptive Power-Gating Architecture....134
Test Chip and Measurement Results....135
Conclusion and Future Roadmap....138
References....139
4 Real-Time Scheduling for Computing Architectures....144
Contents....144
Real-Time Operating System (RTOS)....145
Introduction to Key OS Features....145
Introduction to Real-Time Systems....147
Real-Time CPU Scheduling....149
Scheduling on Single-Core CPUs....149
Scheduling on Multi-core CPUs....151
Real-Time Scheduling for CPU-GPU Systems....154
GPU Background....154
GPU Hardware Architecture....155
Threading Model....156
Scheduling Tasks on a Single GPU....157
Intra-SM Resource Allocation....158
Inter-SM Resource Allocation....159
Memory Transfer Between Device and Host....160
Multi-GPU and CPU-GPU Scheduling....160
Multiple GPUs Controlled by One Host....161
Heterogeneous Systems as DAGs....161
Splitting Tasks Between CPUs and GPUs....162
Application Domains....163
Graphics Processing....163
Cloud Systems....163
Tools and Frameworks....164
NVIDIA and CUDA....164
AMD and ROCm....164
OpenCL....164
Alternative Architectures....165
Processing in Memory....165
FPGAs as Accelerators....165
Real-Time Edge Computing Systems....165
Introduction to Edge Computing....165
The Edge Architecture....167
Real-Time Edge Computing....168
Resource Allocation in Real-Time Edge....169
Contention Model....169
Tiered Architecture....171
Model Parameters....172
Introduction to Real-Time Networks....173
Real-Time Wired Networks....173
Real-Time Wireless Networks....175
Real-Time Flow....178
Routing and Scheduling in Real-Time Wireless Sensor Networks....180
RAP Routing Protocol....181
SPEED Routing Protocol....181
Summary....181
References....182
5 Secure Processor Architectures....188
Contents....188
Introduction....189
Modern CPU Microarchitecture....190
Micro-architectural Attacks....193
Transient Micro-architectural Attacks....196
Meltdown and Spectre-Like Attacks....198
Micro-architectural Data Sampling Attacks....201
Countermeasures....206
Prevention-Based Countermeasures....206
Detection-Based Countermeasures....211
Conclusions....211
References....212
6 Bus and Memory Architectures....217
Contents....217
Introduction....218
SoC Overview....218
Processor Overview....219
CPU Types....220
Balanced Processor Architectures....221
CPU Memory Parallelism....222
MSHRs....222
Memory-Level and Memory Hierarchy Parallelism (MLP and MHP)....223
Parallelism to DRAM....224
Accelerators....224
On-Chip Connectivity....225
Interconnect Interfaces....225
Interconnect Topologies....226
Off-Chip Connectivity....227
Summary and Conclusion....227
References....227
Part II Application-Specific Processors....229
7 Architectures for Multimedia Processing: A Cross-Layer Perspective....230
Contents....230
Introduction and Overview of Video Codecs....231
High Efficiency Video Coding....233
Overview of the Standard....233
Analysis of Computational Complexity, Memory Requirements, and Processor Temperature....235
Hardware and Software Architectures for Video Coding....239
Complexity Reduction....242
Low-Power Memory Architectures....242
Workload Balancing for Multiple Video Tiles....244
Dynamic Thermal Management for HEVC....244
Future Directions....247
Conclusions....249
References....249
8 Post-Quantum Cryptographic Accelerators....252
Contents....252
Introduction....253
Post-Quantum Cryptography (PQC)....255
NIST Post-Quantum Cryptography Standardisation Project....255
Initial Submissions....256
NIST's PQC Round1....256
NIST's PQC Round2....256
NIST's PQC Round3....256
Classes of Post-Quantum Cryptography....256
Code-Based....257
Multivariate-Based....257
Hash-Based....257
Isogeny-Based....257
Lattice-Based....258
Lattice-Based Cryptography Primitives....259
Lattices....259
Computational Problems on Lattices....259
Average-Case Problems on Standard Lattices....260
Classes of Lattices....261
Ring-LWE Based PKE Scheme....262
Computationally Intensive Components of LWE (and Variants)....263
Discrete Gaussian Sampling....264
Polynomial Multiplication....265
Schoolbook Algorithm....265
Number Theoretic Transform (NTT)....266
Barrett's Reduction....268
Coprocessors for the Lattice-Based Cryptography....269
General Optimisation Strategies....269
Performance Benchmarks....270
Coprocessors Design Paradigms for Lattice-Based Cryptography....271
Optimization Strategies for Implementation of Underlying Components....276
Discrete Gaussian Sampling....276
Polynomial Multiplication....278
Physical Protection of Lattice-Based Cryptography....281
Timing Attacks....282
Power Analysis Attacks....282
Fault Attacks....283
Challenges in the Post-Quantum Cryptography Adaptation....284
Conclusions....285
References....286
9 Fault Tolerant Architectures....291
Contents....291
Introduction....292
Faults, Errors, and Failures....295
Fault Model....295
Fault Mechanisms....296
External Faults....296
Aging/Stress-Induced Faults....297
Fault Masking....298
Reliability....300
Types of Reliability....300
Reliability Estimation....301
Fault Tolerance....303
Fault Tolerance Activities....304
Redundancy....305
Fault-Tolerant Computation....308
Single-Core Computing....308
Multicore Computing....310
Reconfigurable Computing....311
Fault-Tolerant Memory/Storage....312
Cache/On-chip SRAM....313
Main Memory/DRAM....313
Storage....313
Fault-Tolerant On-Chip Communication....314
Cross-Layer Reliability....315
Domain-Specific Fault Tolerance....317
Signal Processing....317
Wireless Communication....318
Fault Tolerance in Emerging Technologies....318
Emerging Memory Technologies....318
Reliability Issues in NVMs....320
Read Disturb Issue in OxRRAM....320
Thermal Issues due to PCM's High Voltage Operations....322
Fault Tolerance in AI/ML....323
Built-In Error Tolerance of Machine Learning Models....323
Fault Tolerance via Self-Repair....324
Conclusion....328
Glossary....328
References....330
10 Architectures for Machine Learning....335
Contents....22
Introduction....336
Architectures for Neuromorphic Computing....338
Biological Computing Models and Learning Methods....338
Microarchitecture for Neuromorphic Computing....344
Circuit-Level Design Considerations....349
Prominent Neuromorphic Chips....357
SpiNNaker....357
Neurogrid....358
BrainScales....359
LaCSNN....359
TrueNorth....360
Loihi....361
ODIN....362
Tianjic....362
Architectures for Artificial Neural Networks....363
Design Metrics for ANN Architectures....364
Design Abstractions and Trade-Offs....369
Selective ANN Architectures and Circuits....371
Architectures for Classic Machine Learning....384
Conclusions....385
References....386
11 Computer Arithmetic....394
Contents....394
Introduction....395
Definitions....398
Radix....398
Positional Notation....399
Absolute Error....399
Relative Error....399
Numerical Precision....400
Units in the Last Place....400
Machine Epsilon....400
Floating-Point Operations Per Second....400
Integer Arithmetic....400
Gray Code....401
Unary Code....402
Fixed-Point Arithmetic....402
Floating-Point Arithmetic....403
IEEE 754....403
Subnormal Numbers....404
Exceptions....404
Not a Number-NaN and Infinity....405
Quiet NaN....405
Signaling NaN....405
Rounding Modes....405
Floating-Point Approximate Circuits....406
Posit Arithmetic....407
Other Formats....408
BF16....408
TensorFlow-32....409
Hardware Implementations....409
Adders....409
Ripple-Carry Adder....409
Carry-Lookahead Adder....410
Multipliers....410
Dividers....411
Square Root....411
Conclusion....411
References....412
12 Architectures for Scientific Computing....414
Contents....414
Introduction....415
Definitions....416
Scientific Computing....416
Multicore Architectures....417
Manycore Architectures....418
Field-Programmable Gate Arrays....418
Coarse-Grained Reconfigurable Architectures....418
Custom Architectures....420
Multicore Architectures....420
General Purpose Graphics Processing Units....421
Field-Programmable Gate Arrays....423
Coarse-Grained Reconfigurable Architectures....423
Conclusion....425
References....425
Part III Multicore and Reconfigurable Architectures....428
13 Field-Programmable Gate Array Architecture....429
Contents....429
Introduction....430
Methodology and Tools for FPGA Architecture Evaluation....432
Key FPGA Applications....434
Programmable Logic Blocks....435
Programmable Routing....442
Programmable IO....447
Programmable Clock Distribution Networks....449
On-chip Memory....451
DSP Blocks....459
Processor Subsystems....465
System-Level Interconnect: Network-on-Chip....467
Interposers....469
Configuration and Security....471
Conclusion....472
References....472
14 Coarse-Grained Reconfigurable Array (CGRA)....476
Contents....476
Introduction....477
Historical Context....479
Architecture: A Landscape of Modern CGRA....482
Compilation for CGRAs....486
Modulo Scheduling and Modulo Routing Resource Graph (MRRG)....486
CGRA Mapping Approaches....488
Heuristic Approaches....489
Mathematical Optimization Techniques....494
Graph-Theory-Inspired Techniques....495
Other Compilation-Related Issues....503
Challenges Related to Data Access ....503
Nested Loop Mapping....505
Application-Level Mapping....507
Handling Loops with Control Flow....510
Scalable CGRA Mapping....510
Conclusions....511
References....511
15 Dynamic and Partial Reconfiguration of FPGAs....517
Contents....517
Introduction....518
FPGA Configuration....520
Designing Partially Reconfigurable Systems....522
Managing Partial Reconfiguration....527
Applications of Dynamic Partial Reconfiguration....529
Computing Infrastructure and Virtualization....530
Design Compilation....531
Adaptive Systems....532
Machine Learning....534
Reliability and Harsh Environments....534
Research Directions....535
Conclusions....536
References....536
16 GPU Architecture....541
Contents....541
Introduction....542
Graphics Pipeline....543
GPU for General-Purpose Computing....546
Execution Model....546
Programming Interface....548
Hardware Architecture....550
Shader Pipeline....550
Register File....552
Warp Scheduler....553
SIMT Stack....554
Memories....555
Global Memory....555
Constant Memory and Texture Memory....556
Shared Memory....556
L1 and L2 Caches....557
Optimization Use Case: Access-Aware Variable Mapping to Memory....557
Recent Research on GPU Architecture....560
Performance....560
Hiding Memory Access Latency with Advanced Warp Schedulers....560
Throttling Memory Access Latency....561
Energy Efficiency....562
Revisiting Compute Cores and Pipeline....563
Revisiting Register File....564
Reliability....565
Run-Time Error Detection and Correction....566
Fault Analysis....567
Conclusion....567
References....567
17 Power Management of Multicore Systems....570
Contents....570
Introduction....571
Power Dissipation in Multicore Systems....573
Causes and Effects of Power Dissipation....573
Power Dissipation in Multicore Systems....576
Common Power Reduction Methods....577
Hardware....577
Firmware....578
Dynamic Voltage and Frequency Scaling (dvfs)....578
Dynamic Power Management (dpm)....579
Virtualization....580
Software....580
Task Migration....580
Task Scheduling....580
Data Forwarding....580
Power Management: Embedded Systems....581
Energy Minimization....581
Thermal Management....583
Reliability Improvement....587
Power Management: Desktop and Servers....589
ACPI Standard....590
Power Schemes: Governors....591
Power Management: High-Performance Computing (HPC) Data Centers....592
Fast Heuristics....593
Heuristics Using Design-Time Profiling....593
Machine Learning....594
Network Technologies....594
Recent Advances in Multicore Power Management....594
2.5D/3D Systems....594
Cross-Layer Approach....595
Emerging Technologies....595
AI-/ML-Based Power Management....595
Conclusion....596
Glossary....596
References....597
18 General-Purpose Multicore Architectures....603
Contents....603
Introduction....604
Motivating the Need for Concurrent Processing....606
Classifying Parallel Computing Hardware....606
Multiprocessing....607
Thread-Level Parallelism Within an Application....609
What to Do With All These Transistors?....612
Multicore CPU Hardware Design....614
Optimizing CPU Cores for Parallelism....614
Sharing Caches and Main Memory....617
Coordinating Memory Requests Across Cores....622
Scaling to Many Cores....622
Managing Memory....623
Shared-Memory Model....625
Main Memory Policies....626
Mitigating Interference....629
Cache Coherence....630
Memory Consistency Models....634
Optimizing Operating Systems for Multicore CPUs....636
Evaluating Multicore CPUs....639
The Evolution of Multicore CPUs....643
Systems-on-Chip....643
Heterogeneous CPU Cores....645
Chiplet-Based Multicore Design....646
Conclusion....648
References....648
Part IV Emerging Computing Architectures....652
19 Compute-in-Memory Architecture....653
Contents....653
Introduction....654
DNN Basics and Corresponding CIM Principle....656
Architecture and Algorithm Techniques for CIM....658
Hierarchical Architecture of CIM....658
Network Mapping Strategies....659
Mapping Methods for Inference....660
Mapping Method for Training....663
Number Representation in CIM Architecture....663
Pipeline Design in CIM Architecture....666
Intra-Layer Pipeline....667
Inter-Layer Pipeline....667
Quantization Techniques in CIM Architectures....670
Hardware Implementations for CIM Architecture....673
Device Technologies....673
SRAM....673
Two Terminal eNVM....675
Three-Terminal eNVM....676
Overcoming the Non-idealities from eNVM....677
Circuit Techniques for CIM....678
Memory Modification....678
Input Encoding....679
Output Sensing....681
Frameworks for Evaluating CIM Designs....687
Conclusion....688
References....689
20 Design Automation Techniques for Microfluidic Biochips....693
Contents....693
Introduction....694
Flow-Based Microfluidic Biochips....696
Design Tasks for FBMBs....697
Architecture Design of the Flow Layer....697
Architecture Design of the Control Layer....699
Design Automation for FBMBs....699
Synthesis Methods for the Flow Layer....699
Synthesis Methods for the Control Layer....702
Synthesis Methods for the Codesign of the Control and Flow Layers....705
Digital Microfluidic Biochips....707
Technology Platforms and Applications....707
Synthesis Methods....710
Scheduling and Module Placement....711
Droplet Routing....712
MEDA Biochips....713
Hardware Implementation....714
MEDA Evolution....717
Synthesis Methods....718
Scheduling and Placement for MEDA Biochips....718
Droplet Routing and Extension for MEDA....721
Conclusion....724
References....725
21 Architectures for Quantum Information Processing....729
Contents....729
Introduction....730
Background....731
Quantum Bits (Qubits)....732
Quantum Gates....733
Quantum Error....733
Gate Error....734
Relaxation and Dephasing....734
Measurement Error....734
Crosstalk Error....735
Quantum Hardware....735
Qubit Technologies....736
Superconducting Qubits....736
Trapped-Ion Qubits....737
Spin Qubits....738
Quantum Algorithms....738
Algorithms Designed for Fault-Tolerant Quantum Computers....739
Shor's Algorithm....739
Grover's Algorithm....740
Algorithms for NISQ Computers....740
Variational Quantum Eigensolver or VQE....740
Quantum Approximate Optimization Algorithm or QAOA....741
Quantum Software....741
Quantum Program, Quantum Instruction Sets, and Software Development Kits....741
Quantum Programming Languages....742
Quantum Annealing....744
Compilation, Mapping, and Optimization....745
Superconducting Quantum Computers....746
Coupling Constraints and Need for SWAP Operation....746
Compilation and Optimization....747
Trapped-Ion Quantum Computers....747
Shuttle Operation....747
Compilation and Optimization....748
Considerations for Noisy Systems....749
Technology Agnostic Work....749
Noise-Aware Qubit Mapping....749
Measurement Error Mitigation....750
Superconducting-Specific Work....750
Crosstalk Mitigation....750
Leveraging Extended Native Gates....751
Application-Specific Compilation....751
Conclusion....752
References....752
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits....756
Contents....756
Introduction....757
Monolithic 3D IC Design Flow....758
Motivation and Background....758
Benefit Trends of Monolithic 3D ICs Across Technology Nodes....759
Analysis on Benefits of Monolithic 3D ICs....759
Technology Nodes and Design Libraries....759
Implementation Methodology....760
Power Saving Trend of Monolithic 3D ICs....761
Analysis of Trends....763
M3D Power Saving at Low Frequency....763
M3D Power Saving at High Frequency....765
A Design-Aware Partitioning Approach to Monolithic 3D IC with 2D Commercial Tools....766
Implementation Methodology....767
Design-Aware Partitioning Stage....767
MIV Planning Stage....769
Cascade-2D Stage....770
Impact of New Monolithic 3D IC Design Flow....773
Power and Performance Benefit....773
Comparison to Shrunk-2D Design Flow....774
Power Supply Integrity of Monolithic Three-Dimensional Integrated Circuits....778
Motivation and Background....778
System-Level Power Delivery Network Analysis for Monolithic 3D ICs....779
System-Level Power Delivery Network Modeling....780
Analysis on Power Supply Integrity of Monolithic 3D ICs....781
Monolithic 3D IC Power Delivery Network Design Flow....781
Technology Nodes and Design Libraries....781
Analysis Methods....782
Static Rail Analysis....783
Dynamic Rail Analysis....786
Frequency- and Time-Domain Analysis....787
Monolithic 3D ICs for Deep Neural Network Hardware....790
Motivation and Background....790
Impact of Monolithic 3D ICs on On-Chip Deep Neural Networks Targeting Speech Recognition....791
Deep Neural Network for Speech Recognition....791
DNN Topology....791
Deep Neural Network Training and Classification....792
Coarse-Grain Sparsification....793
Deep Neural Network Architecture Description....794
Impact of Monolithic 3D ICs on Energy-Efficiency of Deep Neural Network Hardware....796
Area, Wire-Length, and Capacitance Comparisons....796
Power Comparisons....798
Impact of Monolithic 3D ICs on Performance of Deep Neural Network Hardware....800
Architectural Impact Discussions....802
CGS-16 and CGS-64 Architecture Comparisons....802
Impact of Workloads....805
Conclusion....806
References....807
Part V Processor Design and Programming Flows....810
23 Architecture Description Languages....811
Contents....811
Introduction....812
A Brief History of ADLs....816
The Classical Era: 1990–2000....817
The First Industrial Era: 2000–2010....817
The Second Industrial Era: 2010–2020....818
Types and Characteristics of ADLs....818
Types of ADLs....818
Characteristics of ADLs....819
Key ADLs....820
MIMOLA....820
EXPRESSION....821
nML....821
LISA....822
PEAS....823
TENSILICA TIE....823
ARC APEX....824
Codasip CodAL....825
Andes ACE....826
RISC-V Chisel....826
ADL-Driven Methodologies....827
Generation of Software Tools....827
Automatic Synthesis of Custom Instructions for an Application....828
Instruction-Set Simulator Generation....829
Generation of Hardware Implementation....831
Top-Down Verification....832
Validation of an ADL Specification....833
Specification-Driven, Simulation-Based, Verification....834
Applications of ADL-Based Design....835
Conclusions....837
References....838
24 Accelerator Design with High-Level Synthesis....844
Contents....844
Introduction....845
Background: Technology and Models....847
Target Technology....847
Accelerator Models....849
Accelerator Template....850
Introduction to High-Level Synthesis....851
A Traditional High-Level Synthesis Framework....851
A Bit of History on Commercial Products and Academic Projects....853
From Input Specification to Intermediate Representation....854
Input Specification and Intermediate Representation....854
Analysis and Optimization of the Intermediate Representation....856
Creation of the Microarchitecture....858
Scheduling and Performance Optimization....858
Binding and Resource Optimization....860
Definition of the Memory Architecture....862
Creation of the FSM Controller....866
RTL Generation and System Integration....866
Code Generation, Evaluation, and Verification....866
System-Level Integration and Optimization....867
Open and Modern Challenges....868
Creation of Domain-Specific Architectures....868
Programmability and System-Level Optimization....870
Hardware Security and Data Protection....871
Conclusion....871
References....872
25 Processor Simulation and Characterization....877
Contents....877
Introduction....878
Application and Algorithm Analysis....881
Data Types and Operations....881
Algorithms....882
Example: Affine Transform of 2D Image....883
New or Existing Processor?....884
Existing Processor....884
Extending Configurable Processor....885
New Processor with New ISA....885
Hybrid Mode: New ISA with Custom Extensions....886
Standard Benchmarks....886
Issues with Estimating Processor Performance....886
Whetstone....890
Linpack....890
Dhrystone....890
CoreMark....891
Embench....892
SPEC CPU....893
EEMBC....894
Berkeley Design Technology....894
Summary....894
Using Application Code for Benchmarking....895
Estimation Analysis....895
Examples of Estimation Flow....897
Hardware Aspects....898
Software Aspects....900
Custom Instructions....903
For Further Consideration....903
Processor Simulation....904
Functional Simulation....904
Definition....904
Trace-Driven Cache Simulators and Branch-Prediction Simulators....904
Instruction Mix Analysis....905
Instruction Level Parallelism (ILP)....905
Memory Access Patterns....905
Register-File Usage Analysis....906
Open-Source Simulators....906
Cycle-Level Simulation....907
Definition....907
Performance Analysis....907
Metrics and System Partitioning....907
Optimization....908
Configurability....908
Open-Source Simulators....908
Hardware Emulation....909
Definition....909
Emulation Modes....909
Using Processor Simulators in System Modelling....909
Summary Table Comparing Various CPU Modelling Abstractions....911
Examples....911
Conclusion....913
References....913
26 Methodologies for Design Space Exploration....916
Contents....916
Introduction....917
DSE: The Basic Concepts....918
Two Basic Ingredients of DSE....920
Y-Chart-Based DSE....921
Evaluation of a Single Design Point....923
Simulative Fitness Evaluation....923
Analytical Fitness Evaluation....927
Searching the Design Space....928
GA-Based DSE....929
Optimizing GA-Based DSE....932
Multi-application Workload Models....933
Scenario-Based DSE....934
Application Exploration....938
NAS by Means of Evolutionary Piecemeal Training (EPT)....938
Evolutionary Operators....939
NAS Results....940
Conclusion and Outlook....941
References....943
27 Virtual Prototyping of Processor-Based Platforms....947
Contents....947
Introduction to Virtual Prototypes....949
SoC Design and Verification Overview....949
Historic Background of Virtual Prototyping....951
Virtual Prototyping in the Verification Continuum....952
Use-Cases for Virtual Prototypes....954
Architecture Analysis....956
Macro-architecture Specification....956
HW/SW Performance Optimization and Validation....959
Software Use-Cases....960
Early Software Development....962
Software Regression Testing....962
Hybrid Use-Cases for Software-Driven Functional Verification....963
RTL Co-simulation....964
Hybrid Emulation....964
Hybrid FPGA Prototyping....965
System-Level Power Analysis....965
Summary....966
Building Transaction Level Virtual Prototypes....967
The SystemC Transaction Level Modeling Standard....967
Loosely Timed Modeling Style....969
Extended Loosely Timed Modeling Style....970
Approximately Timed Modeling Style....970
Extended AT....971
TLM-2.0 Summary....972
Building TLM Components for Virtual Prototypes....973
Levels of Abstraction....973
Processor Models....974
TLM Integration of Processor Models....975
TLM Models of Peripheral Components....976
SSD Controller SoC Case Study....977
SSD Controller SoC Introduction....978
Loosely Timed Virtual Prototype of the SSD SoC....979
Accurate Virtual Prototype of SSD SoC....980
SSD Case Study Summary....982
Conclusion and Outlook....982
References....984
28 FPGA-Specific Compilers....988
Contents....988
Introduction....989
Existing HLS Compilers and Programming Models....991
C-Based HLS Tools....992
Dataflow Compilers....993
Domain-Specific Languages (DSLs)....994
Emerging Accelerator Design Languages....995
Key Compiler and Synthesis Optimizations....996
Pipelining Techniques....997
Operator-Level Optimizations....997
Statically Scheduled Pipelining....999
Dynamically Scheduled Pipelining....1001
Parallelization Techniques....1003
Homogeneous Data-Level Parallelism....1004
Heterogeneous Task-Level Parallelism....1005
Memory Customization Techniques....1006
Exploiting Data Reuse....1006
Decoupled Access-Execute....1007
Data Vectorization....1009
Memory Banking....1009
Data Type Customization Techniques....1010
Automatic Bitwidth Optimization....1010
Custom Precision Floating-Point Data Types....1011
Float to Fixed-Point Conversion....1011
Case Study: Binarized Convolutional Neural Networks....1012
Algorithm Overview....1012
Pipelining and Unrolling....1013
Line Buffers and Window Buffers....1015
Data Vectorization....1015
Building the BNN Accelerator Using HeteroCL....1016
Evaluation....1016
Concluding Remarks....1018
References....1019
29 Approximate Computing Architectures....1025
Contents....1025
Approximate Computing....1026
Approximate Arithmetic Components....1028
Design Methodologies for Approximate Components....1028
Manual Approximation Methods....1029
Automated Approximation Methods....1030
Error Metrics and Evaluation Analysis for Approximate Components....1032
Arithmetic Error Metrics....1033
General Error Metrics....1034
Quality Evaluation....1034
Design Methods for Building Approximate Hardware Accelerators: Case Studies for Error-Tolerant Applications....1035
Image and Video Processing Applications....1036
AutoAx Methodology....1036
Results....1040
Deep Neural Networks (DNNs)....1044
ALWANN Methodology....1045
Evaluation and Experiments....1047
Cross-Layer Approximations for Error-Tolerant Applications....1050
Methodology for Combining Hardware- and Software-Level Approximations....1050
Cross-Layer Methodology for Optimizing DNNs....1052
Case Studies for Improving the Energy and Performance Efficiency of DNN Inference....1053
Structured Pruning....1053
Quantization....1055
Hardware-Level Approximations: Impact of Self-Healing and Nonself-Healing Designs on DNN Accuracy....1056
Conclusions....1061
References....1062
30 Parallel Programming Models....1066
Contents....1066
Introduction....1067
Hardware Models....1067
Constructs in Parallel Programming Models....1068
Taxonomy....1069
The OpenMP Programming Model....1071
The Worksharing Model....1073
The Tasking Model....1074
SIMD Support in OpenMP....1076
Vectorization, Intrinsics, and Semi-automatic Vectorization....1077
SIMD Loops....1079
Function Vectorization....1081
The Accelerator Model....1082
The OmpSs-2 Programming Model....1084
Advanced Dependency System....1084
Global Domain of Dependencies....1085
Advanced Dependency Types....1087
Exploiting Structured Parallelism on Many-Core Processors....1088
Optimal Task Granularity....1088
Work-Sharing Task Syntax....1089
Semantics of Work-Sharing Tasks....1089
OmpSs-2 NUMA Support....1090
NUMA-Aware Allocation API....1090
Nanos6 Data-Tracking System....1091
Nanos6 NUMA-Aware Scheduling System....1092
The XiTAO Programming Model and Runtime....1093
Explicit DAG Programming in XiTAO....1093
Software Topologies and Locality-Aware Programming....1095
The Software Topology Mapping....1095
Locality-Aware Moldable Mapping....1096
The XiTAO Data-Parallel Interface....1097
The Asynchronous Data-Parallel Mode....1098
The Synchronous Data-Parallel Mode....1098
The XiTAO Runtime....1099
XiTAO Internals....1099
Configuring the Runtime....1100
Conclusion....1100
References....1101
31 Dataflow Models of Computation for Programming Heterogeneous Multicores....1103
Contents....1103
Introduction....1104
About Models of Computation....1106
Dataflow Models of Computation....1108
Static Dataflow Models....1108
Homogeneous Synchronous Dataflow (HSDF)....1110
Synchronous Dataflow (SDF)....1112
Further Static Extensions....1115
Dynamic Dataflow moc....1116
Kahn Process Network....1116
Dataflow Process Networks....1117
Relation to Other Dataflow MoCs and Extensions....1118
Reconfigurable Dataflow....1118
πSDF....1118
Other Reconfigurable Dataflow moc....1120
Optimization of Dataflow Programs....1121
Modeling Heterogeneous Platforms....1122
System-Level Description....1122
Modeling Performance and Energy Consumption....1124
Static Mapping....1125
Hybrid Mapping....1128
Examples: Models and Tools....1130
Dataflow in Commercial and Mainstream Tools....1130
MPSoC Application Programming Studio (MAPS)....1130
Preesm and Spider....1133
Preesm....1134
Spider....1135
Conclusion and Outlook....1136
References....1136
32 Retargetable Compilation....1143
Contents....1143
Introduction and Historical Perspective....1144
Compiler Construction....1145
Compiler Frameworks....1145
Retargetable Compilers....1147
Outline of This Chapter....1149
Anatomy of a Compiler....1149
Intermediate Representations....1149
Compilation Phases and Dependencies....1150
Front End....1151
Middle End....1152
Back End....1155
Linker....1156
Architectural Scope of ASIPs....1157
Parallelism....1158
Specialization....1160
Example....1162
Retargetable Compilers for ASIPs....1163
Processor Intermediate Representations....1166
Retargetable Compiler Optimizations....1167
Front End and Middle End....1169
Code Selection....1172
Register Allocation....1173
Register Assignment....1175
Instruction Scheduling....1176
Conclusions....1181
References....1182
Part VI Test and Verification....1185
33 Verification and Its Role in Design of Modern Computers....1186
Contents....1187
Introduction....1187
Formal Verification, Simulation, and Emulation....1188
Outline of the Section....1189
Section Organization....1190
Bit-Level Model Checking Algorithms....1190
C-to-RTL Equivalence Checking....1190
Symbolic Simulation....1192
Mechanical Theorem Proving....1192
Versatile Binary-Level Concolic Testing....1193
Information Flow Analysis....1193
Verification of Quantum Circuit Design Flows....1194
Discussion....1194
Conclusion....1196
References....1196
34 Bit-Level Model Checking....1198
Contents....1198
Introduction....1199
Preliminaries....1200
Explicit Example: A Simple Counter....1200
Linear Time Temporal Logic....1202
Representing Systems Symbolically....1203
Algorithms for Safety Properties....1207
The Induction Principle....1207
Overview of Model Checking Algorithms....1209
Symbolic Model Checking (with BDDs)....1212
Bounded Model Checking....1213
k-Induction....1214
Interpolation and Model Checking....1216
Interpolation Sequence-Based Model Checking (Isb)....1217
Interpolation-Based Model Checking (Itp)....1218
Property Directed Reachability....1220
Combining Interpolation and Pdr....1223
Summary....1224
Algorithms for Liveness Properties....1224
Introduction....1224
Overview of Model Checking Algorithms....1225
Symbolic Model Checking with BDDs....1225
Liveness-to-Safety Conversion (L2S)....1226
Bounded Liveness Checking....1227
Counter-Based Translation....1227
kLiveness....1227
FAIR....1229
Summary....1230
Design Simplification Techniques....1230
Reductions....1231
Combinational Redundancy Removal....1231
Retiming....1231
Sequential Redundancy Removal....1231
Input Reparameterization....1232
Phase Abstraction....1232
Over-approximations....1232
Proof-Based Abstraction....1233
Counterexample-Guided Abstraction....1233
Other Approaches....1234
Summary....1234
Conclusion....1234
References....1234
35 High-Level Formal Equivalence....1238
Contents....1238
Types of Equivalence to Check....1239
Combinational Equivalence....1240
Sequential Equivalence....1241
Transaction-Based Equivalence....1242
Verification Methodology....1243
Using Design Exercise for Datapath Designs....1252
Advanced Datapath Verification....1254
Managing Inconclusive Proofs....1254
Accuracy Challenges....1256
Accuracy Optimized Component Verification....1259
Proving Faithful Rounding....1260
Proving Monotonicity....1261
Proving Commutativity....1262
References....1263
36 Verification of Arithmetic and Datapath Circuits with Symbolic Simulation....1264
Contents....1264
Introduction....1265
Symbolic Simulation....1265
Symbolic Simulation as Formal Verification....1266
Symbolic Simulation Among Formal Verification Methods....1267
Chapter Outline....1269
Simulation....1270
Booleans and Undefined Values....1270
Circuit Simulation and Undefined Values....1271
Mathematical Model of Circuit Simulation....1274
Circuit Properties....1276
Mathematical Model of Circuit Properties....1277
Symbolic Simulation....1278
Symbolic Computation....1278
Simulation with Symbolic Values....1279
Mathematical Model of Symbolic Simulation....1282
Practical Considerations....1283
Simulation Scope Control....1284
Property Triggers....1284
Scope Reduction by Triggers....1288
Reachable-State Invariants....1290
Complexity Management....1292
Simulation Complexity....1292
Complexity Analysis....1294
Weakening....1295
Verification Flow....1297
Arithmetic Circuits....1299
Direct Verification....1299
Floating-Point Operations....1301
Floating-Point Addition....1302
Integer Multiplication....1303
Floating-Point Multiplication and Fused Multiply-Add....1305
Floating-Point Division and Square Root....1306
Industrial Verification....1309
Related Work....1310
References....1312
37 Microprocessor Assurance and the Role of Theorem Proving....1316
Contents....1316
Introduction....1317
ACL2 Preliminaries....1319
Logic Basics....1320
Extension Principles....1322
The Theorem Prover....1323
Some Execution Features: Guards, MBE, and Stobjs....1324
Intended Domains and Guards....1325
Must Be Equal....1326
Single-Threaded Objects....1327
ISA Analysis....1327
ISA Formalization....1328
Mechanical Analysis for ISA....1329
Binary Code Analysis with ISA Models....1330
Some Formalized ISAs....1331
Analysis of Microarchitecture Properties....1333
Pipelining, Out-of-Order, and Speculative Executions....1333
Pipelining....1333
Interrupts, Out-of-Order and Speculative Execution, Self-Modifying Code, and the Works....1335
Reasoning About Memory Hierarchy....1337
Verification of Execution Units....1338
Deep Dive: Formalization and Analysis of (Simplified) x86....1340
Approach....1340
Design Considerations....1342
Scope....1344
Application: Verifying x86 Instruction Implementations....1345
Ucode Model....1347
Verification of the exec Block....1348
A Candidate Instruction....1348
Verification of the Decode Block....1350
Verification of the Xlate/Ucode Blocks....1351
Discussion....1352
Theorem Proving Beyond Microarchitecture....1353
Conclusion....1353
References....1354
38 Versatile Binary-Level Concolic Testing....1359
Contents....1359
Introduction....1360
Challenges of Classic Symbolic and Concolic Testing....1361
Overview of Versatile Binary-Level Concolic Testing....1361
Background....1362
Symbolic Execution....1362
Concolic Testing....1363
Related Works....1364
The Infrastructure of Versatile Binary-Level Concolic Testing....1365
Design and Architecture....1366
Real-World Examples....1367
Concolic Testing on COTS Linux Kernel Modules....1369
Design and Architecture....1370
Real-World Examples....1372
Concolic Testing for Hardware/Software Co-validation of Systems-on-Chips....1374
Design and Architecture....1374
Real-World Examples....1377
Conclusions....1379
References....1379
39 Information Flow Verification....1383
Contents....1383
Introduction....1384
Information Flow....1385
Information Flow Model....1385
Specifying Information Flow Properties....1389
Information Flow Analysis....1390
Trace Properties and Hyperproperties....1392
Verifying Hyperproperties....1393
Static Analysis....1394
Dynamic Analysis....1396
Verification Tools....1397
Simulation-Based Verification....1397
Formal Verification Methods....1398
Case Studies....1398
Cache Timing Side Channels....1399
Memory Access Control....1402
Conclusion....1404
References....1404
40 Verification of Quantum Circuits....1407
Contents....1407
Introduction....1408
Background....1410
Quantum Computing....1410
Quantum Circuit Compilation....1412
Verification....1414
Classical Circuits....1414
Quantum Circuits....1415
Formal Verification....1417
Decision Diagrams....1418
General Approach....1419
Alternating Approach....1420
Designing a Strategy for Verifying Compilation Flow Results....1421
Simulative Verification....1424
Verification Schemes Based on Simulation....1425
Stimuli Generation Schemes....1426
Resulting Quantum Circuit Equivalence Checking Flow....1430
Conclusions....1432
References....1432
Index....1435
This handbook presents the key topics in the area of computer architecture covering from the basic to the most advanced topics, including software and hardware design methodologies. It will provide readers with the most comprehensive updated reference information covering applications in single core processors, multicore processors, application-specific processors, reconfigurable architectures, emerging computing architectures, processor design and programming flows, test and verification. This information benefits the readers as a full and quick technical reference with a high-level review of computer architecture technology, detailed technical descriptions and the latest practical applications.
The content is spread over multiple sections, and in each section, specific chapters offer a detailed glimpse of a topic of interest. The chapters are presented in increasing order of advanced concepts. It is also cross-linked in such a manner that reader can peruse a chapter with only necessary pre-requisite from selected, prior chapters.
In the first section of single-core processors, three chapters provide the background of computer organization, microarchitecture, and communication networks. This is complemented with chapters on operating systems, edge computing, and secure computing architectures – which provide sufficient foundation for a reader to move toward more advanced notions in any of the following sections.
The section on application-specific processors provides valuable insights into the growing demands from application developers to have customized architectures, also referred to as co-processors or accelerators. From a wide range of application segments, multimedia processing, scientific computing, machine learning, and cryptographic workloads are chosen to be covered here. Since these applications heavily depend on digital arithmetic, a short overview of the concepts is presented as well. Multimedia, machine learning, and several other domain-specific architectures are known to get influenced – for good or worse – due to the device-level faults appearing in advanced technology nodes. This is discussed in the section of fault-tolerant architectures.
Various application-specific processors and general-purpose ones come together to contribute in the rich tapestry of modern System-on-Chips (SoCs). This also enhances the notion of architectures significantly by offering reconfigurability as a property. Multicore SoCs and reconfigurable architectures are studied in a dedicated section, covering general-purpose multicore architectures, Graphics Processing Units (GPUs), and Field Programmable Gate Arrays (FPGAs). Furthermore, readers are offered to delve into the Coarse-Grained Reconfigurable Architectures (CGRAs), dynamic and partial reconfigurability notions as well as power management challenges for multicore systems.
Growing technology prowess offers various capabilities to modern architects. In the section of Emerging Computing Architectures, these are studied, including compute-in-memory architectures, architectures for microfluidic biochips, Quantum computing, and the ones benefitting from 3D ICs. The complexity of modern computer architectures can only be managed with the help of powerful design automation flows. This is discussed in the section on Processor Design and Programming Flows. The introductory chapters on parallel programming models and dataflow models help reader to familiarize with the abstract notions necessary to grasp the design automation concepts. This foundation brings further the methodologies for design space exploration, followed by specific tool-flows, as elaborated in the chapters on architecture description languages, high-level synthesis, processor simulation, and virtual prototyping. For customizable, application-specific, and reconfigurable architectures, the compilation flows present a critical role to extract maximum efficiency out of the computing fabric. These are discussed in two chapters on FPGA-specific compilers and retargetable compilers. Balancing of technology constraints all the way to the application layer is a complex design automation challenge, which is discussed in the chapter on approximate computing architectures.
The last section of this volume brings forth the classic and modern techniques for testing and verification of computer architectures.