SlideShare ist ein Scribd-Unternehmen logo
1 von 67
@alblue©2020 Alex Blewitt
Understanding CPU
Microarchitecture
🏎For Maximum Performance
InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
microarchitecture-modern-cpu/
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
@alblue©2020 Alex Blewitt
Overview
• What happens inside a CPU?

• Where do CPU intensive programs get delayed?

• What tools are there to help measure performance bottlenecks?

• How can we make programs run faster?
@alblue©2020 Alex Blewitt
distributed architecture
system architecture
algorithm
hardware
cpu
memory
inst
Performance Pyramid
This talk
Other QCon

talks
https://www.infoq.com/qconlondon2020/
@alblue©2020 Alex Blewitt
DMI x4
**
Platform Topologies
8S Configuration
SKLSKL
LBG
LBG
LBG
DMI
LBG
SKLSKL
SKLSKL
SKLSKL
3x16
PCIe*
4S Configurations
SKLSKL
SKLSKL
2S Configurations
SKLSKL
(4S-2UPI & 4S-3UPI shown)
(2S-2UPI & 2S-3UPI shown)
Intel®
UPI
LBG 3x16
PCIe* 1x100G
Intel® OP Fabric
3x16
PCIe* 1x100G
Intel® OP Fabric
LBGLBG
LBG
DMI
3x16
PCIe*
This slide under embargo until 9:15 AM PDT July 11, 2017
Intel®Xeon®ScalableProcessorsupports
configurationsrangingfrom2S-2UPIto8S
Non Uniform Memory Architecture (NUMA)
https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
@alblue©2020 Alex Blewitt
DMI x4
**
Platform Topologies
8S Configuration
SKLSKL
LBG
LBG
LBG
DMI
LBG
SKLSKL
SKLSKL
SKLSKL
3x16
PCIe*
4S Configurations
SKLSKL
SKLSKL
2S Configurations
SKLSKL
(4S-2UPI & 4S-3UPI shown)
(2S-2UPI & 2S-3UPI shown)
Intel®
UPI
LBG 3x16
PCIe* 1x100G
Intel® OP Fabric
3x16
PCIe* 1x100G
Intel® OP Fabric
LBGLBG
LBG
DMI
3x16
PCIe*
This slide under embargo until 9:15 AM PDT July 11, 2017
Intel®Xeon®ScalableProcessorsupports
configurationsrangingfrom2S-2UPIto8S
Non Uniform Memory Architecture (NUMA)
https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
@alblue©2020 Alex Blewitt
DMI x4
**
Platform Topologies
8S Configuration
SKLSKL
LBG
LBG
LBG
DMI
LBG
SKLSKL
SKLSKL
SKLSKL
3x16
PCIe*
4S Configurations
SKLSKL
SKLSKL
2S Configurations
SKLSKL
(4S-2UPI & 4S-3UPI shown)
(2S-2UPI & 2S-3UPI shown)
Intel®
UPI
LBG 3x16
PCIe* 1x100G
Intel® OP Fabric
3x16
PCIe* 1x100G
Intel® OP Fabric
LBGLBG
LBG
DMI
3x16
PCIe*
This slide under embargo until 9:15 AM PDT July 11, 2017
Intel®Xeon®ScalableProcessorsupports
configurationsrangingfrom2S-2UPIto8S
Non Uniform Memory Architecture (NUMA)
https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
@alblue©2020 Alex Blewitt
DMI x4
**
Platform Topologies
8S Configuration
SKLSKL
LBG
LBG
LBG
DMI
LBG
SKLSKL
SKLSKL
SKLSKL
3x16
PCIe*
4S Configurations
SKLSKL
SKLSKL
2S Configurations
SKLSKL
(4S-2UPI & 4S-3UPI shown)
(2S-2UPI & 2S-3UPI shown)
Intel®
UPI
LBG 3x16
PCIe* 1x100G
Intel® OP Fabric
3x16
PCIe* 1x100G
Intel® OP Fabric
LBGLBG
LBG
DMI
3x16
PCIe*
This slide under embargo until 9:15 AM PDT July 11, 2017
Intel®Xeon®ScalableProcessorsupports
configurationsrangingfrom2S-2UPIto8S
Non Uniform Memory Architecture (NUMA)
https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
@alblue©2020 Alex Blewitt
https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
12
@alblue©2020 Alex Blewitt
18
https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
@alblue©2020 Alex Blewitt
https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
@alblue©2020 Alex Blewitt
10
https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
Cascade/Skylake 10-core die
@alblue©2020 Alex Blewitt
18
https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
Cascade/Skylake 18-core die
@alblue©2020 Alex Blewitt
Sub NUMA cluster 1Sub NUMA cluster 0
https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
Cascade/Skylake 28-core die
@alblue©2020 Alex Blewitt
https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
Cascade 56 core ‘die’
@alblue©2020 Alex Blewitt
Cascade 56 core package
Package
Die
Die
@alblue©2020 Alex Blewitt
L3$ (LLC)
1.375 MiB 11-way
Non-inclusive
L3$ (LLC)
1.375 MiB 11-way
Non-inclusive
L3$ (LLC)
1.375 MiB 11-way
Non-inclusive
Memory and Cache ($)
Register file
180 Integer
168 Floating Point
L1 Data (L1D$)
32 KiB 8-way
L1 Instruction (L1I$)
32 KiB 8-way
L2$
1 MiB 16-way
Inclusive
L3$ (LLC)
1.375 MiB 11-way
Non-inclusive
Information for Cascade/Skylake systems
RAM
RAM
RAM
RAM
1🕐
4🕐 4🕐
40🕐
12🕐
50🕐
150🕐
300🕐
Clock
Cycles
@alblue©2020 Alex Blewitt
lstopo --no-io
Machine (16GB)
Package P#0
L4 (128MB)
L3 (6144KB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#4
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#5
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#6
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#7
Shared memory
between CPU and
GPU
HyperThreads
Single socket
system
Cache levels
Four core processor
@alblue©2020 Alex Blewitt
L3$ (LLC)
1.375 MiB 11-way
Non-inclusive
L3$ (LLC)
1.375 MiB 11-way
Non-inclusive
L3$ (LLC)
1.375 MiB 11-way
Non-inclusive
Memory and Cache ($)
Register file
180 Integer
168 Floating Point
L1 Data (L1D$)
32 KiB 8-way
L1 Instruction (L1I$)
32 KiB 8-way
L2$
1 MiB 16-way
Inclusive
L3$ (LLC)
1.375 MiB 11-way
Non-inclusive
Data TLB
4K: 128 8-way
2M/4M: 8/T assoc
Instruction TLB
4: 64 4-way
2M/4M: 32 4-way
1G: 4 4-way
STLB
4K/2M: 1536 12-way
1G: 16 4-way
RAM
RAM
RAM
RAM
Virtual Physical PCID
00008000(1234) 5e38450c(1234) 10
00008000(1234) 48656c6f(1234) 20
fffffffffffb(8080) 2345ffffffb(8080) 0
1🕐
4🕐 4🕐
40🕐
12🕐
50🕐
150🕐
300🕐
Clock
Cycles
Information for Cascade/Skylake systems
grep /proc/cpuinfo for pcid ↑
@alblue©2020 Alex Blewitt
Memory Pages
8000
ffaa
ffbb
f000
0000CR3
0000
ffff
7fff CR3
0000
ffff
7fff
8000
f000
Two layer page table structure shown
x86_64 has 4 level paging (48 bits, 256TiB virtual, 64TiB real)
Ice Lake processors support 5 level paging (57 bits, 128Pb virtual, 4PiB real)
0x000080001234 0x000080001234
Pages can be
4k, 2M or 1G
@alblue©2020 Alex Blewitt
Huge Pages
0000
Pages can be
4K, 2M or 1G
grep /proc/cpuinfo
pse: 2M support

pdpe1g: 1G support
# Better use of TLB
$ More complex to set up
# Fewer memory cache misses
$ May waste memory
$ Hugetblfs needs to be configured
@alblue©2020 Alex Blewitt
$ Hugetblfs $
• Requires kernel configuration to reserve memory ahead of time

• Boot parameter hugepages=N puts aside memory for huge page use

• Boot parameter hugepagesz={2M,1G} specifies huge page size

• Requires a hugetblfs mount to be provided

• Requires root (or suitably permissioned app) to use hugepages
@alblue©2020 Alex Blewitt
# Transparent Huge Pages $
• Does not require boot time configuration or special permissions

• khugepaged assembles contiguous physical memory for large pages

• Default page size is still 4k, but processes can madvise() use of large pages

• Allows specific apps to opt-in on demand

• Benefits of smaller TLB with less wasted memory

# echo madvise > /sys/kernel/mm/transparent_hugepages/enabled
# echo defer > /sys/kernel/mm/transparent_hugepage/defrag
Defer instead of blocking large page request
Enable opt-in
through use
of madvise
💡
@alblue©2020 Alex Blewitt
Cache lines, loads and stores
• Unit of granularity of a cache entry is 64 bytes (512 bits)

• Even if you only read/write 1 byte you're writing 64 bytes

• Cache lines can generally be in different states:

➡ M – exclusively owned by that core, and modified (dirty)

➡ E – exclusively owned by that core, but not modified

➡ S – shared read-only with other cores

➡ I – invalid, cache line not used
@alblue©2020 Alex Blewitt
Memory prefetching (CPU)
CPU issues
automatic prefetch
for streamed data
Also notices
striding by certain
amounts as well
Can also use
__builtin_prefetch

to explicitly suggest
prefetching memory
elsewhere but needs to be
a measured improvement
💡
@alblue©2020 Alex Blewitt
False sharing
• Two cores trying to write to bytes in the same cache-line will thrash

• First thread will try to acquire exclusive ownership of cache line

• Second thread (on different core) will try to do the same

• Updates may be lost if writes are not atomic *

• Performance will suffer when cache line repeatedly moved

• Avoid by padding to at least cacheline size * 2 (128 bytes) for writes
Thread 1
data[0] = 'A'
Thread 2
data[7] = 'C'
* UPDATE: This confused more than it helped. Synchronisation and false sharing are orthogonal.
@alblue©2020 Alex Blewitt
Memory performance strategies
• Ensure data fits in L1/L2/L3 cache where possible

• Stream or stride through data in a single pass if possible

• Consider pivoting data (array-of-structs or structs-of-arrays)

• Add padding for multi-threaded contended writes

• Prefer thread-local or cpu-local accumulators with final merge step

• Compress data where practical (compressed pointers)
💡
@alblue©2020 Alex Blewitt
Pinning memory/threads
• Pinning memory or threads to a particular core can improve performance

• Reduces intra-core memory ownership traffic

• Less likely to have cache invalidations

• isolcpu allows reservation of CPUs for non-kernel use with cpusets

• taskset allows binding of a process to specific cores

• numactl allows cores/memory to be clamped for a process

• libnuma has additional affinity settings for programmatic use
@alblue©2020 Alex Blewitt
Frontend
Core
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
Backend
x86_64
µop
@alblue©2020 Alex Blewitt
Core
x86_64
Pre-decode Instructions µop decoders
µop cache loop decode
branch
prediction
Backend
µop
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
@alblue©2020 Alex Blewitt
Core
x86_64
Pre-decode Instructions µop decoders
µop cache loop decode
branch
prediction
Backend
µop
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
55 48 89 e5 fe 04 25 d2
04 00 00 41 6c 42 6c 75
@alblue©2020 Alex Blewitt
Core
x86_64
Pre-decode Instructions µop decoders
µop cache loop decode
branch
prediction
Backend
µop
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
55|48 89 e5|fe 04 25 d2
04 00 00|41 6c 42 6c 75
@alblue©2020 Alex Blewitt
Core
x86_64
Pre-decode Instructions µop decoders
µop cache loop decode
branch
prediction
Backend
µop
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
push %rbp
mov %rsp, %rbp
?
incb 0x4d2
@alblue©2020 Alex Blewitt
Core
x86_64
Pre-decode Instructions µop decoders
µop cache loop decode
branch
prediction
Backend
µop
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way incb 0x4d2incb 0x4d2incb 0x4d2
@alblue©2020 Alex Blewitt
Core
x86_64
Pre-decode Instructions µop decoders
µop cache loop decode
branch
prediction
Backend
µop
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
TMP = [0x4d2]
INC TMP
[0x4d2] = TMP
@alblue©2020 Alex Blewitt
Core
x86_64
Pre-decode Instructions µop decoders
µop cache loop decode
branch
prediction
Backend
µop
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
TMP = [0x4d2] INC TMP [0x4d2] = TMP
@alblue©2020 Alex Blewitt
Branch Prediction ⤵🧐
• Correct 95% of the time

• Queues up instructions assuming the branch has been taken

• Learns patterns in code based on existing behaviour

• Iterating through predictable (sorted) data may be more efficient

• Throws away inaccurate work if incorrect

• May cause observable side channel behaviour e.g. cache invalidation 👻
cmp eax,42; jne
@alblue©2020 Alex Blewitt
Branch Target Predictor 🎯
• Predicts where the target is going if taken

• Hard coded addresses/offsets always predictable

• Jump to location of register may be more difficult

• Often seen when jumping through object oriented code

• Inlining is the master optimisation because it avoids unknowable branches
jmp [eax]
@alblue©2020 Alex Blewitt
Core
Allocate
Rename
Retire
load buffer
store buffer
register files
2
3
4
7
8
9
0
1
5
6
Scheduler
Integer Unit Floating Unit
ALU LEA Shift Branch ALU FMA Shift Divide
ALU LEA Multiply Divide ALU FMA Shift Shuffle
ALU LEA Multiply ALU FMA Shuffle
ALU LEA Shift Branch
Execution units added in Ice Lake
Port 0 and 1 can be fused for a 512 bit operation
Port 5 is a 512 bit wide operation
All others handle 256 bits
Port 8 and 9
added in Ice Lake
Address
generation
reorder
buffer
Frontend
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
@alblue©2020 Alex Blewitt
Core
Allocate
Rename
Retire
load buffer
store buffer
register files
2
3
4
7
8
9
0
1
5
6
Scheduler
Integer Unit Floating Unit
ALU LEA Shift Branch ALU FMA Shift Divide
ALU LEA Multiply Divide ALU FMA Shift Shuffle
ALU LEA Multiply ALU FMA Shuffle
ALU LEA Shift Branch
Execution units added in Ice Lake
Port 0 and 1 can be fused for a 512 bit operation
Port 5 is a 512 bit wide operation
All others handle 256 bits
Port 8 and 9
added in Ice Lake
Address
generation
reorder
buffer
Frontend
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
TMP = [0x4d2] INC TMP [0x4d2] = TMP
@alblue©2020 Alex Blewitt
Core
Allocate
Rename
Retire
load buffer
store buffer
register files
2
3
4
7
8
9
0
1
5
6
Scheduler
Integer Unit Floating Unit
ALU LEA Shift Branch ALU FMA Shift Divide
ALU LEA Multiply Divide ALU FMA Shift Shuffle
ALU LEA Multiply ALU FMA Shuffle
ALU LEA Shift Branch
Execution units added in Ice Lake
Port 0 and 1 can be fused for a 512 bit operation
Port 5 is a 512 bit wide operation
All others handle 256 bits
Port 8 and 9
added in Ice Lake
Address
generation
reorder
buffer
Frontend
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
R99 = [0x4d2] INC R99 [0x4d2] = R99
@alblue©2020 Alex Blewitt
Core
Allocate
Rename
Retire
load buffer
store buffer
register files
2
3
4
7
8
9
0
1
5
6
Scheduler
Integer Unit Floating Unit
ALU LEA Shift Branch ALU FMA Shift Divide
ALU LEA Multiply Divide ALU FMA Shift Shuffle
ALU LEA Multiply ALU FMA Shuffle
ALU LEA Shift Branch
Execution units added in Ice Lake
Port 0 and 1 can be fused for a 512 bit operation
Port 5 is a 512 bit wide operation
All others handle 256 bits
Port 8 and 9
added in Ice Lake
Address
generation
reorder
buffer
Frontend
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
R99 = [0x4d2] INC R99
[0x4d2] = R99
@alblue©2020 Alex Blewitt
Core
Allocate
Rename
Retire
load buffer
store buffer
register files
2
3
4
7
8
9
0
1
5
6
Scheduler
Integer Unit Floating Unit
ALU LEA Shift Branch ALU FMA Shift Divide
ALU LEA Multiply Divide ALU FMA Shift Shuffle
ALU LEA Multiply ALU FMA Shuffle
ALU LEA Shift Branch
Execution units added in Ice Lake
Port 0 and 1 can be fused for a 512 bit operation
Port 5 is a 512 bit wide operation
All others handle 256 bits
Port 8 and 9
added in Ice Lake
Address
generation
reorder
buffer
Frontend
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
R99 = 2A INC R99
[0x4d2] = R99
@alblue©2020 Alex Blewitt
Core
Allocate
Rename
Retire
load buffer
store buffer
register files
2
3
4
7
8
9
0
1
5
6
Scheduler
Integer Unit Floating Unit
ALU LEA Shift Branch ALU FMA Shift Divide
ALU LEA Multiply Divide ALU FMA Shift Shuffle
ALU LEA Multiply ALU FMA Shuffle
ALU LEA Shift Branch
Execution units added in Ice Lake
Port 0 and 1 can be fused for a 512 bit operation
Port 5 is a 512 bit wide operation
All others handle 256 bits
Port 8 and 9
added in Ice Lake
Address
generation
reorder
buffer
Frontend
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
R99 = 2A
INC R99
[0x4d2] = R99
@alblue©2020 Alex Blewitt
Core
Allocate
Rename
Retire
load buffer
store buffer
register files
2
3
4
7
8
9
0
1
5
6
Scheduler
Integer Unit Floating Unit
ALU LEA Shift Branch ALU FMA Shift Divide
ALU LEA Multiply Divide ALU FMA Shift Shuffle
ALU LEA Multiply ALU FMA Shuffle
ALU LEA Shift Branch
Execution units added in Ice Lake
Port 0 and 1 can be fused for a 512 bit operation
Port 5 is a 512 bit wide operation
All others handle 256 bits
Port 8 and 9
added in Ice Lake
Address
generation
reorder
buffer
Frontend
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
INC R99
[0x4d2] = R99
R99 = 2B
@alblue©2020 Alex Blewitt
Core
Allocate
Rename
Retire
load buffer
store buffer
register files
2
3
4
7
8
9
0
1
5
6
Scheduler
Integer Unit Floating Unit
ALU LEA Shift Branch ALU FMA Shift Divide
ALU LEA Multiply Divide ALU FMA Shift Shuffle
ALU LEA Multiply ALU FMA Shuffle
ALU LEA Shift Branch
Execution units added in Ice Lake
Port 0 and 1 can be fused for a 512 bit operation
Port 5 is a 512 bit wide operation
All others handle 256 bits
Port 8 and 9
added in Ice Lake
Address
generation
reorder
buffer
Frontend
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
INC R99R99 = 2B[0x4d2] = 2B
@alblue©2020 Alex Blewitt
Core
Allocate
Rename
Retire
load buffer
store buffer
register files
2
3
4
7
8
9
0
1
5
6
Scheduler
Integer Unit Floating Unit
ALU LEA Shift Branch ALU FMA Shift Divide
ALU LEA Multiply Divide ALU FMA Shift Shuffle
ALU LEA Multiply ALU FMA Shuffle
ALU LEA Shift Branch
Execution units added in Ice Lake
Port 0 and 1 can be fused for a 512 bit operation
Port 5 is a 512 bit wide operation
All others handle 256 bits
Port 8 and 9
added in Ice Lake
Address
generation
reorder
buffer
Frontend
L1 Data
32 KiB 8-way
L1 Instruction
32 KiB 8-way
INC R99R99 = 2B[0x4d2] = 2B
@alblue©2020 Alex Blewitt
perf
• Linux perf (compiled from linux/tools/perf, or from linux-tools/linux-perf)

• Running in Docker requires compilation from source

• Commands available

• record – record execution performance for process/pid

• report – generate a report from prior recording

• annotate – annotate a report from a prior recording

• stat – record performance counters for process/pid
https://perf.wiki.kernel.org
https://github.com/alblue/scripts/blob/master/perf-Dockerfile
@alblue©2020 Alex Blewitt
perf record
• Perf record will sample the process(es) and generate stack traces

• Events may be skewed from their location

• Improve accuracy with :p, :pp or :ppp suffix to event

• Can capture branches, last branch records or use processor tracing

• perf record -b program
• perf record --call-graph lbr -j any_call,any_ret program
• perf record -e intel_pt//u program
https://lwn.net/Articles/680985/
https://lwn.net/Articles/680996/
@alblue©2020 Alex Blewitt
perf stat
$ perf stat base64 <(echo hello)
d29ybGQK
Performance counter stats for 'base64 /dev/fd/63':
0.341382 task-clock (msec) # 0.649 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
65 page-faults # 0.190 M/sec
1,218,176 cycles # 3.568 GHz
811,468 stalled-cycles-frontend # 66.61% frontend cycles idle
855,999 instructions # 0.70 insn per cycle
# 0.95 stalled cycles per insn
169,032 branches # 495.140 M/sec
8,883 branch-misses # 5.26% of all branches
0.000526160 seconds time elapsed
https://perf.wiki.kernel.org
IPC
> 4 *
< 1 +
@alblue©2020 Alex Blewitt
Performance counters
• Intel cores have a few dedicated and programmable counters

• Instruction cycles, branches, branch misses …

• Counters can be multiplexed (read X for 1µs, read Y for 1µs)

• Programmable counters can be set to specific measurements

• iTLB-load-misses, LLC-load-misses, uops_dispatched_port.port_5 ...

• Undocumented performance counters can be specified with events

• cpu/event=0x3c,umask=0x0,any=1/
@alblue©2020 Alex Blewitt
19
Locating Issues
Have Precise events for sampling
Precise events added in Skylake
Top-down Microarchitecture Analysis
https://www.researchgate.net/publication/269302126_A_Top-Down_method_for_performance_analysis_and_counters_architecture
Ahmed Yasin
@alblue©2020 Alex Blewitt
Top-down Analysis Method
USING PERFORMANCE MONITORING EVENTS
Additionally, the metric uses the UOPS_ISSUED.ANY, which is common in recent Intel microarchitec-
tures, as the denominator. The UOPS_ISSUED.ANY event counts the total number of Uops that the RAT
issues to RS.
The VectorMixRate metric gives the percentage of injected blend uops out of all uops issued. Usually a
VectorMixRate over 5% is worth investigating.
VectorMixRate[%] = 100 * UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY
Note the actual penalty may vary as it stems from the additional data-dependency on the destination
register the injected blend operations add.
B.2 PERFORMANCE MONITORING AND MICROARCHITECTURE
This section provides information of performance monitoring hardware and terminology related to the
Silvermont, Airmont and Goldmont microarchitectures. The features described here may be specific to
individual microarchitecture, as indicated in Table B-1.
Figure B-3. TMAM Hierarchy Supported by Skylake Microarchitecture
USING PERFORMANCE MONITORING EVENTS
The single entry point of division at a pipeline’s issue-stage (allocation-stage) makes the four categories
additive to the total possible slots. The classification at slots granularity (sub-cycle) makes the break-
down very accurate and robust for superscalar cores, which is a necessity at the top-level.
Figure B-2. TMAM’s Top Level Drill Down Flowchart
https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-optimization-reference-manual
Ahmed Yasin
@alblue©2020 Alex Blewitt
perf stat --topdown
$ perf stat -a --topdown sleep 1
nmi_watchdog enabled with topdown. May give wrong results.
Disable with echo 0 > /proc/sys/kernel/nmi_watchdog
Performance counter stats for 'system wide':
retiring bad speculation frontend bound backend
bound
S0-C0 2 15.3% 2.8% 32.1% 49.9%
S0-C1 2 23.3% 4.0% 27.3% 45.4%
S0-C2 2 15.2% 2.9% 29.8% 52.1%
S0-C3 2 16.7% 0.0% 31.8% 51.5%
S0-C4 2 35.7% 10.7% 26.2% 27.4%
S0-C5 2 14.9% 2.5% 34.1% 48.5%
1.000889285 seconds time elapsed
@alblue©2020 Alex Blewitt
Toplev PMU tools
• Andi Kleen has written toplev.py which allows top-down analysis

• Initial download caches processor information from download.01.org

• Uses perf to record stats, but with custom event filters

• If workload is repeatable, can use --no-multiplex to repeat results

• Run with -l1, see if issues are present, run with -l2 ...
https://github.com/andikleen/pmu-tools/wiki/toplev-manual
@alblue©2020 Alex Blewitt
toplev.py --single-thread
$ dd if=/dev/urandom of=/tmp/rand bs=4096 count=4096
$ ./toplev.py --single-thread --no-multiplex -l1 -- base64 /tmp/rand > /dev/null
# 3.6-full on Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
BE Backend_Bound % Slots 24.07 <==


$ ./toplev.py --single-thread --no-multiplex -l2 -- base64 /tmp/rand > /dev/null
BE Backend_Bound % Slots 23.82
BE/Core Backend_Bound.Core_Bound % Slots 16.08 <==


$ ./toplev.py --single-thread --no-multiplex -l3 -- base64 /tmp/rand > /dev/null
BE Backend_Bound % Slots 23.96
BE/Core Backend_Bound.Core_Bound % Slots 16.35
BE/Core Backend_Bound.Core_Bound.Ports_Utilization % Clocks 24.51 <==
@alblue©2020 Alex Blewitt
Cache line
Instruction layout
Before
After
Error
Is Error?
Before
After
Error
Is Error?
__builtin_expect(error,0)__builtin_expect(error,1)
@alblue©2020 Alex Blewitt
Cache line
Cache line
Loop stream detector
Good
Loop
Bad
Loop
32 bit
aliognment
Align with 

-mllvm -align-all-nofallthru-blocks=5

-mllvm -align-all-functions=5
@alblue©2020 Alex Blewitt
Facebook BOLT
https://arxiv.org/abs/1807.06735
Figure 9: Heat maps for instruction memory accesses of the HHVM binary, without and with BOLT. Heat is a log scale.
Executed
instructions are
distributed across
icache space
After sorting basic
blocks guided by
profiling data, the
icache space is
defragmented
https://github.com/facebookincubator/BOLT
@alblue©2020 Alex Blewitt
SIMD JSON parser
https://github.com/lemire/simdjson
https://arxiv.org/abs/1902.08318
@alblue©2020 Alex Blewitt
Summary: Memory
• Use cacheline-aligned or cacheline-aware data structures

• Compress data in memory and decompress on the fly

• Avoid random memory access when possible

• Configure huge pages and use madvise & defer

• Partition memory with libnuma for data locality
@alblue©2020 Alex Blewitt
Summary: CPU
• Each CPU is its own networked mesh cluster

• Branch speculation and memory/TLB misses are costly

• Use branch free and lock free algorithms when possible

• Analyse perf counters with top down architectural analysis

• Use (auto)vectorisation and use XMM/YMM/ZMM when sensible
@alblue©2020 Alex Blewitt
References
https://alblue.bandlem.com/

https://arxiv.org/abs/1807.06735 → https://github.com/facebookincubator/BOLT

https://arxiv.org/abs/1902.08318 → https://github.com/lemire/simdjson/

https://github.com/andikleen/pmu-tools/wiki/toplev-manual

https://lwn.net/Articles/680985/ && https://lwn.net/Articles/680996/ 

https://perf.wiki.kernel.org 

https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-
A_1000.pdf

https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-optimization-reference-manual 

https://www.researchgate.net/publication/269302126_A_Top-
Down_method_for_performance_analysis_and_counters_architecture
@alblue©2020 Alex Blewitt
Links
https://easyperf.net/notes/

https://epickrram.blogspot.com/

https://groups.google.com/forum/#!forum/mechanical-sympathy/

https://lemire.me/en/

https://psy-lob-saw.blogspot.com/ 

https://richardstartin.github.io/

https://travisdowns.github.io/ 

https://www.agner.org/optimize/

https://www.real-logic.co.uk/
@alblue©2020 Alex Blewitt
Thank you
🗒 https://alblue.bandlem.com

🦉 https://twitter.com/alblue

🐙 https://github.com/alblue

📺 https://vimeo.com/alblue

📇 https://speakerdeck.com/alblue
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
microarchitecture-modern-cpu/

Weitere ähnliche Inhalte

Mehr von C4Media

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinC4Media
 

Mehr von C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Kürzlich hochgeladen

Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Kürzlich hochgeladen (20)

Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Understanding CPU Microarchitecture to Increase Performance

  • 1. @alblue©2020 Alex Blewitt Understanding CPU Microarchitecture 🏎For Maximum Performance
  • 2. InfoQ.com: News & Community Site Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ microarchitecture-modern-cpu/ • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  • 4. @alblue©2020 Alex Blewitt Overview • What happens inside a CPU? • Where do CPU intensive programs get delayed? • What tools are there to help measure performance bottlenecks? • How can we make programs run faster?
  • 5. @alblue©2020 Alex Blewitt distributed architecture system architecture algorithm hardware cpu memory inst Performance Pyramid This talk Other QCon talks https://www.infoq.com/qconlondon2020/
  • 6. @alblue©2020 Alex Blewitt DMI x4 ** Platform Topologies 8S Configuration SKLSKL LBG LBG LBG DMI LBG SKLSKL SKLSKL SKLSKL 3x16 PCIe* 4S Configurations SKLSKL SKLSKL 2S Configurations SKLSKL (4S-2UPI & 4S-3UPI shown) (2S-2UPI & 2S-3UPI shown) Intel® UPI LBG 3x16 PCIe* 1x100G Intel® OP Fabric 3x16 PCIe* 1x100G Intel® OP Fabric LBGLBG LBG DMI 3x16 PCIe* This slide under embargo until 9:15 AM PDT July 11, 2017 Intel®Xeon®ScalableProcessorsupports configurationsrangingfrom2S-2UPIto8S Non Uniform Memory Architecture (NUMA) https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
  • 7. @alblue©2020 Alex Blewitt DMI x4 ** Platform Topologies 8S Configuration SKLSKL LBG LBG LBG DMI LBG SKLSKL SKLSKL SKLSKL 3x16 PCIe* 4S Configurations SKLSKL SKLSKL 2S Configurations SKLSKL (4S-2UPI & 4S-3UPI shown) (2S-2UPI & 2S-3UPI shown) Intel® UPI LBG 3x16 PCIe* 1x100G Intel® OP Fabric 3x16 PCIe* 1x100G Intel® OP Fabric LBGLBG LBG DMI 3x16 PCIe* This slide under embargo until 9:15 AM PDT July 11, 2017 Intel®Xeon®ScalableProcessorsupports configurationsrangingfrom2S-2UPIto8S Non Uniform Memory Architecture (NUMA) https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
  • 8. @alblue©2020 Alex Blewitt DMI x4 ** Platform Topologies 8S Configuration SKLSKL LBG LBG LBG DMI LBG SKLSKL SKLSKL SKLSKL 3x16 PCIe* 4S Configurations SKLSKL SKLSKL 2S Configurations SKLSKL (4S-2UPI & 4S-3UPI shown) (2S-2UPI & 2S-3UPI shown) Intel® UPI LBG 3x16 PCIe* 1x100G Intel® OP Fabric 3x16 PCIe* 1x100G Intel® OP Fabric LBGLBG LBG DMI 3x16 PCIe* This slide under embargo until 9:15 AM PDT July 11, 2017 Intel®Xeon®ScalableProcessorsupports configurationsrangingfrom2S-2UPIto8S Non Uniform Memory Architecture (NUMA) https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
  • 9. @alblue©2020 Alex Blewitt DMI x4 ** Platform Topologies 8S Configuration SKLSKL LBG LBG LBG DMI LBG SKLSKL SKLSKL SKLSKL 3x16 PCIe* 4S Configurations SKLSKL SKLSKL 2S Configurations SKLSKL (4S-2UPI & 4S-3UPI shown) (2S-2UPI & 2S-3UPI shown) Intel® UPI LBG 3x16 PCIe* 1x100G Intel® OP Fabric 3x16 PCIe* 1x100G Intel® OP Fabric LBGLBG LBG DMI 3x16 PCIe* This slide under embargo until 9:15 AM PDT July 11, 2017 Intel®Xeon®ScalableProcessorsupports configurationsrangingfrom2S-2UPIto8S Non Uniform Memory Architecture (NUMA) https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
  • 15. @alblue©2020 Alex Blewitt Sub NUMA cluster 1Sub NUMA cluster 0 https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf Cascade/Skylake 28-core die
  • 17. @alblue©2020 Alex Blewitt Cascade 56 core package Package Die Die
  • 18. @alblue©2020 Alex Blewitt L3$ (LLC) 1.375 MiB 11-way Non-inclusive L3$ (LLC) 1.375 MiB 11-way Non-inclusive L3$ (LLC) 1.375 MiB 11-way Non-inclusive Memory and Cache ($) Register file 180 Integer 168 Floating Point L1 Data (L1D$) 32 KiB 8-way L1 Instruction (L1I$) 32 KiB 8-way L2$ 1 MiB 16-way Inclusive L3$ (LLC) 1.375 MiB 11-way Non-inclusive Information for Cascade/Skylake systems RAM RAM RAM RAM 1🕐 4🕐 4🕐 40🕐 12🕐 50🕐 150🕐 300🕐 Clock Cycles
  • 19. @alblue©2020 Alex Blewitt lstopo --no-io Machine (16GB) Package P#0 L4 (128MB) L3 (6144KB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#4 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#1 PU P#5 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#2 PU P#6 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#3 PU P#7 Shared memory between CPU and GPU HyperThreads Single socket system Cache levels Four core processor
  • 20. @alblue©2020 Alex Blewitt L3$ (LLC) 1.375 MiB 11-way Non-inclusive L3$ (LLC) 1.375 MiB 11-way Non-inclusive L3$ (LLC) 1.375 MiB 11-way Non-inclusive Memory and Cache ($) Register file 180 Integer 168 Floating Point L1 Data (L1D$) 32 KiB 8-way L1 Instruction (L1I$) 32 KiB 8-way L2$ 1 MiB 16-way Inclusive L3$ (LLC) 1.375 MiB 11-way Non-inclusive Data TLB 4K: 128 8-way 2M/4M: 8/T assoc Instruction TLB 4: 64 4-way 2M/4M: 32 4-way 1G: 4 4-way STLB 4K/2M: 1536 12-way 1G: 16 4-way RAM RAM RAM RAM Virtual Physical PCID 00008000(1234) 5e38450c(1234) 10 00008000(1234) 48656c6f(1234) 20 fffffffffffb(8080) 2345ffffffb(8080) 0 1🕐 4🕐 4🕐 40🕐 12🕐 50🕐 150🕐 300🕐 Clock Cycles Information for Cascade/Skylake systems grep /proc/cpuinfo for pcid ↑
  • 21. @alblue©2020 Alex Blewitt Memory Pages 8000 ffaa ffbb f000 0000CR3 0000 ffff 7fff CR3 0000 ffff 7fff 8000 f000 Two layer page table structure shown x86_64 has 4 level paging (48 bits, 256TiB virtual, 64TiB real) Ice Lake processors support 5 level paging (57 bits, 128Pb virtual, 4PiB real) 0x000080001234 0x000080001234 Pages can be 4k, 2M or 1G
  • 22. @alblue©2020 Alex Blewitt Huge Pages 0000 Pages can be 4K, 2M or 1G grep /proc/cpuinfo pse: 2M support pdpe1g: 1G support # Better use of TLB $ More complex to set up # Fewer memory cache misses $ May waste memory $ Hugetblfs needs to be configured
  • 23. @alblue©2020 Alex Blewitt $ Hugetblfs $ • Requires kernel configuration to reserve memory ahead of time • Boot parameter hugepages=N puts aside memory for huge page use • Boot parameter hugepagesz={2M,1G} specifies huge page size • Requires a hugetblfs mount to be provided • Requires root (or suitably permissioned app) to use hugepages
  • 24. @alblue©2020 Alex Blewitt # Transparent Huge Pages $ • Does not require boot time configuration or special permissions • khugepaged assembles contiguous physical memory for large pages • Default page size is still 4k, but processes can madvise() use of large pages • Allows specific apps to opt-in on demand • Benefits of smaller TLB with less wasted memory # echo madvise > /sys/kernel/mm/transparent_hugepages/enabled # echo defer > /sys/kernel/mm/transparent_hugepage/defrag Defer instead of blocking large page request Enable opt-in through use of madvise 💡
  • 25. @alblue©2020 Alex Blewitt Cache lines, loads and stores • Unit of granularity of a cache entry is 64 bytes (512 bits) • Even if you only read/write 1 byte you're writing 64 bytes • Cache lines can generally be in different states: ➡ M – exclusively owned by that core, and modified (dirty) ➡ E – exclusively owned by that core, but not modified ➡ S – shared read-only with other cores ➡ I – invalid, cache line not used
  • 26. @alblue©2020 Alex Blewitt Memory prefetching (CPU) CPU issues automatic prefetch for streamed data Also notices striding by certain amounts as well Can also use __builtin_prefetch to explicitly suggest prefetching memory elsewhere but needs to be a measured improvement 💡
  • 27. @alblue©2020 Alex Blewitt False sharing • Two cores trying to write to bytes in the same cache-line will thrash • First thread will try to acquire exclusive ownership of cache line • Second thread (on different core) will try to do the same • Updates may be lost if writes are not atomic * • Performance will suffer when cache line repeatedly moved • Avoid by padding to at least cacheline size * 2 (128 bytes) for writes Thread 1 data[0] = 'A' Thread 2 data[7] = 'C' * UPDATE: This confused more than it helped. Synchronisation and false sharing are orthogonal.
  • 28. @alblue©2020 Alex Blewitt Memory performance strategies • Ensure data fits in L1/L2/L3 cache where possible • Stream or stride through data in a single pass if possible • Consider pivoting data (array-of-structs or structs-of-arrays) • Add padding for multi-threaded contended writes • Prefer thread-local or cpu-local accumulators with final merge step • Compress data where practical (compressed pointers) 💡
  • 29. @alblue©2020 Alex Blewitt Pinning memory/threads • Pinning memory or threads to a particular core can improve performance • Reduces intra-core memory ownership traffic • Less likely to have cache invalidations • isolcpu allows reservation of CPUs for non-kernel use with cpusets • taskset allows binding of a process to specific cores • numactl allows cores/memory to be clamped for a process • libnuma has additional affinity settings for programmatic use
  • 30. @alblue©2020 Alex Blewitt Frontend Core L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way Backend x86_64 µop
  • 31. @alblue©2020 Alex Blewitt Core x86_64 Pre-decode Instructions µop decoders µop cache loop decode branch prediction Backend µop L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way
  • 32. @alblue©2020 Alex Blewitt Core x86_64 Pre-decode Instructions µop decoders µop cache loop decode branch prediction Backend µop L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way 55 48 89 e5 fe 04 25 d2 04 00 00 41 6c 42 6c 75
  • 33. @alblue©2020 Alex Blewitt Core x86_64 Pre-decode Instructions µop decoders µop cache loop decode branch prediction Backend µop L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way 55|48 89 e5|fe 04 25 d2 04 00 00|41 6c 42 6c 75
  • 34. @alblue©2020 Alex Blewitt Core x86_64 Pre-decode Instructions µop decoders µop cache loop decode branch prediction Backend µop L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way push %rbp mov %rsp, %rbp ? incb 0x4d2
  • 35. @alblue©2020 Alex Blewitt Core x86_64 Pre-decode Instructions µop decoders µop cache loop decode branch prediction Backend µop L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way incb 0x4d2incb 0x4d2incb 0x4d2
  • 36. @alblue©2020 Alex Blewitt Core x86_64 Pre-decode Instructions µop decoders µop cache loop decode branch prediction Backend µop L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way TMP = [0x4d2] INC TMP [0x4d2] = TMP
  • 37. @alblue©2020 Alex Blewitt Core x86_64 Pre-decode Instructions µop decoders µop cache loop decode branch prediction Backend µop L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way TMP = [0x4d2] INC TMP [0x4d2] = TMP
  • 38. @alblue©2020 Alex Blewitt Branch Prediction ⤵🧐 • Correct 95% of the time • Queues up instructions assuming the branch has been taken • Learns patterns in code based on existing behaviour • Iterating through predictable (sorted) data may be more efficient • Throws away inaccurate work if incorrect • May cause observable side channel behaviour e.g. cache invalidation 👻 cmp eax,42; jne
  • 39. @alblue©2020 Alex Blewitt Branch Target Predictor 🎯 • Predicts where the target is going if taken • Hard coded addresses/offsets always predictable • Jump to location of register may be more difficult • Often seen when jumping through object oriented code • Inlining is the master optimisation because it avoids unknowable branches jmp [eax]
  • 40. @alblue©2020 Alex Blewitt Core Allocate Rename Retire load buffer store buffer register files 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shuffle ALU LEA Multiply ALU FMA Shuffle ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder buffer Frontend L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way
  • 41. @alblue©2020 Alex Blewitt Core Allocate Rename Retire load buffer store buffer register files 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shuffle ALU LEA Multiply ALU FMA Shuffle ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder buffer Frontend L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way TMP = [0x4d2] INC TMP [0x4d2] = TMP
  • 42. @alblue©2020 Alex Blewitt Core Allocate Rename Retire load buffer store buffer register files 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shuffle ALU LEA Multiply ALU FMA Shuffle ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder buffer Frontend L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way R99 = [0x4d2] INC R99 [0x4d2] = R99
  • 43. @alblue©2020 Alex Blewitt Core Allocate Rename Retire load buffer store buffer register files 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shuffle ALU LEA Multiply ALU FMA Shuffle ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder buffer Frontend L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way R99 = [0x4d2] INC R99 [0x4d2] = R99
  • 44. @alblue©2020 Alex Blewitt Core Allocate Rename Retire load buffer store buffer register files 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shuffle ALU LEA Multiply ALU FMA Shuffle ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder buffer Frontend L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way R99 = 2A INC R99 [0x4d2] = R99
  • 45. @alblue©2020 Alex Blewitt Core Allocate Rename Retire load buffer store buffer register files 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shuffle ALU LEA Multiply ALU FMA Shuffle ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder buffer Frontend L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way R99 = 2A INC R99 [0x4d2] = R99
  • 46. @alblue©2020 Alex Blewitt Core Allocate Rename Retire load buffer store buffer register files 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shuffle ALU LEA Multiply ALU FMA Shuffle ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder buffer Frontend L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way INC R99 [0x4d2] = R99 R99 = 2B
  • 47. @alblue©2020 Alex Blewitt Core Allocate Rename Retire load buffer store buffer register files 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shuffle ALU LEA Multiply ALU FMA Shuffle ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder buffer Frontend L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way INC R99R99 = 2B[0x4d2] = 2B
  • 48. @alblue©2020 Alex Blewitt Core Allocate Rename Retire load buffer store buffer register files 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shuffle ALU LEA Multiply ALU FMA Shuffle ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder buffer Frontend L1 Data 32 KiB 8-way L1 Instruction 32 KiB 8-way INC R99R99 = 2B[0x4d2] = 2B
  • 49. @alblue©2020 Alex Blewitt perf • Linux perf (compiled from linux/tools/perf, or from linux-tools/linux-perf) • Running in Docker requires compilation from source • Commands available • record – record execution performance for process/pid • report – generate a report from prior recording • annotate – annotate a report from a prior recording • stat – record performance counters for process/pid https://perf.wiki.kernel.org https://github.com/alblue/scripts/blob/master/perf-Dockerfile
  • 50. @alblue©2020 Alex Blewitt perf record • Perf record will sample the process(es) and generate stack traces • Events may be skewed from their location • Improve accuracy with :p, :pp or :ppp suffix to event • Can capture branches, last branch records or use processor tracing • perf record -b program • perf record --call-graph lbr -j any_call,any_ret program • perf record -e intel_pt//u program https://lwn.net/Articles/680985/ https://lwn.net/Articles/680996/
  • 51. @alblue©2020 Alex Blewitt perf stat $ perf stat base64 <(echo hello) d29ybGQK Performance counter stats for 'base64 /dev/fd/63': 0.341382 task-clock (msec) # 0.649 CPUs utilized 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec 65 page-faults # 0.190 M/sec 1,218,176 cycles # 3.568 GHz 811,468 stalled-cycles-frontend # 66.61% frontend cycles idle 855,999 instructions # 0.70 insn per cycle # 0.95 stalled cycles per insn 169,032 branches # 495.140 M/sec 8,883 branch-misses # 5.26% of all branches 0.000526160 seconds time elapsed https://perf.wiki.kernel.org IPC > 4 * < 1 +
  • 52. @alblue©2020 Alex Blewitt Performance counters • Intel cores have a few dedicated and programmable counters • Instruction cycles, branches, branch misses … • Counters can be multiplexed (read X for 1µs, read Y for 1µs) • Programmable counters can be set to specific measurements • iTLB-load-misses, LLC-load-misses, uops_dispatched_port.port_5 ... • Undocumented performance counters can be specified with events • cpu/event=0x3c,umask=0x0,any=1/
  • 53. @alblue©2020 Alex Blewitt 19 Locating Issues Have Precise events for sampling Precise events added in Skylake Top-down Microarchitecture Analysis https://www.researchgate.net/publication/269302126_A_Top-Down_method_for_performance_analysis_and_counters_architecture Ahmed Yasin
  • 54. @alblue©2020 Alex Blewitt Top-down Analysis Method USING PERFORMANCE MONITORING EVENTS Additionally, the metric uses the UOPS_ISSUED.ANY, which is common in recent Intel microarchitec- tures, as the denominator. The UOPS_ISSUED.ANY event counts the total number of Uops that the RAT issues to RS. The VectorMixRate metric gives the percentage of injected blend uops out of all uops issued. Usually a VectorMixRate over 5% is worth investigating. VectorMixRate[%] = 100 * UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY Note the actual penalty may vary as it stems from the additional data-dependency on the destination register the injected blend operations add. B.2 PERFORMANCE MONITORING AND MICROARCHITECTURE This section provides information of performance monitoring hardware and terminology related to the Silvermont, Airmont and Goldmont microarchitectures. The features described here may be specific to individual microarchitecture, as indicated in Table B-1. Figure B-3. TMAM Hierarchy Supported by Skylake Microarchitecture USING PERFORMANCE MONITORING EVENTS The single entry point of division at a pipeline’s issue-stage (allocation-stage) makes the four categories additive to the total possible slots. The classification at slots granularity (sub-cycle) makes the break- down very accurate and robust for superscalar cores, which is a necessity at the top-level. Figure B-2. TMAM’s Top Level Drill Down Flowchart https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-optimization-reference-manual Ahmed Yasin
  • 55. @alblue©2020 Alex Blewitt perf stat --topdown $ perf stat -a --topdown sleep 1 nmi_watchdog enabled with topdown. May give wrong results. Disable with echo 0 > /proc/sys/kernel/nmi_watchdog Performance counter stats for 'system wide': retiring bad speculation frontend bound backend bound S0-C0 2 15.3% 2.8% 32.1% 49.9% S0-C1 2 23.3% 4.0% 27.3% 45.4% S0-C2 2 15.2% 2.9% 29.8% 52.1% S0-C3 2 16.7% 0.0% 31.8% 51.5% S0-C4 2 35.7% 10.7% 26.2% 27.4% S0-C5 2 14.9% 2.5% 34.1% 48.5% 1.000889285 seconds time elapsed
  • 56. @alblue©2020 Alex Blewitt Toplev PMU tools • Andi Kleen has written toplev.py which allows top-down analysis • Initial download caches processor information from download.01.org • Uses perf to record stats, but with custom event filters • If workload is repeatable, can use --no-multiplex to repeat results • Run with -l1, see if issues are present, run with -l2 ... https://github.com/andikleen/pmu-tools/wiki/toplev-manual
  • 57. @alblue©2020 Alex Blewitt toplev.py --single-thread $ dd if=/dev/urandom of=/tmp/rand bs=4096 count=4096 $ ./toplev.py --single-thread --no-multiplex -l1 -- base64 /tmp/rand > /dev/null # 3.6-full on Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz BE Backend_Bound % Slots 24.07 <== 
 $ ./toplev.py --single-thread --no-multiplex -l2 -- base64 /tmp/rand > /dev/null BE Backend_Bound % Slots 23.82 BE/Core Backend_Bound.Core_Bound % Slots 16.08 <== 
 $ ./toplev.py --single-thread --no-multiplex -l3 -- base64 /tmp/rand > /dev/null BE Backend_Bound % Slots 23.96 BE/Core Backend_Bound.Core_Bound % Slots 16.35 BE/Core Backend_Bound.Core_Bound.Ports_Utilization % Clocks 24.51 <==
  • 58. @alblue©2020 Alex Blewitt Cache line Instruction layout Before After Error Is Error? Before After Error Is Error? __builtin_expect(error,0)__builtin_expect(error,1)
  • 59. @alblue©2020 Alex Blewitt Cache line Cache line Loop stream detector Good Loop Bad Loop 32 bit aliognment Align with 
 -mllvm -align-all-nofallthru-blocks=5
 -mllvm -align-all-functions=5
  • 60. @alblue©2020 Alex Blewitt Facebook BOLT https://arxiv.org/abs/1807.06735 Figure 9: Heat maps for instruction memory accesses of the HHVM binary, without and with BOLT. Heat is a log scale. Executed instructions are distributed across icache space After sorting basic blocks guided by profiling data, the icache space is defragmented https://github.com/facebookincubator/BOLT
  • 61. @alblue©2020 Alex Blewitt SIMD JSON parser https://github.com/lemire/simdjson https://arxiv.org/abs/1902.08318
  • 62. @alblue©2020 Alex Blewitt Summary: Memory • Use cacheline-aligned or cacheline-aware data structures • Compress data in memory and decompress on the fly • Avoid random memory access when possible • Configure huge pages and use madvise & defer • Partition memory with libnuma for data locality
  • 63. @alblue©2020 Alex Blewitt Summary: CPU • Each CPU is its own networked mesh cluster • Branch speculation and memory/TLB misses are costly • Use branch free and lock free algorithms when possible • Analyse perf counters with top down architectural analysis • Use (auto)vectorisation and use XMM/YMM/ZMM when sensible
  • 64. @alblue©2020 Alex Blewitt References https://alblue.bandlem.com/ https://arxiv.org/abs/1807.06735 → https://github.com/facebookincubator/BOLT https://arxiv.org/abs/1902.08318 → https://github.com/lemire/simdjson/ https://github.com/andikleen/pmu-tools/wiki/toplev-manual https://lwn.net/Articles/680985/ && https://lwn.net/Articles/680996/ https://perf.wiki.kernel.org https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track- A_1000.pdf https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-optimization-reference-manual https://www.researchgate.net/publication/269302126_A_Top- Down_method_for_performance_analysis_and_counters_architecture
  • 66. @alblue©2020 Alex Blewitt Thank you 🗒 https://alblue.bandlem.com 🦉 https://twitter.com/alblue 🐙 https://github.com/alblue 📺 https://vimeo.com/alblue 📇 https://speakerdeck.com/alblue
  • 67. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ microarchitecture-modern-cpu/