123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336 |
- perf-c2c(1)
- ===========
- NAME
- ----
- perf-c2c - Shared Data C2C/HITM Analyzer.
- SYNOPSIS
- --------
- [verse]
- 'perf c2c record' [<options>] <command>
- 'perf c2c record' [<options>] \-- [<record command options>] <command>
- 'perf c2c report' [<options>]
- DESCRIPTION
- -----------
- C2C stands for Cache To Cache.
- The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows
- you to track down the cacheline contentions.
- On Intel, the tool is based on load latency and precise store facility events
- provided by Intel CPUs. On PowerPC, the tool uses random instruction sampling
- with thresholding feature. On AMD, the tool uses IBS op pmu (due to hardware
- limitations, perf c2c is not supported on Zen3 cpus).
- These events provide:
- - memory address of the access
- - type of the access (load and store details)
- - latency (in cycles) of the load access
- The c2c tool provide means to record this data and report back access details
- for cachelines with highest contention - highest number of HITM accesses.
- The basic workflow with this tool follows the standard record/report phase.
- User uses the record command to record events data and report command to
- display it.
- RECORD OPTIONS
- --------------
- -e::
- --event=::
- Select the PMU event. Use 'perf c2c record -e list'
- to list available events.
- -v::
- --verbose::
- Be more verbose (show counter open errors, etc).
- -l::
- --ldlat::
- Configure mem-loads latency. Supported on Intel and Arm64 processors
- only. Ignored on other archs.
- -k::
- --all-kernel::
- Configure all used events to run in kernel space.
- -u::
- --all-user::
- Configure all used events to run in user space.
- REPORT OPTIONS
- --------------
- -k::
- --vmlinux=<file>::
- vmlinux pathname
- -v::
- --verbose::
- Be more verbose (show counter open errors, etc).
- -i::
- --input::
- Specify the input file to process.
- -N::
- --node-info::
- Show extra node info in report (see NODE INFO section)
- -c::
- --coalesce::
- Specify sorting fields for single cacheline display.
- Following fields are available: tid,pid,iaddr,dso
- (see COALESCE)
- -g::
- --call-graph::
- Setup callchains parameters.
- Please refer to perf-report man page for details.
- --stdio::
- Force the stdio output (see STDIO OUTPUT)
- --stats::
- Display only statistic tables and force stdio mode.
- --full-symbols::
- Display full length of symbols.
- --no-source::
- Do not display Source:Line column.
- --show-all::
- Show all captured HITM lines, with no regard to HITM % 0.0005 limit.
- -f::
- --force::
- Don't do ownership validation.
- -d::
- --display::
- Switch to HITM type (rmt, lcl) or peer snooping type (peer) to display
- and sort on. Total HITMs (tot) as default, except Arm64 uses peer mode
- as default.
- --stitch-lbr::
- Show callgraph with stitched LBRs, which may have more complete
- callgraph. The perf.data file must have been obtained using
- perf c2c record --call-graph lbr.
- Disabled by default. In common cases with call stack overflows,
- it can recreate better call stacks than the default lbr call stack
- output. But this approach is not full proof. There can be cases
- where it creates incorrect call stacks from incorrect matches.
- The known limitations include exception handing such as
- setjmp/longjmp will have calls/returns not match.
- C2C RECORD
- ----------
- The perf c2c record command setup options related to HITM cacheline analysis
- and calls standard perf record command.
- Following perf record options are configured by default:
- (check perf record man page for details)
- -W,-d,--phys-data,--sample-cpu
- Unless specified otherwise with '-e' option, following events are monitored by
- default on Intel:
- cpu/mem-loads,ldlat=30/P
- cpu/mem-stores/P
- following on AMD:
- ibs_op//
- and following on PowerPC:
- cpu/mem-loads/
- cpu/mem-stores/
- User can pass any 'perf record' option behind '--' mark, like (to enable
- callchains and system wide monitoring):
- $ perf c2c record -- -g -a
- Please check RECORD OPTIONS section for specific c2c record options.
- C2C REPORT
- ----------
- The perf c2c report command displays shared data analysis. It comes in two
- display modes: stdio and tui (default).
- The report command workflow is following:
- - sort all the data based on the cacheline address
- - store access details for each cacheline
- - sort all cachelines based on user settings
- - display data
- In general perf report output consist of 2 basic views:
- 1) most expensive cachelines list
- 2) offsets details for each cacheline
- For each cacheline in the 1) list we display following data:
- (Both stdio and TUI modes follow the same fields output)
- Index
- - zero based index to identify the cacheline
- Cacheline
- - cacheline address (hex number)
- Rmt/Lcl Hitm (Display with HITM types)
- - cacheline percentage of all Remote/Local HITM accesses
- Peer Snoop (Display with peer type)
- - cacheline percentage of all peer accesses
- LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types)
- - count of Total/Local/Remote load HITMs
- Load Peer - Total, Local, Remote (For display with peer type)
- - count of Total/Local/Remote load from peer cache or DRAM
- Total records
- - sum of all cachelines accesses
- Total loads
- - sum of all load accesses
- Total stores
- - sum of all store accesses
- Store Reference - L1Hit, L1Miss, N/A
- L1Hit - store accesses that hit L1
- L1Miss - store accesses that missed L1
- N/A - store accesses with memory level is not available
- Core Load Hit - FB, L1, L2
- - count of load hits in FB (Fill Buffer), L1 and L2 cache
- LLC Load Hit - LlcHit, LclHitm
- - count of LLC load accesses, includes LLC hits and LLC HITMs
- RMT Load Hit - RmtHit, RmtHitm
- - count of remote load accesses, includes remote hits and remote HITMs;
- on Arm neoverse cores, RmtHit is used to account remote accesses,
- includes remote DRAM or any upward cache level in remote node
- Load Dram - Lcl, Rmt
- - count of local and remote DRAM accesses
- For each offset in the 2) list we display following data:
- HITM - Rmt, Lcl (Display with HITM types)
- - % of Remote/Local HITM accesses for given offset within cacheline
- Peer Snoop - Rmt, Lcl (Display with peer type)
- - % of Remote/Local peer accesses for given offset within cacheline
- Store Refs - L1 Hit, L1 Miss, N/A
- - % of store accesses that hit L1, missed L1 and N/A (no available) memory
- level for given offset within cacheline
- Data address - Offset
- - offset address
- Pid
- - pid of the process responsible for the accesses
- Tid
- - tid of the process responsible for the accesses
- Code address
- - code address responsible for the accesses
- cycles - rmt hitm, lcl hitm, load (Display with HITM types)
- - sum of cycles for given accesses - Remote/Local HITM and generic load
- cycles - rmt peer, lcl peer, load (Display with peer type)
- - sum of cycles for given accesses - Remote/Local peer load and generic load
- cpu cnt
- - number of cpus that participated on the access
- Symbol
- - code symbol related to the 'Code address' value
- Shared Object
- - shared object name related to the 'Code address' value
- Source:Line
- - source information related to the 'Code address' value
- Node
- - nodes participating on the access (see NODE INFO section)
- NODE INFO
- ---------
- The 'Node' field displays nodes that accesses given cacheline
- offset. Its output comes in 3 flavors:
- - node IDs separated by ','
- - node IDs with stats for each ID, in following format:
- Node{cpus %hitms %stores} (Display with HITM types)
- Node{cpus %peers %stores} (Display with peer type)
- - node IDs with list of affected CPUs in following format:
- Node{cpu list}
- User can switch between above flavors with -N option or
- use 'n' key to interactively switch in TUI mode.
- COALESCE
- --------
- User can specify how to sort offsets for cacheline.
- Following fields are available and governs the final
- output fields set for cacheline offsets output:
- tid - coalesced by process TIDs
- pid - coalesced by process PIDs
- iaddr - coalesced by code address, following fields are displayed:
- Code address, Code symbol, Shared Object, Source line
- dso - coalesced by shared object
- By default the coalescing is setup with 'pid,iaddr'.
- STDIO OUTPUT
- ------------
- The stdio output displays data on standard output.
- Following tables are displayed:
- Trace Event Information
- - overall statistics of memory accesses
- Global Shared Cache Line Event Information
- - overall statistics on shared cachelines
- Shared Data Cache Line Table
- - list of most expensive cachelines
- Shared Cache Line Distribution Pareto
- - list of all accessed offsets for each cacheline
- TUI OUTPUT
- ----------
- The TUI output provides interactive interface to navigate
- through cachelines list and to display offset details.
- For details please refer to the help window by pressing '?' key.
- CREDITS
- -------
- Although Don Zickus, Dick Fowles and Joe Mario worked together
- to get this implemented, we got lots of early help from Arnaldo
- Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
- C2C BLOG
- --------
- Check Joe's blog on c2c tool for detailed use case explanation:
- https://joemario.github.io/blog/2016/09/01/c2c-blog/
- SEE ALSO
- --------
- linkperf:perf-record[1], linkperf:perf-mem[1]
|