perf-arm-spe.txt 8.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218
  1. perf-arm-spe(1)
  2. ================
  3. NAME
  4. ----
  5. perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools
  6. SYNOPSIS
  7. --------
  8. [verse]
  9. 'perf record' -e arm_spe//
  10. DESCRIPTION
  11. -----------
  12. The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and
  13. events down to individual instructions. Rather than being interrupt-driven, it picks an
  14. instruction to sample and then captures data for it during execution. Data includes execution time
  15. in cycles. For loads and stores it also includes data address, cache miss events, and data origin.
  16. The sampling has 5 stages:
  17. 1. Choose an operation
  18. 2. Collect data about the operation
  19. 3. Optionally discard the record based on a filter
  20. 4. Write the record to memory
  21. 5. Interrupt when the buffer is full
  22. Choose an operation
  23. ~~~~~~~~~~~~~~~~~~~
  24. This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all
  25. architectural instructions or all micro-ops. Sampling happens at a programmable interval. The
  26. architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should
  27. sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random
  28. perturbation is also added to the sampling interval by default.
  29. Collect data about the operation
  30. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  31. Program counter, PMU events, timings and data addresses related to the operation are recorded.
  32. Sampling ensures there is only one sampled operation is in flight.
  33. Optionally discard the record based on a filter
  34. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  35. Based on programmable criteria, choose whether to keep the record or discard it. If the record is
  36. discarded then the flow stops here for this sample.
  37. Write the record to memory
  38. ~~~~~~~~~~~~~~~~~~~~~~~~~~
  39. The record is appended to a memory buffer
  40. Interrupt when the buffer is full
  41. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  42. When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records.
  43. Perf saves the raw data in the perf.data file.
  44. Opening the file
  45. ----------------
  46. Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the
  47. recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding
  48. the data, Perf generates "synthetic samples" as if these were generated at the time of the
  49. recording. These samples are the same as if normal sampling was done by Perf without using SPE,
  50. although they may have more attributes associated with them. For example a normal sample may have
  51. just the instruction pointer, but an SPE sample can have data addresses and latency attributes.
  52. Why Sampling?
  53. -------------
  54. - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for
  55. hardware. Only one sampled operation is in flight at a time.
  56. - Allows precise attribution data, including: Full PC of instruction, data virtual and physical
  57. addresses.
  58. - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source
  59. indicates which particular cache was hit, but the meaning is implementation defined because
  60. different implementations can have different cache configurations.)
  61. However, SPE does not provide any call-graph information, and relies on statistical methods.
  62. Collisions
  63. ----------
  64. When an operation is sampled while a previous sampled operation has not finished, a collision
  65. occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate
  66. should be set to avoid collisions.
  67. The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this
  68. count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact
  69. number for samples dropped that would have made it through the filter, but can be a rough
  70. guide.
  71. The effect of microarchitectural sampling
  72. -----------------------------------------
  73. If an implementation samples micro-operations instead of instructions, the results of sampling must
  74. be weighted accordingly.
  75. For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it
  76. becomes twice as likely to appear in the sample population.
  77. The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be
  78. estimated from the 'sample_pop' and 'inst_retired' PMU events.
  79. Kernel Requirements
  80. -------------------
  81. The ARM_SPE_PMU config must be set to build as either a module or statically.
  82. Depending on CPU model, the kernel may need to be booted with page table isolation disabled
  83. (kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer
  84. inaccessible. Try passing 'kpti=off' on the kernel command line".
  85. Capturing SPE with perf command-line tools
  86. ------------------------------------------
  87. You can record a session with SPE samples:
  88. perf record -e arm_spe// -- ./mybench
  89. The sample period is set from the -c option, and because the minimum interval is used by default
  90. it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL.
  91. Config parameters
  92. ~~~~~~~~~~~~~~~~~
  93. These are placed between the // in the event and comma separated. For example '-e
  94. arm_spe/load_filter=1,min_latency=10/'
  95. branch_filter=1 - collect branches only (PMSFCR.B)
  96. event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below
  97. jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND)
  98. load_filter=1 - collect loads only (PMSFCR.LD)
  99. min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR)
  100. pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
  101. pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
  102. store_filter=1 - collect stores only (PMSFCR.ST)
  103. ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS)
  104. +++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
  105. than only the execution latency.
  106. Only some events can be filtered on; these include:
  107. bit 1 - instruction retired (i.e. omit speculative instructions)
  108. bit 3 - L1D refill
  109. bit 5 - TLB refill
  110. bit 7 - mispredict
  111. bit 11 - misaligned access
  112. So to sample just retired instructions:
  113. perf record -e arm_spe/event_filter=2/ -- ./mybench
  114. or just mispredicted branches:
  115. perf record -e arm_spe/event_filter=0x80/ -- ./mybench
  116. Viewing the data
  117. ~~~~~~~~~~~~~~~~~
  118. By default perf report and perf script will assign samples to separate groups depending on the
  119. attributes/events of the SPE record. Because instructions can have multiple events associated with
  120. them, the samples in these groups are not necessarily unique. For example perf report shows these
  121. groups:
  122. Available samples
  123. 0 arm_spe//
  124. 0 dummy:u
  125. 21 l1d-miss
  126. 897 l1d-access
  127. 5 llc-miss
  128. 7 llc-access
  129. 2 tlb-miss
  130. 1K tlb-access
  131. 36 branch-miss
  132. 0 remote-access
  133. 900 memory
  134. The arm_spe// and dummy:u events are implementation details and are expected to be empty.
  135. To get a full list of unique samples that are not sorted into groups, set the itrace option to
  136. generate 'instruction' samples. The period option is also taken into account, so set it to 1
  137. instruction unless you want to further downsample the already sampled SPE data:
  138. perf report --itrace=i1i
  139. Memory access details are also stored on the samples and this can be viewed with:
  140. perf report --mem-mode
  141. Common errors
  142. ~~~~~~~~~~~~~
  143. - "Cannot find PMU `arm_spe'. Missing kernel support?"
  144. Module not built or loaded, KPTI not disabled (see above), or running on a VM
  145. - "Arm SPE CONTEXT packets not found in the traces."
  146. Root privilege is required to collect context packets. But these only increase the accuracy of
  147. assigning PIDs to kernel samples. For userspace sampling this can be ignored.
  148. - Excessively large perf.data file size
  149. Increase sampling interval (see above)
  150. SEE ALSO
  151. --------
  152. linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
  153. linkperf:perf-inject[1]