timekeeping.rst 8.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180
  1. ===========================================================
  2. Clock sources, Clock events, sched_clock() and delay timers
  3. ===========================================================
  4. This document tries to briefly explain some basic kernel timekeeping
  5. abstractions. It partly pertains to the drivers usually found in
  6. drivers/clocksource in the kernel tree, but the code may be spread out
  7. across the kernel.
  8. If you grep through the kernel source you will find a number of architecture-
  9. specific implementations of clock sources, clockevents and several likewise
  10. architecture-specific overrides of the sched_clock() function and some
  11. delay timers.
  12. To provide timekeeping for your platform, the clock source provides
  13. the basic timeline, whereas clock events shoot interrupts on certain points
  14. on this timeline, providing facilities such as high-resolution timers.
  15. sched_clock() is used for scheduling and timestamping, and delay timers
  16. provide an accurate delay source using hardware counters.
  17. Clock sources
  18. -------------
  19. The purpose of the clock source is to provide a timeline for the system that
  20. tells you where you are in time. For example issuing the command 'date' on
  21. a Linux system will eventually read the clock source to determine exactly
  22. what time it is.
  23. Typically the clock source is a monotonic, atomic counter which will provide
  24. n bits which count from 0 to (2^n)-1 and then wraps around to 0 and start over.
  25. It will ideally NEVER stop ticking as long as the system is running. It
  26. may stop during system suspend.
  27. The clock source shall have as high resolution as possible, and the frequency
  28. shall be as stable and correct as possible as compared to a real-world wall
  29. clock. It should not move unpredictably back and forth in time or miss a few
  30. cycles here and there.
  31. It must be immune to the kind of effects that occur in hardware where e.g.
  32. the counter register is read in two phases on the bus lowest 16 bits first
  33. and the higher 16 bits in a second bus cycle with the counter bits
  34. potentially being updated in between leading to the risk of very strange
  35. values from the counter.
  36. When the wall-clock accuracy of the clock source isn't satisfactory, there
  37. are various quirks and layers in the timekeeping code for e.g. synchronizing
  38. the user-visible time to RTC clocks in the system or against networked time
  39. servers using NTP, but all they do basically is update an offset against
  40. the clock source, which provides the fundamental timeline for the system.
  41. These measures does not affect the clock source per se, they only adapt the
  42. system to the shortcomings of it.
  43. The clock source struct shall provide means to translate the provided counter
  44. into a nanosecond value as an unsigned long long (unsigned 64 bit) number.
  45. Since this operation may be invoked very often, doing this in a strict
  46. mathematical sense is not desirable: instead the number is taken as close as
  47. possible to a nanosecond value using only the arithmetic operations
  48. multiply and shift, so in clocksource_cyc2ns() you find:
  49. ns ~= (clocksource * mult) >> shift
  50. You will find a number of helper functions in the clock source code intended
  51. to aid in providing these mult and shift values, such as
  52. clocksource_khz2mult(), clocksource_hz2mult() that help determine the
  53. mult factor from a fixed shift, and clocksource_register_hz() and
  54. clocksource_register_khz() which will help out assigning both shift and mult
  55. factors using the frequency of the clock source as the only input.
  56. For real simple clock sources accessed from a single I/O memory location
  57. there is nowadays even clocksource_mmio_init() which will take a memory
  58. location, bit width, a parameter telling whether the counter in the
  59. register counts up or down, and the timer clock rate, and then conjure all
  60. necessary parameters.
  61. Since a 32-bit counter at say 100 MHz will wrap around to zero after some 43
  62. seconds, the code handling the clock source will have to compensate for this.
  63. That is the reason why the clock source struct also contains a 'mask'
  64. member telling how many bits of the source are valid. This way the timekeeping
  65. code knows when the counter will wrap around and can insert the necessary
  66. compensation code on both sides of the wrap point so that the system timeline
  67. remains monotonic.
  68. Clock events
  69. ------------
  70. Clock events are the conceptual reverse of clock sources: they take a
  71. desired time specification value and calculate the values to poke into
  72. hardware timer registers.
  73. Clock events are orthogonal to clock sources. The same hardware
  74. and register range may be used for the clock event, but it is essentially
  75. a different thing. The hardware driving clock events has to be able to
  76. fire interrupts, so as to trigger events on the system timeline. On an SMP
  77. system, it is ideal (and customary) to have one such event driving timer per
  78. CPU core, so that each core can trigger events independently of any other
  79. core.
  80. You will notice that the clock event device code is based on the same basic
  81. idea about translating counters to nanoseconds using mult and shift
  82. arithmetic, and you find the same family of helper functions again for
  83. assigning these values. The clock event driver does not need a 'mask'
  84. attribute however: the system will not try to plan events beyond the time
  85. horizon of the clock event.
  86. sched_clock()
  87. -------------
  88. In addition to the clock sources and clock events there is a special weak
  89. function in the kernel called sched_clock(). This function shall return the
  90. number of nanoseconds since the system was started. An architecture may or
  91. may not provide an implementation of sched_clock() on its own. If a local
  92. implementation is not provided, the system jiffy counter will be used as
  93. sched_clock().
  94. As the name suggests, sched_clock() is used for scheduling the system,
  95. determining the absolute timeslice for a certain process in the CFS scheduler
  96. for example. It is also used for printk timestamps when you have selected to
  97. include time information in printk for things like bootcharts.
  98. Compared to clock sources, sched_clock() has to be very fast: it is called
  99. much more often, especially by the scheduler. If you have to do trade-offs
  100. between accuracy compared to the clock source, you may sacrifice accuracy
  101. for speed in sched_clock(). It however requires some of the same basic
  102. characteristics as the clock source, i.e. it should be monotonic.
  103. The sched_clock() function may wrap only on unsigned long long boundaries,
  104. i.e. after 64 bits. Since this is a nanosecond value this will mean it wraps
  105. after circa 585 years. (For most practical systems this means "never".)
  106. If an architecture does not provide its own implementation of this function,
  107. it will fall back to using jiffies, making its maximum resolution 1/HZ of the
  108. jiffy frequency for the architecture. This will affect scheduling accuracy
  109. and will likely show up in system benchmarks.
  110. The clock driving sched_clock() may stop or reset to zero during system
  111. suspend/sleep. This does not matter to the function it serves of scheduling
  112. events on the system. However it may result in interesting timestamps in
  113. printk().
  114. The sched_clock() function should be callable in any context, IRQ- and
  115. NMI-safe and return a sane value in any context.
  116. Some architectures may have a limited set of time sources and lack a nice
  117. counter to derive a 64-bit nanosecond value, so for example on the ARM
  118. architecture, special helper functions have been created to provide a
  119. sched_clock() nanosecond base from a 16- or 32-bit counter. Sometimes the
  120. same counter that is also used as clock source is used for this purpose.
  121. On SMP systems, it is crucial for performance that sched_clock() can be called
  122. independently on each CPU without any synchronization performance hits.
  123. Some hardware (such as the x86 TSC) will cause the sched_clock() function to
  124. drift between the CPUs on the system. The kernel can work around this by
  125. enabling the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK option. This is another aspect
  126. that makes sched_clock() different from the ordinary clock source.
  127. Delay timers (some architectures only)
  128. --------------------------------------
  129. On systems with variable CPU frequency, the various kernel delay() functions
  130. will sometimes behave strangely. Basically these delays usually use a hard
  131. loop to delay a certain number of jiffy fractions using a "lpj" (loops per
  132. jiffy) value, calibrated on boot.
  133. Let's hope that your system is running on maximum frequency when this value
  134. is calibrated: as an effect when the frequency is geared down to half the
  135. full frequency, any delay() will be twice as long. Usually this does not
  136. hurt, as you're commonly requesting that amount of delay *or more*. But
  137. basically the semantics are quite unpredictable on such systems.
  138. Enter timer-based delays. Using these, a timer read may be used instead of
  139. a hard-coded loop for providing the desired delay.
  140. This is done by declaring a struct delay_timer and assigning the appropriate
  141. function pointers and rate settings for this delay timer.
  142. This is available on some architectures like OpenRISC or ARM.