memory-model.rst 7.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177
  1. .. SPDX-License-Identifier: GPL-2.0
  2. .. _physical_memory_model:
  3. =====================
  4. Physical Memory Model
  5. =====================
  6. Physical memory in a system may be addressed in different ways. The
  7. simplest case is when the physical memory starts at address 0 and
  8. spans a contiguous range up to the maximal address. It could be,
  9. however, that this range contains small holes that are not accessible
  10. for the CPU. Then there could be several contiguous ranges at
  11. completely distinct addresses. And, don't forget about NUMA, where
  12. different memory banks are attached to different CPUs.
  13. Linux abstracts this diversity using one of the two memory models:
  14. FLATMEM and SPARSEMEM. Each architecture defines what
  15. memory models it supports, what the default memory model is and
  16. whether it is possible to manually override that default.
  17. All the memory models track the status of physical page frames using
  18. struct page arranged in one or more arrays.
  19. Regardless of the selected memory model, there exists one-to-one
  20. mapping between the physical page frame number (PFN) and the
  21. corresponding `struct page`.
  22. Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn`
  23. helpers that allow the conversion from PFN to `struct page` and vice
  24. versa.
  25. FLATMEM
  26. =======
  27. The simplest memory model is FLATMEM. This model is suitable for
  28. non-NUMA systems with contiguous, or mostly contiguous, physical
  29. memory.
  30. In the FLATMEM memory model, there is a global `mem_map` array that
  31. maps the entire physical memory. For most architectures, the holes
  32. have entries in the `mem_map` array. The `struct page` objects
  33. corresponding to the holes are never fully initialized.
  34. To allocate the `mem_map` array, architecture specific setup code should
  35. call :c:func:`free_area_init` function. Yet, the mappings array is not
  36. usable until the call to :c:func:`memblock_free_all` that hands all the
  37. memory to the page allocator.
  38. An architecture may free parts of the `mem_map` array that do not cover the
  39. actual physical pages. In such case, the architecture specific
  40. :c:func:`pfn_valid` implementation should take the holes in the
  41. `mem_map` into account.
  42. With FLATMEM, the conversion between a PFN and the `struct page` is
  43. straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the
  44. `mem_map` array.
  45. The `ARCH_PFN_OFFSET` defines the first page frame number for
  46. systems with physical memory starting at address different from 0.
  47. SPARSEMEM
  48. =========
  49. SPARSEMEM is the most versatile memory model available in Linux and it
  50. is the only memory model that supports several advanced features such
  51. as hot-plug and hot-remove of the physical memory, alternative memory
  52. maps for non-volatile memory devices and deferred initialization of
  53. the memory map for larger systems.
  54. The SPARSEMEM model presents the physical memory as a collection of
  55. sections. A section is represented with struct mem_section
  56. that contains `section_mem_map` that is, logically, a pointer to an
  57. array of struct pages. However, it is stored with some other magic
  58. that aids the sections management. The section size and maximal number
  59. of section is specified using `SECTION_SIZE_BITS` and
  60. `MAX_PHYSMEM_BITS` constants defined by each architecture that
  61. supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a
  62. physical address that an architecture supports, the
  63. `SECTION_SIZE_BITS` is an arbitrary value.
  64. The maximal number of sections is denoted `NR_MEM_SECTIONS` and
  65. defined as
  66. .. math::
  67. NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)}
  68. The `mem_section` objects are arranged in a two-dimensional array
  69. called `mem_sections`. The size and placement of this array depend
  70. on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of
  71. sections:
  72. * When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections`
  73. array is static and has `NR_MEM_SECTIONS` rows. Each row holds a
  74. single `mem_section` object.
  75. * When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections`
  76. array is dynamically allocated. Each row contains PAGE_SIZE worth of
  77. `mem_section` objects and the number of rows is calculated to fit
  78. all the memory sections.
  79. The architecture setup code should call sparse_init() to
  80. initialize the memory sections and the memory maps.
  81. With SPARSEMEM there are two possible ways to convert a PFN to the
  82. corresponding `struct page` - a "classic sparse" and "sparse
  83. vmemmap". The selection is made at build time and it is determined by
  84. the value of `CONFIG_SPARSEMEM_VMEMMAP`.
  85. The classic sparse encodes the section number of a page in page->flags
  86. and uses high bits of a PFN to access the section that maps that page
  87. frame. Inside a section, the PFN is the index to the array of pages.
  88. The sparse vmemmap uses a virtually mapped memory map to optimize
  89. pfn_to_page and page_to_pfn operations. There is a global `struct
  90. page *vmemmap` pointer that points to a virtually contiguous array of
  91. `struct page` objects. A PFN is an index to that array and the
  92. offset of the `struct page` from `vmemmap` is the PFN of that
  93. page.
  94. To use vmemmap, an architecture has to reserve a range of virtual
  95. addresses that will map the physical pages containing the memory
  96. map and make sure that `vmemmap` points to that range. In addition,
  97. the architecture should implement :c:func:`vmemmap_populate` method
  98. that will allocate the physical memory and create page tables for the
  99. virtual memory map. If an architecture does not have any special
  100. requirements for the vmemmap mappings, it can use default
  101. :c:func:`vmemmap_populate_basepages` provided by the generic memory
  102. management.
  103. The virtually mapped memory map allows storing `struct page` objects
  104. for persistent memory devices in pre-allocated storage on those
  105. devices. This storage is represented with struct vmem_altmap
  106. that is eventually passed to vmemmap_populate() through a long chain
  107. of function calls. The vmemmap_populate() implementation may use the
  108. `vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to
  109. allocate memory map on the persistent memory device.
  110. ZONE_DEVICE
  111. ===========
  112. The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer
  113. `struct page` `mem_map` services for device driver identified physical
  114. address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact
  115. that the page objects for these address ranges are never marked online,
  116. and that a reference must be taken against the device, not just the page
  117. to keep the memory pinned for active use. `ZONE_DEVICE`, via
  118. :c:func:`devm_memremap_pages`, performs just enough memory hotplug to
  119. turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and
  120. :c:func:`get_user_pages` service for the given range of pfns. Since the
  121. page reference count never drops below 1 the page is never tracked as
  122. free memory and the page's `struct list_head lru` space is repurposed
  123. for back referencing to the host device / driver that mapped the memory.
  124. While `SPARSEMEM` presents memory as a collection of sections,
  125. optionally collected into memory blocks, `ZONE_DEVICE` users have a need
  126. for smaller granularity of populating the `mem_map`. Given that
  127. `ZONE_DEVICE` memory is never marked online it is subsequently never
  128. subject to its memory ranges being exposed through the sysfs memory
  129. hotplug api on memory block boundaries. The implementation relies on
  130. this lack of user-api constraint to allow sub-section sized memory
  131. ranges to be specified to :c:func:`arch_add_memory`, the top-half of
  132. memory hotplug. Sub-section support allows for 2MB as the cross-arch
  133. common alignment granularity for :c:func:`devm_memremap_pages`.
  134. The users of `ZONE_DEVICE` are:
  135. * pmem: Map platform persistent memory to be used as a direct-I/O target
  136. via DAX mappings.
  137. * hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()`
  138. event callbacks to allow a device-driver to coordinate memory management
  139. events related to device-memory, typically GPU memory. See
  140. Documentation/mm/hmm.rst.
  141. * p2pdma: Create `struct page` objects to allow peer devices in a
  142. PCI/-E topology to coordinate direct-DMA operations between themselves,
  143. i.e. bypass host memory.