Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UefiPayloadPkg/UefiPayloadPkg.dsc: Disable 1G pages #159

Closed
wants to merge 1 commit into from

Conversation

miczyg1
Copy link
Contributor

@miczyg1 miczyg1 commented Aug 5, 2024

ESXi bootloader does not seem to cope well with 1G pages and page tables. On VP6670 it caused a #PF on frambuffer access. However, the issue was not reproduced on MSI MS-7D25 DDR4 nor QEMU. The ESXi bootloader allocates memory for a copy of page tables. On the first run of the page tabel parser it returns how many tables it has to determine how many 4K chunks to allocate for the copy. Then it goes through all of the page tables again and copies those it consideres valid (PA == VA, present bit set). However, the parser incorrectly calculates the numebr of page tables and the buffer for page table copy is underallocated. As a result the pages tables containing the framebuffer (and all subsequent page tables) were not copied by the bootloader, causing a #PF when the bootloader switched to the page table copy it just made. Apparently it does a slightly better job with 2M pages, so disable 1G pages as a workaround to get ESXI booting.

ESXi bootloader does not seem to cope well with 1G pages and page
tables. On VP6670 it caused a #PF on frambuffer access. However,
the issue was not reproduced on MSI MS-7D25 DDR4 nor QEMU. The ESXi
bootloader allocates memory for a copy of page tables. On the first
run of the page tabel parser it returns how many tables it has to
determine how many 4K chunks to allocate for the copy. Then it goes
through all of the page tables again and copies those it consideres
valid (PA == VA, present bit set). However, the parser incorrectly
calculates the numebr of page tables and the buffer for page table
copy is underallocated. As a result the pages tables containing the
framebuffer (and all subsequent page tables) were not copied by the
bootloader, causing a #PF when the bootloader switched to the page
table copy it just made. Apparently it does a slightly better job
with 2M pages, so disable 1G pages as a workaround to get ESXI booting.

Signed-off-by: Michał Żygowski <michal.zygowski@3mdeb.com>
@macpijan
Copy link
Contributor

macpijan commented Aug 6, 2024

@miczyg1 It is configured globally for all platforms, despite being a problem on only one of them?
Do we see some potentials problems here caused by this workaround?

@krystian-hebel Would you like to take a look?

@krystian-hebel
Copy link
Contributor

I don't like applying this workaround globally, it uses more memory and takes longer to boot due to creation of page tables and later during page walks. I've tested VP6650 build with both this change and PR target, to make sure that other changes on dasharo didn't impact results.

Before:

  Reserved  :         39,549 Pages (161,992,704 Bytes)
  LoaderCode:            278 Pages (1,138,688 Bytes)
  LoaderData:              0 Pages (0 Bytes)
  BS_Code   :          1,795 Pages (7,352,320 Bytes)
  BS_Data   :         11,750 Pages (48,128,000 Bytes)
  RT_Code   :            260 Pages (1,064,960 Bytes)
  RT_Data   :            962 Pages (3,940,352 Bytes)
  ACPI_Recl :             20 Pages (81,920 Bytes)
  ACPI_NVS  :             44 Pages (180,224 Bytes)
  MMIO      :             64 Pages (262,144 Bytes)
  MMIO_Port :              0 Pages (0 Bytes)
  PalCode   :              0 Pages (0 Bytes)
  Available :     16,722,558 Pages (68,495,597,568 Bytes)
  Persistent:              0 Pages (0 Bytes)

After:

  Reserved  :         39,549 Pages (161,992,704 Bytes)
  LoaderCode:            278 Pages (1,138,688 Bytes)
  LoaderData:              0 Pages (0 Bytes)
  BS_Code   :          1,795 Pages (7,352,320 Bytes)
  BS_Data   :         12,266 Pages (50,241,536 Bytes)
  RT_Code   :            260 Pages (1,064,960 Bytes)
  RT_Data   :            962 Pages (3,940,352 Bytes)
  ACPI_Recl :             20 Pages (81,920 Bytes)
  ACPI_NVS  :             44 Pages (180,224 Bytes)
  MMIO      :             64 Pages (262,144 Bytes)
  MMIO_Port :              0 Pages (0 Bytes)
  PalCode   :              0 Pages (0 Bytes)
  Available :     16,722,042 Pages (68,493,484,032 Bytes)
  Persistent:              0 Pages (0 Bytes)
              -------------- 
Total Memory:         65,381 MB (68,557,484,032 Bytes)

This is 516 more pages. Rounding it down to 2^9 pages, each mapping 2^9 pages, 2^21 bytes each gives 39 address bits, which would be the ideal number assuming that there actually were 512 new pages, all pages before were 1G and all pages after were 2M big. In practice, there would have been smaller pages, e.g. to allow for NX stack, so the number of additional pages should be smaller. The fact that it isn't suggests that something is broken underneath.

Have you tested QEMU with the same code, or on rebased branch? There were some bugs related to page tables (git log --grep=[^_e]bug[^z] --grep=[Pp]age --all-match) fixed since 2020, which is AFAICT the point at which dasharo deviated from upstream. Perhaps it would be worth to check if VP66xx built from rebased has the same problem.

@miczyg1
Copy link
Contributor Author

miczyg1 commented Aug 6, 2024

@miczyg1 It is configured globally for all platforms, despite being a problem on only one of them? Do we see some potentials problems here caused by this workaround?

Is is configured globally, because all platform may have a problem with booting ESXI. Why should we fix only one platform if potentially all can be affected (symptoms may be different though)?

@miczyg1
Copy link
Contributor Author

miczyg1 commented Aug 6, 2024

Have you tested QEMU with the same code, or on rebased branch? There were some bugs related to page tables (git log --grep=[^_e]bug[^z] --grep=[Pp]age --all-match) fixed since 2020, which is AFAICT the point at which dasharo deviated from upstream. Perhaps it would be worth to check if VP66xx built from rebased has the same problem.

tested QEMu and VP66xxx from rebased branch. QEMu worked, but result was the same for VP66xx, it did not work. Like I said, it could be just a fluke, like for MSI.

I also don't like the solution, because it is just a workaround for a bugged code elsewhere.

@miczyg1
Copy link
Contributor Author

miczyg1 commented Aug 9, 2024

Disabeld it per board only for now: Dasharo/coreboot@722b4e9

@miczyg1 miczyg1 closed this Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants