Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ze_peak explicit scaling benchmark #88

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lyu
Copy link
Contributor

@lyu lyu commented Oct 16, 2024

The explicit scaling code for ze_peak violates L0 spec and has no overlap between sub-devices. This PR corrects these issues.

@lyu
Copy link
Contributor Author

lyu commented Oct 16, 2024

Original execution flow for each subdevice ID

  1. Assume current subdevice ID is N
  2. Reset cmdlist N
  3. Append 1 memcpy to cmdlist N
  4. Close cmdlist N
  5. For each warmup iteration: [submit cmdlist N to subdevice N, and if this is the last subdevice then synchronize the cmdqueue of all subdevices]
  6. For each benchmark iteration: do the same as step 4
  7. Measure and return the time taken to do step 5

Suppose we have two subdevices 0 & 1. For subdevice 0 there is no synchronization at all, since the cmdqueue is async we only measure the submission time which is very small. For subdevice 1 we will call cmdqueue sync on subdevice 0 at step 4, before we actually run the benchmark on subdevice 1 so there is no overlap at all.

At the end we sum all the time measurements and calculate the BW. Although we had no overlap, we also didn't measure execution on subdevice 0, so we get half of the actual time and thus double the BW, so this bug was not discovered before.

Additionally, for subdevice 0 we do submit-submit-...(500 times)...-submit-sync using the same cmdlist & cmdqueue pair, this violates L0 spec's description of zeCommandQueueExecuteCommandLists ref:

The application must ensure the device is not currently referencing the command list since the implementation is allowed to modify the contents of the command list for submission.

So we saw command buffer GPU page faults when running ze_peak on PVC.

Corrected execution flow for each subdevice ID

  1. Assume current subdevice ID is N
  2. If N == 0 then reset all cmdlists
  3. Append 1 memcpy and 1 barrier to cmdlist N
  4. Close cmdlist N
  5. Run warmup iterations just on subdevice N
  6. Reset cmdlist N
  7. Append 500 [memcpy + barrier] to cmdlist N
  8. Close cmdlist N
  9. Submit cmdlist N to subdevice N, and if this is the last subdevice then synchronize the cmdqueue of all subdevices
  10. Measure and return the time taken to do step 9

Basically, now we submit 1 cmdlist to each subdevice asynchronously, and we do synchronization on all subdevices once we have submitted all cmdlists. There will be some warmup & cmdlist operation overhead mixed in there, and the barriers also have their own overhead, but the measured BW is still very close to 2x the performance on a single subdevice.

The explicit scaling code for ze_peak violates L0 spec and has no
overlap between sub-devices. This PR corrects these issues.

Signed-off-by: Wenbin Lu <wenbin.lu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant