Fix ze_peak explicit scaling benchmark #88

lyu · 2024-10-16T15:16:08Z

The explicit scaling code for ze_peak violates L0 spec and has no overlap between sub-devices. This PR corrects these issues.

lyu · 2024-10-16T15:59:30Z

Original execution flow for each subdevice ID

Assume current subdevice ID is N
Reset cmdlist N
Append 1 memcpy to cmdlist N
Close cmdlist N
For each warmup iteration: [submit cmdlist N to subdevice N, and if this is the last subdevice then synchronize the cmdqueue of all subdevices]
For each benchmark iteration: do the same as step 4
Measure and return the time taken to do step 5

Suppose we have two subdevices 0 & 1. For subdevice 0 there is no synchronization at all, since the cmdqueue is async we only measure the submission time which is very small. For subdevice 1 we will call cmdqueue sync on subdevice 0 at step 4, before we actually run the benchmark on subdevice 1 so there is no overlap at all.

At the end we sum all the time measurements and calculate the BW. Although we had no overlap, we also didn't measure execution on subdevice 0, so we get half of the actual time and thus double the BW, so this bug was not discovered before.

Additionally, for subdevice 0 we do submit-submit-...(500 times)...-submit-sync using the same cmdlist & cmdqueue pair, this violates L0 spec's description of zeCommandQueueExecuteCommandLists ref:

The application must ensure the device is not currently referencing the command list since the implementation is allowed to modify the contents of the command list for submission.

So we saw command buffer GPU page faults when running ze_peak on PVC.

Corrected execution flow for each subdevice ID

Assume current subdevice ID is N
If N == 0 then reset all cmdlists
Append 1 memcpy and 1 barrier to cmdlist N
Close cmdlist N
Run warmup iterations just on subdevice N
Reset cmdlist N
Append 500 [memcpy + barrier] to cmdlist N
Close cmdlist N
Submit cmdlist N to subdevice N, and if this is the last subdevice then synchronize the cmdqueue of all subdevices
Measure and return the time taken to do step 9

Basically, now we submit 1 cmdlist to each subdevice asynchronously, and we do synchronization on all subdevices once we have submitted all cmdlists. There will be some warmup & cmdlist operation overhead mixed in there, and the barriers also have their own overhead, but the measured BW is still very close to 2x the performance on a single subdevice.

The explicit scaling code for ze_peak violates L0 spec and has no overlap between sub-devices. This PR corrects these issues. Signed-off-by: Wenbin Lu <wenbin.lu@intel.com>

Fix ze_peak explicit scaling benchmark

dbc9719

The explicit scaling code for ze_peak violates L0 spec and has no overlap between sub-devices. This PR corrects these issues. Signed-off-by: Wenbin Lu <wenbin.lu@intel.com>

lyu force-pushed the ze_peak_fix branch from d719a18 to dbc9719 Compare October 16, 2024 22:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ze_peak explicit scaling benchmark #88

Fix ze_peak explicit scaling benchmark #88

lyu commented Oct 16, 2024

lyu commented Oct 16, 2024

Fix ze_peak explicit scaling benchmark #88

Are you sure you want to change the base?

Fix ze_peak explicit scaling benchmark #88

Conversation

lyu commented Oct 16, 2024

lyu commented Oct 16, 2024

Original execution flow for each subdevice ID

Corrected execution flow for each subdevice ID