Skip to content

Commit

Permalink
fix(nvidia-smi/parse): do not parse remapped rows N/A (#128)
Browse files Browse the repository at this point in the history
Fix the nvidia-smi parsing error

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
  • Loading branch information
gyuho authored Oct 19, 2024
1 parent a5b2f70 commit 5e217ec
Show file tree
Hide file tree
Showing 3 changed files with 274 additions and 0 deletions.
10 changes: 10 additions & 0 deletions components/accelerator/nvidia/query/nvidia_smi_query.go
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,16 @@ func ParseSMIQueryOutput(b []byte) (*SMIOutput, error) {
// Processes : null
currentLine = bytes.Replace(currentLine, []byte("None"), []byte("null"), 1)

case bytes.Contains(currentLine, []byte("Remapped Rows")) && bytes.HasSuffix(bytes.TrimSpace(currentLine), []byte("N/A")):
// e.g.,
//
// Remapped Rows : N/A
//
// should be
//
// Remapped Rows : null
currentLine = bytes.Replace(currentLine, []byte("N/A"), []byte("null"), 1)

case bytes.HasPrefix(lastKey, []byte("HW Slowdown")) ||
bytes.HasPrefix(lastKey, []byte("HW Thermal Slowdown")) ||
bytes.HasPrefix(lastKey, []byte("Process ID")) ||
Expand Down
14 changes: 14 additions & 0 deletions components/accelerator/nvidia/query/nvidia_smi_query_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,20 @@ func TestParseWithRemappedRows(t *testing.T) {
}
}

func TestParseWithRemappedRowsNone(t *testing.T) {
data, err := os.ReadFile("testdata/nvidia-smi-query.560.35.03.out.0.valid")
if err != nil {
t.Fatalf("failed to read file: %v", err)
}
parsed, err := ParseSMIQueryOutput(data)
if err != nil {
t.Errorf("Parse returned an error: %v", err)
}
if parsed.GPUs[0].RemappedRows != nil {
t.Errorf("RemappedRows should be nil: %+v", parsed.GPUs[0].RemappedRows)
}
}

func TestParseWithHWSlowdownActive(t *testing.T) {
data, err := os.ReadFile("testdata/nvidia-smi-query.535.161.08.out.0.valid")
if err != nil {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
==============NVSMI LOG==============

Timestamp : Fri Oct 18 10:13:00 2024
Driver Version : 560.35.03
CUDA Version : 12.6

Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA TITAN V
Product Brand : Titan
Product Architecture : Volta
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : N/A
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0334517001520
GPU UUID : GPU-f4c353ff-b56c-c1de-bcb9-7dad861b589a
Minor Number : 0
VBIOS Version : 88.00.36.00.01
MultiGPU Board : No
Board ID : 0x100
Board Part Number : 900-1G500-2500-000
GPU Part Number : 1D81-400-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU C2C Mode : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
vGPU Heterogeneous Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
GSP Firmware Version : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Base Classcode : 0x3
Sub Classcode : 0x0
Device Id : 0x1D8110DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x121810DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Device Current : 1
Device Max : 3
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Outbound : N/A
Atomic Caps Inbound : N/A
Fan Speed : 30 %
Performance State : P8
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Sparse Operation Mode : N/A
FB Memory Usage
Total : 12288 MiB
Reserved : 239 MiB
Used : 290 MiB
Free : 11760 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : N/A
OFA : N/A
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 44 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 100 C
GPU Slowdown Temp : 97 C
GPU Max Operating Temp : 91 C
GPU Target Temperature : 84 C
Memory Current Temp : 41 C
Memory Max Operating Temp : 95 C
GPU Power Readings
Power Draw : 29.02 W
Current Power Limit : 250.00 W
Requested Power Limit : 250.00 W
Default Power Limit : 250.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
GPU Memory Power Readings
Power Draw : N/A
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 135 MHz
SM : 135 MHz
Memory : 850 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1200 MHz
Memory : 850 MHz
Default Applications Clocks
Graphics : 1200 MHz
Memory : 850 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1912 MHz
SM : 1912 MHz
Memory : 850 MHz
Video : 1717 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
CliqueId : N/A
ClusterUUID : N/A
Health
Bandwidth : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 100607
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 142 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 100838
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 33 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 101815
Type : G
Name : /opt/test
Used GPU Memory : 111 MiB
Capabilities
EGM : disabled

0 comments on commit 5e217ec

Please sign in to comment.