Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix random crash when running TensorFlow resnet50 #142

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yuhc
Copy link
Member

@yuhc yuhc commented May 30, 2021

Motivation and Context

The issue was reported by #124, while the error may appear at other APIs.

How has this been tested?

Benchmark: https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks

Run worker: ./install/bin/legacy_manager --worker_path ./install/tf_opt/bin/worker
Run benchmark: LD_LIBRARY_PATH=/project/hyu-build/install/tf_opt/lib/ python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server

Errors

2021-05-29 17:59:21.207799: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7250 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute cap
ability: 6.1)                                                                                                                                                                                                                                                              
2021-05-29 17:59:21.644515: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experiment
al_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.                                                                
INFO:tensorflow:Running local_init_op.                                                                                                                                                                                                                                     
I0529 17:59:21.832978 140152980096832 session_manager.py:500] Running local_init_op.                                                                                                                                                                                       
INFO:tensorflow:Done running local_init_op.                                                                                                                                                                                                                                
I0529 17:59:21.881336 140152980096832 session_manager.py:502] Done running local_init_op.                                                                                                                                                                                  
Running warm up                                                                                                                                                                                                                                                            
2021-05-29 17:59:23.026055: W tensorflow/stream_executor/cuda/cuda_blas.cc:236] To call cublasCreate                                                                                                                                                                       
2021-05-29 17:59:23.026128: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10                                                                                                                            
2021-05-29 17:59:23.078337: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7                                                                                                                              
Done warm up                                                                                                                                                                                                                                                               
Step    Img/sec total_loss                                                                                                                                                                                                                                                 
1       images/sec: 53.5 +/- 0.0 (jitter = 0.0) 8.169                                                                                                                                                                                                                      
10      images/sec: 56.1 +/- 0.9 (jitter = 1.2) 7.593                                                                                                                                                                                                                      20      images/sec: 59.5 +/- 2.6 (jitter = 2.9) 7.696                                                                                                                                                                                                                      
30      images/sec: 60.6 +/- 2.2 (jitter = 4.8) 7.753                                                                                                                                                                                                                      40      images/sec: 62.5 +/- 2.1 (jitter = 8.5) 8.007                                                                                                                                                                                                                      
2021-05-29 17:59:49.240899: E tensorflow/stream_executor/cuda/cuda_driver.cc:1145] failed to enqueue async memcpy from device to host: CUDA_ERROR_ILLEGAL_ADDRESS; host dst: 0xa84f580; GPU src: 0x7f1f59481b00; size: 1=0x1                                               
2021-05-29 17:59:49.241064: F tensorflow/core/common_runtime/gpu/gpu_util.cc:293] GPU->CPU Memcpy failed                                                                                                                                                                   
Fatal Python error: Aborted                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                           
Thread 0x00007f77e897c740 (most recent call first):                                                                                                                                                                                                                          File "/project/venv-tf1.14/lib/pyth2021-05-29 17:59:49.241191: F ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:495] Non-OK-status: CudaLaunchKernel(BlockReduceKernel<IN_T, T*, num_threads, Op>, num_blocks, num_threads, 0, cu_stream, in, (T*)temp_storage.flat
<int8_t>().data(), in_size, op, init) status: Internal: an illegal memory access was encountered                                                                                                                                                                           on3.6/site-packages/tensAborted (core dumped)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Document update (this change is mainly a documentation update)

Checklist:

  • My code passes format and lint checks.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have commented my code, particularly in hard-to-understand areas.
  • My changes generate no new warnings.
  • I have tested my code with a reasonable workload.
  • My code may break some other features.

@yuhc yuhc added the bug Something isn't working label May 30, 2021
@yuhc yuhc changed the title Fix random crash when running TensorFlow resnet50 [WIP] Fix random crash when running TensorFlow resnet50 May 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A segmentation fault may accur in "__cudaPopCallConfiguration" in tf_opt.c
2 participants