Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradiant overflow in finetuning #5

Open
ghaddarAbs opened this issue Jun 30, 2021 · 2 comments
Open

Gradiant overflow in finetuning #5

ghaddarAbs opened this issue Jun 30, 2021 · 2 comments

Comments

@ghaddarAbs
Copy link

Hi,

Thank you very much for the great work, and for sharing the fine-tuning data last week.
I got an issue when I tried to fine-tune and evaluate the model on the flickr30k, using:

# I just run the second command (GPU:1 lr: 2e-5 )
./bash/train_flickr.sh

The epoch start normally at the beginning, but suddenly the loss strat increasing at epoch 6:


Epoch: 6: Step: 555/1511, loss=0.527620, loss_nce=0.527620, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 559/1511, loss=0.727350, loss_nce=0.727350, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 563/1511, loss=0.570808, loss_nce=0.570808, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 567/1511, loss=0.393095, loss_nce=0.393095, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 571/1511, loss=0.674848, loss_nce=0.674848, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 575/1511, loss=0.499143, loss_nce=0.499143, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 579/1511, loss=0.594417, loss_nce=0.594417, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 583/1511, loss=0.637567, loss_nce=0.637567, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 587/1511, loss=0.848309, loss_nce=0.848309, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 591/1511, loss=0.859852, loss_nce=0.859852, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 595/1511, loss=0.551946, loss_nce=0.551946, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 599/1511, loss=0.569656, loss_nce=0.569656, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 603/1511, loss=0.811136, loss_nce=0.811136, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 607/1511, loss=0.926843, loss_nce=0.926843, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 611/1511, loss=0.878590, loss_nce=0.878590, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 615/1511, loss=0.930382, loss_nce=0.930382, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 619/1511, loss=1.138345, loss_nce=1.138345, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 623/1511, loss=1.101084, loss_nce=1.101084, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 627/1511, loss=0.899013, loss_nce=0.899013, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 631/1511, loss=1.180095, loss_nce=1.180095, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 635/1511, loss=1.371186, loss_nce=1.371186, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 639/1511, loss=1.614157, loss_nce=1.614157, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 643/1511, loss=1.712646, loss_nce=1.712646, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 647/1511, loss=2.504568, loss_nce=2.504568, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 651/1511, loss=2.761936, loss_nce=2.761936, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 655/1511, loss=4.210203, loss_nce=4.210203, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 659/1511, loss=6.195764, loss_nce=6.195764, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 663/1511, loss=8.189028, loss_nce=8.189028, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 667/1511, loss=12.597887, loss_nce=12.597887, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 671/1511, loss=11.704583, loss_nce=11.704583, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 675/1511, loss=13.765331, loss_nce=13.765331, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 679/1511, loss=18.207155, loss_nce=18.207155, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 683/1511, loss=16.359169, loss_nce=16.359169, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 687/1511, loss=20.523600, loss_nce=20.523600, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 691/1511, loss=27.668240, loss_nce=27.668240, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 695/1511, loss=30.855385, loss_nce=30.855385, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 699/1511, loss=35.086441, loss_nce=35.086441, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 703/1511, loss=30.574892, loss_nce=30.574892, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 707/1511, loss=52.953876, loss_nce=52.953876, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 711/1511, loss=40.207417, loss_nce=40.207417, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 715/1511, loss=53.108303, loss_nce=53.108303, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 719/1511, loss=47.695160, loss_nce=47.695160, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 723/1511, loss=45.211182, loss_nce=45.211182, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 727/1511, loss=49.979271, loss_nce=49.979271, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 731/1511, loss=45.502415, loss_nce=45.502415, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 735/1511, loss=42.128304, loss_nce=42.128304, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 739/1511, loss=57.433262, loss_nce=57.433262, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 743/1511, loss=70.618607, loss_nce=70.618607, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 747/1511, loss=52.835541, loss_nce=52.835541, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 751/1511, loss=57.775532, loss_nce=57.775532, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 755/1511, loss=75.909271, loss_nce=75.909271, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 759/1511, loss=47.627548, loss_nce=47.627548, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 763/1511, loss=55.984451, loss_nce=55.984451, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 767/1511, loss=39.634636, loss_nce=39.634636, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 771/1511, loss=43.213181, loss_nce=43.213181, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 775/1511, loss=37.875175, loss_nce=37.875175, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 779/1511, loss=45.833000, loss_nce=45.833000, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 783/1511, loss=42.249699, loss_nce=42.249699, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 787/1511, loss=49.242207, loss_nce=49.242207, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 791/1511, loss=59.082058, loss_nce=59.082058, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 795/1511, loss=44.366467, loss_nce=44.366467, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 799/1511, loss=61.286034, loss_nce=61.286034, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 803/1511, loss=65.236374, loss_nce=65.236374, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 807/1511, loss=55.568848, loss_nce=55.568848, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 811/1511, loss=81.588463, loss_nce=81.588463, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 815/1511, loss=138.267487, loss_nce=138.267487, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 819/1511, loss=205.398163, loss_nce=205.398163, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 823/1511, loss=106.781647, loss_nce=106.781647, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 827/1511, loss=114.370003, loss_nce=114.370003, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 831/1511, loss=85.564255, loss_nce=85.564255, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 835/1511, loss=58.856918, loss_nce=58.856918, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 839/1511, loss=48.463295, loss_nce=48.463295, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 843/1511, loss=49.180916, loss_nce=49.180916, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 847/1511, loss=42.912064, loss_nce=42.912064, loss_kd=0.0, lr=0.000013
Epoch: 6: Step: 851/1511, loss=33.153042, loss_nce=33.153042, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 855/1511, loss=49.714306, loss_nce=49.714306, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 859/1511, loss=30.225197, loss_nce=30.225197, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 863/1511, loss=40.542446, loss_nce=40.542446, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 867/1511, loss=42.657013, loss_nce=42.657013, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 871/1511, loss=29.824253, loss_nce=29.824253, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 875/1511, loss=38.451778, loss_nce=38.451778, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 879/1511, loss=30.017517, loss_nce=30.017517, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 883/1511, loss=30.451855, loss_nce=30.451855, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 887/1511, loss=24.856079, loss_nce=24.856079, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 891/1511, loss=26.671665, loss_nce=26.671665, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 895/1511, loss=24.949318, loss_nce=24.949318, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 899/1511, loss=24.966484, loss_nce=24.966484, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 903/1511, loss=31.370058, loss_nce=31.370058, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 907/1511, loss=54.106686, loss_nce=54.106686, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 911/1511, loss=27.364002, loss_nce=27.364002, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 915/1511, loss=31.717720, loss_nce=31.717720, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 919/1511, loss=32.850029, loss_nce=32.850029, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 923/1511, loss=36.481514, loss_nce=36.481514, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 927/1511, loss=36.080856, loss_nce=36.080856, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 931/1511, loss=43.164818, loss_nce=43.164818, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 935/1511, loss=82.020950, loss_nce=82.020950, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 939/1511, loss=36.782185, loss_nce=36.782185, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 943/1511, loss=32.322525, loss_nce=32.322525, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 947/1511, loss=37.928696, loss_nce=37.928696, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 951/1511, loss=37.906788, loss_nce=37.906788, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 955/1511, loss=40.255390, loss_nce=40.255390, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 959/1511, loss=36.430790, loss_nce=36.430790, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 963/1511, loss=34.600498, loss_nce=34.600498, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 967/1511, loss=39.713654, loss_nce=39.713654, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 971/1511, loss=46.052864, loss_nce=46.052864, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 975/1511, loss=37.347187, loss_nce=37.347187, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 979/1511, loss=41.355392, loss_nce=41.355392, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 983/1511, loss=45.157066, loss_nce=45.157066, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 987/1511, loss=32.828815, loss_nce=32.828815, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 991/1511, loss=55.191578, loss_nce=55.191578, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 995/1511, loss=49.200516, loss_nce=49.200516, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 999/1511, loss=34.357136, loss_nce=34.357136, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 1003/1511, loss=37.069489, loss_nce=37.069489, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 1007/1511, loss=45.910133, loss_nce=45.910133, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 1011/1511, loss=41.456188, loss_nce=41.456188, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 1015/1511, loss=60.424339, loss_nce=60.424339, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 1019/1511, loss=35.902451, loss_nce=35.902451, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 1023/1511, loss=43.260071, loss_nce=43.260071, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 1027/1511, loss=39.661362, loss_nce=39.661362, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 1031/1511, loss=64.590012, loss_nce=64.590012, loss_kd=0.0, lr=0.000012
Epoch: 6: Step: 1035/1511, loss=34.630993, loss_nce=34.630993, loss_kd=0.0, lr=0.000012

and continue like this for the end of the training, then the code crash at the evaluation

Epoch: 14: Step: 1459/1511, loss=1448.427734, loss_nce=1448.427734, loss_kd=0.0, lr=0.000000
Epoch: 14: Step: 1463/1511, loss=1645.300171, loss_nce=1645.300171, loss_kd=0.0, lr=0.000000
Epoch: 14: Step: 1467/1511, loss=1398.610107, loss_nce=1398.610107, loss_kd=0.0, lr=0.000000
Epoch: 14: Step: 1471/1511, loss=1394.673096, loss_nce=1394.673096, loss_kd=0.0, lr=0.000000
Epoch: 14: Step: 1475/1511, loss=2031.539795, loss_nce=2031.539795, loss_kd=0.0, lr=0.000000
Epoch: 14: Step: 1479/1511, loss=1238.061768, loss_nce=1238.061768, loss_kd=0.0, lr=0.000000
Epoch: 14: Step: 1483/1511, loss=1475.774780, loss_nce=1475.774780, loss_kd=0.0, lr=0.000000
Epoch: 14: Step: 1487/1511, loss=1240.767578, loss_nce=1240.767578, loss_kd=0.0, lr=0.000000
Epoch: 14: Step: 1491/1511, loss=1186.123657, loss_nce=1186.123657, loss_kd=0.0, lr=0.000000
Epoch: 14: Step: 1495/1511, loss=1728.326904, loss_nce=1728.326904, loss_kd=0.0, lr=0.000000
Epoch: 14: Step: 1499/1511, loss=1731.635498, loss_nce=1731.635498, loss_kd=0.0, lr=0.000000
Epoch: 14: Step: 1503/1511, loss=1679.102173, loss_nce=1679.102173, loss_kd=0.0, lr=0.000000
Epoch: 14: Step: 1507/1511, loss=1465.885498, loss_nce=1465.885498, loss_kd=0.0, lr=0.000000
Total data indexed 1014
Total data indexed 5070
Saved checkpoint at /path/to/flickr-bert-two_stream/2e-5_96_0_none_0.0_768_both_run1/biencoder.best.pt
Saved checkpoint at /path/to/flickr-bert-two_stream/2e-5_96_0_none_0.0_768_both_run1/biencoder.last.pt
test dataset len = 5000, dataloader len = 63
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 64.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.03125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.015625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.001953125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0009765625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.00048828125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.103515625e-05
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.9073486328125e-06
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.76837158203125e-07
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.725290298461914e-09
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.313225746154785e-10
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.9103830456733704e-11
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.8189894035458565e-12
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.7763568394002505e-15
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.440892098500626e-16
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1102230246251565e-16
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.3877787807814457e-17
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.469446951953614e-18
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.673617379884035e-19
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19
Traceback (most recent call last):
  File "train_itm.py", line 369, in <module>
    args.txt_retrieval, img2txt)
AttributeError: 'Namespace' object has no attribute 'txt_retrieval'

However, I tried to evaluate the best model biencoder.best.pt using the following command:

python eval_itm.py ./config/flickr30k_eval_config.json /path/to/flickr-bert-two_stream/2e-5_96_0_none_0.0_768_both_run1/biencoder.best.pt

and get the following results:

Total data indexed 1000
Total data indexed 5000
time cost = 10.698805809020996s
average loss = nan, accuracy = 0.0126
indexed  1000 data
image retrieval recall = {1: 0.001, 5: 0.005, 10: 0.01}
txt retrieval recall = {1: 0.001, 5: 0.005, 10: 0.01}
@intersun
Copy link
Owner

From the loss curve it looks like it ran successfully in first 6 epochs, and suddenly the loss blows up, which seems very similar to your previous issue. Can you just try to reproduce the error by training on a smaller dataset (say the dev set for flickr, or a subset of training, if you prefer), and solve it using the suggestions from the other thread?

As for the evaluation issue, I will investigate more in this weekends. To me it looks like the checkpoint is NOT loading successfully (can you double check this part?) so the model just randomly picked some images as retrieved results.

@Zjamie813
Copy link

Hello,
I am also trying to run the code to reproduce the fine-tuning results on Flickr30k. However, I cannot find the shared data link for Flickr30k fine-tuning. Would you like to share it? Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants