Skip to content

Latest commit

 

History

History
50 lines (32 loc) · 4.77 KB

training.md

File metadata and controls

50 lines (32 loc) · 4.77 KB

MultiVerS model training

MultiVerS training as described in the paper takes place in two stages.

  1. Pretraining on out-of-domain data (FEVER) + weakly-labeled in-domain data (PubMedQA + Evidence Inference).
  2. Finetuning on a target dataset: one of SciFact, HealthVer, or CovidFact.

Data and code are now available to do both stages of training. In practice, Stage 1 training is quite expensive (FEVER is big), so unless you have a lot of GPUs to burn, it's probably not worth doing it yourself. Instead, download the fever_sci checkpoint using get_checkpoint.py; this model is the output of Stage 1 training and is a good starting point for developing models for a new target dataset.

Outline

Data

The data were formatted slightly differently for training, compared to the final format I used for prediction. To download the training-formatted version, run bash script/get_data_train.sh. The script will download the training data into data_train. The data are organized into two subdirectories:

  • pretrain contains data for the three Stage 1 datasets.
    • For pretraining, I trained on 4 negative samples (i.e. claim / document pairs with label NEI) per positive sample.
  • target contains data for each of the target datasets.
    • For covidfact and healthver, I didn't do any negative sampling (it's not necessary since these datsets don't require document retrieval).
    • For SciFact, there are two versions of the training data: scifact_20 trains on 20 negative samples per positive, and scifact_10 trains on 10 negatives. The MultiVerS model reported in the paper is trained on scifact_20; this is the model you will get by running get_checkpoint scifact. In a subsequent paper, I discovered that training on 20 negatives lead to overfitting; thus, I have also made available the version with 10 negative samples.

I haven't released the data preprocessing code to convert these datasets from their original form into the form available for download. The data preprocessing was kind of a hassle, so I probably won't bother unless there's a demand. If you need these scripts, please open an issue and I'll try to get them pulled together.

Training

  • To pretrain (Stage 1 training), use pretrain.py. Example usage: python script/pretrain.py --datasets fever,pubmedqa,evidence_inference
  • To finetune (Stage 2 training), use train_target.py. Example usage: python script/train_target.py --dataset scifact_20

For both scripts, the trained model will go in a directory named checkpoints_user. Note that train_target.py task a single target dataset as input, while pretrain.py takes multiple target datasets, separated by commas.

Batching and GPU usage

The models should be trained with an effective batch size of 8. The training scripts take care of the arithmetic to make this happen. The arguments requested from the user are:

  • --gpus: The number of GPU's to train on. If an integer is given (e.g. --gpus=2), this specifies the number of GPU's to use. If a comma-separated list of ints is given (e.g. --gpus=2,3), this specifies the specific devices to use for training. The total number of gpu's must be a power of 2; this is required to get the effective batch size to work out.
  • --gradient_checkpointing: Using this flag turns on gradient checkpointing, which increases effective memory at the expense of slower compute. If you're hitting out-of-memory errors, you probably need gradient checkpointing. I've run model training on two types of GPU's:
    • NVIDIA GeForce RTX 2080 Ti with 11 GB memory. I needed to turn on gradient checkpointing for these.
    • NVIDIA Quadro RTX 6000 with 24 GB memory. For this, I ran with gradient checkpointing off.

Caveat: Unfortunately, multi-GPU training via DDP doesn't play nicely with gradient checkpointing; see for instance this issue. So, if you've got gradient checkpointing turned on, you're stuck training with a single GPU. I'd very much welcome a pull request fixing this.

Status

I've invoked the training scripts on all relevant datasets, and the scripts have run without breaking. I have not run all training scripts to completion or checked the performance of all the resulting models. I have checked with covidfact; the model I trained was within 1 F1 of the numbers reported in the paper. If you train a model and are seeing wildly different results, please raise an issue.