Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models


Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou

Qwen, Alibaba Inc.

✨ Overview

This is the repository contains core implementations of the AutoIF, proposed by Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models.

AutoIF is the first scalable and reliable method for automatically generating instruction-following data and verifying its quality using code execution feedback.


🚀 Data Synthesis of AutoIF

We divided the AutoIF's data synthesis process into steps and provided 10-20 samples per step to facilitate your reproduction. Please remember to replace them with your own input.

🔧 Dependencies

General Setup Environment:

  • Python 3.9
  • PyTorch (currently tested on version 2.1.2+cu121)
  • Transformers (version 4.41.2, unlikely to work lower than this version)
cd ./AutoIF/
pip install -r requirements.txt

Instruction Augmentation and Verification

Firstly, we hand-write 36 seed instructions: image

Step1: Self-instruct Seed Instructions

Concatenate the instruction with the RFT prompt.


Please perform k times RFT with a supervised model (e.g., GPT-4, Qwen2-72B), save as format in seed_instruction.txt.

Step2: Verification Funcs and Cases Generation

Using seed and augmented instructions for generating verification funcs and cases.


Please generate K verification functions and cases for each sample, save it in eval_func_rft.jsonl

Step3: Quality Cross-validation

Cross-validate the pass rates of verification functions and cases to ensure high-quality instructions.


Step4 & 5: Back Translation

Please back translate verification funcs to instructions, and then use mDeBERTa for consistency filtering.


Query Augmentation and Verification

Step1: Query Reforming and Augmentation

We randomly concat each query with K queries of ShareGPT and reformat them using our response RFT template:


Please use supervision model to generate k responses for each query.

Step2: Instruction-following Verification

Cross-validate the pass rate of verification functions and augmented responses to obtain high-quality queries.


In this step, we also concatenate each sample with a consistency scoring prompt. Please score them using the supervision model.

Step3: Query Quality Verification

Finally, we fliter out the sample with score > 8 and save it into LlaMA-Factory's SFT data format.


⚡ DPO Data Construction


✨Tips: In our paper, DPO includes two settings, the following are their differences:

  • Offline DPO: the reponses are obtained from your SFT data generated by supervision model.
  • Online DPO: the reponses are obtained from your response generated by your base model during each training iteration.

Please process your SFT data using the eval functions generated in the previous step, and format the results as dpo_query_eval_score_results.jsonl.

Step1: Verification Funcs Scoring

We use verfy the pass rate of each response by using corresponding verfication funcs.


Step1: Data selection

We construct DPO pairs with postive samples (Acc>=0.5) and nagative samples (Acc=0).


After construction you need to process as the DPO data format in LlaMA-Factory.

🎯 Training

We use the version of LlaMA-Factory v0.6.3. Thanks for their excellent work.

✨Tips: the difference between our two setups:

  • Strong-to-Weak Distillation: we use powerful model as supervision model (e.g., GPT-4, Qwen2-72B, Llama3-70B), and weak model (e.g., Qwen2-7B, Llama3-8B) as base model.
  • Self-Alignment: we use the same model (e.g., Qwen2-72B, Llama3-70B) as supervision and base model.

(1) SFT Training:

deepspeed --num_gpus=8 \
        --deepspeed $deepspeed_zero3_config_path \
        --stage sft \
        --do_train \
        --use_fast_tokenizer \
        --flash_attn \
        --adam_beta1 0.9 \
        --adam_beta2 0.95 \
        --model_name_or_path $MODEL_PATH \
        --dataset $dataset \
        --template $Template \
        --finetuning_type full \
        --output_dir $OUTPUT_PATH \
        --overwrite_cache \
        --overwrite_output_dir \
        --warmup_steps 20 \
        --weight_decay 0.1 \
        --per_device_train_batch_size 4 \
        --gradient_accumulation_steps 4 \
        --ddp_timeout 9000 \
        --learning_rate 7e-6 \
        --lr_scheduler_type "linear" \
        --logging_steps 1 \
        --cutoff_len 8192 \
        --save_steps 200 \
        --num_train_epochs 3.0 \
        --plot_loss \

(2) DPO Training:

deepspeed --num_gpus 8 \
        --deepspeed $deepspeed_zero3_config_path \
        --stage dpo \
        --do_train \
        --model_name_or_path $MODEL_PATH \
        --dataset $dataset \
        --dataset_dir $DATA_PATH \
        --template $Template \
        --finetuning_type full \
        --output_dir $OUTPUT_PATH \
        --overwrite_cache \
        --overwrite_output_dir \
        --cutoff_len 4096 \
        --preprocessing_num_workers 1 \
        --per_device_train_batch_size 1 \
        --gradient_accumulation_steps 2 \
        --lr_scheduler_type cosine \
        --logging_steps 10 \
        --warmup_ratio 0.1 \
        --save_steps 1000 \
        --learning_rate 5e-6 \
        --num_train_epochs 2.0 \
        --max_samples 200000 \
        --ddp_timeout 180000000 \
        --plot_loss \

For the implementations details between training 7B and 70B models, please refer to our paper.

