OpenMatch v2

OpenMatch v2 is an all-in-one toolkit for information retrieval (IR) currently under active development. It supports training and evaluation of various dense retrievers and re-rankers with deep integration of Huggingface Transformers and Datasets.

Features

Human-friendly interface for dense retriever and re-ranker training and testing
Various PLMs supported (BERT, RoBERTa, T5...)
Native support for common IR & QA Datasets (MS MARCO, NQ, KILT, BEIR, ...)
Deep integration with Huggingface Transformers and Datasets
Efficient training and inference via stream-style data loading

Installation

To install OpenMatch V2, follow these steps:

git clone https://github.com/OpenMatch/OpenMatch.git
cd OpenMatch
pip install -e .

Note: -e means editable, i.e. you can change the code directly in your directory.

Dependency Management

We do not include all the requirements in the package. You may need to manually install some dependencies based on your environment:

• torch and tensorboard for model training and visualization. Install with:

pip install torch tensorboard

• faiss for dense retrieval. Choose between faiss-cpu or faiss-gpu depending on your system. Make sure that the correct version of faiss-gpu is installed for your CUDA environment. Install faiss with:

conda install faiss-cpu -c pytorch
# or
conda install faiss-gpu -c pytorch

Note: If you encounter GPU search errors (especially with CUDA >= 11.0), you may need to install faiss-gpu manually via conda instead of pip.

Quick Start Guide

This section demonstrates how to set up and run a simple retrieval task using OpenMatch v2.

Step 1: Data Preparation

First, select a supported dataset for training and evaluation, such as MS MARCO:

wget --no-check-certificate https://rocketqa.bj.bcebos.com/corpus/marco.tar.gz
tar -zxf marco.tar.gz
rm marco.tar.gz

Step 2: Train the Model

python -m openmatch.driver.train_dr \
    --output_dir $CHECKPOINT_DIR/msmarco/t5 \
    --model_name_or_path bert-base-uncased \
    --do_train \
    --save_steps 20000 \
    --eval_steps 20000 \
    --train_path $PROCESSED_DIR/msmarco/t5/train.new.jsonl \
    --eval_path $PROCESSED_DIR/msmarco/t5/val.jsonl \
    --fp16 \
    --per_device_train_batch_size 8 \
    --num_train_epochs 3 \
    --learning_rate 5e-6 \
    --logging_dir $LOG_DIR/msmarco/t5 \
    --evaluation_strategy steps

Step 3: Inference and Retrieval

python -m openmatch.driver.build_index \
    --output_dir $EMBEDDING_DIR/msmarco/t5 \
    --model_name_or_path $CHECKPOINT_DIR/msmarco/t5 \
    --per_device_eval_batch_size 256 \
    --corpus_path $COLLECTION_DIR/marco/corpus.tsv \
    --q_max_len 32 \
    --p_max_len 128 \
    --fp16

python -m openmatch.driver.retrieve \
    --output_dir $RESULT_DIR/msmarco/t5 \
    --model_name_or_path $CHECKPOINT_DIR/msmarco/t5 \
    --query_path $COLLECTION_DIR/marco/dev.query.txt \
    --trec_save_path $RESULT_DIR/msmarco/t5/dev.trec \
    --fp16

Step 4: Evaluation

python scripts/evaluate.py \
    -m mrr.10 \  # Specify your evaluation metric (e.g., MRR@10)
    $COLLECTION_DIR/marco/qrels.dev.tsv \
    $RESULT_DIR/msmarco/t5/dev.trec

Note: This Quick Start Guide provides a streamlined process for setting up and training a dense retrieval model with OpenMatch v2. For more detailed instructions or advanced configurations, refer to the documentation.

Documentation

We are actively working on the docs.

Project Organizers

Zhiyuan Liu
- Tsinghua University
- Homepage
Zhenghao Liu
- Northeastern University
- Homepage
Chenyan Xiong
- Microsoft Research AI
- Homepage
Maosong Sun
- Tsinghua University
- Homepage

Acknowledgments

Our implementation uses Tevatron as the starting point. We thank its authors for their contributions.

Contributing

We welcome contributions! To contribute:

Fork the repository.
Create a new branch for your changes.
Open a pull request, ensuring that your code passes all tests and follows the project’s style guidelines.

Contact

For any inquiries, please contact yushi17@foxmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 361 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
src/openmatch		src/openmatch
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenMatch v2

Features

Installation

Dependency Management

Quick Start Guide

Step 1: Data Preparation

Step 2: Train the Model

Step 3: Inference and Retrieval

Step 4: Evaluation

Documentation

Project Organizers

Acknowledgments

Contributing

Contact

About

Releases

Packages

Languages

License

qiusy99/OpenMatch

Folders and files

Latest commit

History

Repository files navigation

OpenMatch v2

Features

Installation

Dependency Management

Quick Start Guide

Step 1: Data Preparation

Step 2: Train the Model

Step 3: Inference and Retrieval

Step 4: Evaluation

Documentation

Project Organizers

Acknowledgments

Contributing

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages