PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation

PEACH is a new sequence to sequence multilingual transformer model trained with the semi-supervised pseudo-parallel document generation, our proposed pre-training objective for training multilingual models.

Abstract

Multilingual pre-training significantly improves many multilingual NLP tasks, including machine translation. Most existing methods are based on some variants of masked language modeling and text-denoising objectives on monolingual data. Multilingual pre-training on monolingual data ignores the availability of parallel data in many language pairs. Also, some other works integrate the available human-generated parallel translation data in their pre-training. This kind of parallel data is definitely helpful, but it is limited even in high-resource language pairs. This paper introduces a novel semi-supervised method, SPDG, that generates high-quality pseudo-parallel data for multilingual pre-training. First, a denoising model is pre-trained on monolingual data to reorder, add, remove, and substitute words, enhancing the pre-training documents' quality. Then, we generate different pseudo-translations for each pre-training document using dictionaries for word-by-word translation and applying the pre-trained denoising model. The resulting pseudo-parallel data is then used to pre-train our multilingual sequence-to-sequence model, PEACH. Our experiments show that PEACH outperforms existing approaches used in training mT5 and mBART on various translation tasks, including supervised, zero- and few-shot scenarios. Moreover, PEACH's ability to transfer knowledge between similar languages makes it particularly useful for low-resource languages. Our results demonstrate that with high-quality dictionaries for generating accurate pseudo-parallel, PEACH can be valuable for low-resource languages.

File orderiing

The files are organized in the following system:

|
|__ models
   |
   |__ peach
      |
	  |__ bin
	  |__ data
	  |__ datasets
	  |__ eval
	  |__ layers
	  |__ models
	  |__ ops
	  |__ params
   |__ requirements.txt
   |__ setup.py
|
|__ requirements.txt
|__ T5
|__ mBART
|__ peach
   |
   |__ denoising
   |__ translation

models directory, contains tensorflow codes for creating the models and parameters. There is a Readme in the repository which shows how exactly the codes work and how they can be used.

In the pretrain directory, we have our model objective implementation, as well as mT5' objective and mBART's objective.

For our objective, we have two pre-training methods:

word-by-word translation which can be found at translation directory
denoising which can be found at denoising directory

In case to find out how to change the hyperparameters and parameters of the models, read the README files in models directory.

peach_training_finetuning.ipynb notebook shows how the generate data for different models (pre-training), how to train models, and how to fine-tune the model. You can use the following checkpoint in order not to train the model from scratch.

Link to models

Denoising models

Here is the link to denosing models.

language	model	vocab
German(de)	download	download
English(en)	download	download
French(fr)	download	download
Macedonian(mk)	download	download

Masked Language Modeling objective models

Pre-trained and fine-tuned models for MLM objective:

Pre-trained model links:

language	model	vocab
en, fr, and de	download	download
en, fr, and de(xlni)	download	download
en and mk	download	download

Fine-tuned models:

language pairs	model	vocab
de-en	download	download
de-fr	download	download
en-de	download	download
en-fr	download	download
fr-de	download	download
fr-en	download	download
en-mk	download	download
mk-en	download	download

Masked Language Modeling with Reordering objective models

Pre-trained and fine-tuned models for MLM with Reordering objective:

Pre-trained model links:

language	model	vocab
en, fr, and de	download	download
en and mk	download	download

XLNI for MLM with Reordering is available here:

model	vocab
download	download

Fine-tuned models:

language pairs	model	vocab
de-en	download	download
de-fr	download	download
en-de	download	download
en-fr	download	download
fr-de	download	download
fr-en	download	download
en-mk	download	download
mk-en	download	download

SPDG objective models

checkpoint	model	vocab
checkpoint-100000	download	download
checkpoint-200000	download	download
checkpoint-300000	download	download
checkpoint-400000	download	download
checkpoint-500000	download	download

XLNI for SPDG is available here:

model	vocab
download	download

Pre-trained pair-language models:

pair-language	model	vocab
en and de	download	download
en and fr	download	download
en and mk	download	download
fr and de	download	download

Fine-tuned pair-language models:

language	model	vocab
en-de	download	download
de-en	download	download
en-fr	download	download
fr-en	download	download
de-fr	download	download
fr-de	download	download
en-mk	download	download
mk-en	download	download

Transformer models:

languages	model	vocab
de-en	download	download
de-fr	download	download
en-de	download	download
en-fr	download	download
fr-de	download	download
fr-de	download	download

Translation models:

checkpoint	languages	model	vocab
checkpoint-100000	de-en	download	download
checkpoint-100000	de-fr	download	download
checkpoint-100000	en-de	download	download
checkpoint-100000	en-fr	download	download
checkpoint-100000	fr-de	download	download
checkpoint-100000	fr-en	download	download
checkpoint-200000	de-en	download	download
checkpoint-200000	de-fr	download	download
checkpoint-200000	en-de	download	download
checkpoint-200000	en-fr	download	download
checkpoint-200000	fr-de	download	download
checkpoint-200000	fr-en	download	download
checkpoint-300000	de-en	download	download
checkpoint-300000	de-fr	download	download
checkpoint-300000	en-de	download	download
checkpoint-300000	en-fr	download	download
checkpoint-300000	fr-de	download	download
checkpoint-300000	fr-en	download	download
checkpoint-400000	de-en	download	download
checkpoint-400000	de-fr	download	download
checkpoint-400000	en-de	download	download
checkpoint-400000	en-fr	download	download
checkpoint-400000	fr-de	download	download
checkpoint-400000	fr-en	download	download
checkpoint-500000	de-en	download	download
checkpoint-500000	de-fr	download	download
checkpoint-500000	en-de	download	download
checkpoint-500000	en-fr	download	download
checkpoint-500000	fr-de	download	download
checkpoint-500000	fr-en	download	download

Citation

@inproceedings{salemi-etal-2023-peach,
    title = "{PEACH}: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation",
    author = "Salemi, Alireza  and
      Abaskohi, Amirhossein  and
      Tavakoli, Sara  and
      Shakery, Azadeh  and
      Yaghoobzadeh, Yadollah",
    booktitle = "Proceedings of the The Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.loresmt-1.3",
    pages = "32--46",
    abstract = "Multilingual pre-training significantly improves many multilingual NLP tasks, including machine translation. Most existing methods are based on some variants of masked language modeling and text-denoising objectives on monolingual data. Multilingual pre-training on monolingual data ignores the availability of parallel data in many language pairs. Also, some other works integrate the available human-generated parallel translation data in their pre-training. This kind of parallel data is definitely helpful, but it is limited even in high-resource language pairs. This paper introduces a novel semi-supervised method, SPDG, that generates high-quality pseudo-parallel data for multilingual pre-training. First, a denoising model is pre-trained on monolingual data to reorder, add, remove, and substitute words, enhancing the pre-training documents{'} quality. Then, we generate different pseudo-translations for each pre-training document using dictionaries for word-by-word translation and applying the pre-trained denoising model. The resulting pseudo-parallel data is then used to pre-train our multilingual sequence-to-sequence model, PEACH. Our experiments show that PEACH outperforms existing approaches used in training mT5 and mBART on various translation tasks, including supervised, zero- and few-shot scenarios. Moreover, PEACH{'}s ability to transfer knowledge between similar languages makes it particularly useful for low-resource languages. Our results demonstrate that with high-quality dictionaries for generating accurate pseudo-parallel, PEACH can be valuable for low-resource languages.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
models		models
pretrain		pretrain
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
peach_training_finetuning.ipynb		peach_training_finetuning.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation

Abstract

File orderiing

Link to models

Denoising models

Masked Language Modeling objective models

Masked Language Modeling with Reordering objective models

SPDG objective models

Citation

About

Releases

Packages

Languages

License

AmirAbaskohi/PEACH

Folders and files

Latest commit

History

Repository files navigation

PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation

Abstract

File orderiing

Link to models

Denoising models

Masked Language Modeling objective models

Masked Language Modeling with Reordering objective models

SPDG objective models

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages