Training bilingual models without parallel corpora involves using methods that allow a model to learn from monolingual data in multiple languages. This approach is crucial because parallel corpora, which consist of aligned sentence pairs in different languages, are not always available or easy to create, especially for low-resource languages.
Several techniques have emerged to tackle this challenge:
1. Unsupervised Machine Translation (UMT)
Unsupervised machine translation refers to methods that train translation models without relying on parallel corpora. The model learns translation by leveraging monolingual corpora in each language.
-
Back-Translation: A popular technique in unsupervised machine translation is back-translation, where a model translates text from one language to another and then back to the original language. This creates synthetic parallel data that can be used for training. The process works as follows:
-
Translate text from the source language to the target language using an initial model.
-
Translate the generated target-language text back into the source language.
-
Use these synthetic parallel sentences to train the model.
-
-
Cycle Consistency: Cycle consistency ensures that a model can not only translate from language A to language B, but also reverse the translation from language B to language A. This helps refine the translation quality without parallel data.
2. Pretrained Multilingual Models (like mBERT, XLM-R)
Multilingual pretrained models can be used for training bilingual models in a scenario where parallel corpora are unavailable. These models are trained on large amounts of monolingual data from multiple languages and are capable of learning shared representations across languages.
-
mBERT (Multilingual BERT) and XLM-R (Cross-lingual RoBERTa) are examples of models that can be fine-tuned for specific tasks, such as translation, with minimal additional training. They use transfer learning to leverage common linguistic structures across languages, which can be beneficial even when no parallel corpora are available.
-
Fine-tuning a pretrained model on monolingual corpora from different languages enables the model to learn cross-lingual representations. The model does not need explicit parallel sentences, as it relies on semantic similarity learned from large-scale monolingual data.
3. Cross-lingual Transfer Learning
In this method, knowledge from one language can be transferred to another by training a shared model. Here are a couple of approaches to achieve this:
-
Zero-shot Translation: The model is trained on one language (source) and evaluated on another language (target) without directly seeing aligned training data. The zero-shot capability often works well with models like mBERT, where the multilingual embedding space can map sentences from different languages into a shared vector space.
-
Multilingual Embeddings: These are vector representations that are common across multiple languages. By projecting sentences or words from both languages into a shared vector space, the model can align meaning across languages. This can be done by training embeddings using methods like FastText or word2vec on monolingual corpora from different languages and then aligning them in a shared space.
4. Multilingual Neural Machine Translation (NMT)
Multilingual NMT is a technique where one model is trained to handle translations between multiple languages, even without parallel corpora for all the language pairs. The model is trained on monolingual corpora in each language, and during training, it learns to translate between multiple languages by using a shared representation.
-
Multilingual Encoders: Using an encoder-decoder architecture where the encoder is shared across all languages, the decoder is also shared, and the model learns to translate between languages by utilizing shared representations of syntax and semantics.
-
Language Tokens: To differentiate between languages during training, language tokens are often added to the input to inform the model which language it should translate from and into.
5. Adversarial Training
Adversarial training can be used to force the model to distinguish between languages in an unsupervised fashion. This approach leverages adversarial networks (like GANs) to improve the translation quality by pushing the model to generate realistic translations.
-
In adversarial training, the model is encouraged to fool a discriminator network into thinking that a translated sentence from a non-parallel corpus is real. The discriminator is trained to distinguish between real and fake translations, while the generator (translation model) tries to produce translations that can “trick” the discriminator.
6. Multilingual Word Representations and Cross-lingual Transfer
This approach focuses on learning word-level representations that are shared across languages. Models like VecMap are used to map word vectors between languages by aligning monolingual embeddings. Even without parallel corpora, these methods can build strong cross-lingual embeddings that allow translation tasks.
-
Translation Pivoting: This method uses a pivot language. For instance, if there’s no direct parallel data between language A and language B, data for both languages can be aligned to a third language C (a pivot language) with available parallel corpora. The pivoting method can significantly reduce the need for direct bilingual parallel corpora.
7. Self-Training
Self-training involves training a model on one language and using it to generate pseudo-parallel corpora for another language. The steps usually include:
-
Train the model on a language pair where data is available.
-
Use this model to translate a large corpus in one language.
-
Use these translations as pseudo-parallel data to further train the model, which can improve translation quality.
Conclusion
Training bilingual models without parallel corpora is a challenging yet feasible task with the right techniques. While approaches like unsupervised machine translation, multilingual pretraining, and cross-lingual transfer learning have made significant progress, challenges remain, particularly when dealing with low-resource languages. Nevertheless, the continuous advancements in AI research are making bilingual training more accessible and efficient, even in the absence of parallel corpora.