您的位置:首页 > 大数据 > 人工智能

An overview on domain adaptation in neural machine translation

2017-08-14 15:37 471 查看
 Neural machine translation(NMT)  has achieved state-of-the-art performance in most of the language pairs. Arguably, the success of NMT is mainly attributed to the large-scale, high-quality bilingual corpus. Although new corpora are becoming increasing available,
 only those that belong to the same or similar domains are helpful for improving the translation performance.  The NMT models are very sensitive to the domains they trained on because each domain has its own style, sentence structure and terminology. However,
the in-domain corpora is usually relatively scarce and the out-of-domain corpora is abundant. Naturally, improving the translation performance of the NMT model with the low-resource in-domain data and large-scale out-of-domain data is challenging and promising.
In this essay, we give an overview about the domain adaptation in NMT.

Domain adaptation, which means to adapt a general model to a specific domain, has got widely interests in recent years. Many approaches has been proposed: transfer learning, batch normalization, data selection, fine tune and so on. This essay only focus
on the approaches which adapt a neural machine translation model to a specific domain. We collect, read and analysis the papers published in recent years, and give an overview here. We consider two settings, \emph{single-domain adaptation} and \emph{multi-domain
adaptation}. The \emph{single-domain adaptation} means that there is only single domain the NMT model needs to be adapted to. That is, the training corpus only contains the in-domain data and out-of-domain data (or called general data). The \emph{multi-domain
adaptation} means that the NMT model shall be adapted to multi domains. That is, the training corpus has been (or can be) classified into multi domains and the NMT model shall performs well on each domain.

The single-domain adaptation mainly focuses on this problem: Given the low-resource in-domain data and large-scale general data, how can we train a NMT model which adapts well on this specific domain. More strictly, we may also expect that the NMT model
does not has performance deration on the general test sets. The approaches can be classified into the following categories:

A naive thinking is that the model trained on the mix of the in-domain and general data can perform well either on the in-domain test sets or on the general test sets. Hence, a simple method is proposed as that training the NMT model on the mix of the in-domain
and general data directly, which we referred to as mix training. This method is very simple and effect when the in-domain data has the comparable size with the general data. However, when the amount of in-domain data is relatively too small, the prediction
of the NMT model will be dominated by the general data and the translation performances on the in-domain test sets will be extremely poor.

To alleviate the domination of the large-scale general data, some researches propose a two-stage translation method. Firstly, training the NMT model on the large-scale general data until convergence; Then, continue training the NMT model on the in-domain
data. We call this method as fine tuning. A glaring shortcoming of this method is that the model can get over fit easily in the in-domain data. Specifically, the translation performance of the NMT model on the general test sets get degraded severely. Some
researchers have conducted experiments to compare the mix training and the fine tuning on the in-domain test sets. They show that the mix training gets better translation performance when the amount of in-domain data is not so small and the fine tuning is
better if otherwise.

From the two subsections above, we know that the mix training has better generalization ability and the fine tuning fits the in-domain data better. Is there one method which can combines the advantages of the mix training and fine tuning? To this end, researchers
propose to supply some additional features into the NMT model. With the additional features, the NMT model can distinguish the in-domain data from the out-of-domain data automatically and its translation prediction shall not be dominated by the out-of-domain
data. Hence, we can directly feed the in-domain and out-of-domain data into the NMT model at the same time. Specifically, there are mainly two kind of additional features: artificial token and domain-specific embedding.

The artificial token, maybe like "@IN-DOMAIN@" or "@OUT-DOMAIN@", is appended at the end of each input source sentence to indicate whether the input is in-domain sentence pair or the out-of-domain sentence pair. The artificial tokens are appropriately selected
in order to avoid overlaps with words present in the source vocabulary.

Each word in NMT is represented as word embedding, which is a high-dimension vector. The word embedding can be easily extended to arbitrary number of cells, designed to encode domain information. Under this feature framework, the sentence-level domain information
is added on a word-by-word basis to all words in a sentence.

Experiments show that the domain-specific embedding achieves consistently better performance than the artificial token. As a matter of fact, this method can be easily extended to multi-domain adaptation problem.

Data selection can be classified into static data selection and dynamic data selection. Static data selection ranks sentence pairs in a large training corpus according to their difference to an in-domain corpus and a general corpus and the top n sentence
pair with the highest rank are selected and used for training the NMT model. Hence, the selected top n sentences are static and not changed any more. Since the static data selection discard the irrelevant data, it can also exacerbate the problem of low vocabulary
coverage and unreliable statistics for rarer words , which are major issues in NMT. In addition, it has been shown that NMT performance drops tremendously in low-resource scenarios. To over this problem, researchers propose the dynamic data selection. Contrary
to the static data selection, the dynamic data selection samples n sentence pairs for each iteration by using the distribution computed from the ranks. In practice, the top-ranked sentence pairs are selected in nearly each epoch while bottom-ranked sentence
pairs are selected nearly only once.

There are several ways to compute the difference for one sentences to an in-domain corpus and a general corpus. Traditional method is applying the language model, where two language models are trained on the in-domain corpus and general corpus respectively.
And then compute the cross entropy for each sentence. The cross entropy can be viewed as the difference of this sentence to in-domain corpus and general corpus. The researchers also propose one new method which use the sentence embedding to compute the difference.
The sentence embedding is usually computed as the average of the hidden states of the encoder.

In the multi-domain setting, we consider that given k full-trained domain experts, how can we train the k+1 domain model rapidly. Typically, we can address this problem in multi-task framework, where each domain is regarded as a individual task. A possible
downside of this approach is that we need to re-train the model each time when a new domain comes. To overcome this shortcoming, researchers propose a solution based on attending an ensemble of domain experts. Assuming k domain-specific intent and slot models
trained on respective domains, given domain k+1, the model uses a weighted combination of the k domain experts' feedback along with its own opinion to make predictions on the new domain. Experiments show that this model significantly outperformed baselines
that do not use domain adaptation and also performed better than the full re-training approach.

This informal essay summaries the new approaches about domain adaptation in neural machine translation. We wish this essay is helpful to researchers who are interested in this area.

We list some important reference papers here.

[1] Chu, Chenhui  and  Dabre, Raj  and  Kurohashi, Sadao,  An Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation (ACL2017).

[2] Wang, Rui  and  Finch, Andrew  and  Utiyama, Masao  and  Sumita, Eiichiro, Sentence Embedding for Neural Machine Translation Domain Adaptation (ACL2017).

[3] Freitag M, Alonaizan Y. Fast Domain Adaptation for Neural Machine Translation[J]. 2016.

[4] Servan C, Crego J, Senellart J. Domain specialization: a post-training domain adaptation for Neural Machine Translation[J]. 2016.

[5] Kobus C, Crego J, Senellart J. Domain Control for Neural Machine Translation[J]. 2016.

[6] Kim, Young-Bum  and  Stratos, Karl  and  Kim, Dongchan, Domain Attention with an Ensemble of Experts (ACL2017).
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
相关文章推荐