Moses运行过程记录---Moses语言模型和翻译模型构建(三)
2011-02-25 11:53
1111 查看
This time I want to translate English to Chinese, so I choose Chinese as a language model.Go to the directory:/home/tianliang/mosesdecoder/srilm/bin/i686-gcc4, we will use the “ngram-count” to build a 5-gram model. The process is like below:
tianliang@ubuntu:~/mosesdecoder/srilm/bin/i686-gcc4/test$ mkdir test
tianliang@ubuntu:~/mosesdecoder/srilm/bin/i686-gcc4/test$ cd test
tianliang@ubuntu:~/mosesdecoder/srilm/bin/i686-gcc4/test$ ./ngram-count -text clean.chn -lm chinese.gz -order 5 -unk -wbdiscount -interpolate
Here it means: we will build Chinese file “clean.chn” into a 5-gram language model chinese.gz using the smoothing methods called Witten-Bell discounting and interpolated estimates. The chinese.gz looks like:
It shows the number of the n-gram models. For example, there are 594 3-gram models in our corpus.
Moses' toolkit does a great job of wrapping up calls to mkcls and GIZA++ inside a training script, and outputting the phrase and reordering tables needed for decoding. The script that does this is called train-factored-phrase-model.perl. In my running, the train-factored-phrase-model.perl is located at
/home/tianliang/mosesdecoder/moses/scripts/target/scripts-20100126-0922/training. The training process takes place in nine steps, all of them executed by the scripts. The nine steps are:
Ø Prepare data
Ø Run GIZA++
Ø Align words
Ø Get lexical translation table
Ø Extract phrase
Ø Score phrase
Ø Build lexicalized reordering model
Ø Build generation models
Ø Create configure file
If you want to check the details of the nine steps, you can refer to the User Manual on Page 66-78. Now we will build the translation model using the following command:
$train-factored-phrase-model.perl --scripts-root-dir --corpus corpus /directory --f target language --e source language
There should be two files in the corpus directory and the files should be sentence-aligned halves of the parallel corpus. In my corpus, the English corpus called clean.eng and the Chinese corpus called clean.chn. “clean.chn” should contain only Chinese and the other should contain English only. The Command I use is like below:
First we should export the path to the scripts like this:
tianliang@ubuntu:~/mosesdecoder/moses/scripts/target/scripts-20100126-0922/training$ export
SCRIPTS_ROOTDIR=/home/tianliang/mosesdecoder/moses/scripts/target/scripts-20100126-0922
Then I will use the “SCRIPTS_ROOTDIR” in our next command for short without inputting full path.
tianliang@ubuntu:~/mosesdecoder/moses/scripts/target/scripts-20100126-0922/training$ ./train-factored-phrase-model.perl --scripts-root-dir $SCRIPTS_ROOTDIR --corpus /home/tianliang/mosesdecoder/corpus/clean --f eng
--e chn --alignment grow-diag-final-and-reordering msd-bidirectional-fe --lm
0:5:/home/tianliang/mosesdecoder/corpus/chinese.gz:0
Here I just use some parameters to train the model. “--alignment grow-diag-final-and-reordering
msd-bidirectional-fe” means that I want to build alignment table and reordering table. Actually speaking, this command equals: “--alignment grow-diag-final --reordering msd-bidirectional-fe”. The other parameters you can add if you need. The last step will get something like:
For I didn't use the --generation parameter, there is warning telling you that you didn't request the generation model like below:
Now would be a good time to look at what we've done. Enter the training folder; you can see that four new folders have been made.
In folder “corpus”: Two vocabulary files are generated and the parallel corpus is converted into a numberized format. The vocabulary files contain words, integer word identifiers and word count information:
The sentence-aligned corpus now looks like this:
We can see that the alignment between Chinese and English from these two files. A sentence pair now consists of three lines: First the frequency of this sentence. In our training process this is always 1. This number can be used for weighting different parts of the training corpus differently. The two lines below contain word ids of Chinese and the English sentence.
GIZA++ also requires words to be placed into word classes. This is done automatically by calling the mkcls program. Word classes are only used for the IBM reordering model in GIZA++. A peek into the English word class file:
In folder giza.chn-eng :
This is generated by GIZA++, what we get now is the alignment of the Chinese and English. The most important file we get is called:chn-eng.A3.final .It looks like:
In this file, after some statistical information and the Chinese sentence, the English sentence is listed word by word, with references to aligned Chinese words: The first word I ({ 1 }) is aligned to the first Chinese word 我. The second word don ({ }) ' ({ }) t ({ 2 }) is aligned to Chinese word 不 And so on.
In folder giza.eng-chn:
It's the inverse of the last one ,Here I won't give the content of the folder. I am sure you have guessed it.
Note:
What I want to mention is that Moses word alignments are taken from the intersection of bidirectional runs of GIZA++ plus some additional alignment points from the union of the two runs. So we get the two folders in two inverse pairs like shown above.
In folder model:
This is the actual decoder we will use. In this directory, we get the phrase table and mose.ini. We can copy the executable program moses (in ~/moses/moses-cmd/src) to this folder. With these tools we can get the final result!
最后一部分将给出moses的运行结果和评测过程,请看第四部分“Moses运行过程记录---结果和评测(四)”。
相关文章推荐
- Moses创建一个翻译系统的基本过程记录,以后会按照每个过程详细说明,并给出每个步骤的参数说明
- Moses Running Process and Steps :Moses运行过程记录---Moses运行前的准备(一)
- Moses运行过程记录---Moses软件的安装(二)
- Moses运行过程记录---Moses结果和评测(四)
- Caffe在Windows 10 下配置、安装和运行mnist cifar10 的过程记录(着重补充了一些注意点)
- 语言都是相通的,学好一门语言,再学第二门语言就很简单,记录一下我复习c语言的过程。
- cocos2dx-jsb 跨语言调用及第三方集成 - 过程记录
- 内核窥秘之一:start_kernel()运行过程记录 http://blog.csdn.net/yyplc/article/details/7030983
- 还在等待漫长的iOS构建过程?来试试通过命令行的方式进行iOS应用快速构建和运行吧
- Pocketsphinx语音识别-----语言模型训练和声学模型的适应过程
- Swing 写的客户端程序在java web start 运行,多语言过程中,JOptionPane.showMessageDialog() 按钮多语言问题
- 使用TSP数据构建过程性能模型
- sas构建评分卡模型过程详解(二):变量筛选及逻辑回归
- gdb调试运行程序带参数(调用动态链接库),debug过程记录
- 多语言程序在运行过程中,更改显示语言
- 制作ZedBoard-linaro-desktop-ubuntu全过程之构建硬件运行环境
- 【翻译二】其他语言有内存模型吗
- linux下CERTI示例Billard运行过程记录及教程
- 用MODELLER构建好模型后对loop区域进行自动的优化过程
- 用MODELLER构建好模型后对loop区域进行自动的优化过程