XGBoost:二分类问题
2016-01-17 20:43
246 查看
本文介绍XGBoost的命令行使用方法。Python和R的使用方法见https://github.com/dmlc/xgboost/blob/master/doc/README.md 。
下面将介绍如何利用XGBoost解决二分类问题。以下使用的数据集见mushroom dataset
XGBoost的输入数据格式和LibSVM一样。下面是XGBoost使用的输入数据格式:
每行表示一个样本,第一列的数字表示类别标签,表示样本所属于的类别,‘101’和‘102’表示特征索引,’1.2‘和‘0.03’是特征所对应的值。在二分类中‘1’表示正类,‘0’表示负类。同时类别标签支持概率标签,取值服务i为[0,1],表示样本属于某个类别的可能性。
第一步需要将数据集转化成libSVM形式,执行如下脚本
mapfeat.py和mknfold.py分别如下
运行完以上两个Python脚本将会产生训练数据集:’agaricus.txt.train’ 和测试数据集: ‘agaricus.txt.test’
执行如下命令行完成模型训练:
mushroom.conf文件用于配置训练模型和测试模型时需要的信息。每行的配置信息格式为:[attribute]=[value]:
这里的booster采用gbtree,目标函数采用logistic regression。这意味着可以采用经典的梯度提升回归树进行计算(GBRT)。这种方法能够很好的处理二分类问题
以上的配置文件中给出了最常用的配置参数。如果想了解更多的参数,详见https://github.com/dmlc/xgboost/blob/master/doc/parameter.md。如果不想在配置文件中配置算法参数,可以通过命令行配置,如下
这表示max_depth参数将被设置为6而不是配置文件中的3。当使用命令行参数时确保max_depth=6为一个参数,即参数之间不要含有间隔。如果既使用配置又使用命令行参数,则命令行参数会覆盖配置文件参数,即优先使用命令行参数
在以上的例子中使用tree booster计算梯度提升。如果想使用linear booster进行回归计算,可以修改booster参数为gblinear,配置文件中的其它参数都不需要修改,配置文件信息如下
在训练好模型之后,可以对测试数据进行预测,执行如下脚本
对于二分类问题预测的输出结果为[0,1]之间的概率值,表示样本属于正类的概率。
目前这还是个基本功能,只支持树模型的展示。XGBoost可以用文本的显示展示树模型,执行以下脚本
0003.model将会输出到dump.raw.txt和dump.nice.txt中。dump.nice.txt中的结果更容易理解,因为其中使用了特征映射文件featmap.txt
featmap.txt的格式为
Feature id从0开始直到特征的个数为止,从小到大排列。
i表示是二分类特征
q表示数值变量,如年龄,时间等。q可以缺省
int表示特征为整数(when int is hinted, the decision boundary will be integer)
当运行程序时,会输出如下运行信息
计算过程中模型评价信息输出到错误输出流stderr中,如果希望记录计算过程中的模型评价信息,可以执行如下脚本
在log.txt文件中记录如下信息
也可以同时监测训练过程和测试过程中的统计信息,可以通过如下方式进行配置
运行以上的脚本后得到的信息如下
运行规则是[name-printed-in-log] = filename, filename文件将会被加入检测进程并在每个迭代过程中对模型进行评价。
XGBoost同时支持多种统计量的监测,假设希望监测在训练过程每次预测的平均log-likelihood,只需要在配置文件中添加配置信息
如果现在运行过程中每两步保存一个模型,则可以设置参数set save_period=2.。在当前文件夹将会看到模型0002.model。如果想修改模型输出的路径,则可以通过参数dir=foldername修改。缺省情况下XGBoost将会保持上次迭代的结果模型。
如果想从已有的模型继续训练,例如从0002.model继续计算,则用如下命令行
XGBoost将加载0002.model并进行两次迭代计算,并将输出明显保存在continue.model。需要注意的是 在mushroom.conf中定义的训练数据和评价数据信息不能发生变化。
当计算大数据集时,可能需要并行计算。如果编译器支持OpenMP,XGBoost原生是支持多线程的,通过一下参数
默认情况下XGBoost将会产生二进制的缓存文件,文件后缀为
下面将介绍如何利用XGBoost解决二分类问题。以下使用的数据集见mushroom dataset
简介
产生输入数据
XGBoost的输入数据格式和LibSVM一样。下面是XGBoost使用的输入数据格式:<code class="hljs r has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">101</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.2</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">102</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.03</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2.1</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10001</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">300</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10002</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">400</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>
每行表示一个样本,第一列的数字表示类别标签,表示样本所属于的类别,‘101’和‘102’表示特征索引,’1.2‘和‘0.03’是特征所对应的值。在二分类中‘1’表示正类,‘0’表示负类。同时类别标签支持概率标签,取值服务i为[0,1],表示样本属于某个类别的可能性。
第一步需要将数据集转化成libSVM形式,执行如下脚本
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">python mapfeat<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.py</span> python mknfold<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.py</span> agaricus<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.txt</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
mapfeat.py和mknfold.py分别如下
<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#!/usr/bin/python</span> <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">loadfmap</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">( fname )</span>:</span> fmap = {} nmap = {} <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> l <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> open( fname ): arr = l.split() <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>].find(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'.'</span>) != -<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>: idx = int( arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>].strip(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'.'</span>) ) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">assert</span> idx <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">not</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> fmap fmap[ idx ] = {} ftype = arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>].strip(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">':'</span>) content = arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>] <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>: content = arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> it <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> content.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">','</span>): <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> it.strip() == <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">''</span>: <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">continue</span> k , v = it.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'='</span>) fmap[ idx ][ v ] = len(nmap) nmap[ len(nmap) ] = ftype+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'='</span>+k <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> fmap, nmap <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">write_nmap</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">( fo, nmap )</span>:</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range( len(nmap) ): fo.write(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'%d\t%s\ti\n'</span> % (i, nmap[i]) ) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># start here</span> fmap, nmap = loadfmap( <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'agaricus-lepiota.fmap'</span> ) fo = open( <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'featmap.txt'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'w'</span> ) write_nmap( fo, nmap ) fo.close() fo = open( <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'agaricus.txt'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'w'</span> ) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> l <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> open( <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'agaricus-lepiota.data'</span> ): arr = l.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">','</span>) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] == <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'p'</span>: fo.write(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'1'</span>) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>: <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">assert</span> arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] == <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'e'</span> fo.write(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'0'</span>) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range( <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,len(arr) ): fo.write( <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">' %d:1'</span> % fmap[i][arr[i].strip()] ) fo.write(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'\n'</span>) fo.close()</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li></ul>
<code class="hljs livecodeserver has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#!/usr/bin/python</span> import sys import <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">random</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">len</span>(sys.argv) < <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>: print (<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'Usage:<filename> <k> [nfold = 5]'</span>) exit(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>) <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">random</span>.seed( <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span> ) k = int( sys.argv[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>] ) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">len</span>(sys.argv) > <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>: nfold = int( sys.argv[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>] ) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>: nfold = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span> fi = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">open</span>( sys.argv[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>], <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'r'</span> ) ftr = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">open</span>( sys.argv[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'.train'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'w'</span> ) fte = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">open</span>( sys.argv[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'.test'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'w'</span> ) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> l <span class="hljs-operator" style="box-sizing: border-box;">in</span> fi: <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">random</span>.randint( <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> , nfold ) == k: fte.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">write</span>( l ) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>: ftr.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">write</span>( l ) fi.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">close</span>() ftr.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">close</span>() fte.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">close</span>()</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li></ul>
运行完以上两个Python脚本将会产生训练数据集:’agaricus.txt.train’ 和测试数据集: ‘agaricus.txt.test’
训练
执行如下命令行完成模型训练:<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">xgboost mushroom<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>
mushroom.conf文件用于配置训练模型和测试模型时需要的信息。每行的配置信息格式为:[attribute]=[value]:
<code class="language-conf hljs vala has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># General Parameters, see comment for each definition</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># can be gbtree or gblinear</span> booster = gbtree <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># choose logistic regression loss function for binary classification</span> objective = binary:logistic <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># Tree Booster Parameters</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># step size shrinkage</span> eta = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.0</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># minimum loss reduction required to make a further partition</span> gamma = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.0</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># minimum sum of instance weight(hessian) needed in a child</span> min_child_weight = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># maximum depth of a tree</span> max_depth = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># Task Parameters</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># the number of round to do boosting</span> num_round = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># 0 means do not save any model except the final round model</span> save_period = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># The path of training data</span> data = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"agaricus.txt.train"</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># The path of validation data, used to monitor training process, here [test] sets name of the validation set</span> eval[test] = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"agaricus.txt.test"</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># The path of test data </span> test:data = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"agaricus.txt.test"</span> </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li></ul>
这里的booster采用gbtree,目标函数采用logistic regression。这意味着可以采用经典的梯度提升回归树进行计算(GBRT)。这种方法能够很好的处理二分类问题
以上的配置文件中给出了最常用的配置参数。如果想了解更多的参数,详见https://github.com/dmlc/xgboost/blob/master/doc/parameter.md。如果不想在配置文件中配置算法参数,可以通过命令行配置,如下
<code class="hljs fix has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-attribute" style="box-sizing: border-box;">xgboost mushroom.conf max_depth</span>=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">6</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>
这表示max_depth参数将被设置为6而不是配置文件中的3。当使用命令行参数时确保max_depth=6为一个参数,即参数之间不要含有间隔。如果既使用配置又使用命令行参数,则命令行参数会覆盖配置文件参数,即优先使用命令行参数
在以上的例子中使用tree booster计算梯度提升。如果想使用linear booster进行回归计算,可以修改booster参数为gblinear,配置文件中的其它参数都不需要修改,配置文件信息如下
<code class="language-conf hljs livecodeserver has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># General Parameters</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># choose the linear booster</span> booster = gblinear ... <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># Change Tree Booster Parameters into Linear Booster Parameters</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># L2 regularization term on weights, default 0</span> lambda = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.01</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># L1 regularization term on weights, default 0</span> f ```agaricus.txt.test.buffer``` exists, <span class="hljs-operator" style="box-sizing: border-box;">and</span> automatically loads <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">from</span> binary buffer <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> possible, this can speedup training <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">process</span> when you <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">do</span> training many times. You can disable <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">it</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">by</span> setting ```use_buffer=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>```. - Buffer <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">file</span> can also be used <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> standalone input, i.e <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> buffer <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">file</span> exists, but original agaricus.txt.test was removed, xgboost will still run * Deviation <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">from</span> LibSVM input <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">format</span>: xgboost is compatible <span class="hljs-operator" style="box-sizing: border-box;">with</span> LibSVM <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">format</span>, <span class="hljs-operator" style="box-sizing: border-box;">with</span> <span class="hljs-operator" style="box-sizing: border-box;">the</span> following minor differences: - xgboost allows feature index starts <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">from</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> - <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> binary classification, <span class="hljs-operator" style="box-sizing: border-box;">the</span> label is <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> positive, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> negative, instead <span class="hljs-operator" style="box-sizing: border-box;">of</span> +<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> - <span class="hljs-operator" style="box-sizing: border-box;">the</span> feature indices <span class="hljs-operator" style="box-sizing: border-box;">in</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">each</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">line</span> *<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">do</span> <span class="hljs-operator" style="box-sizing: border-box;">not</span>* need <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">to</span> be sorted alpha = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.01</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># L2 regularization term on bias, default 0</span> lambda_bias = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.01</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># Regression Parameters</span> ...</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li></ul>
预测
在训练好模型之后,可以对测试数据进行预测,执行如下脚本<code class="hljs bash has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">xgboost mushroom.conf task=pred model_<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span>=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0003</span>.model</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>
对于二分类问题预测的输出结果为[0,1]之间的概率值,表示样本属于正类的概率。
模型展示
目前这还是个基本功能,只支持树模型的展示。XGBoost可以用文本的显示展示树模型,执行以下脚本<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">../../xgboost mushroom<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span> task=dump model_in=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0003.</span>model name_dump=dump<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.raw</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.txt</span> ../../xgboost mushroom<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span> task=dump model_in=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0003.</span>model fmap=featmap<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.txt</span> name_dump=dump<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.nice</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.txt</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
0003.model将会输出到dump.raw.txt和dump.nice.txt中。dump.nice.txt中的结果更容易理解,因为其中使用了特征映射文件featmap.txt
featmap.txt的格式为
featmap.txt: <featureid> <featurename> <q or i or int>\n:
Feature id从0开始直到特征的个数为止,从小到大排列。
i表示是二分类特征
q表示数值变量,如年龄,时间等。q可以缺省
int表示特征为整数(when int is hinted, the decision boundary will be integer)
计算过程监测
当运行程序时,会输出如下运行信息<code class="hljs vbscript has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">tree train <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">end</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> roots, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">12</span> extra nodes, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> pruned nodes ,max_depth=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span> [<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] test-<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">error</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.016139</span> boosting <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">round</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> sec elapsed tree train <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">end</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> roots, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span> extra nodes, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> pruned nodes ,max_depth=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span> [<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>] test-<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">error</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.000000</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li></ul>
计算过程中模型评价信息输出到错误输出流stderr中,如果希望记录计算过程中的模型评价信息,可以执行如下脚本
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">xgboost mushroom<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>>log<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.txt</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>
在log.txt文件中记录如下信息
<code class="hljs css has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[0]</span> <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.016139</span> <span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[1]</span> <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.000000</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
也可以同时监测训练过程和测试过程中的统计信息,可以通过如下方式进行配置
<code class="language-conf hljs bash has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">eval</span>[test] = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"agaricus.txt.test"</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">eval</span>[trainname] = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"agaricus.txt.train"</span> </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
运行以上的脚本后得到的信息如下
<code class="hljs css has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[0]</span> <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.016139</span> <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.014433</span> <span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[1]</span> <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.000000</span> <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.001228</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
运行规则是[name-printed-in-log] = filename, filename文件将会被加入检测进程并在每个迭代过程中对模型进行评价。
XGBoost同时支持多种统计量的监测,假设希望监测在训练过程每次预测的平均log-likelihood,只需要在配置文件中添加配置信息
eval_metric=logloss。再次运行log文件中将会有如下信息
<code class="hljs css has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[0]</span> <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.016139</span> <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-negllik</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.029795</span> <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.014433</span> <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-negllik</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.027023</span> <span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[1]</span> <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.000000</span> <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-negllik</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.000000</span> <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.001228</span> <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-negllik</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.002457</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
保存运行过程中的模型
如果现在运行过程中每两步保存一个模型,则可以设置参数set save_period=2.。在当前文件夹将会看到模型0002.model。如果想修改模型输出的路径,则可以通过参数dir=foldername修改。缺省情况下XGBoost将会保持上次迭代的结果模型。
从已有模型继续计算
如果想从已有的模型继续训练,例如从0002.model继续计算,则用如下命令行<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">xgboost mushroom<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span> model_in=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0002.</span>model num_round=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span> model_out=continue<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.model</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>
XGBoost将加载0002.model并进行两次迭代计算,并将输出明显保存在continue.model。需要注意的是 在mushroom.conf中定义的训练数据和评价数据信息不能发生变化。
使用多线程
当计算大数据集时,可能需要并行计算。如果编译器支持OpenMP,XGBoost原生是支持多线程的,通过一下参数nthread=10设置线程数为10。
其它需要注意的点
agaricus.txt.test.buffer和
agaricus.txt.train.buffer是什么文件
默认情况下XGBoost将会产生二进制的缓存文件,文件后缀为
buffer。当下次再次运行XGBoost时将加载缓存文件而不是原始的文件。
相关文章推荐
- boost库中 不同版本载入lib的方式
- 【UE4官方文档翻译】Unreal Engine 4 For Unity Developers (针对Unity开发者的UE4)
- 23 判断扑克牌的顺子
- WMware11下安装Fedora22详细图文
- nginx性能优化
- FreeBSD系统更新与软件安装方法
- Android笔记--对Content的理解和使用和Application的用途、生命周期
- 考研后的Java温习之I/O
- Android内存泄漏分析实战
- PHP中将对数据库的操作,封装成一个工具类以及学会使用面向对象的方式进行编程
- 分享周鸿祎的《如何建立一个“铁打的营盘”》
- Xcode和模拟器的清除缓存
- javaEE之-------Spring中的aspectJ的应用
- ISO/IEC 9899:2011 条款6.10——预处理指示符
- 7、单例设计模式
- 续续 Codeforces 596 C Wilbur and Points
- mac版mysql 5.6 与5.7的安装与测试
- ZOJ 3471 Most Powerful (状压dp)
- 11464 Even Parity
- JAVA之JDBC连接MYSQL