您的位置:首页 > 其它

XGBoost:二分类问题

2016-01-17 20:43 246 查看
本文介绍XGBoost的命令行使用方法。Python和R的使用方法见https://github.com/dmlc/xgboost/blob/master/doc/README.md

下面将介绍如何利用XGBoost解决二分类问题。以下使用的数据集见mushroom dataset


简介


产生输入数据

XGBoost的输入数据格式和LibSVM一样。下面是XGBoost使用的输入数据格式:
<code class="hljs r has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">101</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.2</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">102</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.03</span>
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2.1</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10001</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">300</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10002</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">400</span>
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>


每行表示一个样本,第一列的数字表示类别标签,表示样本所属于的类别,‘101’和‘102’表示特征索引,’1.2‘和‘0.03’是特征所对应的值。在二分类中‘1’表示正类,‘0’表示负类。同时类别标签支持概率标签,取值服务i为[0,1],表示样本属于某个类别的可能性。

第一步需要将数据集转化成libSVM形式,执行如下脚本
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">python mapfeat<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.py</span>
python mknfold<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.py</span> agaricus<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.txt</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>


mapfeat.py和mknfold.py分别如下
<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#!/usr/bin/python</span>
<span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">loadfmap</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">( fname )</span>:</span>
fmap = {}
nmap = {}
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> l <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> open( fname ):
arr = l.split()
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>].find(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'.'</span>) != -<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>:
idx = int( arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>].strip(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'.'</span>) )
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">assert</span> idx <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">not</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> fmap
fmap[ idx ] = {}
ftype = arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>].strip(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">':'</span>)
content = arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>]
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>:
content = arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>]
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> it <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> content.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">','</span>):
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> it.strip() == <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">''</span>:
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">continue</span>
k , v = it.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'='</span>)
fmap[ idx ][ v ] = len(nmap)
nmap[ len(nmap) ] = ftype+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'='</span>+k
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> fmap, nmap

<span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">write_nmap</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">( fo, nmap )</span>:</span>
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range( len(nmap) ):
fo.write(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'%d\t%s\ti\n'</span> % (i, nmap[i]) )
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># start here</span>
fmap, nmap = loadfmap( <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'agaricus-lepiota.fmap'</span> )
fo = open( <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'featmap.txt'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'w'</span> )
write_nmap( fo, nmap )
fo.close()
fo = open( <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'agaricus.txt'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'w'</span> )
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> l <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> open( <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'agaricus-lepiota.data'</span> ):
arr = l.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">','</span>)
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] == <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'p'</span>:
fo.write(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'1'</span>)
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>:
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">assert</span> arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] == <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'e'</span>
fo.write(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'0'</span>)
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range( <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,len(arr) ):
fo.write( <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">' %d:1'</span> % fmap[i][arr[i].strip()] )
fo.write(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'\n'</span>)
fo.close()</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li></ul>
<code class="hljs livecodeserver has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#!/usr/bin/python</span>
import sys
import <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">random</span>
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">len</span>(sys.argv) < <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>:
print (<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'Usage:<filename> <k> [nfold = 5]'</span>)
exit(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>)
<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">random</span>.seed( <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span> )
k = int( sys.argv[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>] )
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">len</span>(sys.argv) > <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>:
nfold = int( sys.argv[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>] )
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>:
nfold = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>
fi = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">open</span>( sys.argv[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>], <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'r'</span> )
ftr = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">open</span>( sys.argv[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'.train'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'w'</span> )
fte = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">open</span>( sys.argv[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'.test'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'w'</span> )
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> l <span class="hljs-operator" style="box-sizing: border-box;">in</span> fi:
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">random</span>.randint( <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> , nfold ) == k:
fte.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">write</span>( l )
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>:
ftr.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">write</span>( l )
fi.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">close</span>()
ftr.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">close</span>()
fte.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">close</span>()</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li></ul>


运行完以上两个Python脚本将会产生训练数据集:’agaricus.txt.train’ 和测试数据集: ‘agaricus.txt.test’


训练

执行如下命令行完成模型训练:
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">xgboost mushroom<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>


mushroom.conf文件用于配置训练模型和测试模型时需要的信息。每行的配置信息格式为:[attribute]=[value]:
<code class="language-conf hljs vala has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># General Parameters, see comment for each definition</span>
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># can be gbtree or gblinear</span>
booster = gbtree
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># choose logistic regression loss function for binary classification</span>
objective = binary:logistic

<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># Tree Booster Parameters</span>
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># step size shrinkage</span>
eta = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.0</span>
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># minimum loss reduction required to make a further partition</span>
gamma = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.0</span>
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># minimum sum of instance weight(hessian) needed in a child</span>
min_child_weight = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># maximum depth of a tree</span>
max_depth = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>

<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># Task Parameters</span>
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># the number of round to do boosting</span>
num_round = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># 0 means do not save any model except the final round model</span>
save_period = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># The path of training data</span>
data = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"agaricus.txt.train"</span>
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># The path of validation data, used to monitor training process, here [test] sets name of the validation set</span>
eval[test] = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"agaricus.txt.test"</span>
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># The path of test data </span>
test:data = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"agaricus.txt.test"</span>      </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li></ul>


这里的booster采用gbtree,目标函数采用logistic regression。这意味着可以采用经典的梯度提升回归树进行计算(GBRT)。这种方法能够很好的处理二分类问题

以上的配置文件中给出了最常用的配置参数。如果想了解更多的参数,详见https://github.com/dmlc/xgboost/blob/master/doc/parameter.md。如果不想在配置文件中配置算法参数,可以通过命令行配置,如下
<code class="hljs fix has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-attribute" style="box-sizing: border-box;">xgboost mushroom.conf max_depth</span>=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">6</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>


这表示max_depth参数将被设置为6而不是配置文件中的3。当使用命令行参数时确保max_depth=6为一个参数,即参数之间不要含有间隔。如果既使用配置又使用命令行参数,则命令行参数会覆盖配置文件参数,即优先使用命令行参数

在以上的例子中使用tree booster计算梯度提升。如果想使用linear booster进行回归计算,可以修改booster参数为gblinear,配置文件中的其它参数都不需要修改,配置文件信息如下
<code class="language-conf hljs livecodeserver has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># General Parameters</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># choose the linear booster</span>
booster = gblinear
...

<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># Change Tree Booster Parameters into Linear Booster Parameters</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># L2 regularization term on weights, default 0</span>
lambda = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.01</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># L1 regularization term on weights, default 0</span>
f ```agaricus.txt.test.buffer``` exists, <span class="hljs-operator" style="box-sizing: border-box;">and</span> automatically loads <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">from</span> binary buffer <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> possible, this can speedup training <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">process</span> when you <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">do</span> training many times. You can disable <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">it</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">by</span> setting ```use_buffer=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>```.
- Buffer <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">file</span> can also be used <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> standalone input, i.e <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> buffer <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">file</span> exists, but original agaricus.txt.test was removed, xgboost will still run
* Deviation <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">from</span> LibSVM input <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">format</span>: xgboost is compatible <span class="hljs-operator" style="box-sizing: border-box;">with</span> LibSVM <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">format</span>, <span class="hljs-operator" style="box-sizing: border-box;">with</span> <span class="hljs-operator" style="box-sizing: border-box;">the</span> following minor differences:
- xgboost allows feature index starts <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">from</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>
- <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> binary classification, <span class="hljs-operator" style="box-sizing: border-box;">the</span> label is <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> positive, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> negative, instead <span class="hljs-operator" style="box-sizing: border-box;">of</span> +<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>
- <span class="hljs-operator" style="box-sizing: border-box;">the</span> feature indices <span class="hljs-operator" style="box-sizing: border-box;">in</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">each</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">line</span> *<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">do</span> <span class="hljs-operator" style="box-sizing: border-box;">not</span>* need <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">to</span> be sorted
alpha = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.01</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># L2 regularization term on bias, default 0</span>
lambda_bias = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.01</span>

<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># Regression Parameters</span>
...</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li></ul>


预测

在训练好模型之后,可以对测试数据进行预测,执行如下脚本
<code class="hljs bash has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">xgboost mushroom.conf task=pred model_<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span>=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0003</span>.model</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>


对于二分类问题预测的输出结果为[0,1]之间的概率值,表示样本属于正类的概率。


模型展示

目前这还是个基本功能,只支持树模型的展示。XGBoost可以用文本的显示展示树模型,执行以下脚本
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">../../xgboost mushroom<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span> task=dump model_in=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0003.</span>model name_dump=dump<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.raw</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.txt</span>
../../xgboost mushroom<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span> task=dump model_in=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0003.</span>model fmap=featmap<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.txt</span> name_dump=dump<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.nice</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.txt</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>


0003.model将会输出到dump.raw.txt和dump.nice.txt中。dump.nice.txt中的结果更容易理解,因为其中使用了特征映射文件featmap.txt

featmap.txt的格式为
featmap.txt: <featureid> <featurename> <q or i or int>\n
:

Feature id从0开始直到特征的个数为止,从小到大排列。
i表示是二分类特征
q表示数值变量,如年龄,时间等。q可以缺省
int表示特征为整数(when int is hinted, the decision boundary will be integer)


计算过程监测

当运行程序时,会输出如下运行信息
<code class="hljs vbscript has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">tree train <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">end</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> roots, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">12</span> extra nodes, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> pruned nodes ,max_depth=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>
[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>]  test-<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">error</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.016139</span>
boosting <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">round</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> sec elapsed

tree train <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">end</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> roots, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span> extra nodes, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> pruned nodes ,max_depth=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>
[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]  test-<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">error</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.000000</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li></ul>


计算过程中模型评价信息输出到错误输出流stderr中,如果希望记录计算过程中的模型评价信息,可以执行如下脚本
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">xgboost mushroom<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>>log<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.txt</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>


在log.txt文件中记录如下信息
<code class="hljs css has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[0]</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.016139</span>
<span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[1]</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.000000</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>


也可以同时监测训练过程和测试过程中的统计信息,可以通过如下方式进行配置
<code class="language-conf hljs bash has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">eval</span>[test] = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"agaricus.txt.test"</span>
<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">eval</span>[trainname] = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"agaricus.txt.train"</span> </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>


运行以上的脚本后得到的信息如下
<code class="hljs css has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[0]</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.016139</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.014433</span>
<span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[1]</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.000000</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.001228</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>


运行规则是[name-printed-in-log] = filename, filename文件将会被加入检测进程并在每个迭代过程中对模型进行评价。

XGBoost同时支持多种统计量的监测,假设希望监测在训练过程每次预测的平均log-likelihood,只需要在配置文件中添加配置信息
eval_metric=logloss
。再次运行log文件中将会有如下信息
<code class="hljs css has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[0]</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.016139</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-negllik</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.029795</span>   <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.014433</span>        <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-negllik</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.027023</span>
<span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[1]</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.000000</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-negllik</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.000000</span>   <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.001228</span>        <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-negllik</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.002457</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>


保存运行过程中的模型

如果现在运行过程中每两步保存一个模型,则可以设置参数set save_period=2.。在当前文件夹将会看到模型0002.model。如果想修改模型输出的路径,则可以通过参数dir=foldername修改。缺省情况下XGBoost将会保持上次迭代的结果模型。


从已有模型继续计算

如果想从已有的模型继续训练,例如从0002.model继续计算,则用如下命令行
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">xgboost mushroom<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span> model_in=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0002.</span>model num_round=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span> model_out=continue<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.model</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>


XGBoost将加载0002.model并进行两次迭代计算,并将输出明显保存在continue.model。需要注意的是 在mushroom.conf中定义的训练数据和评价数据信息不能发生变化。


使用多线程

当计算大数据集时,可能需要并行计算。如果编译器支持OpenMP,XGBoost原生是支持多线程的,通过一下参数
nthread=10
设置线程数为10。


其它需要注意的点

agaricus.txt.test.buffer
agaricus.txt.train.buffer
是什么文件

默认情况下XGBoost将会产生二进制的缓存文件,文件后缀为
buffer
。当下次再次运行XGBoost时将加载缓存文件而不是原始的文件。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: