您的位置:首页 > 其它

Mahout贝叶斯算法源码分析(7)

2013-09-05 21:11 316 查看
首先更正下seq2sparse(6)之TFIDFPartialVectorReducer源码分析中最后的公式应该是如下的形式:

sqrt(e.get())*[ln(vectorCount/(df+1)) + 1]
前面说到e.get(),当时想当然的就以为是获取单词的计数了,其实这里获得的值是1而已,而且那个log函数是以e为底的,所以要改为ln;
seq2sparse(7)中的PartialVectorMergeReducer就真的没啥了,和前面简直是一模一样了,这里就不做分析了;继续往下面进行分析,有最开始的log信息可以看到接下来的信息是:

+ echo 'Creating training and holdout set with a random 80-20 split of the generated vector dataset'
Creating training and holdout set with a random 80-20 split of the generated vector dataset
+ ./bin/mahout split -i /home/mahout/mahout-work-mahout/20news-vectors/tfidf-vectors --trainingOutput /home/mahout/mahout-work-mahout/20news-train-vectors --testOutput /home/mahout/mahout-work-mahout/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
找到split对应的类为SplitInput,参考此类和上面的参数来分析上面执行的代码。

首先贴出该类的使用指南:

usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Job-Specific Options:                                                           
  --input (-i) input                                 Path to job input          
                                                     directory.                 
  --trainingOutput (-tr) trainingOutput              The training data output   
                                                     directory                  
  --testOutput (-te) testOutput                      The test data output       
                                                     directory                  
  --testSplitSize (-ss) testSplitSize                The number of documents    
                                                     held back as test data for 
                                                     each category              
  --testSplitPct (-sp) testSplitPct                  The % of documents held    
                                                     back as test data for each 
                                                     category                   
  --splitLocation (-sl) splitLocation                Location for start of test 
                                                     data expressed as a        
                                                     percentage of the input    
                                                     file size (0=start,        
                                                     50=middle, 100=end         
  --randomSelectionSize (-rs) randomSelectionSize    The number of items to be  
                                                     randomly selected as test  
                                                     data                       
  --randomSelectionPct (-rp) randomSelectionPct      Percentage of items to be  
                                                     randomly selected as test  
                                                     data when using mapreduce  
                                                     mode                       
  --charset (-c) charset                             The name of the character  
                                                     encoding of the input      
                                                     files (not needed if using 
                                                     SequenceFiles)             
  --sequenceFiles (-seq)                             Set if the input files are 
                                                     sequence files.  Default   
                                                     is false                   
  --method (-xm) method                              The execution method to    
                                                     use: sequential or         
                                                     mapreduce. Default is      
                                                     mapreduce                  
  --overwrite (-ow)                                  If present, overwrite the  
                                                     output directory before    
                                                     running job                
  --keepPct (-k) keepPct                             The percentage of total    
                                                     data to keep in map-reduce 
                                                     mode, the rest will be     
                                                     ignored.  Default is 100%  
  --mapRedOutputDir (-mro) mapRedOutputDir           Output directory for map   
                                                     reduce jobs                
  --help (-h)                                        Print out help             
  --tempDir tempDir                                  Intermediate output        
                                                     directory                  
  --startPhase startPhase                            First phase to run         
  --endPhase endPhase                                Last phase to run          
Specify HDFS directories while running on hadoop; else specify local file       
system directories
前面的几个路径参数就不做分析了,看下面的参数:

--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
第一个参数表示40%的数据会被用来做测试,剩下做训练;第二个参数表示输出路径在运行job之前会被清空;第三个参数表示输入路径的文件时序列文件;第四个参数表示要使用mapreduce方式还是sequential方式,默认是mapreduce方式(这里可以看到选择的不是默认方式);但是在log信息里面可以看到测试数据是用了20%,而非是80%,所以这里设置40%,不知道是用来干嘛的,后面设置-xm不是表明不用mapreduce方式么,那这里还设置这个参数?
首先说结果吧:这个类把数据分为了两个部分,测试和训练,比数为2:3,的确是40%的测试数据比重。下面来分析这个类:
前面基本都是参数设置:
在该类的run方法中:

if (parseArgs(args)) {
      splitDirectory();
    }
if括号中的为参数设置,splitDirectory方法是主要的执行体,该方法中有是否使用mapreduce或者sequential的选择:

if (useMapRed) {
      SplitInputJob.run(new Configuration(), inputDir, mapRedOutputDirectory,
              keepPct, testRandomSelectionPct);
    } else {
      // input dir contains one file per category.
      FileStatus[] fileStats = fs.listStatus(inputDir, PathFilters.logsCRCFilter());
      for (FileStatus inputFile : fileStats) {
        if (!inputFile.isDir()) {
          splitFile(inputFile.getPath());
        }
      }
    }
这里进入else中的部分;其主要代码为splitFile方法,进入该方法:
主要进行了三个操作:
1.获取所有的行数:

int lineCount = countLines(fs, inputFile, charset);
2. 随机生成40%lineCount个大小在(0,lineCount-1)中间的数组:

long[] ridx = new long[testSplitSize];
      RandomSampler.sample(testSplitSize, lineCount - 1, testSplitSize, 0, ridx, 0, RandomUtils.getRandom());
      randomSel = new BitSet(lineCount);
      for (long idx : ridx) {
        randomSel.set((int) idx + 1);
      }
3. 遍历输入文件,当行数在2.中产生的数组中时就把该行放入到test输出,否则放入到train输出:

writer = randomSel.get(pos) ? testWriter : trainingWriter;

只是我不明白的是为甚么只有18846条记录的数据会有162419行数据?

分享,成长,快乐
转载请注明blog地址:http://blog.csdn.net/fansy1990
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: