您的位置:首页 > 编程语言 > Java开发

Java集成Weka做逻辑回归(Logistic Regression)(续)

2016-07-13 10:14 330 查看
从网上找样本数据太不好找了,尤其是想看看多分类的那种数据;而且数据量都偏小,不好玩。

得,还是自己造数据,当然规则自己拟。

自己造数据,生成arff文件。

static private void genArffData(String arffPath, int numRows, int numFields, int numClasses) throws FileNotFoundException {

// 生成一个n+1字段的随机数据,准备做多分类

Random random = new Random(Calendar.getInstance().getTimeInMillis());

File arff = new File(arffPath);
PrintWriter writer = new PrintWriter(new BufferedOutputStream(new FileOutputStream(arff)));

writer.println("@RELATION \"LogisticRegression FakeData\"");
writer.println();

int i=0;
for (; i<numFields; ++i) {
writer.println("@ATTRIBUTE " + (char)('A'+i) + " REAL");
}
writer.print("@ATTRIBUTE " + (char)('A'+i) + " {");
for (i=0; i<numClasses; ++i) {
if (i>0) writer.print(',');
writer.print((char)('0'+i));
}
writer.println('}');
writer.println();

writer.println("@DATA");

float [] values = new float[numFields];
for (i=0; i<numRows; ++i) {
for (int j=0; j<numFields; ++j) {
values[j] = random.nextFloat();

writer.print(values[j]);
writer.print(',');
}

int classValue = computeClass(values, numClasses);
writer.println(classValue);
}

writer.close();
}


这段代码就只是打开文件,写内容而已……

关键是 computeClass 这个函数,自己定义一下数据怎么分类的规则。用上各种函数(使用Java这么多年,第一次关注一下Math里面有哪些东西……汗)

private static int computeClass(float[] values, int numClasses) {

float cv = values[0];
for(int i=1; i<values.length; ++i) {

switch (i) {
case 1:
cv += values[i] * 5;
break;
case 2:
cv += java.lang.Math.log10(values[i]);
break;
case 3:
cv += java.lang.Math.asin(values[i]);
break;
case 4:
cv += java.lang.Math.exp(values[i]);
break;
default:
cv += values[i]*i;
break;
}
}

int c;
if (cv<3) {
c = 0;
}
else if (cv > (numClasses+3)) {
c = numClasses-1;
}
else {
c = ((int) ((cv)*1.5) / numClasses);
if (c >= numClasses)
c = numClasses-1;
}
return c;
}


好了,放到main函数玩玩,来个10万行怎么样:

public static void main(String[] args) throws Exception {

final String arffFilePath = "data/LogisticRegressionFakeData.arff";
genArffData(arffFilePath, 100000, 6, 4);

Logistic logic = trainModel(arffFilePath, 6);

ArffLoader loader = new ArffLoader();
File inputFile = new File(arffFilePath);//测试语料文件
loader.setFile(inputFile);
Instances insTest =loader.getDataSet(); // 读入测试文件
insTest.setClassIndex(6); //设置分类属性所在行号(第一行为0号),instancesTest.numAttributes()可以取得属性总数

double sum = insTest.numInstances();//测试语料实例数
double right=0.0f;
for(int i=0;i<sum;i++){

Instance ins = insTest.instance(i);

if(logic.classifyInstance(ins)==ins.classValue()) {
right++;//正确值加一

System.out.println("No.\t" + i + "\t" + ins.classValue() + " RIGHT");
}
else {
System.out.println("No.\t" + i + "\t" + ins.classValue() + " WRONG");
}
}
System.out.println("classification precision:" + (right/sum));
}


跑出来的生成数据:

@RELATION "LogisticRegression FakeData"

@ATTRIBUTE A REAL
@ATTRIBUTE B REAL
@ATTRIBUTE C REAL
@ATTRIBUTE D REAL
@ATTRIBUTE E REAL
@ATTRIBUTE F REAL
@ATTRIBUTE G {0,1,2,3}

@DATA
0.71897244,0.32674688,0.34844375,0.14773273,0.60203516,0.030885875,1
0.87727785,0.26676136,0.9318922,0.50508565,0.22496736,0.39517665,2
0.44499284,0.5905153,0.7953741,0.05966431,0.13777435,0.106003165,1
0.37487888,0.8418185,0.33143914,0.6179532,0.39359564,0.96861655,3
0.047727704,0.23949718,0.58549887,0.53503656,0.83233106,0.5622865,2
0.70024496,0.43123567,0.18669724,0.20847279,0.17981762,0.79000807,3
0.5998019,0.39879912,0.83340144,0.5890504,0.70057064,0.049901605,2
0.6422481,0.31674922,0.18628752,0.6275924,0.66154146,0.54778665,2
0.09535301,0.63388544,0.20779681,0.16196364,0.37264192,0.73777825,3
……


运行的结果:

classification precision:0.9487


用weka工具看看取值的分布(看上去很漂亮?当然是调出来的……):



用weka跑了一会……



造出来的数据,跑出来的模型果真比较完美……如果再调调生成分类的规则呢,简单些,不用log、asin这些函数,是否能跑出100%的准确度?
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息