您的位置：首页 > 其它

使用核支持向量机(svm)+tfidf进行文本分类

2020-06-29 21:58 106 查看

文本分类的算法很多，这里提供一个使用svc来分类文本的例子。在一个分类任务中，我分别使用决策树和RNN进行分类，表现最佳的是使用svc的分类，所以下面只给出了svc的代码。

[code]import jieba
import json
from sklearn.datasets import base
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC, SVC
from sklearn.model_selection import GridSearchCV
from sklearn.externals import joblib
from sklearn.pipeline import make_pipeline

def isdigit(n):
return n.isdigit()!=True

def trans():
bunch=base.Bunch(label=[], contents=[])

with open("data.txt",'r',encoding='utf-8') as f:
a=f.read()
b=json.loads(a)

for tmp in b:
bunch.contents.append(" ".join(filter(isdigit,jieba.cut(tmp['txt']))))
bunch.label.append(tmp['label'])

X_train, X_test, y_train, y_test=train_test_split(bunch.contents,bunch.label,train_size=0.90)#分割训练和测试集

tfidfspace = base.Bunch(label=bunch.label,tdm=[], vocabulary={})
vectorizer=TfidfVectorizer(token_pattern=r"(?u)\b\w\w+\b")

svc = SVC(C=100,gamma=0.1)
#使用pipeline进行训练
pip=make_pipeline(vectorizer,svc)

#网格搜索,优化参数
param_grid = {'svc__C': [0.001, 0.01, 0.1, 1, 10, 100,1000,10000,100000],
'svc__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
grid_search = GridSearchCV(pip, param_grid, cv=20,n_jobs=-1)

grid_search.fit(X_train,y_train)

print('score',grid_search.score(X_test,y_test))

joblib.dump(grid_search, "train_model.m")

def p(s):
lm: GridSearchCV = joblib.load("train_model.m")
b = lm.predict(s)
print(b[0])

if __name__=="__main__":
trans()
p(["客人 表示 操作 取消 订单 ， 但 银行卡 只 收到 一笔 退款 ， 由于 客人 没有 下载 银行 APP ， 现 无法 核实 提供 账单 信息 ， 要求 我司 自行 查询 ， 请 跟进"])

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航