您的位置：首页 > 其它

基于规则嵌入的论文比对系统——创新实训记录12

2020-07-13 06:11 351 查看

6-27 可视化完善修改

创新实训记录12

词云图去停用词
重新爬取论文被引用量再可视化
数据爬取
可视化

词云图去停用词

修改：在统计关键词列表时就去掉停用词。Stopwords是从网上搜的nltk库中的英文停用词列表。

# 去除停用词的关键词列表
key_path = 'D:/大学资料/大三下/项目实训/code+data/ACM数据集/keywords.txt'
key_file = open(key_path,'r')
word_list =[]
stopword = set(['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself',
'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been',
'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and',
'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with',
'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above',
'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again',
'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any',
'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own',
'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd',
'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven',
'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn'])
index = 0
for line in key_file:
words = line.strip().split()
for word in words:
if word in stopword:
words.remove(word)
word_list.append(words)
print(word_list[1])

去除停用词前的词云：

去除停用词后的词云：

重新爬取论文被引用量再可视化

数据爬取

使用论文检索工具，搜索每一篇论文的题目，然后网址上就会展示论文的各种信息。这里使用的是google scholar国内镜像版，论文信息如下图：

但是，会遇到一些特殊情况，如查询的论文不存在或者论文存在但是没有被引用数的信息。

我们在爬取时对于爬取不到数据或者爬取的数据不是数字的情况进行判断，这两种情况对应的引用数为空。
我们爬取的思路跟之前获取year,venue的思路一样，这里不多赘述。
首先定义获取quote信息的函数。

import requests
from bs4 import BeautifulSoup

# 定义爬取一篇论文被引用数的函数
def get_Info(url):
res = requests.get(url,verify=False)
requests.packages.urllib3.disable_warnings()
res.encoding='utf-8'
soup = BeautifulSoup(res.text,'html.parser')
# 获取被引用次数信息
#如果网页不存在/搜索的论文没有找到
if not(soup.select('.card-title')):
quote = ''
return quote
text = soup.select('.card-title')[0].select('a')
quote = text[-1].text
if not(quote.isdigit()):
quote = ''
return quote

然后根据lins.txt中的url进行数据爬取，并保存到quote.txt中，格式与ACM数据集中其他数据集格式相同。

import numpy as np
from tqdm import tqdm

# 读入所有的论文链接
file_links = 'D:/大学资料/大三下/项目实训/code+data/ACM数据集/links.txt'
flinks = open(file_links,'r')
links = []
for line in flinks:
lines = line.strip('\n')
links.append(lines)
#print(links[:3])
flinks.close()

nlinks = np.array(links)

# 新建quotes.txt
file_quotes = 'D:/大学资料/大三下/项目实训/code+data/ACM数据集/quotes.txt'
fquotes = open(file_quotes,'a',encoding='utf-8')

#引入进度条

for i in tqdm(range(43400,43432),desc='进行中'):
link = nlinks[i]
if link.strip() =='https://proxy.niostack.com/scholar?q=':
quotse='\n'
else:
quote = get_Info(link)
quotes = quote+'\n'
fquotes.write(quotes)

fquotes.close()
print('数据已全部写入')

我们一部分一部分地爬取，爬取过程使用tqdm查看进度。

可视化

我们利用新爬取的数据对被不同年份和不同会议的论文求平均被引用量，然后进行可视化。数据处理和之前类似，不赘述了。
不同年份的论文平均被引用量：

# 构造x,y轴的数据
x_data = []
y_data = []
for k in sorted(inavg_dic):
x_data.append(k)
y_data.append(inavg_dic[k])

#绘制折线图
import pyecharts.options as opts
from pyecharts.charts import Line
line = Line()\
.add_xaxis(xaxis_data=x_data)\
.add_yaxis(
series_name="平均被引用量",
y_axis=y_data,
markpoint_opts=opts.MarkPointOpts(
data=[
opts.MarkPointItem(type_="max", name="最大值"),
opts.MarkPointItem(type_="min", name="最小值"),
]
),
markline_opts=opts.MarkLineOpts(
data=[opts.MarkLineItem(type_="average", name="平均值")]
),
)\
.set_global_opts(
title_opts=opts.TitleOpts(title="不同年份平均被引用量统计", subtitle="统计截止到2020.6.26"),
xaxis_opts=opts.AxisOpts(type_="category", boundary_gap=False),
datazoom_opts=opts.DataZoomOpts(is_show= True,orient="horizontal")
)
line.render_notebook()

不同会议的论文平均被引用量：

#对数据进行筛选，选出排名前60的会议
def sort_dict(data, reverse):
"""
@param data: 待排序字典
@param reverse: 是否倒序
@return: 排序后的待输出结果
"""
data_list = [{k: v} for k, v in data.items()]
f = lambda x: list(x.values())[0]

return sorted(data_list, key=f, reverse=reverse)

l = sort_dict(avg_dic,True)
filter_l = l[:60]
print(type(filter_l[0].keys()))
# 构造x,y数据
x_data = []
y_data = []
for item in filter_l:
for k in item:
x_data.append(k)
y_data.append(item[k])
from pyecharts import options as opts
from pyecharts.charts import Pie
from pyecharts.globals import ThemeType

#绘制饼图
venue_pie=Pie()\
.add("会议/期刊", [list(z) for z in zip(x_data, y_data)],
label_opts=opts.LabelOpts(is_show=False),
radius=[40, 120])\
.set_colors(["blue", "green", "yellow", "red", "pink", "orange", "purple","grey"])\
.set_global_opts(title_opts=opts.TitleOpts(
title="不同会议/期刊的论文平均被引用量（排名前60）",),
legend_opts=opts.LegendOpts(is_show=False))\
.set_series_opts(tooltip_opts=opts.TooltipOpts(
formatter="{a} <br/>{b}: {c} ({d}%)"
),)
venue_pie.render_notebook()

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航