您的位置:首页 > 其它

DBLP实验数据集处理

2016-04-12 14:12 232 查看
DBLP介绍

XML数据格式

解析XML

DBLP介绍

DBLP是计算机领域的英文文献数据库,收录了国际期刊和会议等公开发表的论文。DBLP没有提供对中文文献的收录和检索功能,国内类似的权威期刊及重要会议论文集成检索系统有C-DBLP。DBLP是德国特里尔大学的Michael Ley负责开发和维护。它提供计算机领域科学文献的搜索服务,但只储存这些文献的相关元数据,如标题,作者,发表日期等,并使用XML存储元数据。

DBLP数据广泛用于学术研究,如作者主题分析、社区发现、关系推荐、链接预测、作者影响力分析、学术热点研究等。在学术界声誉很高,很多论文及实验都是基于DBLP的。而且更新也很快,每个月初更新一次XML文件,截止至2016.04.12,共收录了330万+的论文、170万+的学者。

XML数据格式

<inproceedings mdate="2012-09-18" key="persons/Codd74">
<author>E. F. Codd</author>
<title>Seven Steps to Rendezvous with the Casual User.</title>
<year>1974</year>
<booktitle>IFIP Working Conference Data Base Management</booktitle>
<url>db/conf/ds/dbm74.html#Codd74</url>
<note>IBM Research Report RJ 1333, San Jose, California</note>
</inproceedings>
<article mdate="2002-01-03" key="persons/Codd69">
<author>E. F. Codd</author>
<title>Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks.</title>
<journal>IBM Research Report, San Jose, California</journal>
<year>1969</year>
<ee>db/labs/ibm/RJ599.html</ee>
</article>


XML的头文件编码方式是 ISO-8859-1 (“Latin-1”) ,但是文件中的内容的都是ASCII字符,其中拉丁字符被转换成对应的实体,如é表示为& eacute; 。包含类型:article、inproceedings、proceedings、book、incollection、phdthesis、mastersthesis、www。

XML具体介绍可参考【官文的PDF】【DBLP XML数据下载地址

本文介绍将XML解析出来,然后保存到mysql数据库。

mysql存储数据的表结构:

CREATE TABLE if not exists paper(
id int(11) NOT NULL,
ptag varchar(64) default NULL,
title varchar(512) default NULL,
author varchar(256) default NULL,
subtag varchar(64) default NULL,
sub_detail varchar(512) default NULL,
pyear int(11) default NULL,
url varchar(256) default NULL,
mdate varchar(32) default NULL,
pkey varchar(256) default NULL,
publtype varchar(256) default NULL
)


解析XML

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import os,sys
import xml.sax
import re
import mysql_util
from mysql_util import mysqlutil
import MySQLdb

reload(sys)
sys.setdefaultencoding('utf-8')

#paper_tags = ('article','inproceedings','proceedings','book', 'incollection','phdthesis','mastersthesis','www')
paper_tags = ('article','inproceedings') ## only parse these tags
sub_tags = ('publisher', 'journal', 'booktitle')

class MovieHandler(xml.sax.ContentHandler):
def __init__(self):
self.id = 1
self.kv = {}
self.reset()
self.util = mysqlutil()
self.params = []
self.batch_len = 10

def reset(self):
self.curtag = None
self.pid = None
self.ptag = None
self.title = None
self.author = None
self.tag = None
self.subtag = None
self.subtext = None
self.year = None
self.url = None
self.mdate = None
self.key = None
self.publtype = None
self.kv = {}

#元素开始事件处理
def startElement(self, tag, attributes):
if tag is not None and len(tag.strip()) > 0:
self.curtag = tag

if tag in paper_tags:
self.reset()
self.pid = self.id
self.kv['ptag'] = str(tag)
self.kv['id'] = self.id
self.id += 1

if attributes.has_key('key'):
self.key = str(attributes['key'])

if attributes.has_key('mdate'):
self.mdate = str(attributes['mdate'])

if attributes.has_key('publtype'):
self.publtype = str(attributes['publtype'])
elif tag in sub_tags:
self.kv['sub_tag'] = str(tag)

# 元素结束事件处理
def endElement(self, tag):
if tag == 'title':
self.kv['title'] = str(self.title)

elif tag == 'author':
self.author = re.sub(' ','_', str(self.author))
if self.kv.has_key('author') == False:
self.kv['author'] = []
self.kv['author'].append(str(self.author))
else:
self.kv['author'].append(str(self.author))

elif tag in sub_tags:
self.kv['sub_detail'] = str(self.subtext)

elif tag == 'url':
self.kv['url'] = str(self.url)

elif tag == 'year':
self.kv['year'] = str(self.year)

elif tag in paper_tags:
tid = int(self.kv['id']) if self.kv.has_key('id') else 0
ptag = self.kv['ptag'] if self.kv.has_key('ptag') else 'NULL'

try:
title = self.kv['title'] if self.kv.has_key('title') else 'NULL'
except Exception, e:
title = ''
author = self.kv['author'] if self.kv.has_key('author') else 'NULL'
author = ','.join(author) if author is not None else 'NULL'
subtag = self.kv['subtag'] if self.kv.has_key('subtag') else 'NULL'
sub_detail = self.kv['sub_detail'] if self.kv.has_key('sub_detail') else 'NULL'
year = self.kv['year'] if self.kv.has_key('year') else 0
url = self.kv['url'] if self.kv.has_key('url') else 'NULL'
mdate = self.kv['mdate'] if self.kv.has_key('mdate') else 'NULL'
pkey = self.kv['pkey'] if self.kv.has_key('pkey') else 'NULL'
publtype = self.kv['publtype'] if self.kv.has_key('publtype') else 'NULL'
param = (str(tid), ptag, title, author, subtag, sub_detail, year, url, mdate, pkey, publtype)

# 只抽取其中的会议论文
if url.find('db/conf') >= 0:
self.params.append(param)

if len(self.params) % self.batch_len == 0:
print len(self.params)
sql = "insert into paper_conf(id, ptag, title, author, subtag, sub_detail, year, url, mdate, pkey, publtype) values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
self.util.execute_sql_params(sql, self.params)
self.params = []

# 内容事件处理
def characters(self, content):
if self.curtag == "title":
self.title = content.strip()
elif self.curtag == "author":
self.author = content.strip()
elif self.curtag in sub_tags:
self.subtext = content.strip()
elif self.curtag == "year":
self.year = content.strip()
elif self.curtag == "url":
self.url = content.strip()

## python parser.py dblp-2015-03-02.xml
if __name__ == "__main__":

filename = 'test.xml'
if len(sys.argv) == 2:
filename = sys.argv[1]

if os.path.exists(filename) == False:
print '[%s] not exists!' % filename
exit(1)

# 创建一个 XMLReader
parser = xml.sax.make_parser()

# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)

# 重写 ContextHandler
Handler = MovieHandler()
parser.setContentHandler( Handler )

parser.parse(filename)
print 'Parser Complete!'


整个代码:【下载地址, 访问密码:ff52】

共有如下文件:

create_table.sql:创建dblp数据表

parser.py:解析xml–>mysql数据库

mysql_util:连接mysql

gen_data.py:从mysql数据库抽取部分相关的数据

porter_stemmer.py:对文本进行词干化处理

声明:本文仅对相关数据集进行说明,并提供相应的链接,如需转载,请注明本文链接:/article/7727495.html
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: