您的位置:首页 > 编程语言 > Python开发

python在大量地图poi数据中进行位置查找:来源于Rtree的思想

2015-08-27 22:53 766 查看
被给予了一个腾讯的在北京的poi数据,里面有具体地理位置的用户打分数据格式如下

10003977431176793344 北京东方美服装租赁公司
北京市朝阳区广渠路66号双井桥百环家园18号楼1008室(近优士阁) user
110105 116.46595
39.89033 132000
300 10003977431176793344#5927273234503450399#6055830013640664509#8088400840359825217#9374778665525650596
{"new_tag": {"category": {}, "show_tags": {}, "property": {}, "relation": {"location": {"poi": {"road": [["武圣北路", 95.57], ["八棵杨北街", 128.82], ["八棵杨中街", 160.1]], "aoi": [["双井", 0.0], ["劲松", 176.61], ["西大望路", 568.16], ["百子湾", 777.21], ["广渠门", 904.7]]}}}}}
1431008806 dianping_hezuo;landowner;raize;user
0 0
北京东方美服装租赁。东方美服装租赁 {"base_poi": "5927273234503450399", "base_poi_source": "dianping_hezuo", "alias_level": {"credible":"北京东方美服装租赁。东方美服装租赁"} }
{"short_address": "广渠路66号双井桥百环家园18号楼1008室(近优士阁)", "address_cut_tag": "1", "weak": {"latitude": "39.891405", "longitude": "116.466175"}, "rank": "203"}
北京市|11!朝阳区|12!广渠路|14!66号|16!双|0!井桥|20!百环家园|15!18号楼|16!1008室|16!(|4!近|2!优士阁|15!)|4

10012491990178971873 星宇昊超市(南定福社区卫生服务站北)
北京市平谷区 jiejing
110117 117.0225
40.13318 131200
300 10012491990178971873#11991039546089768294#12374368086444127907#5689739178564225798
{"new_tag": {"category": {}, "show_tags": {}, "property": {}, "relation": {"location": {"poi": {"town": [["马昌营镇", 112.69], ["大兴庄镇", 2121.69]], "road": [["密三路", 20.15], ["南定福庄东路", 59.05], ["南定福庄路", 76.28]], "aoi": [["马昌营", 0.0]], "village": [["南定福庄村",
305.25], ["北定福庄村", 486.73], ["薄各庄村", 1118.76]]}}}}}
1431445276 base;dianping_hezuo;jiejing;raize
0 2
{"base_poi": "5689739178564225798", "base_poi_source": "dianping_hezuo", "alias_level": {} }
{"short_address": "", "address_cut_tag": "1", "rank": "295"}
北京市|11!平谷区|12

10038066437969844323 北京盐合光仁餐饮管理有限公司
北京市朝阳区朝阳门北大街甲12号 raize
110105 116.43517
39.92814 102000
250 7997028911001436825
{"new_tag": {"category": {}, "show_tags": {}, "property": {}, "relation": {"location": {"poi": {"road": [["朝阳门北大街", 44.58], ["二环", 61.5], ["仓南胡同", 109.97]], "aoi": [["朝阳门", 29.88], ["朝外大街", 121.84], ["东四十条", 122.02], ["工体", 386.9], ["东直门", 621.44]]}}}}}
1430913867 raize
0 0
{"base_poi": "7997028911001436825", "base_poi_source": "raize", "alias_level": {} }
{"short_address": "朝阳门北大街甲12号", "address_cut_tag": "1", "rank": "136"}
北京市|11!朝阳区|12!朝阳门北大街|14!甲12号|16

10042148412976685409 中国西南资源联合开发有限公司驻北京联络处
北京市东城区新中西街2号新中大厦5层 siwei
110101 116.43829
39.93311 111000
500 1061660789693439435#15546759762400324474#17218910228363255433
{"new_tag": {"category": {}, "show_tags": {}, "property": {}, "relation": {"location": {"poi": {"road": [["新中西街", 18.7], ["工人体育场北路", 47.35], ["新中街", 118.24]], "aoi": [["工体", 0.0], ["东直门", 74.59], ["东四十条", 281.34], ["朝阳门", 521.51], ["海运仓", 547.05]]}}}}}
1431445296 base;raize;siwei
0 0
中国西南资源联合开发公司驻北京联络处 {"base_poi": "17218910228363255433", "base_poi_source": "siwei", "alias_level": {"credible":"中国西南资源联合开发公司驻北京联络处"} }
{"short_address": "新中西街2号新中大厦5层", "address_cut_tag": "1", "siweiinfo": {"kind": "A980", "name": "中国西南资源联合开发公司驻北京联络处", "srcfile": "POI", "chaincode": "", "Poi_id": "3595642", "dispalyx": "116.438290", "acenter": "", "dispalyy": "39.933110", "y": "39.933130",
"x": "116.438550", "food_type": "", "prior_auth": "", "rawid": "17218910228363255433", "side": "L", "linkid": "575006"}, "master_node": {"imp_level": "-1", "subpoi_tag": "", "master_id": "5559623789279519122", "category_num": "9", "suffix": ""}, "rank": "532"}
北京市|11!东城区|12!新中西街|14!2号|16!新中大厦|15!5层|16

需要对给予的另一个高德地图的相应地点上,将rank字段添加。

然而这是存在问题的:

(1)作为源数据用于查找的文本过大,达到100万行数据

(2)在两个公司的POI中可能同一地点的名称并不完全相同

怎样保证高效快速又准确的找出确切的地点呢。我们可以如下处理。

(1)将腾讯地图POI数据中只留下“地名,经度,纬度,rank”这四个需要的数据,减小数据量。

(2)将高德地图的数据根据行政区域编码去除掉非北京地区的店铺。

(3)将北京城按照横纵坐标划分成若干个小的地理块,将(1)中的得到的地理位置点放入到大地图之中。由于python的dict结构是索引,所以等于是按照索引存储的。查找速度并不慢。

(4)对于高德地图中的每一个地理位置,找寻其应在的小块,计算其在该小块中和谁的相似度最大,相似度用编辑距离的方法来求解。

(5)如果找不到相似度高的,就找一个地理上欧式距离最近的。

附源代码:(请您尽情嘲笑)

# -*- coding:utf-8 -*-
import string
import os
import math
import time
import Levenshtein

new_file = open('poi_beijing_new.txt','w')

dirty_file = open('poi_beijing_dirty.txt')
cnt = 0
for each_line in f:
cnt = cnt +1
s = each_line.split(',')
if s[14] == '010':
new_file.write(each_line)

dirty_file.close()
new_file.close()

Map = {}
for i in range(220):
alt = i + 11520
Map[alt]={}
for j in range(235):
Map[alt][j+3925] = {}

f = open('poi_beijing.txt')
cnt = 0
for each_line in f:
cnt = cnt + 1
if cnt%100 ==0:
print cnt
s = each_line.split(',')
a = string.atof(s[1])
b = string.atof(s[2])
lat = math.trunc(a * 100)
alt = math.trunc(b * 100)
r = s[3].split('\n')
if lat>= 11520 and lat<=11739 and alt>=3925 and alt<=4159:
Map[lat][alt][s[0]] = {'latitude':s[1], 'altitude':s[2] ,'rank':r[0]}

f.close()

g = open('poi_beijing_new.txt')
m = open('result.txt','w')
for each_line in g:
m.write(each_line)

info = each_line.split(',')
geo = info[7].split('|')
a = string.atof(geo[0])
b = string.atof(geo[1])
geo_lat = math.trunc(a * 100)
geo_alt = math.trunc(b * 100)
similar = 0
rank    = '125'
dis     = 1000000
if geo_lat>= 11520 and geo_lat<=11739 and geo_alt>=3925 and geo_alt<=4159:
for each_item in Map[geo_lat][geo_alt]:
leven = 1-Levenshtein.distance(info[2],each_item) * 1.0 / max(len(info[2]),len(each_item))
if leven > similar:
similar = leven
rank    = Map[geo_lat][geo_alt][each_item]['rank']

if similar>0.3333333:
m.write(rank)

else:
for each_item in Map[geo_lat][geo_alt]:
x_dis = abs(string.atof(Map[geo_lat][geo_alt][each_item]['latitude']) - string.atof(geo[0])) * 10000
y_dis = abs(string.atof(Map[geo_lat][geo_alt][each_item]['altitude']) - string.atof(geo[1])) * 10000
O_dis = math.sqrt(x_dis * x_dis + y_dis * y_dis)
if O_dis < dis:
dis = O_dis
rank = Map[geo_lat][geo_alt][each_item]['rank']
m.write(rank)

m.write('\n')
m.close()
g.close()
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: