San Francisco Crime Classification(Kaggle)
2015-09-14 22:55
381 查看
Predict the category of crimes that occurred in the city by the bay
From 1934 to 1963, San Francisco was infamous for housing some of the world’s most notorious criminals on the inescapable island of Alcatraz.
Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.
From Sunset to SOMA, and Marina to Excelsior, this competition’s dataset provides nearly 12 years of crime reports from across all of San Francisco’s neighborhoods. Given time and location, you must predict the category of crime that occurred.
We’re also encouraging you to explore the dataset visually. What can we learn about the city through visualizations like this Top Crimes Map? The top most up-voted scripts from this competition will receive official Kaggle swag as prizes.
预测旧金山的犯罪种类。可用的特征只有时间和地点,预测结果要求给出概率,最后优化的目标函数是logloss。
犯罪数据本身是十分有趣的,简单探索可以发现比如周五的犯罪最多,某些犯罪可能偏向于发生在某些特定地区和特定时间,最常见的犯罪比如偷窃等等。论坛里有很多漂亮的数据可视化的展示。
预测概率,选择了简单的朴素贝叶斯算法,特征中有很多字符串类,需要把他们转化成数值类,如果直接对字符串hash给一个数值,这个数值可能并不能真实反映特征不同种类之间的关系,所以选择二值化处理。这里学会了一个很方便的方法,用pandas的get_dummies。
最后排名50%。
代码:
From 1934 to 1963, San Francisco was infamous for housing some of the world’s most notorious criminals on the inescapable island of Alcatraz.
Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.
From Sunset to SOMA, and Marina to Excelsior, this competition’s dataset provides nearly 12 years of crime reports from across all of San Francisco’s neighborhoods. Given time and location, you must predict the category of crime that occurred.
We’re also encouraging you to explore the dataset visually. What can we learn about the city through visualizations like this Top Crimes Map? The top most up-voted scripts from this competition will receive official Kaggle swag as prizes.
预测旧金山的犯罪种类。可用的特征只有时间和地点,预测结果要求给出概率,最后优化的目标函数是logloss。
犯罪数据本身是十分有趣的,简单探索可以发现比如周五的犯罪最多,某些犯罪可能偏向于发生在某些特定地区和特定时间,最常见的犯罪比如偷窃等等。论坛里有很多漂亮的数据可视化的展示。
预测概率,选择了简单的朴素贝叶斯算法,特征中有很多字符串类,需要把他们转化成数值类,如果直接对字符串hash给一个数值,这个数值可能并不能真实反映特征不同种类之间的关系,所以选择二值化处理。这里学会了一个很方便的方法,用pandas的get_dummies。
最后排名50%。
代码:
__author__ = 'will' import pandas as pd from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import log_loss from sklearn import preprocessing """ Data fields Dates - timestamp of the crime incident Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict. Descript - detailed description of the crime incident (only in train.csv) DayOfWeek - the day of the week PdDistrict - name of the Police Department District Resolution - how the crime incident was resolved (only in train.csv) Address - the approximate street address of the crime incident X - Longitude Y - Latitude """ def parse_date(date): tem = date.split() datetime, time = tem[0], tem[1] tem = datetime.split('-') year, month, day = tem[0], tem[1], tem[2] tem = time.split(':') hour = tem[0] return year, month, day, hour train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') # Get binarized weekdays, districts, and parse Dates to year, month, day, hour weekdays = pd.get_dummies(train.DayOfWeek) district = pd.get_dummies(train.PdDistrict) parse_res = map(parse_date, train.Dates.values) years = pd.Series(data=map(lambda x: x[0], parse_res), name='year') months = pd.Series(data=map(lambda x: x[1], parse_res), name='month') days = pd.Series(data=map(lambda x: x[2], parse_res), name='day') hours = pd.Series(data=map(lambda x: x[3], parse_res), name='hour') categorys = train.Category # Build new array train_data = pd.concat([categorys, hours, days, months, years, weekdays, district], axis=1) # Repeat for test data weekdays = pd.get_dummies(test.DayOfWeek) district = pd.get_dummies(test.PdDistrict) parse_res = map(parse_date, test.Dates.values) years = pd.Series(data=map(lambda x: x[0], parse_res), name='year') months = pd.Series(data=map(lambda x: x[1], parse_res), name='month') days = pd.Series(data=map(lambda x: x[2], parse_res), name='day') hours = pd.Series(data=map(lambda x: x[3], parse_res), name='hour') test_data = pd.concat([hours, days, months, years, weekdays, district], axis=1) mnb = MultinomialNB() features = ['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday', 'BAYVIEW', 'CENTRAL', 'INGLESIDE', 'MISSION', 'NORTHERN', 'PARK', 'RICHMOND', 'SOUTHERN', 'TAR***AL', 'TENDERLOIN'] train_x = pd.DataFrame(train_data, columns=features).values train_y = train_data.Category.values mnb.fit(train_x, train_y) pred = mnb.predict_proba(train_x) print log_loss(train_y, pred) test_x = pd.DataFrame(test_data, columns=features).values predicted = mnb.predict_proba(test_x) # Write results result = pd.DataFrame(predicted, columns=mnb.classes_) # result.to_csv('testResult.csv', index = True, index_label = 'Id' )
相关文章推荐
- 测试基础原创性测验
- lstm
- redis 启动警告及处理
- Processing 教程(14) - 渐进地认识While Loop!
- Swift - 函数
- codevs姓名与id
- XHTML
- const与define的区别
- SQL Server 2012 AlwaysOn高可用配置之七:新建数据库
- SQL Server 2012 AlwaysOn高可用配置之七:新建数据库
- LeetCode 6 ZigZag
- Docker安装前升级内核3.10
- 特别注意hibernate4缓存的配置版本问题
- Eclipse 导入 xUtils 源码
- Winsock系列函数 及 Socket通信流程
- 构建多模块的Maven项目
- JVM堆设置及调优
- Android将"content://"类型的uri转为文件路径
- 【C++学习】 之 const专题讲座
- 【Oracle】day01_数据类型_DDL语句_DML语句