cdmc2016数据挖掘竞赛题目Android Malware Classification
2016-10-26 18:50
253 查看
http://www.csmining.org/cdmc2016/
The New Zealand Herald (www.nzherald.co.nz), Reuters(www.reuters.com), The Times (www.timesonline.co.uk) , Yahoo News (news.yahoo.com), BBC (www.bbc.co.uk) and The Press (www.stuff.co.nz).
Business, entertainment, sport, technology, and travel are the selected five news categories. Each document of the dataset is labelled manually by skimming over the text and determining the category. In the provided data files, each news piece is formatted as one line pure text with the last character as the class label (for training data), and we removed all punctuations and symbols during the data formation.
Note that; the dataset text is encrypted for fair play purpose, and this task is not aiming for decryption practices. So any uses of such technique are prohibited and should be avoided in your methods used for competition. Any participants alleged with this misconduct will be declared void results.
The statistical information of the training dataset is summarised as below:
This dataset is the operational data that captured from real-time running UniteCloud server with a sample period of 1-minute interval. There are 243 features for each sample, which correspond to operational measurements of 243 sensors from the UniteCloud servers. The file is labelled accordingly by anomalous events and anomaly category determination over the collected log data. In the supplied training dataset, we provide 57,654 samples, with 243 sensor operation values for each sample, and the non-zero labels in the last column indicate the seven anomalous events.
The goal of this task is to identify various abnormal events accurately from ranges of sensor log files without high computational costs.
The statistical information of this dataset is summarized as:
The permission system is applied as a measure to restrict access to privileged system resources and is considered as the first barrier to malware. Application developers have to explicitly declare the permissions in the AndroidManifest.xml file contained in the APK. All official Android permissions are categorized into four types: Normal, Dangerous, Signature and SignatureOrSystem. As dangerous permissions have access to restricted resources and can have a negative impact if used incorrectly, they require user’s approval at installation.
To be taken as the input of a machine-learning algorithm, permissions are commonly coded as binary variables i.e., an element in the vector could only take on two values: 1 for a requested permission and 0 otherwise. The number of all possible Android permissions varies based on the version of the OS. In this task, for each APK file under consideration, we provide a list of permissions declared in its AndoridManifest.xml file. The class label of the APK file -- +1 if it is regarded as malicious and -1 otherwise -- is determined by the detection results of security appliances hosted by VirusTotal. Note that adware was not counted as malware in our setting. The participants of CDMC 2016 competition are invited to design a classifier that could best match this result.
The statistical information of the dataset is summarized as:
Also, the MD5 hash is provided if you may need for checksum:
CDMC2016_AndroidPermissions.Train, md5(473f64d9e650e82325b1ce0216cc50c9)
CDMC2016_AndroidLabels.Train, md5(784b2ce7da61ff2935dca770c4bcbfb3)
CDMC2016_AndroidPermissions.Test, md5(192c70a8489c41fa95f5b95732fcdfb1)
Data Mining Tasks Description
Task 1: 2016 e-News categorisation
For this year, the dataset is sourced from 6 online news media:The New Zealand Herald (www.nzherald.co.nz), Reuters(www.reuters.com), The Times (www.timesonline.co.uk) , Yahoo News (news.yahoo.com), BBC (www.bbc.co.uk) and The Press (www.stuff.co.nz).
Business, entertainment, sport, technology, and travel are the selected five news categories. Each document of the dataset is labelled manually by skimming over the text and determining the category. In the provided data files, each news piece is formatted as one line pure text with the last character as the class label (for training data), and we removed all punctuations and symbols during the data formation.
Note that; the dataset text is encrypted for fair play purpose, and this task is not aiming for decryption practices. So any uses of such technique are prohibited and should be avoided in your methods used for competition. Any participants alleged with this misconduct will be declared void results.
The statistical information of the training dataset is summarised as below:
Topic | No. of News |
Business | 361 |
Entertainment | 343 |
Sport | 363 |
Technology | 356 |
Travel | 362 |
Task 2: UniteCloud Operation Log for Anomaly Detection
UniteCloud is a resilient private Cloud infrastructure created in New Zealand Unitec Institute of Technology using OpenNebula for cloud orchestration and KVM for virtualization.This dataset is the operational data that captured from real-time running UniteCloud server with a sample period of 1-minute interval. There are 243 features for each sample, which correspond to operational measurements of 243 sensors from the UniteCloud servers. The file is labelled accordingly by anomalous events and anomaly category determination over the collected log data. In the supplied training dataset, we provide 57,654 samples, with 243 sensor operation values for each sample, and the non-zero labels in the last column indicate the seven anomalous events.
The goal of this task is to identify various abnormal events accurately from ranges of sensor log files without high computational costs.
The statistical information of this dataset is summarized as:
No. of Sample | No. of Features | No. of Classes | No. of Training | No. of Testing |
82,363 | 243 | 8 | 57,654 | 24,709 |
Task 3: Android Malware Classification
This dataset is created from a set of APK (application package) files collected from the Opera Mobile Store over the period of January to September of 2014. Just like Windows (PC) systems use an .exe file for installing software,Android use APK files for installing software on the Android operating system.The permission system is applied as a measure to restrict access to privileged system resources and is considered as the first barrier to malware. Application developers have to explicitly declare the permissions in the AndroidManifest.xml file contained in the APK. All official Android permissions are categorized into four types: Normal, Dangerous, Signature and SignatureOrSystem. As dangerous permissions have access to restricted resources and can have a negative impact if used incorrectly, they require user’s approval at installation.
To be taken as the input of a machine-learning algorithm, permissions are commonly coded as binary variables i.e., an element in the vector could only take on two values: 1 for a requested permission and 0 otherwise. The number of all possible Android permissions varies based on the version of the OS. In this task, for each APK file under consideration, we provide a list of permissions declared in its AndoridManifest.xml file. The class label of the APK file -- +1 if it is regarded as malicious and -1 otherwise -- is determined by the detection results of security appliances hosted by VirusTotal. Note that adware was not counted as malware in our setting. The participants of CDMC 2016 competition are invited to design a classifier that could best match this result.
The statistical information of the dataset is summarized as:
No. of APK files | No. of Permissions | No. of Classes | No. of Training | No. of Testing |
61,730 | up to 583 | 2 | 30,920 | 30,810 |
CDMC2016_AndroidPermissions.Train, md5(473f64d9e650e82325b1ce0216cc50c9)
CDMC2016_AndroidLabels.Train, md5(784b2ce7da61ff2935dca770c4bcbfb3)
CDMC2016_AndroidPermissions.Test, md5(192c70a8489c41fa95f5b95732fcdfb1)
相关文章推荐
- cdmc2016数据挖掘竞赛题目Android Malware Classification
- 数据挖掘竞赛题目 -- 电影推荐
- 数据挖掘竞赛题目 -- 文本分类
- 【介绍】KDD Cup2012 数据挖掘竞赛主题一:预测围脖的推荐结果(腾讯赞助)
- 数据挖掘竞赛
- 数据挖掘类竞赛经验总结与分享:人人都可以是赢家
- kaggle数据挖掘竞赛初步--Titanic<随机森林&特征重要性>
- kaggle数据挖掘竞赛初步--Titanic<随机森林&特征重要性>
- 第三届泰迪杯数据挖掘竞赛试题讲解
- 2015年校园招聘之腾讯(数据挖掘)笔试面试题目
- kaggle数据挖掘竞赛初步--Titanic<数据变换>
- Android应用开发之MetaData之数据挖掘
- CIKM Competition数据挖掘竞赛夺冠算法陈运文
- 很高兴获得了CIKM Competition数据挖掘竞赛的冠军
- 机器学习/数据挖掘工程师校招笔试题目总结。
- KDD Cup2012 数据挖掘竞赛主题一:预测围脖的推荐结果(腾讯赞助)
- 2015年校园招聘之腾讯(数据挖掘)笔试面试题目
- 软件开发全套视频教程汇总(javaSE,javaEE,linux,android开发,C# ,web前端,大数据云计算,数据挖掘,web前端,php开发,UI设计,C++开发,3D视频)
- "阿里巴巴"杯北邮数据挖掘竞赛(一)
- 几个正在进行的数据挖掘竞赛