【转】关于KDD Cup '99 数据集的警告,希望从事相关工作的伙伴注意
2014-11-29 09:57
267 查看
Features
From: Terry BruggerDate: 15 Sep 2007
Subject: KDD Cup '99 dataset (Network Intrusion) considered harmful Oftentimes in the scientific community, we become interested in new techniques or approaches based on characteristics of the technique or approach itself. While such investigation may be informative from a pure research standpoint, the general public -- and particularly most research sponsors -- tend to be more interested in the application of this technology. To this end, the KDD Cup Challenge has, for over ten years, provided the KDD community with datasets from real world problems to demonstrate the applicability and performance of different knowledge discovery techniques. Researchers in the computer security community (based on the tone of papers published at the time) were initially excited to see a problem from their domain adopted for the 1999 KDD Cup Challenge. Since then, however, the dataset has become widely discredited. This letter is intended to briefly outline the problems that have been cited with the KDD Cup '99 dataset, and discourage its further use.
The KDD Cup '99 dataset was created by processing the tcpdump portions of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset, created by Lincoln Lab under contract to DARPA [Lippmann et al]. Since one can not know the intention (benign or malicious) of every connection on a real world network (if we could, we would not need research in intrusion detection), the artificial data was generated using a closed network, some proprietary network traffic generators, and hand-injected attacks. It was intended to simulate the traffic seen in a medium sized US Air Force base (and was created in collaboration with the AFRL in Rome, NY, which could be characterized as a medium sized US Air Force base).
Based on the published description of how the data was generated, McHugh published a fairly harsh criticism of the dataset. Among the issues raised, the most important seemed to be that no validation was ever performed to show that the DARPA dataset actually looked like real network traffic. Indeed, even a cursory examination of the data showed that the data rates were far below what will be experienced in a real medium sized network. Nevertheless, IDS researchers continued to use the dataset (and the KDD Cup dataset that was derived from it) for lack of anything better.
In 2003, Mahoney and Chan built a trivial intrusion detection system and ran it against the DARPA tcpdump data. They found numerous irregularities, including that -- due to the way the data was generated -- all the malicious packets had a TTL of 126 or 253 whereas almost all the benign packets had a TTL of 127 or 254. This served to demonstrate to most people in the network security research community that the DARPA dataset (and by extension, the KDD Cup '99 dataset) was fundamentally broken, and one could not draw any conclusions from any experiments run using them. Numerous researchers indicated to us (in personal conversations) that if they were reviewing a paper based solely on the DARPA dataset, they would reject it solely on that basis.
Indeed, at the time we were conducting our own assessment of the DARPA dataset, using Snort [Caswell and Roesch]. Trivial detection using the TTL aside, we found that it was still useful to evaluate the true positive performance of a network IDS; however, any false positive results were meaningless [Brugger and Chow]. Anonymous reviewers at respectable information security conferences were unimpressed; one noted, ``is there any interest to study the capacities of SNORT on such data?''. A reviewer from another conference summarized their review with ``The content of the paper is really out of date. If this paper appears five years ago, there is some value, but not much now.''
While the DARPA (and KDD Cup '99) dataset has fallen from grace in the network security community, we still see it widely used in the greater KDD community. Examples in the past couple years include [Kayacik et al.], [Sarasamma et al.], [Gao et al.], [Chan et al.], and [Zhang et al.]. While this sample doesn't necessarily represent the top-tier journals and conferences in the KDD community, they are to the best of our knowledge respectable, peer-reviewed publications. Obviously, the knowledge discovery researchers are well intentioned by wanting to show the usefulness of every technique imaginable to the network intrusion detection domain. Unfortunately, due to the problems with the dataset, such conclusions can not be drawn. As a result, we strongly recommend that (1) all researchers stop using the KDD Cup '99 dataset, (2) The KDD Cup and UCI websites include a warning on the KDD Cup '99 dataset webpage informing researchers that there are known problems with the dataset, and (3) peer reviewers for conferences and journals ding papers (or even outright reject them, as is common in the network security community) with results drawn solely from the KDD Cup '99 dataset.
S Terry Brugger, zow at acm dot org
UC Davis, Department of Computer Science
References
Brugger, S. T. and J. Chow (January 2007). An assessment of the DARPA IDS Evaluation Dataset using Snort. Technical Report CSE-2007-1, University of California, Davis, Department of Computer Science, Davis, CA.http://www.cs.ucdavis.edu/research/tech-reports/2007/CSE-2007-1.pdf.
Caswell, B. and M. Roesch (16 May 2004). Snort: The open source network intrusion detection system. http://www.snort.org/.
Chan, A. P., W. W. Y. Ng, D. S. Yeung, and E. C. C. Tsang ( 19-21 August 2005). Comparison of different fusion approaches for network intrusion detection using ensemble of RBFNN. In Proc. of 2005 Intl. Conf. on Machine Learning and Cybernetics, Volume 6, Guangzhou, China, pp. 3846-3851. IEEE.
Hai-Hua Gao, Hui-Hua Yang, X.-Y. W. (27-29 August 2005). Principal component neural networks based intrusion feature extraction and detection using SVM. In Advances in Natural Computation, Volume 3611 of Lecture Notes in Computer Science, Changsha, China, pp. 21-27. Springer.
Kayacik, H. G., A. N. Zincir-Heywood, and M. I. Heywood (June 2007). A hierarchical SOM-based intrusion detection system. Engineering Applications of Artificial Intelligence 20 (4), 439-451. Full text not available; analysis based on detailed abstract.
Lippmann, R. P., D. J. Fried, I. Graf, J. W. Haines, K. Kendall, D. McClung, D. Weber, S. Webster, D. Wyschogrod, R. K. Cunningham, and M. Zissman (January 2000). Evaluating intrusion detection systems: The 1998 DARPA off-line intrusion detection evaluation. In Proc. of the DARPA Information Survivability Conference and Exposition, Los Alamitos, CA. IEEE Computer Society Press.
Mahoney, M. V. and P. K. Chan (8-10 September 2003). An analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for network anomaly detection. In G. Vigna, E. Jonsson, and C. Krugel (Eds.), Proc. 6th Intl. Symp. on Recent Advances in Intrusion Detection (RAID 2003), Volume 2820 of Lecture Notes in Computer Science, Pittsburgh, PA, pp. 220-237. Springer.
McHugh, J. (2000). Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory. ACM Trans. Information System Security 3 (4), 262-294.
Sarasamma, S. T., Q. A. Zhu, and J. Huff (April 2005). Hierarchical Kohonenen net for anomaly detection in network security. IEEE Trans. Syst., Man, Cybern. B 35 (2), 302-312.
Zhang, C., J. Jiang, and M. Kamel (May 2005). Intrusion detection using hierarchical neural networks. Pattern Recognition Letters 26 (6), 779-791.
All opinions expressed are solely the view of the author(s), and are not necessarily shared or endorsed by The University of California, Davis, or their employer(s).
原文地址:
http://www.kdnuggets.com/news/2007/n18/4i.html
相关文章推荐
- 【转】关于KDD Cup '99 数据集的警告,希望从事相关工作的伙伴注意
- 在知乎上看到的一个关于Linux运维工程师必知的几点,希望对有志于从事运维工作的你有帮助
- 关于Dearbook的书评(希望Dearbook工作人员能看到!)
- 为什么在中国应该从事人力资源相关的工作?
- 使用MapReduce程序对KDD Cup 99数据集进行信息检索(一)
- 关于PHP 相关注意事项
- 关于EAS Bos工作区间和开发平台移动需要注意的问题
- 写给希望从事编程工作的年轻人·1
- 【部门经理亲授】关于LOG的相关注意事项,备忘!
- 关于程序员工作 交接的一些注意事项
- [红色警告]关于http://qvod.thesswws.com/u.html恶意网址相关病毒是初步研究
- 新的一年,从事液晶广告机、数字标牌技术支持的我,希望有更大的进步,寻找一份更有挑战性的工作!
- 为什么在中国应该从事人力资源相关的工作?
- 关于gooogleman的联系方式以及gooogleman从事的工作介绍
- [原创]nginx等web 服务器设计中关于相关注意事项与心得
- 关于编译器、ADT、SDK相关注意事项FAQ
- 真的很希望能从事软件工作,最好是Java啦.
- 关于数据集的相关代码
- 关于工作和生活,希望和正在经历以及即将经历的朋友共鸣
- 关于Flex Ant的相关注意事项