Notes:De-anonymizing Programmers via Code Stylometry
2016-06-13 21:14
856 查看
Essay Information
De-anonymizing Programmers via Code StylometryAylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss,
Fabian Yamaguchi, and Rachel Greenstadt.
Usenix Security Symposium, 2015
Source code stylometry
Everyone learns coding on an individual basis, as a result code in aunique style, which makes de-annoymization possible.
Software engineering insights
programmer style changes while implementing sophisticated
functionality
differences in coding style of programmers with different skill sets
Identify malicious programmers.
Scenario 1 : Who wrote this code?
Alice analyzes a library with Malicious source code.
Bob has a source code collection with known authors
Bob will search his collection to find Alice’s adversary
Scenario 2 : Who wrote this code?
Alice got an extension for her programming assignment.
Bob, the teacher has everyone else’s code.
Bob wants to see if Alice plagiarized.
Comparison to related work
Machine learning workflow
Abstract Syntax Trees (AST)
Stylemotry can be used in source code to identify the author of a program.
Extract layout and lexical features from source code.
Abstract syntax trees (AST) in code represent the structure of the program.
Preprocess source code to obtain AST.
Parse AST to extract coding style features.
Feature Extraction
### Code Stylometry Feature Set (CSFS)Lexical features (Extract from source code)
Layout features (Extract from source code)
Syntactic features (Extract from ASTs)
Feature Selection
WEKA’s information gain criterion, which evaluates the difference between the entropy of the distribution of classes and the entropy of the conditional distribution of classes given a particular feature:where A is the class corresponding to an author, H is Shannon entropy, and Mi is the ith feature of the dataset.
Intuitively, the information gain can be thought of as measuring the amount of information that the observation of the value of feature i gives about the class label associated with the example.
To reduce the total size and sparsity of the feature vector, we retained only those features that individually had non-zero information gain
Random Forest Classification
Method
Use random forest as the machine learning classifieravoid over-fitting
multi-class classifier by nature
K-fold cross validation
Validate method on a different dataset
Future work
Multiple authorship detectionMultiple author identification
Anonymizing source code
obfuscation is not the answer
Stylometry in executable binaries
authorship attribution
相关文章推荐
- RPC failed; result=22, HTTP code = 411
- gosyd/go1.6.slide at master · davecheney/gosyd · GitHub
- Windows Server 2008 Code "Longhorn" Beta 3 提供下载
- dedecms v5.1 WriteBookText() code injection vul注入漏洞
- 在C#中生成与PHP一样的MD5 Hash Code的方法
- Live Write 的代码高亮插件 Paste Code
- JQuery入门―JQuery程序的代码风格详细介绍
- ERROR CODE: 1175 YOU ARE USING SAFE UPDATE MODE AN
- IIS7.5 Error Code 0x8007007e HTTP 错误 500.19的解决方法
- Microsoft Windows Server Code Name"Longhorn"Beta 3 正式发布! 下载
- 当前流行的JavaScript代码风格指南
- Microsfot .NET Framework4.0框架 安装失败的解决方法
- Windows 下Spark 快速搭建Spark源码阅读环境
- Leanote集成Ace代码编辑器, 程序员的最爱
- 好代码是廉价的代码
- Optimizing Java Code(Pro Android Apps Performance Optimization)
- data,bdata,idata,pdata,xdata,code存储类型与存储区bit
- Writing Efficient Android Code(转)
- 常见HTTP状态(304,200等)
- 转:关于keil中data,idata,xdata,pdata,code的问题