您的位置:首页 > 编程语言

Notes:De-anonymizing Programmers via Code Stylometry

2016-06-13 21:14 856 查看

Essay Information

De-anonymizing Programmers via Code Stylometry

Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss,

Fabian Yamaguchi, and Rachel Greenstadt.

Usenix Security Symposium, 2015

Source code stylometry

Everyone learns coding on an individual basis, as a result code in a

unique style, which makes de-annoymization possible.

Software engineering insights

programmer style changes while implementing sophisticated

functionality

differences in coding style of programmers with different skill sets

Identify malicious programmers.

Scenario 1 : Who wrote this code?

Alice analyzes a library with Malicious source code.

Bob has a source code collection with known authors

Bob will search his collection to find Alice’s adversary

Scenario 2 : Who wrote this code?

Alice got an extension for her programming assignment.

Bob, the teacher has everyone else’s code.

Bob wants to see if Alice plagiarized.

Comparison to related work



Machine learning workflow



Abstract Syntax Trees (AST)



Stylemotry can be used in source code to identify the author of a program.

Extract layout and lexical features from source code.

Abstract syntax trees (AST) in code represent the structure of the program.

Preprocess source code to obtain AST.

Parse AST to extract coding style features.

Feature Extraction

### Code Stylometry Feature Set (CSFS)

Lexical features (Extract from source code)



Layout features (Extract from source code)



Syntactic features (Extract from ASTs)



Feature Selection

WEKA’s information gain criterion, which evaluates the difference between the entropy of the distribution of classes and the entropy of the conditional distribution of classes given a particular feature:



where A is the class corresponding to an author, H is Shannon entropy, and Mi is the ith feature of the dataset.

Intuitively, the information gain can be thought of as measuring the amount of information that the observation of the value of feature i gives about the class label associated with the example.

To reduce the total size and sparsity of the feature vector, we retained only those features that individually had non-zero information gain



Random Forest Classification

Method

Use random forest as the machine learning classifier

avoid over-fitting

multi-class classifier by nature

K-fold cross validation

Validate method on a different dataset

Future work

Multiple authorship detection

Multiple author identification

Anonymizing source code

obfuscation is not the answer

Stylometry in executable binaries

authorship attribution
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息