Learning from other's mistakes: Data-driven code analysis

Post on 18-Jul-2015

290 views 2 download

Transcript of Learning from other's mistakes: Data-driven code analysis

Data-driven code analysis: Learning from other's mistakes

Andreas Dewes (@japh44)

andreas@quantifiedcode.com

13.04.2015

PyCon 2015 – Montreal

About

Physicist and Python enthusiast

CTO of a spin-off of the

University of Munich (LMU):

We develop software for data-driven code analysis.

Our mission

Tools & Techniques for Ensuring Code Quality

static dynamic

automated

manual

Debugging

Profiling

...

Manual

code reviews

Static analysis /

automated

code reviews

Unit testing

System testing

Integration testing

Discovering problems in code

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

obj returns only thekeys of the dictionary.(obj.items() is needed)

value.imaginary does not exist. (value.imag would be correct)

Dynamic Analysis (e.g. unit testing)

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

def test_encode(): d = {'a' : 1j+4,

's' : {'d' : 4+5j}}

r = encode(d) #this will fail...

assert r['a'] == {'type' : 'complex', 'r' : 4,'i' : 1}

assert r['s']['d'] == {'type' : 'complex', 'r' : 4,'i' : 5}

Static Analysis (for humans)

encode is a function with 1 parameterwhich always returns a dict.

I: obj should be an iterator/list of tupleswith two elements.

encode gets called with adict, which does not satisfy (I).

a value of type complex does nothave an .imaginary attribute!

encode is called with a dict, whichagain does not satisfy (I).

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

How static analysis tools works (short version)

1. Compile the code into a data

structure, typically an abstract syntax

tree (AST)

2. (Optionally) annotate it with

additional information to make

analysis easier

3. Parse the (AST) data to find problems.

Python Tools for Static Analysis

PyLint (most comprehensive tool)http://www.pylint.org/

PyFlakes (smaller, less verbose)https://pypi.python.org/pypi/pyflakes

Pep8 (style and some structural checks)https://pypi.python.org/pypi/pep8

(... and many others)

Limitations of current tools & technologies

Checks are hard to create / modify...(example: PyLint code for analyzing 'try/except' statements)

Long feedback cycles

Rethinking code analysis for Python

Our approach

1. Code is data! Let's not keep it in text

files but store it in a useful form that we

can work with easily (e.g. a graph).

2. Make it super-easy to specify errors

and bad code patterns.

3. Make it possible to learn from user

feedback and publicly available code.

Building the Code Graph

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

dict

name

nameassign

functiondef

body

body

targets

for

body iterator

Building the Code Graph

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

value

{i : 1}

{id : 'e'}

{name: 'encode',args : [...]}

{i:0}

Building the Code Graph

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

e4fa76b...

a76fbc41...

c51fa291...

74af219...

name

nameassign

body

body

targets

for

body iterator

value

dict

functiondef

$type: dict

Example: Tornado Project

10 modules from the tornado project

Modules

Classes

Functions

Advantages

- Simple detection of (exact) duplicates

- Semantic diffing of modules, classes, functions, ...

- Semantic code search on the whole tree

Describing Code Errors / Anti-Patterns

Code issues = patterns on the graph

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

name

attribute

value

attr

{id : imaginary}

name

$type {id : value}

complex

Using YAML to describe graph patterns

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

node_type: attribute

value:

$type: complex

attr: imaginary

Generalizing patterns

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

node_type: attribute

value:

$type: complex

attr:

$not:

$or: [real, imagin]

Learning from feedback / false positives

"else" in for loop without break statement

node_type: for

body:

$not:

$anywhere:

node_type: break

orelse:

$anything: {}

values = ["foo", "bar", ... ]

for i,value in enumerate(values): if value == 'baz': print "Found it!"

else: print "didn't find 'baz'!"

Learning from false positives (I)

values = ["foo", "bar", ... ]

for i,value in enumerate(values): if value == 'baz': print "Found it!"return value

else: print "didn't find 'baz'!"

node_type: for

body:

$not:

$or:

- $anywhere:

node_type: break

- $anywhere:

node_type: return

orelse:

$anything: {}

Learning from false positives (II)

node_type: for

body:

$not:

$or:

- $anywhere:

node_type: break

exclude:

node_type:

$or: [while,for]

- $anywhere:

node_type: return

orelse:

$anything: {}

values = ["foo", "bar", ... ]

for i,value in enumerate(values): if value == 'baz': print "Found it!"for j in ...:

#...break

else: print "didn't find 'baz'!"

patterns vs. code

handlers:node_type: excepthandlertype: null

node_type: tryexcept

handlers:- body:

- node_type: passnode_type: excepthandler

node_type: tryexcept

(no exception type specified)

(empty exception handler)

Summary & Feedback

1. Storing code as a graph opens up many

interesting possibilities. Let's stop thinking of

code as text!

2. We can learn from user feedback or even

use machine learning to create and adapt

code patterns!

3. Everyone can write code checkers!

=> crowd-source code quality!

Thanks!

www.quantifiedcode.comhttps://github.com/quantifiedcode

@quantifiedcode

Andreas Dewes (@japh44)

andreas@quantifiedcode.com

Visit us at booth 629!