Mining Software Data - Uni Koblenz-Landau · • Tokenizing with NLTK import nltk import string...

© 2016 Hakan Aksu, Ralf Lämmel, Software Languages Team 1

Mining Software DataHakan Aksu, Ralf Lämmel

University of Koblenz-Landau Faculty of Computer Science

Software Languages Team

Creative Commons License: softlang logos by Wojciech Kwasnik, Archina Void, Ralf Lämmel, Software Languages Team, Faculty of Computer Science, University of Koblenz-Landau is licensed under a Creative Commons Attribution 4.0 International License

SOFTLANG

© 2016-2017 Hakan Aksu, Ralf Lämmel, Software Languages Team

Mining Software Data


Information Retrieval (IR)

Machine Learning

Mining Software Repositories

Natural Language Processing

‘Mining Software Data’ concerns several fields of computer science

2


Overview Information Retrieval (IR)

3


Overview Information Retrieval (IR)

• Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). (http://nlp.stanford.edu/IR-book/pdf/01bool.pdf)

• ‘Documents’ in 101project • wiki text (pages or sections) and • source-code units with

• program identifiers and • comments

4


IR scenario• Objective: find source-code units that implement a

specific feature, e.g., ‘Total’.

• Method: search source code for characteristic terms, e.g., ‘total’.

• Challenges:

• Distinguish feature implementation and testing.

• Dealing with variation in natural language usage.

5


• Precision is the fraction of the documents retrieved that are relevant to the user's information need.

• Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.

Performance and correctness measures in IR

6


Overview Machine Learning

7


Types of Machine Learning

• Supervised learning

• Unsupervised learning

• Reinforcement learning

8


Types of Machine Learning

Supervised learning (e.g., prediction models) is the machine learning task of inferring a function from labeled training data. The computer is presented with example inputs and their desired outputs. The goal is to learn a general rule that maps inputs to outputs.

Training Data Test Data

Learning Rules test rules determine, e.g., precision & recall

9


Types of Machine LearningUnsupervised learning (e.g., cluster analysis): is the machine learning task of inferring a function to describe hidden structure from unlabeled data.No labels are given to the learning algorithm, leaving it on its own to find structure in its input.

One color stands for a cluster.

Possible structures:

10

Source: https://de.wikipedia.org/wiki/Clusteranalyse

https://de.wikipedia.org/wiki/Clusteranalyse


Types of Machine LearningReinforcement learning:

• Inspired by behaviorist psychology

• The algorithm learns by reward and punishment like a human

• Example: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle), without a teacher explicitly telling it whether it has come close to its goal.

• Another example is learning to play a game by playing against an opponent.

11


Overview Natural Language Processing

(NLP)

12


Definition 1

Natural language processing is a branch of artificial intelligence that deals with analyzing, understanding and generating the languages that humans use naturally in order to interface with computers in both written and spoken contexts using natural human languages instead of computer languages. http://www.webopedia.com/TERM/N/NLP.html (April, 2016)

13

http://www.webopedia.com/TERM/N/NLP.html


Definition 2

Natural language processing is a method to translate between computer and human languages. It is a method of getting a computer to understandably read a line of text without the computer being fed some sort of clue or calculation. In other words, NLP automates the translation process between computers and humans. https://www.techopedia.com/definition/653/natural-language-processing-nlp (April, 2016)

14

https://www.techopedia.com/definition/653/natural-language-processing-nlp


Definition 3

Computer understanding, analysis, manipulation, and/or generation of natural language. This can refer to anything from fairly simple string-manipulation tasks like stemming, or building concordances of natural language texts, to higher-level AI-like tasks like processing user queries in natural language. natural language processing. Dictionary.com. The Free On-line Dictionary of Computing. Denis Howe. http://www.dictionary.com/browse/natural-language-processing (accessed: April, 2016).

15


Some tasks in NLP• Machine translation

• Translate text from one language to another • Morphological segmentation

• Separate words into individual morphemes and identify the class of the morphemes • Natural language generation

• Convert information from databases into readable human language • Part-of-speech tagging

• Given a sentence, determine the part of speech (e.g., noun, verb or adjective) for each word.

• Question answering • Given a human-language question, determine its answer

• Relationship extraction • Given a text, identify the relationships among named entities (e.g., who is father to whom)

• Word sense disambiguation • Many words have more than one meaning; Identify the meaning of words

• Text simplification • Text-to-speech • Natural language search • Text-proofing • …

16

Not much of this in this course!


Overview Mining Software Repositories

(MSR)

17


Mining Software Repositories

• The Mining Software Repositories (MSR) field analyzes the rich data available in software repositories.

• Analysis of • version control repositories • mailing list archives • bug tracking systems • issue tracking systems, etc.

• to uncover information about software systems, projects and software engineering.

https://en.wikipedia.org/wiki/Mining_Software_Repositories

18

https://en.wikipedia.org/wiki/Mining_Software_Repositories


Current MSR research in the Software Languages Team

• Analysis of Software Repositories to identify developer experience in different scopes like experience in API usage or skills in specific frameworks (e.g., Django)

19


Preparing programs for NLP and data mining

20


NL data in 101project• Program identifiers • Comments• Wiki text • Commit messages • Github issues • Github revisions

21


Program identifierClass/Type/Interface names

method names variable/parameter names

22


Comments

comment blocks

single comments

23


Wiki text

Description in natural language

Source Code

More information24


NL preprocessing

25


NL preprocessing tasks• After preparing programs for data mining (like scanning, fact extraction and/or program

identifier splitting) we need some NL preprocessing steps.

• Preprocessing is used to prepare the raw data for the analysis itself.

• There are many methods like • part-of-speech tagging • lemmatization • text simplification • word sense disambiguation • …

• We are interested in • Program identifier ‘Tokenizing’• ‘Tokenizing’• ‘Stemming’• Removing ‘Stop Words’

26


Splitting

27

„getSalary“

splitting to

„get“ and „Salary“

Split program identifier to single words

Example


Tokenizing• Separate a text (String) to its tokens

• Example — Input:• „Natural language processing makes fun.“

• Result:• „Natural“, „language“, „processing“, „makes“, „fun“, “.“

• Best practice is to work without punctuations and lowercased tokens (normalization of tokens).

• Normalized result:• „natural“, „language“, „processing“, „makes“, „fun“

28

© 2016-2017 Hakan Aksu, Ralf Lämmel, Software Languages Team 29

Tokenizing in Python• Tokenizing with NLTK

import nltkimport stringremove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def tokenize(text):tokens = nltk.word_tokenize(text.translate(remove_punctuation_map))tokens = [w.lower() for w in tokens]return tokens

• Tokenizing with TextBlob

from textblob import TextBlobdef tokenize(text):

text = TextBlob(text)tokens = text.wordsreturn tokens

tokenize

with lowercase letters

without punctuation

only tokenizing

normalization also possible

Limited alternative text.translate(None,string.punctuation)


Stemming• Stemming is the process of reducing a word into its stem.

• The stem or root form is not necessarily a word by itself, but it can be used to generate words by concatenating the right suffix.

• Example • fish, fishes and fishing stems into fish

It is a correct word • study, studies and studying stems into studi

It is not an English word.

• Most commonly, stemming algorithms (a.k.a. stemmers) are based on rules for suffix stripping.

30


Stemming• The most famous algorithm is the Porter stemmer.

Introduced in1979.

• A more aggressive stemming algorithm is the Lancaster stemmer. Introduced in 1990.

• Python libraries • NLTK

doc http://www.nltk.org/api/nltk.stem.html demo http://www.nltk.org/book/ch03.html (Section 3.6)

• PyStemmer - https://github.com/snowballstem/pystemmer

31

http://www.nltk.org/api/nltk.stem.html

http://www.nltk.org/book/ch03.html

https://github.com/snowballstem/pystemmer


Stemming in Python• Stemming with NLTK

import nltkfrom nltk.stem.porter import PorterStemmerdef stem(tokens):

stem = []for item in tokens:

stems.append(PorterStemmer().stem(item))return stems

• Stemming with PyStemmer

import Stemmerdef stem(tokens):

stemmer = Stemmer.Stemmer('english')stems = stemmer.stemWords(tokens)return stems stemmer.stemWord(‘WordXYZ’)

stems single words

32


Stop words• Stop words are usually extremely common words in a language

which are filtered out before processing of natural language data. There is no single universal list of stop words used by all NLP tools but here are some common english stop words: a, an, and, are, as, at, be, by, for, from, has, he, in, is it, its, of, on, that, the, to, was, were, will, with, …

• Some stop words lists (a.k.a. stop lists): • http://snowball.tartarus.org/algorithms/english/stop.txt • http://xpo6.com/list-of-english-stop-words/

• Python library: • NLTK - http://www.nltk.org/book/ch02.html (Section 4.1)

33

http://snowball.tartarus.org/algorithms/english/stop.txt

http://xpo6.com/list-of-english-stop-words/

http://www.nltk.org/book/ch02.html


Stop words in Python Remove stop words (with NLTK)

import nltkfrom nltk.corpus import stopwordsdef remove_stopwords(tokens):

stopwords = nltk.corpus.stopwords.words('english')content = [w for w in tokens if w not in stopwords]return content

you can use an alternative stop list -> you don’t need NLTK

if you use the stop list of NLTK install ‘stopwords‘ from the NLTK-Corpus with

nltk.download('stopwords')34

Source: http://www.nltk.org/book/ch02.html



35


Mining Software Data• We apply some techniques of NLP, data mining (IR, machine

learning), metaprogramming to scenarios with program identifiers, program comments, or documentation as data.

• Selection of techniques: • Java Parser• Regular Expression Matching• IDF• Cosine similarity• Clustering • Prediction model • Sentiment analysis • Correlation • “Plotting”

36


Java Parser

37


Java Parser

Given a source file JavaParser recognizes the different syntactic elements and produces an

Abstract Syntax Tree (AST).

38

https://github.com/javaparser/javaparser

https://github.com/javaparser/javaparser


JavaParser

39

returns AST .java-File

parsing


Iterate over AST with JavaParser


Iterate over AST• via Recursive-Method

• process on actual Node (instanceof queries and casts for specific Nodes)

• call method with child Nodes

• via Visitor • call accept method with specific Visitor • process on visit methods for specific Nodes

41


Java Parser scenario

How can we print all method names

of a .java-File?

42


VIA Recursive Method

43

ACTUAL NODE- At beginning the AST-root (CompilationUnit) -

instanceof-query and cast for MethodDeclaration-Nodes

recursive call for child Nodes

prints Method names


VIA Visitor

44

calls accept method with specific Visitor

prints the Method names

the specific Node


Regular Expression Matching

45


Regular Expression with Java’s Regex-API

46

Charactersx The character x \\ The backslash character \t The tab character ('\u0009') \n The newline (line feed) character ('\u000A') \r The carriage-return character ('\u000D') \f The form-feed character ('\u000C') \a The alert (bell) character ('\u0007') \e The escape character ('\u001B') \cx The control character corresponding to x

More:https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

Character classes[abc] a, b, or c (simple class) [^abc] Any character except a, b, or c (negation) [a-zA-Z] a through z or A through Z, inclusive (range) [a-d[m-p]] a through d, or m through p: [a-dm-p] (union) [a-z&&[def]] d, e, or f (intersection) [a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction) [a-z&&[^m-p]] a through z, and not m through p: [a-lq-z](subtraction)

Greedy quantifiersX? X, once or not at all X* X, zero or more times X+ X, one or more times X{n} X, exactly n times X{n,} X, at least n times X{n,m} X, at least n but not more than m times

https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html


Typical invocation sequence: Pattern p = Pattern.compile("a*b"); Matcher m = p.matcher("aaaaab"); boolean b = m.matches();

47

More:https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

Regular Expression with Java’s Regex-API

define regular expression

define String Input

matching test

https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html


Design Pattern Detection with

JavaParser and

Regular Expression Matching

48


https://boa.unimib.it/retrieve/handle/10281/31515/43100/phd_unimib_055259.pdf

Tools for Design Pattern Detection

https://boa.unimib.it/retrieve/handle/10281/31515/43100/phd_unimib_055259.pdf


Design Pattern Detection

We focus on simple design pattern detection with • Java Parser and • Regular Expression Matching

50


Example

How can we detect the Singleton Pattern?

51


Constraints for Singleton Pattern

We identify a Singleton class with following criteria: 1. has class constructors regardless of accessibility 2. has a static reference, regardless of accessibility,

to the Singleton class, and 3. has a public-static method that returns the

Singleton class type

52

http://dblp.uni-trier.de/rec/html/conf/kbse/ShiO06


Singleton Pattern Detection with

Java Parser

53


DPD with Java Parser

54

We identify a Singleton class with following criteria: 1. has class constructors regardless of accessibility 2. has a static reference, regardless of accessibility, to the

Singleton class, and 3. has a public-static method that returns the Singleton class type

12

3

Parse File

Use Visitor

Check Constraints


DPD with Java Parser

55

Specific Visitor for

Singleton Pattern Detection


save classname

1. constraint

3. constraint

2. constraint


Singleton Pattern Detection with

regular expression matching

57


DPD with regular expression matching

58

regular expr 1. constraint



Match reg expr in File for

1.-3. constraint

API: java.util.regex.*


Inverse Document Frequency (IDF)

59


TF-IDF scenario• Extract vocabulary from Wiki text (including stemming and

stop list application) • Compute TF-IDF from wiki pages („documents“).

• We get ranked lists for every document. • Do the first ranked terms ‘characterize’ the document? • Are the first ranked terms important for all documents?

word TF-IDFword1 5,4word2 2,9

… …


… …


… …

document 1 document 2 document 3

60


Inverse Document Frequency (IDF)

N = total number of documentsdft = number of documents that contain a term tidft = Inverse document frequency of a term t

idft = log

N

dft

If a word appears in many documents, then it is not a unique identifier; this word has a low IDF score.

61


• TF-IDF stands for "Term Frequency, Inverse Document Frequency". It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.

• Intuitively...• If a word appears frequently in a document, it's important. Give the

word a high score.• If a word appears in many documents, it's not a unique identifier.

Give the word a low score.

tfd,t = number of occurrence of term t in a document d

TF-IDF

tfidfd,t = tfd,t logN

dft

N = total number of documentsdft = number of documents that contain a term t

tfidfd,t = term freq. - IDF of a term t in document d

62


TF-IDF in PythonDifferent ways to calculate the ‘term frequency’:

def term_frequency(term, tokenized_document): return tokenized_document.count(term)

def sublinear_term_frequency(term, tokenized_document): return 1 + math.log(tokenized_document.count(term))

def augmented_term_frequency(term, tokenized_document): max_count = max([term_frequency(t, tokenized_document) for t in tokenized_document]) return (0.5 + ((0.5 * term_frequency(term, tokenized_document))/max_count))

63

Source: http://billchambers.me/tutorials/2014/12/21/tf-idf-explained-in-python.html

http://billchambers.me/tutorials/2014/12/21/tf-idf-explained-in-python.html


TF-IDF in PythonCalculating IDF

def inverse_document_frequencies(tokenized_documents): idf_values = {} all_tokens_set = set([item for sublist in tokenized_documents for item in sublist]) for tkn in all_tokens_set: contains_token = map(lambda doc: tkn in doc, tokenized_documents) idf_values[tkn] = 1 + math.log(len(tokenized_documents)/(sum(contains_token))) return idf_values

64


tokenized documents

sublist

item

item

item

sublist

item

item

item

sublist

item

item

item



TF-IDF in PythonCalculating TF-IDF for every term in all documents

def tfidf(documents): tokenized_documents = [tokenize(d) for d in documents] idf = inverse_document_frequencies(tokenized_documents) tfidf_documents = [] for document in tokenized_documents: doc_tfidf = [] for term in idf.keys(): tf = sublinear_term_frequency(term, document) doc_tfidf.append(tf * idf[term]) tfidf_documents.append(doc_tfidf) return tfidf_documents

65




TF-IDF in PythonCalculating TF-IDF with scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer

sklearn_tfidf = TfidfVectorizer(tokenizer=process, stop_words = ‘english’)

sklearn_representation = sklearn_tfidf.fit_transform(all_documents)

66


reference to tokenizing and stemming method define stop words



Cosine similarity

67


Cosine scenarioFind for each method or class scope of each contribution the most similar scope in another contribution where the vector is based on the term frequency (after preprocessing) of the program identifiers (or comments) in the scope. The terms to be included into the vector could be selected in different ways.

We could consider the top-n terms from an (TF-)IDF analysis.


Cosine similarity in Pythonimport nltk, stringfrom sklearn.feature_extraction.text import TfidfVectorizer

stemmer = nltk.stem.porter.PorterStemmer()remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens): return [stemmer.stem(item) for item in tokens]

def normalize(text): return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

def cosine_sim(text1, text2): tfidf = vectorizer.fit_transform([text1, text2]) return ((tfidf * tfidf.T).A)[0,1]

print cosine_sim('a little bird', 'a little bird')print cosine_sim('a little bird', 'a little bird chirps')print cosine_sim('a little bird', 'a big dog barks')

calculate cosine similarity

calculate tfidf-Vector

Mining Software Data - Uni Koblenz-Landau · • Tokenizing with NLTK import nltk import string...

Documents

Transcript of Mining Software Data - Uni Koblenz-Landau · • Tokenizing with NLTK import nltk import string...