Mining Software Data - Uni Koblenz-Landau · • Tokenizing with NLTK import nltk import string...

Mining Software DataHakan Aksu, Ralf Lämmel

University of Koblenz-Landau Faculty of Computer Science

Software Languages Team

Creative Commons License: softlang logos by Wojciech Kwasnik, Archina Void, Ralf Lämmel, Software Languages Team, Faculty of Computer Science, University of Koblenz-Landau is licensed under a Creative Commons Attribution 4.0 International License

SOFTLANG

Mining Software Data

Information Retrieval (IR)

Machine Learning

Mining Software Repositories

Natural Language Processing

‘Mining Software Data’ concerns several fields of computer science

Overview Information Retrieval (IR)

• Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). (http://nlp.stanford.edu/IR-book/pdf/01bool.pdf)

• ‘Documents’ in 101project • wiki text (pages or sections) and • source-code units with

• program identifiers and • comments

IR scenario• Objective: find source-code units that implement a

specific feature, e.g., ‘Total’.

• Method: search source code for characteristic terms, e.g., ‘total’.

• Challenges:

• Distinguish feature implementation and testing.

• Dealing with variation in natural language usage.

• Precision is the fraction of the documents retrieved that are relevant to the user's information need.

• Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.

Performance and correctness measures in IR

Overview Machine Learning

Types of Machine Learning

• Supervised learning

• Unsupervised learning

• Reinforcement learning

Types of Machine Learning

Supervised learning (e.g., prediction models) is the machine learning task of inferring a function from labeled training data. The computer is presented with example inputs and their desired outputs. The goal is to learn a general rule that maps inputs to outputs.

Training Data Test Data

Learning Rules test rules determine, e.g., precision & recall

Types of Machine LearningUnsupervised learning (e.g., cluster analysis): is the machine learning task of inferring a function to describe hidden structure from unlabeled data.No labels are given to the learning algorithm, leaving it on its own to find structure in its input.

One color stands for a cluster.

Possible structures:

Source: https://de.wikipedia.org/wiki/Clusteranalyse

Types of Machine LearningReinforcement learning:

• Inspired by behaviorist psychology

• The algorithm learns by reward and punishment like a human

• Example: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle), without a teacher explicitly telling it whether it has come close to its goal.

• Another example is learning to play a game by playing against an opponent.

Overview Natural Language Processing

Definition 1

Natural language processing is a branch of artificial intelligence that deals with analyzing, understanding and generating the languages that humans use naturally in order to interface with computers in both written and spoken contexts using natural human languages instead of computer languages. http://www.webopedia.com/TERM/N/NLP.html (April, 2016)

Definition 2

Natural language processing is a method to translate between computer and human languages. It is a method of getting a computer to understandably read a line of text without the computer being fed some sort of clue or calculation. In other words, NLP automates the translation process between computers and humans. https://www.techopedia.com/definition/653/natural-language-processing-nlp (April, 2016)

Definition 3

Computer understanding, analysis, manipulation, and/or generation of natural language. This can refer to anything from fairly simple string-manipulation tasks like stemming, or building concordances of natural language texts, to higher-level AI-like tasks like processing user queries in natural language. natural language processing. Dictionary.com. The Free On-line Dictionary of Computing. Denis Howe. http://www.dictionary.com/browse/natural-language-processing (accessed: April, 2016).

Some tasks in NLP• Machine translation

• Translate text from one language to another • Morphological segmentation

• Separate words into individual morphemes and identify the class of the morphemes • Natural language generation

• Convert information from databases into readable human language • Part-of-speech tagging

• Given a sentence, determine the part of speech (e.g., noun, verb or adjective) for each word.

• Question answering • Given a human-language question, determine its answer

• Relationship extraction • Given a text, identify the relationships among named entities (e.g., who is father to whom)

• Word sense disambiguation • Many words have more than one meaning; Identify the meaning of words

• Text simplification • Text-to-speech • Natural language search • Text-proofing • …

Not much of this in this course!

Overview Mining Software Repositories

Mining Software Repositories

• The Mining Software Repositories (MSR) field analyzes the rich data available in software repositories.

• Analysis of • version control repositories • mailing list archives • bug tracking systems • issue tracking systems, etc.

• to uncover information about software systems, projects and software engineering.

https://en.wikipedia.org/wiki/Mining_Software_Repositories

Current MSR research in the Software Languages Team

• Analysis of Software Repositories to identify developer experience in different scopes like experience in API usage or skills in specific frameworks (e.g., Django)

Preparing programs for NLP and data mining

NL data in 101project• Program identifiers • Comments• Wiki text • Commit messages • Github issues • Github revisions

Program identifierClass/Type/Interface names

method names variable/parameter names

Comments

comment blocks

single comments

Wiki text

Description in natural language

Source Code

More information24

NL preprocessing

NL preprocessing tasks• After preparing programs for data mining (like scanning, fact extraction and/or program

identifier splitting) we need some NL preprocessing steps.

• Preprocessing is used to prepare the raw data for the analysis itself.

• There are many methods like • part-of-speech tagging • lemmatization • text simplification • word sense disambiguation • …

• We are interested in • Program identifier ‘Tokenizing’• ‘Tokenizing’• ‘Stemming’• Removing ‘Stop Words’

Splitting

„getSalary“

splitting to

„get“ and „Salary“

Split program identifier to single words

Example

Tokenizing• Separate a text (String) to its tokens

• Example — Input:• „Natural language processing makes fun.“

• Result:• „Natural“, „language“, „processing“, „makes“, „fun“, “.“

• Best practice is to work without punctuations and lowercased tokens (normalization of tokens).

• Normalized result:• „natural“, „language“, „processing“, „makes“, „fun“

Tokenizing in Python• Tokenizing with NLTK

import nltkimport stringremove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def tokenize(text):tokens = nltk.word_tokenize(text.translate(remove_punctuation_map))tokens = [w.lower() for w in tokens]return tokens

• Tokenizing with TextBlob

from textblob import TextBlobdef tokenize(text):

text = TextBlob(text)tokens = text.wordsreturn tokens

tokenize

with lowercase letters

without punctuation

only tokenizing

normalization also possible

Limited alternative text.translate(None,string.punctuation)

Stemming• Stemming is the process of reducing a word into its stem.

• The stem or root form is not necessarily a word by itself, but it can be used to generate words by concatenating the right suffix.

• Example • fish, fishes and fishing stems into fish

It is a correct word • study, studies and studying stems into studi

It is not an English word.

• Most commonly, stemming algorithms (a.k.a. stemmers) are based on rules for suffix stripping.

Stemming• The most famous algorithm is the Porter stemmer.

Introduced in1979.

• A more aggressive stemming algorithm is the Lancaster stemmer. Introduced in 1990.

• Python libraries • NLTK

doc http://www.nltk.org/api/nltk.stem.html demo http://www.nltk.org/book/ch03.html (Section 3.6)

• PyStemmer - https://github.com/snowballstem/pystemmer

Stemming in Python• Stemming with NLTK

import nltkfrom nltk.stem.porter import PorterStemmerdef stem(tokens):

stem = []for item in tokens:

stems.append(PorterStemmer().stem(item))return stems

• Stemming with PyStemmer

import Stemmerdef stem(tokens):

stemmer = Stemmer.Stemmer('english')stems = stemmer.stemWords(tokens)return stems stemmer.stemWord(‘WordXYZ’)

stems single words

Stop words• Stop words are usually extremely common words in a language

which are filtered out before processing of natural language data. There is no single universal list of stop words used by all NLP tools but here are some common english stop words: a, an, and, are, as, at, be, by, for, from, has, he, in, is it, its, of, on, that, the, to, was, were, will, with, …

• Some stop words lists (a.k.a. stop lists): • http://snowball.tartarus.org/algorithms/english/stop.txt • http://xpo6.com/list-of-english-stop-words/

• Python library: • NLTK - http://www.nltk.org/book/ch02.html (Section 4.1)

Stop words in Python Remove stop words (with NLTK)

import nltkfrom nltk.corpus import stopwordsdef remove_stopwords(tokens):

stopwords = nltk.corpus.stopwords.words('english')content = [w for w in tokens if w not in stopwords]return content

you can use an alternative stop list -> you don’t need NLTK

if you use the stop list of NLTK install ‘stopwords‘ from the NLTK-Corpus with

nltk.download('stopwords')34

Source: http://www.nltk.org/book/ch02.html

Mining Software Data

Mining Software Data• We apply some techniques of NLP, data mining (IR, machine

learning), metaprogramming to scenarios with program identifiers, program comments, or documentation as data.

• Selection of techniques: • Java Parser• Regular Expression Matching• IDF• Cosine similarity• Clustering • Prediction model • Sentiment analysis • Correlation • “Plotting”

Java Parser

Given a source file JavaParser recognizes the different syntactic elements and produces an

Abstract Syntax Tree (AST).

https://github.com/javaparser/javaparser

JavaParser

returns AST .java-File

parsing

Iterate over AST with JavaParser

Iterate over AST• via Recursive-Method

• process on actual Node (instanceof queries and casts for specific Nodes)

• call method with child Nodes

• via Visitor • call accept method with specific Visitor • process on visit methods for specific Nodes

Java Parser scenario

How can we print all method names

of a .java-File?

VIA Recursive Method

ACTUAL NODE- At beginning the AST-root (CompilationUnit) -

instanceof-query and cast for MethodDeclaration-Nodes

recursive call for child Nodes

prints Method names

VIA Visitor

calls accept method with specific Visitor

prints the Method names

the specific Node

Regular Expression Matching

Regular Expression with Java’s Regex-API

Charactersx The character x \\ The backslash character \t The tab character ('\u0009') \n The newline (line feed) character ('\u000A') \r The carriage-return character ('\u000D') \f The form-feed character ('\u000C') \a The alert (bell) character ('\u0007') \e The escape character ('\u001B') \cx The control character corresponding to x

More:https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

Character classes[abc] a, b, or c (simple class) [^abc] Any character except a, b, or c (negation) [a-zA-Z] a through z or A through Z, inclusive (range) [a-d[m-p]] a through d, or m through p: [a-dm-p] (union) [a-z&&[def]] d, e, or f (intersection) [a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction) [a-z&&[^m-p]] a through z, and not m through p: [a-lq-z](subtraction)

Greedy quantifiersX? X, once or not at all X* X, zero or more times X+ X, one or more times X{n} X, exactly n times X{n,} X, at least n times X{n,m} X, at least n but not more than m times

Typical invocation sequence: Pattern p = Pattern.compile("a*b"); Matcher m = p.matcher("aaaaab"); boolean b = m.matches();

More:https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

Regular Expression with Java’s Regex-API

define regular expression

define String Input

matching test

Design Pattern Detection with

JavaParser and

Regular Expression Matching

https://boa.unimib.it/retrieve/handle/10281/31515/43100/phd_unimib_055259.pdf

Tools for Design Pattern Detection

Design Pattern Detection

We focus on simple design pattern detection with • Java Parser and • Regular Expression Matching

Example

How can we detect the Singleton Pattern?

Constraints for Singleton Pattern

We identify a Singleton class with following criteria: 1. has class constructors regardless of accessibility 2. has a static reference, regardless of accessibility,

to the Singleton class, and 3. has a public-static method that returns the

Singleton class type

http://dblp.uni-trier.de/rec/html/conf/kbse/ShiO06

Singleton Pattern Detection with

Java Parser

DPD with Java Parser

We identify a Singleton class with following criteria: 1. has class constructors regardless of accessibility 2. has a static reference, regardless of accessibility, to the

Singleton class, and 3. has a public-static method that returns the Singleton class type

Parse File

Use Visitor

Check Constraints

DPD with Java Parser

Specific Visitor for

Singleton Pattern Detection

save classname

1. constraint

3. constraint

2. constraint

Singleton Pattern Detection with

regular expression matching

DPD with regular expression matching

regular expr 1. constraint

Match reg expr in File for

1.-3. constraint

API: java.util.regex.*

Inverse Document Frequency (IDF)

TF-IDF scenario• Extract vocabulary from Wiki text (including stemming and

stop list application) • Compute TF-IDF from wiki pages („documents“).

• We get ranked lists for every document. • Do the first ranked terms ‘characterize’ the document? • Are the first ranked terms important for all documents?

word TF-IDFword1 5,4word2 2,9

… …

document 1 document 2 document 3

Inverse Document Frequency (IDF)

N = total number of documentsdft = number of documents that contain a term tidft = Inverse document frequency of a term t

idft = log

If a word appears in many documents, then it is not a unique identifier; this word has a low IDF score.

• TF-IDF stands for "Term Frequency, Inverse Document Frequency". It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.

• Intuitively...• If a word appears frequently in a document, it's important. Give the

word a high score.• If a word appears in many documents, it's not a unique identifier.

Give the word a low score.

tfd,t = number of occurrence of term t in a document d

TF-IDF

tfidfd,t = tfd,t logN

N = total number of documentsdft = number of documents that contain a term t

tfidfd,t = term freq. - IDF of a term t in document d

TF-IDF in PythonDifferent ways to calculate the ‘term frequency’:

def term_frequency(term, tokenized_document): return tokenized_document.count(term)

def sublinear_term_frequency(term, tokenized_document): return 1 + math.log(tokenized_document.count(term))

def augmented_term_frequency(term, tokenized_document): max_count = max([term_frequency(t, tokenized_document) for t in tokenized_document]) return (0.5 + ((0.5 * term_frequency(term, tokenized_document))/max_count))

Source: http://billchambers.me/tutorials/2014/12/21/tf-idf-explained-in-python.html

TF-IDF in PythonCalculating IDF

def inverse_document_frequencies(tokenized_documents): idf_values = {} all_tokens_set = set([item for sublist in tokenized_documents for item in sublist]) for tkn in all_tokens_set: contains_token = map(lambda doc: tkn in doc, tokenized_documents) idf_values[tkn] = 1 + math.log(len(tokenized_documents)/(sum(contains_token))) return idf_values

tokenized documents

sublist

TF-IDF in PythonCalculating TF-IDF for every term in all documents

def tfidf(documents): tokenized_documents = [tokenize(d) for d in documents] idf = inverse_document_frequencies(tokenized_documents) tfidf_documents = [] for document in tokenized_documents: doc_tfidf = [] for term in idf.keys(): tf = sublinear_term_frequency(term, document) doc_tfidf.append(tf * idf[term]) tfidf_documents.append(doc_tfidf) return tfidf_documents

TF-IDF in PythonCalculating TF-IDF with scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer

sklearn_tfidf = TfidfVectorizer(tokenizer=process, stop_words = ‘english’)

sklearn_representation = sklearn_tfidf.fit_transform(all_documents)

reference to tokenizing and stemming method define stop words

Cosine similarity

Cosine scenarioFind for each method or class scope of each contribution the most similar scope in another contribution where the vector is based on the term frequency (after preprocessing) of the program identifiers (or comments) in the scope. The terms to be included into the vector could be selected in different ways.

We could consider the top-n terms from an (TF-)IDF analysis.

Cosine similarity in Pythonimport nltk, stringfrom sklearn.feature_extraction.text import TfidfVectorizer

stemmer = nltk.stem.porter.PorterStemmer()remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens): return [stemmer.stem(item) for item in tokens]

def normalize(text): return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

def cosine_sim(text1, text2): tfidf = vectorizer.fit_transform([text1, text2]) return ((tfidf * tfidf.T).A)[0,1]

print cosine_sim('a little bird', 'a little bird')print cosine_sim('a little bird', 'a little bird chirps')print cosine_sim('a little bird', 'a big dog barks')

calculate cosine similarity

calculate tfidf-Vector

Mining Software Data - Uni Koblenz-Landau · • Tokenizing with NLTK import nltk import string...

Documents

Transcript of Mining Software Data - Uni Koblenz-Landau · • Tokenizing with NLTK import nltk import string...

NLTK: The Natural Language Toolkit

Python NLTK

NLTK: The Natural Language Toolkit - Lopered.loper.org/presentations/pycon-nltk-slides.pdf · NLTK: Python-Based NLP Courseware •NLTK: Natural Language Toolkit • A suite of Python

Nltk for biginer

Isd312 03-nltk

Frequency with nltk

Bird05 nltk-intro

Tiểu luận NLTK(1)

Python 3 March 15, 2011. NLTK import nltk nltk.download()

Accessing files with NLTK Regular Expressions

NLTK-Trainer Documentation - Read the Docs · NLTK-Trainer Documentation, Release 1.0 NLTK-Trainer is a set ofPythoncommand line scripts for natural language processing. With these

Natural Language Processing with NLTK

Natural Language Toolkit NLTK

Lecture 7 NLTK POS Tagging

Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Language Sleuthing HOWTO with NLTK

NLTK and Lexical Information - GitHub Pages · NLTK and Lexical Information Text Statistics References NLTK book examples Concordances Lexical Dispersion Plots Diachronic vs Synchronic

Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

NLTK Book Chapter 2

Introduction to NLTK