Development Emails Content Analyzer: Intention Mining in Developer Discussions

88
Development Emails Content Analyzer: Intention Mining in Developer Discussions Andrea Di Sorbo Sebastiano Panichella Corrado Visaggio Massimiliano Di Penta Gerardo Canfora Harald Gall

Transcript of Development Emails Content Analyzer: Intention Mining in Developer Discussions

Page 1: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Development  Emails  Content  Analyzer:  Intention  Mining  in  Developer  Discussions

 Andrea Di  Sorbo

 Sebastiano Panichella

 Corrado Visaggio

 Massimiliano Di  Penta

 Gerardo Canfora

 Harald Gall

Page 2: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Outline  

Context:   Wri5en   Development  Discussions Case  Study:   Development  Mailing  List of  2  Open  Source  Projects

Results: Automatic  Classification  of  Relevant Contents  in  Developers’  Communication

2

Page 3: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Open  Source  (OS)  and    Industrial  Projects  

3

Page 4: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Open  Source  (OS)  and    Industrial  Projects

4

Page 5: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Open  Source  (OS)  and    Industrial  Projects

5

Page 6: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Open  Source  (OS)  and    Industrial  Projects

6

Page 7: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Development    Communication  Means

Recommender  systems: -­‐‑  Bug  Triaging  [1] -­‐‑  Suggest  Mentors  [2] -­‐‑  Code  re-­‐‑documentation  [3] -­‐‑  Etc.

[1]  Anvik  et  al.  “Who  should  fix  this  bug?”. [2]  Canfora  et  al.  “Who  is  going  to  mentor  newcomers  in  open  source  projects?”   [3]  Panichella  et  al.  “Mining  source  code  descriptions  from  developer  communications” 7

Page 8: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Development    Communication  Means

8

Page 9: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Development    Communication  Means

[1]  Bacchelli  et  al.  “Content  classification  of  development  emails”. [2]  Cerulo  et  al.  “A  Hidden  Markov  Model  to  detect  coded  information  islands  in  free  text.”  

9

Page 10: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Different  Kinds  of  Data  

Structured

Semi-­‐‑Structured

Unstructured

10

Page 11: Development Emails Content Analyzer: Intention Mining in Developer Discussions

A  Considerable  Effort  for  Developers

Many  messages  

Developers  get  lost  in  unnecessary  details  missing  potential  useful  information…

11

Page 12: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Previous  Work  

12

Hana  et  al.

“…Lazy”  RTC  occurs  when  a  core  developer  post  a  change  to  a  mailing  lists  and  nobody  responds,  it  assumed  that  other  developers  reviewed  the  

code…”

Page 13: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Previous  Work  

Approaches  for:   -­‐‑  Generating  summaries          of  emails.              à  Lam  et  al.  ,              à  Rambow  et  al.

-­‐‑  Generating  summaries            of  bug  reports.            à    Rastkar  et  al.

13

Page 14: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Different  Purposes  

Feature  requests

Bug  disclosures

Project  Management

14

Page 15: Development Emails Content Analyzer: Intention Mining in Developer Discussions

DECA  (Development  Email  Content  Analyzer)

An  approach  to  Classify  Paragraphs   According  to  Intentions

hSp://www.ifi.uzh.ch/seal/people/panichella/tools/DECA.html 15

Page 16: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Why  use  NLP  for  Classifying  Paragraphs  According  to  

Intentions?

16

Page 17: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Example

i.  We  could  use  a  leaky  bucket  algorithm  to  limit  the  bandwidth

ii. The  leaky  bucket  algorithm  fails  in  limiting  the  bandwidth  

17

Page 18: Development Emails Content Analyzer: Intention Mining in Developer Discussions

i.  We  could  use  a  leaky  bucket  algorithm  to  limit  the  bandwidth

ii. The  leaky  bucket  algorithm  fails  in  limiting  the  bandwidth  

     An  high  percentage  of  words  in  common

Example

18

Page 19: Development Emails Content Analyzer: Intention Mining in Developer Discussions

i.  We  could  use  a  leaky  bucket  algorithm  to  limit  the  bandwidth

ii. The  leaky  bucket  algorithm  fails  in  limiting  the  bandwidth  

Discuss  about  the  same  topics

Example

19

Page 20: Development Emails Content Analyzer: Intention Mining in Developer Discussions

i.  We  could  use  a  leaky  bucket  algorithm  to  limit  the  bandwidth

ii. The  leaky  bucket  algorithm  fails  in  limiting  the  bandwidth  

Have  different  intentions

Example

20

Page 21: Development Emails Content Analyzer: Intention Mining in Developer Discussions

i.  We  could  use  a  leaky  bucket  algorithm  to  limit  the  bandwidth

ii. The  leaky  bucket  algorithm  fails  in  limiting  the  bandwidth  

Have  different  intentions

Example

“Techniques  based  on  lexicon  analysis,  such  as  VSM  [1],  LSI  [2],  or  LDA  [3]  would  not  be  sufficient  to  classify  paragraphs  according  to  intentions”. .

[1]  Baeza-­‐‑Yates  et  al.  “Modern  Information  Retrieval”. [2]  de  Marneffe  et  al.,  “The  Stanford  typed  dependencies  representation”. [3]  Blei  et  al.,  “Latent  dirichlet  allocation”.

21

Page 22: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Perspective  

22

Page 23: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Goal:  Understanding  to  what  extent  NL  parsing  could  be  used     in   recognizing   informative   text   fragments   in   emails  from  a  software  maintenance  and  evolution  perspective Quality   focus:   Detection   of   text   paragraphs   in  development   discussions     containing   helpful   information  for  developers.   Perspective:   Guide   developers   in   maintaining   and  evolving  their  products.  

Case  Study  

23

Page 24: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Research  Questions  

RQ1:   Can   an   NLP   approach   (i.e.   DECA)   be  effective   in   classifying   writers’   intentions   in  development  emails?

RQ2:  Is  DECA  more  effective  than  existing  Machine   Learning   techniques   in  classifying  development  emails  content?

24

Page 25: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Qt

Ubuntu

Context

25

Page 26: Development Emails Content Analyzer: Intention Mining in Developer Discussions

STEPS: 1)    Taxonomy  Definition

   2)    Classification  Based  on  DECA  (NLP  Analyzer)

26

Page 27: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Taxonomy Definition

27

Page 28: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Sampling   We  selected  100

Of  the                            Project      

28

Page 29: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Clustering

Clusters Implementation

Technical  Infrastructure Project  Status

Social  Interations Usage

Discarded

Guzzi  et.  al  –  MSR2013

29

Page 30: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Clustering

Guzzi  et.  al  –  ICSE2012

30

Page 31: Development Emails Content Analyzer: Intention Mining in Developer Discussions

The  final  taxonomy

31

Page 32: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Differences  with  Guzzi  et.  al.

32

Page 33: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Examples

33

Page 34: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Natural Language Parsing

DECA  (Development  Email  Content  Analyzer)

34

Page 35: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Recurrent  Linguistic  PaSerns

35

Page 36: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Why  NL  parsing?   Well  defined  predicate-­‐‑argument  structures

use

we could algorithm

a leaky bucket

limit

to bandwidth

the

           nsubj                    aux                    dobj                        xcomp

           det                  amod                    nn                      aux                        dobj            

det

fails

algorithm

the leaky bucket

in

limiting

bandwidth

the

                                   nsubj                                                          prep

               det                amod          nn                    pcomp  

               dobj  

               det  

36

Page 37: Development Emails Content Analyzer: Intention Mining in Developer Discussions

NL  parsing Natural  Language  Templates

use

[someone] could [something]

                         nsubj                    aux                    dobj

fails

[somehing]

nsubj

37

Page 38: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Natural  Language  Templates

use

[someone] could [something]

                         nsubj                    aux                    dobj

fails

[somehing]

nsubj

NL  parsing

38

Page 39: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Natural  Language  Templates

use

[someone] could [something]

                         nsubj                    aux                    dobj

fails

[somehing]

nsubj

NL  parsing

39

Page 40: Development Emails Content Analyzer: Intention Mining in Developer Discussions

NLP  Heuristics

40

Page 41: Development Emails Content Analyzer: Intention Mining in Developer Discussions

NLP  Parser

raw  text NLP  parser NLP  heuristics

41

Page 42: Development Emails Content Analyzer: Intention Mining in Developer Discussions

42

Page 43: Development Emails Content Analyzer: Intention Mining in Developer Discussions

43

Page 44: Development Emails Content Analyzer: Intention Mining in Developer Discussions

RQ1:   Is  DECA  effective  in  

classifying  writers’  intentions  in  development  emails?

44

Page 45: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Experiment  I

training

test

102 87

100

45

Page 46: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Experiment  I

training

test

102 87

100

Experiment  II False  

Negative 46

Page 47: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Experiment  II

training 100 169

test 100

Experiment  III False  

Negative 47

Page 48: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Experiment  III

training 100 231

test 100

48

Page 49: Development Emails Content Analyzer: Intention Mining in Developer Discussions

49

Page 50: Development Emails Content Analyzer: Intention Mining in Developer Discussions

50

Page 51: Development Emails Content Analyzer: Intention Mining in Developer Discussions

51

Page 52: Development Emails Content Analyzer: Intention Mining in Developer Discussions

52

Page 53: Development Emails Content Analyzer: Intention Mining in Developer Discussions

53

Page 54: Development Emails Content Analyzer: Intention Mining in Developer Discussions

54

Page 55: Development Emails Content Analyzer: Intention Mining in Developer Discussions

RQ2:   Is  the  proposed  approach  more  

effective  than  existing  ML  in  classifying  development  emails  content?

55

Page 56: Development Emails Content Analyzer: Intention Mining in Developer Discussions

ML  for  Email  Classification

An  Approach  Based  on  ML  for  Email  Content  Classification

           à  Antoniol  et.  al.,  CASCON  2008                à  Zhou  et  al.  ,  ICSME  2014

56

Page 57: Development Emails Content Analyzer: Intention Mining in Developer Discussions

ML  for  Email  Classification

An  Approach  Based  on  ML  for  Email  Content  Classification 1)Text  Features

57

Page 58: Development Emails Content Analyzer: Intention Mining in Developer Discussions

ML  for  Email  Classification

An  Approach  Based  on  ML  for  Email  Content  Classification 1)Text  Features

2)  Split  training  and  test  sets

58

Page 59: Development Emails Content Analyzer: Intention Mining in Developer Discussions

ML  for  Email  Classification

An  Approach  Based  on  ML  for  Email  Content  Classification 1)Text  Features

2)  Split  training  and  test  sets

3)  Oracle  building

59

Page 60: Development Emails Content Analyzer: Intention Mining in Developer Discussions

ML  for  Email  Classification

An  Approach  Based  on  ML  for  Email  Content  Classification 1)Text  Features

2)  Split  training  and  test  sets

3)  Oracle  building

4)  Classification

training

prediction

           à  Antoniol  et.  al.,  CASCON  2008                à  Zhou  et  al.  ,  ICSME  2014

60

Page 61: Development Emails Content Analyzer: Intention Mining in Developer Discussions

61

Page 62: Development Emails Content Analyzer: Intention Mining in Developer Discussions

62

Page 63: Development Emails Content Analyzer: Intention Mining in Developer Discussions

63

Page 64: Development Emails Content Analyzer: Intention Mining in Developer Discussions

64

Page 65: Development Emails Content Analyzer: Intention Mining in Developer Discussions

65

Page 66: Development Emails Content Analyzer: Intention Mining in Developer Discussions

66

Page 67: Development Emails Content Analyzer: Intention Mining in Developer Discussions

67

Page 68: Development Emails Content Analyzer: Intention Mining in Developer Discussions

68

Page 69: Development Emails Content Analyzer: Intention Mining in Developer Discussions

69

Page 70: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Summary

•  RQ2:   DECA outperforms traditional ML techniques in terms of recall, precision and F-Measure when classifying e-mail content.

•  RQ1:   the automatic classification performed by DECA achieves very good results in terms of both precision, recall and F-measure (over all the experiments).

70

Page 71: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Summary

•  RQ2:   DECA outperforms traditional ML techniques in terms of recall, precision and F-Measure when classifying e-mail content.

”…it took the MSR community more than 10 years to figure out that machine learning is not the best method for analyzing human-written text. Thank you for helping move the field forward…”  [One of the ASE Reviewers]

•  RQ1:   the automatic classification performed by DECA achieves very good results in terms of both precision, recall and F-measure (over all the experiments).

71

Page 72: Development Emails Content Analyzer: Intention Mining in Developer Discussions

72

Page 73: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Code  e-­‐‑documentation

àPanichella  et.  al.  –  ICPC  2012  Extract  methods’  descriptions  from  developers  discussions

à  Vector  Space  Models à  ad  hoc  heuristics

“…  several  are  the  discourse  paIerns  that  characterize  false  negative  method  descriptions…  “

73

Page 74: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Code  re-­‐‑documentation “…  several  are  the  discourse  

paIerns  that  characterize  false  negative  method  descriptions…  “

74

Page 75: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Code  re-­‐‑documentation “…  several  are  the  discourse  

paIerns  that  characterize  false  negative  method  descriptions…  “

75

Page 76: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Code  re-­‐‑documentation “…  several  are  the  discourse  

paIerns  that  characterize  false  negative  method  descriptions…  “

76

Page 77: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Code  re-­‐‑documentation “…  several  are  the  discourse  

paIerns  that  characterize  false  negative  method  descriptions…  “

77

Page 78: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Code  re-­‐‑documentation “…  several  are  the  discourse  

paIerns  that  characterize  false  negative  method  descriptions…  “

78

Page 79: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Code  re-­‐‑documentation “…  several  are  the  discourse  

paIerns  that  characterize  false  negative  method  descriptions…  “

79

Page 80: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Code  re-­‐‑documentation

delete  

80

Page 81: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Conclusion

81

Page 82: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Conclusion

82

Page 83: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Conclusion

83

Page 84: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Conclusion

84

Page 85: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Conclusion

85

Page 86: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Conclusion

86

Page 87: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Future  work

1)DECA  as  preprocessing  support  to  discard  irrelevant  sentences  in  summarization  

approaches

87

Page 88: Development Emails Content Analyzer: Intention Mining in Developer Discussions

Future  work

1)DECA  as  preprocessing  support  to  discard  irrelevant  sentences  in  summarization  

approaches

2)DECA  in  combination  with  topic  models  for  mining  

contents  with  the  same  intentions  and  the  same  topics  

88