Andrew G. West and Insup Lee August 28, 2012

Andrew G. West and Insup LeeAugust 28, 2012

Towards Content-Driven Reputation for Collaborative Code Repositories

Big Concept

1. Do the computed reputations accurately reflect user behavior? If so, how could such a system be useful in practice?

2. What do inaccuracies teach us about differences in the evolution of code vs. natural language content? Adaptation?

2

Apply reputation algorithms developed for wikis in collaborative code repositories:

Motivations

Platform equivalence• Purely collaborative• Increasingly distributed;

collaboration between unknown/un-trusted parties

3

VehicleForge.mil [1]•Crowdsourcing a next generation military vehicle•Trust implications!

CONTENT-DRIVEN REPUATION

4

Content Driven Rep.

5

V1V0

Article Version History

Initialization

AuthorsA1

Mr. Franklin flew a kite

IDEA: Content that survives is good content. Good content is written/maintained by good authors.

V1: No reputation changes; no survival

Content Driven Rep.

6

V1 V2 V3V0


Initialization

AuthorsA1 A2 A3

V4A4


Your mom flew a plane

Damage

IDEA: When a subsequent editor allows content to survive, it has his/her implicit approval (and vice versa)

V2: Author A2 deletes most of A1’s content. Reputation of A1 is negatively impacted.

Content Driven Rep.

7

V1 V2 V3V0


Initialization Content Restoration

AuthorsA1 A2 A3

V4A4




Damage

IDEA: Survival is examined at depth

V3: Author A3 reverts A2’s content. Editor A1 gains reputation as his content is restored, A2 loses rep.

Content Driven Rep.

8

V1 V2 V3V0


Initialization Content Restoration

AuthorsA1 A2 A3

V4

Content Persistence

A4




Mr. Franklin flew a kite and …

Damage

IDEA: … and the process continues (depth=10)

V4: Author A1 and A3 accrue reputation, while A2 continues to receive reputation decrements.

In Practice

Implemented as WikiTrust [2, 3]• Token survival + edit distance captures novel

content as well as maintenance actions• Size of ∆ is: (1) proportional to degree of

change, (2) weighted by the rep. of the editor• Nice security properties–Implicit feedback–Symmetric evaluation–No self approval

9

WikiTrust Success

Live processing several language editions of Wikipedia; portable!

10

VANDALISM

Implementation [4] works on any MediaWiki installation

REPRESENTING AREPOSITORY ON

A WIKI PLATFORM

11

Repo. ↔ Wiki Model

12

1

2

3

4

6

7

5

9

tags/

trunk/

branches/ merge

Just replay history in a sequential fashion:•Repository ↔ wiki•Check-in ↔ edit•File ↔ article

Repo. ↔ Wiki Model

Minor accommodations: • Ignore tags• Ignore branches (merge

as a recommendation)• Multi-file check-in

13

1

2

3

4

6

7

5

9

tags/

trunk/

branches/ merge

Just replay history in a sequential fashion:•Repository ↔ wiki•Check-in ↔ edit•File ↔ article

Replay in Practice

1. [svnsync] produces local copy (not a checkout)2. [svn log] yields metadata script (see table)3. Pipe file versions into wiki via API

1. Log-in user (create account if needed)2. Use [svn cat path@id] syntax to yield content3. Make edit to article “path”. Logout.

14

ID USR COMMENT MOD PATH

1 U1 Initial check-in.A /trunk/core/header.c

A /trunk/core/misc.c

2 U2 Compilation error M /trunk/core/header.c

3 U1 Don’t need this D /trunk/core/misc.c

CASE STUDYINTRODUCTION

15

Mediawiki SVN• Case study repository: Mediawiki SVN [5]• http://hincapie.cis.upenn.edu/wiki_mediawiki/

16

PROPERTY ORIG MOD

Authors 326 271

Check-ins 91,808 53,715

File versions 585,629 117,432

… in trunk/ 420,613 117,432

Unique paths 138,741 7,521

… to PHP file 56,063 7,521

Further filtering:• Only PHP files

• Core language• No binary files• Tokenization

• Toss out i18n filesper late 2011

Mediawiki SVN• Case study repository: Mediawiki SVN [X]• http://hincapie.cis.upenn.edu/wiki_mediawiki/

17

PROPERTY ORIG MOD

Authors 326 271

Check-ins 91,808 53,715

File versions 585,629 117,432

… in trunk/ 420,613 117,432

Unique paths 138,741 7,521

… to PHP file 56,063 7,521

Further filtering:• Only PHP files

• Core language• No binary files• Tokenization

• Toss out i18n files

Wiki database is givento WikiTrust implementation:

Revision #A by J had ∆+0.75 on reputation of X=12.05Revision #B by K had ∆-42.00 on reputation of Y=0.5

Revision #B by K had ∆+16.75 on reputation of Z=1000.1… … …

Recall: An edit can change up to 10 reputations!

General Results (1)

18

Distribution of Final User Reputations • Reputations

lie on [0,20k]• 0.0 is the

initial rep. • ≈15 users

w/max. rep. Not always those w/most revs.

General Results (2)

19

Distribution of Update ∆s, by Magnitude • Majority of

updates are positive; evidence of a healthy community

• Most freq. update is 1-10 pt. increment

Example Reputations

20

EVALUATING REPUTATION ACCURACY

21

Evaluation Process

Find edits (Ex) where:

• Subsequent edit (Ex+1) resulted in non-trivial rep. loss for author

• Manually inspect comment, Bugzilla, diffs, and ask:“Would editor Ax+1 consider the previous change CONSTRUCTIVE, or UNCONSTRUCTIVE”?

• Could be a subjective mess, but…22

Ex+1

Ex

Non-trivialcontentremoval

Was this removal the result of ineptitude by the prior editor?

Classifying Rep. Loss (1)

23

Surprising number of obviously “bad” actions resulting in reverts. Editor calls out previous edit and/or editor explicitly:

“Password in plaintext! … DOESN'T WORK … don't put it in trunk!”“massive breakage with incomplete experimental changes”

“revert … spewing giant red HTML all over everything”

“failed, possibly other problems. NEEDS PARSER TESTS”“ten billion extra callouts …. clutter things up and trigger errors”

“… no apparent purpose … more complex and prone to breakage”

Classifying Rep. Loss (2)

24

Some cases are more ambiguous. The editor erred but its not immediately clear there should be significant penalty (NONFATAL):

Code showing no immediate errors:• But reverted (or branched) for testing

Issues unrelated to functional code: • Whitespace, comment/string changes

Evaluation Results

Per a conservative approach, anything not in the other two sets is CONSTRUCTIVE:

25

UNCONSTRUCTIVE NON-FATAL CONSTRUCTIVE51% 19% 30%

63% accuracy if we discount the “non-fatal” cases70% accuracy if we interpret them as “unconstructive”Interpret how you wish; purposely a naïve application

Concentrate on false-positives:Can the algorithm be improved?

IDENTIFYING & FIXINGFALSE POSITIVES +

EVALUATION

26

False Positives (1)

SVN does not handle RENAME elegantly:

27

file.c

file_renamed.cADD

DEL

Consequences: Authors of [file.c] punished; provenance lost; renamer gets all credit.

Solutions: Detect via hash; simple wiki “move”

False Positives (2.1)

28

INTER-DOCUMENT REORGANIZATION is problematic for WikiTrust

file1.c >>

file2.c >>

file3.c >> ...

Entire code-base as one giant doc. –global diff!

func_b(){…}func_c(){…}

file_1.c

func_c(){…}……

file_2.c

--- ∆ +++ ∆

Solution: Examine all diff ∆; sub-string matching; replay history. Intra-doc reorg. is a non-issue!


29


file1.c >>

file2.c >>

file3.c >> ...

Entire code-base as one giant doc.

Solution: Intra-document reorg. is non-issue!; Global diff; substring matching; replay history.


file_1.cfunc_c(){…}……

file_2.c

--- ∆ +++ ∆

[This is the content block being moved]

A1 – V1

A2 – V2

A3– V3

[This is the same block 3 edits ago]

Destination doc. history

A1

A2

A3

!


30


file1.c >>

file2.c >>

file3.c >> ...

Entire code-base as one giant doc.

Solution: Intra-document reorg. is non-issue!; Global diff; substring matching; replay history.


file_1.cfunc_c(){…}……

file_2.c

--- ∆ +++ ∆

TRANSCLUSION!

A1

A2

A3 text{{sect}} text

A1

A2

A3

sec. txt. sec. txt.sec. txt.

New doc.Old doc.

False Positives (3)

REVERT CHAINS cause big penalties:

31

+++ BIG CODE CHANGES

“Revert: Needs testing first”

+++ BIG CODE CHANGES

identicalnearly identical

V0 V1 V2 V3

Consequences: At V2, A1 loses reputation (a NONFATAL).

Solution: Revert chains rare; manual inspection?

testing done

At V3, A2 is wrongly punished.

False Positives (4)

• Initially 30 false positive cases– If “solutions” were implemented– This number would be just 10– Suggestions accuracies of 80-90%

• And those 10 cases?– Benign code evolution– Feature requests; method deprecation; no fault

• Results similar for [ruby] and [httpd]32

Better Evaluation

• POC evaluation lacking in many ways– Not enough examples. Subjective.– Says nothing about true negatives

• Bug attribution is extremely difficult– Corpus: “X erred at rev. Y with severity {L,M,H}”– If it could be automated; problem solved!– Work backwards from Bugzilla? Developers?– Reputation as a predictor of future loss events.

• Qualitative instead of quantitative measures33

Other Optimization

• Lots of free variables, weights, ceilings

34

// this is a loopfor(int i=0;i<10;i++) print(“Some text”); for ( int i = 0 ; i < 10 ; i++ ){

print( “” );}

Canonical code

for ( int i = 0; i < 10; i++ ){ print( “” );} for ( int i = 0 ; i < 10 ; i++ ){

print( “” );}

Tokenization

USE-CASES &CONCLUSIONS

35

Use-case: Small Projects

• Small/non-production proj.– Conflict, not just tokens!

• Undergraduate research– Who did all the work?

• Academic paper repositories– Automatic author order!

• Collaboration or conflict?– Graph of reputation events

36

A B

C D

Faction #1

Faction #2

-

+

+

++

--

-

-

-

Use-cases (2)MEDIAWIKI• Alert service/warnings (anti-vandal style)• Expediting code review• Permission granting/revocation

37

Use-cases (2)MEDIAWIKI• Alert service/warnings (anti-vandal style)• Expediting code review• Permission granting/revocation

VEHICLEFORGE.MIL• Access control for users/commits• Wrap content-persistent reputation with metadata

features for a stronger classifier [6]• Robustness considerations (i.e., reach-ability)

38

Conclusions• Despite high-(er) barriers to entry, bad things still

happen in production repositories!• Content-persistence is a reasonably accurate way to

identify these instances ex post facto• False positives indicate code uniqueness:

– 1. Non-functional aspects are non-trivial (WS, comments)– 2. Inter-document reorganization is common– 3. Quality-assurance is more than surface level

• Evaluation needs to be more rigorous• A variety of use-cases if it becomes production-ready

39

References

40

[1] Lohr, Steve. “Pentagon Pushes Crowdsourced Manufacturing”. New York Times “Bits Blog”. April 5, 2012.

[2] Adler, B.T. and L. de Alfaro. “A Content-Driven Reputation System for Wikipedia”. In WWW 2007: Proc. of the 16th Intl. World Wide Web Conference.

[3] Adler, B.T., et al. “Measuring Author Contributions to Wikipedia”. In WikiSym 2008: Proc. of the 3rd Intl. Symposium on Wikis and Open Collaboration.

[4] WikiTrust online. http://www.wikitrust.net/

[5] Mediawiki SVN. http://svn.wikimedia.org/viewvc/mediawiki/ (note: this an archive of that resource, Git is the currently used repository software)

[6] Adler, B.T. et al. “Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features”. In CICLing 2011: Proc. of the 12th Intl. Conference on Intelligent Text Processing and Computational Linguistics.

[Ø] Mediawiki Developer Hub. http://www.mediawiki.org/wiki/Developer_hub

Andrew G. West and Insup Lee August 28, 2012

Documents

Transcript of Andrew G. West and Insup Lee August 28, 2012