Software Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering
Transcript of Software Analytics: Data Analytics for Software Engineering
Software Analytics:
Data Analytics for Software EngineeringAchievements and Opportunities
Tao XieDepartment of Computer Science
University of Illinois at Urbana-Champaign, USA
In Collaboration with Microsoft Research; Most Slides Adapted from Slides Co-Presented with
Dongmei Zhang (Microsoft Research, http://research.microsoft.com/en-us/groups/sa/)
New Era…Software itself is changing...
Software Services
How people use software is changing…
Individual Isolated
Not much data/content
generation
How people use software is changing…
How people use software is changing…
Individual Isolated
Not much data/content
generation
How people use software is changing…
Individual
Social
Isolated
Not much data/content
generation
Collaborative
Huge amount of data/artifacts
generated anywhere anytime
How software is built & operated is changing…
How software is built & operated is changing…
Data pervasive
Long product cycle
Experience & gut-feeling
In-lab testing
Informed decision making
Centralized development
Code centric
Debugging in the large
Distributed development
Continuous release
… …
How software is built & operated is changing…
Data pervasive
Long product cycle
Experience & gut-feeling
In-lab testing
Informed decision making
Centralized development
Code centric
Debugging in the large
Distributed development
Continuous release
… …
Software Analytics
Software analytics is to enable software
practitioners to perform data exploration and
analysis in order to obtain insightful and
actionable information for data-driven tasks
around software and services.
Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software
Analytics as a Learning Case in Practice: Approaches and Experiences. In MALETS 2011
http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf
Software Analytics
Software analytics is to enable software
practitioners to perform data exploration and
analysis in order to obtain insightful and
actionable information for data-driven tasks
around software and services.
http://research.microsoft.com/en-us/groups/sa/
http://research.microsoft.com/en-us/news/features/softwareanalytics-052013.aspx
Research topics – the trinity view
Software
Users
Software
Development
Process
Software
System
• Covering different areas of
software domain
• Throughout entire development
cycle
• Enabling practitioners to obtain
insights
Data sources
Runtime traces
Program logs
System events
Perf counters
…
Usage log
User surveys
Online forum posts
Blog & Twitter
…
Source code
Bug history
Check-in history
Test cases
Eye tracking
MRI/EMG
…
Target audience – software practitioners
Target audience – software practitioners
Developer
Tester
Target audience – software practitioners
Developer
Tester
Program Manager
Usability engineer
Designer
Support engineer
Management personnel
Operation engineer
Output – insightful information
• Conveys meaningful and useful understanding or
knowledge towards completing the target task
• Not easily attainable via directly investigating raw data
without aid of analytics technologies
• Examples
– It is easy to count the number of re-opened bugs, but how to
find out the primary reasons for these re-opened bugs?
– When the availability of an online service drops below a
threshold, how to localize the problem?
Output – actionable information
• Enables software practitioners to come up with concrete
solutions towards completing the target task
• Examples
– Why bugs were re-opened?
• A list of bug groups each with the same reason of re-
opening
– Why availability of online services dropped?
• A list of problematic areas with associated confidence values
Research topics & technology pillars
Software
Users
Software
Development
Process
Software
System Vertical
Horizontal
Information Visualization
Data Analysis Algorithms
Large-scale Computing
Connection to practice
• Software Analytics is naturally tied with software
development practice
• Getting real
Connection to practice
• Software Analytics is naturally tied with software
development practice
• Getting real
Real
Data
Real
Problems
Real
Users
Real
Tools
Outline
• Software Engineering for Big Data
• Overview of Software Analytics
• Selected Projects
– StackMine: Performance debugging in the large
– XIAO: Scalable code clone analysis
– SAS: Incident management of online services
– Code Hunt/Pex4Fun: Teaching & learning via interactive gaming
StackMinePerformance debugging in the large
via mining millions of stack traces
http://research.microsoft.com/en-us/groups/sa/stackmine_icse2012.pdf
Performance issues in the real world• One of top user complaints
• Impacting large number of users every day
• High impact on usability and productivity
High Disk I/O High CPU consumption
Performance issues in the real world• One of top user complaints
• Impacting large number of users every day
• High impact on usability and productivity
High Disk I/O High CPU consumption
As modern software systems tend to get more and more complex, given limited time and resource before software release, development-site testing and debugging become more and more insufficient to ensure satisfactory software performance.
Performance debugging in the large
Performance debugging in the large
Trace StorageTrace collection
Network
Performance debugging in the large
Trace StorageTrace collection
Network
Trace analysis
Performance debugging in the large
Trace StorageTrace collection
Bug DatabaseNetwork
Trace analysis
Bug filing
Performance debugging in the large
Trace StorageTrace collection
Problematic Pattern
Repository Bug DatabaseNetwork
Trace analysis
Bug filing
Performance debugging in the large
Pattern Matching
Trace StorageTrace collection
Bug update
Problematic Pattern
Repository Bug DatabaseNetwork
Trace analysis
Bug filing
Performance debugging in the large
Pattern Matching
Trace StorageTrace collection
Bug update
Problematic Pattern
Repository Bug DatabaseNetwork
Trace analysis
Bug filing
Key to issue
discovery
Performance debugging in the large
Pattern Matching
Trace StorageTrace collection
Bug update
Problematic Pattern
Repository Bug DatabaseNetwork
Trace analysis
Bug filing
Key to issue
discovery
Bottleneck of
scalability
Performance debugging in the large
Pattern Matching
Trace StorageTrace collection
Bug update
Problematic Pattern
Repository Bug DatabaseNetwork
Trace analysis
How many issues are
still unknown?
Bug filing
Key to issue
discovery
Bottleneck of
scalability
Performance debugging in the large
Pattern Matching
Trace StorageTrace collection
Bug update
Problematic Pattern
Repository Bug DatabaseNetwork
Trace analysis
How many issues are
still unknown?
Which trace file should I
investigate first?
Bug filing
Key to issue
discovery
Bottleneck of
scalability
Problem definition
Given OS traces collected from tens of thousands
(potentially millions) of users,
help domain experts identify impactful program execution
patterns
(that cause the most impactful underlying performance problems)
with limited time and resource.
Challenges
Combination of expertise• Generic machine learning tools without domain
knowledge guidance do not work well
Highly complex analysis• Numerous program runtime combinations triggering
performance problems
• Multi-layer runtime components from application to
kernel being intertwined
Large-scale trace data• TBs of trace files and increasing
• Millions of events in single trace streamInternet
Technical highlights
• Machine learning for system domain
– Formulate the discovery of problematic execution patterns as
callstack mining & clustering
– Systematic mechanism to incorporate domain knowledge
• Interactive performance analysis system
– Parallel mining infrastructure based on HPC + MPI
– Visualization aided interactive exploration
Impact“We believe that the MSRA tool is highly valuable and much more
efficient for mass trace (100+ traces) analysis. For 1000 traces, we
believe the tool saves us 4-6 weeks of time to create new signatures,
which is quite a significant productivity boost.”
Highly effective new issue discovery on Windows
mini-hang
Continuous impact on future Windows
versions
XIAOScalable code clone analysis
2012
http://research.microsoft.com/jump/175199
XIAO: Code Clone Analysis
• Motivation
– Copy-and-paste is a common developer behavior
– A real tool widely adopted internally and externally
• XIAO enables code clone analysis in the following way
– High tunability
– High scalability
– High compatibility
– High explorability
High tunability – what you tune is what you get
• Intuitive similarity metric: effective control of the
degree of syntactical differences between two code
snippets
for (i = 0; i < n; i ++) {
a ++;
b ++;
c = foo(a, b);
d = bar(a, b, c);
e = a + c; }
for (i = 0; i < n; i ++) {
c = foo(a, b);
a ++;
b ++;
d = bar(a, b, c);
e = a + d;
e ++; }
High scalability• Four-step analysis process
• Easily parallelizable based on source code partition
Pre-processing
Pruning Fine Matching
Coarse Matching
High compatibility
• Compiler independent
• Light-weight built-in parsers for C/C++ and C#
• Open architecture for plug-in parsers to support
different languages
• Easy adoption by product teams
– Different build environment
– Almost zero cost for trial
High explorability
1. Clone navigation based on source tree hierarchy
2. Pivoting of folder level statistics
3. Folder level statistics
4. Clone function list in selected folder
5. Clone function filters
6. Sorting by bug or refactoring potential
7. Tagging
1 2 3 4 5 6
7
1. Block correspondence
2. Block types
3. Block navigation
4. Copying
5. Bug filing
6. Tagging
1
2
3
4
1
6
5
Scenarios & SolutionsQuality gates at milestones• Architecture refactoring
• Code clone clean up
• Bug fixing
Post-release maintenance• Security bug investigation
• Bug investigation for sustained engineering
Development and testing• Checking for similar issues before check-in
• Reference info for code review
• Supporting tool for bug triage
Online code clone search
Offline code clone analysis
Benefiting developer community
Available in Visual Studio 2012 RC
Searching similar snippets
for fixing bug onceFinding refactoring
opportunity
More secure Microsoft products
Code Clone Search service integrated into
workflow of Microsoft Security Response Center
Over 590 million lines of code indexed across
multiple products
Real security issues proactively identified and addressed
Example – MS Security Bulletin MS12-034Combined Security Update for Microsoft Office, Windows, .NET Framework, and Silverlight, published: Tuesday, May 08, 2012
3 publicly disclosed vulnerabilities and seven privately reported involved. Specifically, one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32k.sys
Cloned copy in gdiplus.dll, ogl.dll (office), Silver Light, Windows Journal viewer
Microsoft Technet Blog about this bulletin
However, we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base. To that end, we have been working with Microsoft Research to develop a “Cloned Code Detection” system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product. This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034.
SASIncident management of online services
http://research.microsoft.com/apps/pubs/?id=202451
Motivation
Incident Management (IcM) is a critical task to
assure service quality
• Online services are increasingly popular & important
• High service quality is the key
Incident Management: Workflow
Detect a
service
issue
Alert On-
Call
Engineers
(OCEs)
Investigate
the problem
Restore
the
service
Fix root cause
via
postmortem
analysis
Incident Management: Characteristics
Shrink-Wrapped
Software Debugging
Root Cause and Fix
Debugger
Controlled
Environment
Online Service
Incident
Management
Workaround
No Debugger
Live Data
Incident Management: Challenges
Large volume and noisy data
Highly complex problem space
No knowledge of entire system
Knowledge not well organized
SAS: Incident management of online services
SAS, developed and deployed to effectively reduce MTTR
(Mean Time To Restore) via automatically analyzing
monitoring data
5
5
Design Principle of SAS
Automating Analysis
Handling Heterogeneity
Accumulating Knowledge
Supporting human-in-the-loop (HITL)
Techniques Overview
• System metrics
– Identifying Incident Beacons
• Transaction logs
– Mining Suspicious Execution Patterns
• Historical incidents
– Mining Historical Workaround Solutions
Industry Impact of SAS
Deployment
• SAS deployed to
worldwide datacenters for
Service X (serving
hundreds of millions of
users) since June 2011
• OCEs now heavily depend
on SAS
Usage
• SAS helped successfully
diagnose ~76% of the
service incidents assisted
with SAS
Code Hunt/Pex4FunTeaching and learning via interactive gaming
http://research.microsoft.com/pubs/189240/icse13see-pex4fun.pdf
http://pex4fun.com/https://www.codehunt.com/
Code Hunt: Resigned As Gamehttps://www.codehunt.com/
Behind the Scene
Secret Implementation
class Secret {
public static int Puzzle(int
x) {
if (x <= 0) return 1;
return x * Puzzle(x-1);
}
}
Player Implementation
class Player {
public static int Puzzle(int x) {
return x;
}
}
class Test {
public static void Driver(int x) {
if (Secret.Puzzle(x) != Player.Puzzle(x))
throw new Exception(“Mismatch”);
}
}
behaviorSecret Impl == Player Impl
62
Coding Duels: Fun and Engaging
Iterative gameplay
Adaptive
Personalized
No cheating
Clear winning criterion
Teaching and Learning
Coding Duels for Course Assignments@Grad Software Engineering Course
http://pexforfun.com/gradsofteng
Observed Benefits
• Automatic Grading
• Real-time Feedback (for Both Students and Teachers)
• Fun Learning Experiences
Example User Feedback
“It really got me *excited*. The part that got me most
is about spreading interest in teaching CS: I do think
that it’s REALLY great for teaching | learning!”
“I used to love the first person shooters and
the satisfaction of blowing away a whole team
of Noobies playing Rainbow Six, but this is far
more fun.”
“I’m afraid I’ll have to constrain myself to spend just an
hour or so a day on this really exciting stuff, as I’m
really stuffed with work.”
Released since 2010
X
Usage ScenariosStudents: proceed through a sequence on puzzles to learn and practice.
Educators: exercise different parts of a curriculum, and track students’ progress
Recruiters: use contests to inspire communities and results to guide hiring
Researchers: mine extensive data in Azure to evaluate how people code and learn
http://taoxie.cs.illinois.edu/publications/icse15jseet-codehunt.pdfICSE 2015 JSEET:
http://taoxie.cs.illinois.edu/publications/icse13see-pex4fun.pdfICSE 2013 SEE:
Conclusion
Software
Users
Software
Development
Process
Software
System Vertical
Horizontal
Information Visualization
Data Analysis Algorithms
Large-scale Computing
Conclusion
Software
Users
Software
Development
Process
Software
System Vertical
Horizontal
Information Visualization
Data Analysis Algorithms
Large-scale Computing
Code Hunt Programming Game
Code Hunt Programming Game
Code Hunt Programming Game
Code Hunt Programming Game
Code Hunt Programming Game
Code Hunt Programming Game
Code Hunt Programming Game
Code Hunt Programming Game
Code Hunt Programming Game
Code Hunt Programming Game
Code Hunt Programming Game
Code Hunt Programming Game
Code Hunt Programming Game
Leaderboard and Dashboard
Publically visible, updated during the contest
Visible only to the organizer
It’s a game!
iterative gameplay
adaptive
personalized
no cheating
clear winning criterion
code
test cases