A Case for an Open Source Data Repository
-
Upload
johnsondon -
Category
Documents
-
view
440 -
download
1
Transcript of A Case for an Open Source Data Repository
A Case for an Open Source Data Repository
Archana GanapathiDepartment of EECS, UC
Berkeley([email protected])
Why do we study failure data? Understand cause->effect relationship
between configurations and system behavior
Still don’t have a complete understanding of failures in systems Can’t worry about fixing problems if we don’t
understand them in the first place Gauge behavioral changes over time Need realistic workload/faultload data to
test/evaluate systems Success stories…people have benefited
from failure data analysis
Crash data collection success stories
Berkeley EECS BOINC 2 Unnamed Companies
…So Why Does Windows Crash?
Definitions Crash
Event caused by a problem in the operating system(OS) or application(app) Requires OS or app restart.
Application Crash A crash occurring at user-level, caused by one or more components
(.exe/.dll files) Requires an application restart.
Application Hang An application crash caused as a result of the user terminating a process
that is potentially deadlocked or running an infinite loop. Component (.exe/.dll file routing) causing the loop/deadlock cannot be
identified (yet) OS Crash
A crash occurring at kernel-level, caused by memory corruption, bad drivers or faulty system-level routines.
Blue-screen-generating crashes require a machine reboot Windows explorer crashes require restarting the explorer process.
Bluescreen An OS crash that produces a user-visible blue screen followed by a non-
optional machine reboot.
Procedure Collect crash dumps from two different sources
UC Berkeley EECS department BOINC volunteers
Filter data/form crash clusters to avoid double-counting
Account for shared resources, dependent processes, system instability, user retry
Parse/Interpret crash dumps using Debugging tools for Windows
Study both application crash behavior and operating systems crashes
Supplement crash data with usage data
EECS Dataset
Type of UserNumber of
UsersNumber of
Crashes
graduate student 30 621
staff 28 414
unknown 16 197
faculty 14 191
undergraduate 4 19
visitor 3 51
guest 1 9
postdoc 1 19
TOTAL 97 1521
Crashes reported per month
Number of Crashes per Month
54
204
248
282
320
220
184201
191
237
113
0
50
100
150
200
250
300
350
Jun 14-30, 2004
Jul 1-31, 2004
Aug 1-31, 2004
Sep 1-30, 2004
Oct 1-31, 2004
Nov 1-30, 2004
Dec 1-31, 2004
Jan 1-31, 2005
Feb 1-28, 2005
Mar 1-31, 2005
Apr 1-14, 2005
Month
# c
ras
he
s
Usage/Crashes per day of week
Percentage of Computer Users per Day of Week
0
20
40
60
80
100
120
Mon
day
Tuesd
ay
Wed
nesd
ay
Thurs
day
Friday
Satur
day
Sunda
y
Day of Week
% u
se
rs
Number of Crashes per Day of Week
446
371
482
410
297
116 132
050
100
150200250300350
400450500
Mon
day
Tuesd
ay
Wed
nesd
ay
Thurs
day
Friday
Satur
day
Sunda
y
Day of Week
# c
ras
he
s
EECS department users use their EECS computers Monday through Friday.
Few users use computers on weekends.
Crashes do not occur uniformly across the five days of the working week.
Usage/Crashes per hour of day
Percentage of Computer Users per Hour of Day
0
20
40
60
80
100
12
am
-12
:59
am
1:0
0a
m-1
:59
am
2:0
0a
m-2
:59
am
3:0
0a
m-3
:59
am
4:0
0a
m-4
:59
am
5:0
0a
m-5
:59
am
6:0
0a
m-6
:59
am
7:0
0a
m-7
:59
am
8:0
0a
m-8
:59
am
9:0
0a
m-9
:59
am
10
:00
am
-10
:59a
m
11
:00
am
-11
:59a
m
12
:00
pm
-12
:59p
m
1:0
0p
m-1
:59
pm
2:0
0p
m-2
:59
pm
3:0
0p
m-3
:59
pm
4:0
0p
m-4
:59
pm
5:0
0p
m-5
:59
pm
6:0
0p
m-6
:59
pm
7:0
0p
m-7
:59
pm
8:0
0p
m-8
:59
pm
9:0
0p
m-9
:59
pm
10
:00
pm
-10
:59p
m
11
:00
pm
-11
:59p
m
Hour of Day
% u
se
rs
Number of Crashes per Hour of Day
2113
4 2 4 2 826
114
161 164 171 167
214196
208 204
139
10795
87
5843 46
0
50
100
150
200
250
12
am
-12
:59
am
1:0
0a
m-1
:59
am
2:0
0a
m-2
:59
am
3:0
0a
m-3
:59
am
4:0
0a
m-4
:59
am
5:0
0a
m-5
:59
am
6:0
0a
m-6
:59
am
7:0
0a
m-7
:59
am
8:0
0a
m-8
:59
am
9:0
0a
m-9
:59
am
10
:00
am
-10
:59
am
11
:00
am
-11
:59
am
12
:00
pm
-12
:59
pm
1:0
0p
m-1
:59
pm
2:0
0p
m-2
:59
pm
3:0
0p
m-3
:59
pm
4:0
0p
m-4
:59
pm
5:0
0p
m-5
:59
pm
6:0
0p
m-6
:59
pm
7:0
0p
m-7
:59
pm
8:0
0p
m-8
:59
pm
9:0
0p
m-9
:59
pm
10
:00
pm
-10
:59
pm
11
:00
pm
-11
:59
pm
Hour of Day
# cr
ash
es
Most people work during the typical hours of 9am to 5pm.
Our data set involves users of various affiliations to the department, hence the wider spectrum of work schedules
Reboot FrequencyPercentage of Users Rebooting their Computer at
Specified Frequency
0
5
10
15
20
25
30
1 2.5 5 7 10 14 30 60 365
interval (days)
Per
cen
tag
e o
f u
sers
Automatic Clustering Experiment for Categorizing Apps Augment the crash data with information about
usage patterns and program dependencies Feed data into the k-means and agglomerative
clustering algorithms to determine which applications are behaviorally related.
We determined that we did not have enough data to derive a method to categorize applications in our data set
Need several instances of every (application, component, error code) combo
As a last resort, we chose to categorize apps based on categorization based on application functionality
Crash Cause by Application Category
Application Category # Crashes Crash % Usage %
web browsing 598 41% 18%
unknown 185 13% n/a
document preparation 152 11% 22%
email 130 9% 24%
scientific computing 95 7% 7%
document viewer 84 6% 8%
multimedia 57 4% 6%
code development 26 2% 10%
document archiving 23 2% n/a
remote connection 23 2% n/a
instant messaging 17 1% n/a
i/o 15 1% n/a
other 14 1% 1%
database 8 1% n/a
system management 8 1% 4%
security 7 0% n/a
Application Hang vs Crashes due to Faulty Component
Crash Cause
application hang48%
faulty component
52%
Which applications hang?Application # hangs % hangs
% Running Total
iexplore.exe 185 25% 25%
matlab.exe 68 9% 34%
winword.exe 67 9% 43%
outlook.exe 60 8% 51%
firefox.exe 47 6% 57%
netscape.exe 41 6% 63%
unknown 25 3% 66%
powerarc.exe 19 3% 69%
powerpnt.exe 13 2% 71%
thunderbird.exe 13 2% 73%
excel.exe 12 2% 75%
acrobat.exe 11 1% 76%
explorer.exe 11 1% 77%
mozilla.exe 11 1% 78%
acrord32.exe 10 1% 79%
msimn.exe 10 1% 80%
AdDestroyer.exe 7 1% 81%
wmplayer.exe 7 1% 82%
notepad.exe 6 1% 83%
rundll32.exe 5 1% 84%
hp precisionscan pro.exe 4 1% 85%
mathematica.exe 4 1% 86%
msaccess.exe 4 1% 87%
msdev.exe 4 1% 88%
photosle.exe 4 1% 89%
winamp.exe 4 1% 90%
apps causing <1% of crashes each 84 11% 101%
Total 736
Which components cause crashes?
Component Description Author Apps invoking component
%crash
ntdll.dll
NT system functions MS Internet Explorer, Matlab
11% (86)
msvcrt.dll Microsoft C runtime library MS Acrobat,
Netscape 5% (37)
acrord32.exe
Acrobat Reader 3rd part
y
Acrobat Reader
4% (29)
pdm.dll
Scripting component functions
MS Visual Studio, Internet Explorer 3% (23)
firefox.exe
Web browser 3rd part
y
Firefox
2% (19)
user32.dll Communication, message
handler, timer functions MS Firefox, Internet
Explorer 2% (17)
ray_tracing.exe
User application 3rd part
y
--
2% (16)
winword.exe Windows document editor MS Word, Outlook 2% (15)
mshtml.dll HTML related functions MS
Internet Explorer, Netscape 2% (15)
tempest.exe Unknown
3rd part
y
--
2% (15)
gklayout.dll Mozilla layout library
3rd part
y
Thunderbird, Firefox
2% (14)
kernel32.dll
Microsoft memory management, I/O and interrupts library MS
Acrobat, Firefox, Internet Explorer 2% (14)
simpl_fox_gl.exe
User application 3rd part
y
--
2% (14)
rpcl3260.dll Real Player component
3rd part
y
Real Player
2% (13)
thunderbird.exe Mozilla e-mail program
3rd part
y
Thunderbird
2% (13)
BOINChttp://winerror.cs.berkeley.edu/crashcollection/
Berkeley Open Infrastructure for Network Computing
Users download boinc client app Crash dumps are scraped/sent to boinc
servers Currently 791 accounts created for
crash collection + resource management 492 users for crash collection
OS Crashes Driver faults
asynchronous events code must follow kernel programming
etiquette exceedingly difficult to debug
Memory corruption Hardware problems (e.g. non-ECC mem) Software-related 47 of these in our dataset so far…don’t have
tools to analyze these in detail
OS crash causing images(based on 150 boinc users, 562 crashes)
Image Name Image DescriptionNum
crashes
% crashes
% Running Total
ntoskrnl.exe NT kernel and system 150 27% 27%
GDFSHK.SYS McAfee Privacy Service File
Guardian 42 8% 35%
ALCXWDM.SYS
Windows (R) WDM driver for Realtek AC'97 40 7% 42%
kmixer.sys kernel audio mixer of Microsoft
Windows 28 5% 47%
win32k.sys multi user win32 driver 19 3% 50%
ati3d2ag.dll ATI Technologies Inc. Radeon
DirectX Universal Driver 18 3% 53%
Brwgate.sys NAT/Proxy/Firewall system 16 3% 56%
HSF_CNXT.sys
Conexant Systems SoftK or SoftK56 Modem Driver 10 2% 58%
Ialmdev5.DLL Intel graphics driver 10 2% 60%
ati2dvag.dll ATI Radeon WinNT display
driver 8 1% 61%
nv4_disp.dll NVIDIA Compatible Windows
2000 display driver 8 1% 62%
V7.SYS IBM V7 Driver for Windows NT/2000
8 1% 63%
usbscan.sys Microsoft usb driver 7 1% 64%
ALCXSENS.SYS
Windows (R) WDM driver for Realtek AC'97 6 1% 65%
ar5211.sys driver for dual band WIFI
wireless mini pci adapter 6 1% 66%
pcx500.sys NDIS5.1 Miniport Driver for 32
bit Windows 6 1% 67%
Unknown_Image -- 6 1% 68%
ati3duag.dll ATI Technologies Inc. Radeon
DirectX Universal Driver 5 1% 69%
AVGNTDD.SYS
Filter Device for Windows XP/2000/NT 5 1% 70%
nv4_mini.sys NVIDIA Compatible Windows
2000 Miniport Driver 5 1% 71%
Crash generating driver fault type
Driver Fault Type Num Crashes
PAGE FAULT IN NONPAGED AREA 118
IRQL NOT LESS OR EQUAL 105
KERNEL MODE EXCEPTION NOT HANDLED 67
UNEXPECTED KERNEL MODE TRAP 63
BAD POOL CALLER 46
THREAD STUCK IN DEVICE DRIVER 36
SYSTEM THREAD EXCEPTION NOT HANDLED 29
Unknown bugcheck code 16
Other (each caused 1 crash) 14
PFN LIST CORRUPT 13
DRIVER CORRUPTED EXPOOL 12
DRIVER UNLOADED WITHOUT CANCELLING PENDING OPERATIONS 8
MANUALLY INITIATED CRASH 5
File Corruption - Unreadable File 4
BAD POOL HEADER 4
KERNEL DATA INPAGE ERROR 4
NTFS FILE SYSTEM 4
CRITICAL OBJECT TERMINATION 3
FAT FILE SYSTEM 3
DRIVER POWER STATE FAILURE 2
KERNEL STACK INPAGE ERROR 2
MEMORY MANAGEMENT 2
MULTIPLE IRP COMPLETE REQUESTS 2
Summary of crash analysis Application crashes are caused by both
faulty non-robust dll files as well as impatient users
OS crashes are predominantly caused by poorly-written device driver code
Commonly used core components are blamed for most crashes need to improve reliability of these
components
Practical techniques to reduce crashes
Software-Based Fault Isolation Nooks Separate protection level for
drivers Move driver code to user libraries Virtual Machine for each
unsafe/distrusted app
Lessons from crash data study
Clearly people want to know what’s wrong and how to fix it
The more feedback we give, the more data sets we receive
...but it’s not as easy as it sounds
What kinds of data should we collect?
Failure data Configuration information Logs of normal behavior Usage data Performance logs Annotations of data Collect data for Individual
Machines + Services
Why are people afraid of sharing data?
Fear of public humiliation (reverse engineering what user was doing)
Revealing problems within their organization Fear of competitors using data against them Revealing loopholes through which malware
can easily propagate. Revealing dependability problems in third
party products (MS)
Non-technical challenges to getting data
Collecting (useful) data is tedious What information is “necessary and sufficient”
to understand data trends? Privacy concerns
Especially with usage data Finding the person with access to data
No central location that can be queried for data
Legal agreements take a long time to draft Researchers are more willing to share data
than lawyers Publicity
Technical solution
Amortize the cost of data collection by building an open source repository
Provide a set of tools to cleanse and mine the data
What tools should we implement?
Collect BOINC Instrumentation (MS, Pinpoint) Pre-aggregated data from companies
Anonymize/Preprocess Pre-written anonymization tools Company-specific privacy requirements
Hash values of certain fields Drop irrelevent fields Mask part of data
Tools cont’d Store
Open-source repository schema Common log format/ data descriptor headers Tools to convert log metadata to common
format to cross-link data tables Sample queries: data mining ~ asking questions
about data as it is Analyze/Experiment
SLT algorithms Visualization Stream processing Other tools (eg. WinDbg)
Thoughts on Collection/Anonymization
Defining necessary and sufficient Bad example: Cannot correlate crashes if
we getting rid of all user/machine names Good example: Hash user/machine
names Default: hide if not necessary? What would it take for you not to
invoke the legal dept?
Thoughts on Storage/Analysis
Use time/data source as primary key?
How domain-specific should the common format be?
Management logistics… Access control…
Acronym Suggestions???
Open Source (Failure) Data Repository