1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢...
-
date post
21-Dec-2015 -
Category
Documents
-
view
262 -
download
9
Transcript of 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢...
1
Predicting the winner of C.Y. award
指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖
2
Introduction
Baseball sport in Taiwan CPBL (Chinese Professional Baseball League)
MLB (Major League Baseball) Baseball sport in USA
Cy Young Award since 1956 Baseball Writers Association of America Weighted scores Each league has one winner per year.
3
Measurements There are no definite rules be used to judge. Nevertheless, many measurements could be used
to judge whether a pitcher is good or not. Wins ERA WHIP G/F
etc.
4
Aim of the study To analysis the historical statistics of pitchers. Building a predictive model. To predict the Cy Young Award winner of the
year in the future.
5
Data mining procedure
Ten data mining methodology steps
6
Step 1 : Translate the Problem Directed data mining problem
Target variable: Cy Young Award Classification Decision tree
Purposes Gambling game Predictive activities
7
Step 2 : Select Appropriate Data Just MLB statistics data (1871 ~ 2006)
Cy Young Award: 1956 ~ 2006 total 21456 records List of Cy Young Award winners
“Time” factor 1999 as the dividing year.
Because of the emerging items.
Variables: to remove the items that are not representative of a pitcher.
8
Step 3 : Get to know the data The materials that we used all come from
MLB official site These data have already been disclosed for a
lot of years The quality of data is very good some attributes has value since 1999
9
Step 4 : Create a model set We divide the data into training data and
testing data We do not create a balanced sample The record of MLB is not the seasonal
materials we will pick the materials since 1999
10
Step 5 : Fix problems with the data These data are taken from MLB official side No missing values single source
11
Step 6 : Transform data to bring information to the surface There are no combinations of attributes We delete some attributes We add a attribute-Year We add a attribute (CyYoungAward_Winner)
for classification
12
Step 7 : Build Models Tools Used Weka Crash Problem Blank Attributes Build Model Handling Blank Attributes
13
Tools Used
14
Weka Crash Problem Raw data
21456 data instances 42 attributes
Weka crashed during model construction Give Weka more memory
15
Blank Attributes
16
Build Model MLB 1956~2006
with blank attributes ADTree
MLB 1956~2006 without blank attributes ADTree
MLB 1999~2006 ADTree
17
Handling Blank Attributes
18
1956~2006, with blank attributes, ADTree
19
1956~2006, with blank attributes, ADTree
=== Confusion Matrix ===
NONWINNER WINNER <-- classified as
21343 21 NONWINNER
58 34 WINNER
20
1956~2006, without blank attributes, ADTree
21
1956~2006, without blank attributes, ADTree
=== Confusion Matrix ===
NONWINNER WINNER <-- classified as
21350 14 NONWINNER
62 30 WINNER
22
1999~2006, ADTree
23
1999~2006, ADTree
=== Confusion Matrix ===
NONWINNER WINNER <-- classified as
5090 3 NONWINNER
13 3 WINNER
24
Not good enough for gambling
Step 8 : Assess Models(1/2)
=== Confusion Matrix ===
NONWINNER WINNER <-- classified as
21350 14 NONWINNER
62 30 WINNER
=== Confusion Matrix ===
NONWINNER WINNER <-- classified as
5090 3 NONWINNER
13 3 WINNER
25
Step 8 : Assess Models(2/2) Some attributes are more important
Number of Appearance of Attributes in Different ModelsW BB WPCT OBA WHIP K/9 ERA GF
1956~2006ADTree
2 3 1 1
1956~2006 Without Blank AttributesADTree
2 1 1 1 1 1
1999~2006ADTree
2 1 1 1 1
1956~2006 Without Blank AttributesJ48
3 2 1 1
26
Step 9 : Deploy Models To implement a computer program with the
built model. To predict the Cy Young Award winner more
easily.
27
Step 10 : Assess Results To compare the predictive and the final Cy
Young Award winner directly. Not “business” but “interest”.
Assessment from the judgment of the person.
28
Conclusions
We have used the classification technology to set up the model of predicting
We find the accuracy of the built model is not high
Some factors that we are not to consider It can not use in the place with essential
benefits Just for fun