교육 과정 소개서. · 4 패스트캠퍼스 검색광고 데이터 분석 - 데이터 전처리 실습 ... 9 NYC 택시 데이터 EDA - 데이터 시각화(2) 10 NYC 택시 수요
서울시 미세먼지 데이터 분석
-
Upload
dong-hee-lee -
Category
Data & Analytics
-
view
91 -
download
0
Transcript of 서울시 미세먼지 데이터 분석
Principles and Practice in Data
Mining2012314261 LEE DONG HEE
Seoul City Weather Data Analysis
2016. 12. 09.
Prof. Seo yuran
1
INDEX 01PROBLEM
02ANALYSIS PROCESS
03CONCLUSION
2
01PROBLEM
3
01 | PROBLEM
4
01 | PROBLEM
5
01 | PROBLEM
1.1 BackgroundIn recent years, high concentration of local pollution has occurred due to regional characteristics.Therefore, it is necessary to analyze the cause by scientific reason.
1.2 PurposeThe relationship between find dust and meteorological factors is identified and formulatedthrough statistical techniques.And, a basis for the prediction of fine dust management in Seoul is provided.
6
01 | PROBLEM
1.3 Data Source
ASOS(Automated Synopic
Oberving System)PM10
• Temperature• Wind Speed• Sunshine
…”59 variables”
• Fine dust
“1 variable”
▪ Area : Seoul City
▪ Period : 2010 - 2015
▪ Rows : 2190 (365 X 6)
▪ Columns : 60 (59 + 1)
7
02ANALYSISPROCESS
8
02ANALYSISPROCESS
1. Exploring Data 2. Refining Data3. Creating
Model& Verification
9
02 | ANALYSIS PROCESS
2.1 Exploring DataBasic Statistic
MeanTemperature
MeanWind Speed Precipitation Mean
Relative Humidity Radiation Sunshine PM
NA 0 0 1320 0 13 1 8
MIN - 14.5 1.1 0 20.1 0.25 0 3.9
MEDIAN 14 2.5 1.5 60 11.57 7.2 41.05
MEAN 12.68 2.69 10.03 60.33 12.24 6.29 45.95
MAX 31.8 7.5 301.5 99.8 29.74 13.5 658.2
10
02 | ANALYSIS PROCESS
2.1 Exploring Data
PM
Date 11
PM Trend Graph
02 | ANALYSIS PROCESS
2.1 Exploring DataMissing Value : NA Outliers
NA
Precipitation
1320 ☞ 0
Radiation 13☞ 0
Sunshine 1☞ 0
PM 8 ☞ 0
All missing value is changed to ‘0’. Because ’NA’ means it has nothing value and this show that meteorological instrument do not observe anything.
Boxplot of PM
The outlier’s valueIs 658.2. And the next higher value I s 292.So, the outlier is changed to 300.Because generally the outlier is replace with upper limit of data.
12
02 | ANALYSIS PROCESS
2.1 Exploring Data
13
Correlation Coefficient
02 | ANALYSIS PROCESS
2.2 Refining DataAdd a variable : Degree(factor type)
Fine Dust Levels PM(µg/m3)
Good 0 ~ 30
Normal 31 ~ 80
Bad 81 ~ 150
Very Bad 151 ~
MeanTemperature
MeanWind Speed
PrecipitationMean
Relative Humidity
Radiation Sunshine
PM Degree
14
02 | ANALYSIS PROCESS
2.2 Refining DataCreate Standardization data
Because the scales are different for each variable, you standardized the variables for accurate modeling.
Train Set : 80 % Test Set: 20 %
Separate Train Set and Test Set randomly from the dataset
• Train Set is used to learn the model• Test Set is used to evaluate the performance of the model which you created.
15
02 | ANALYSIS PROCESS
2.3 Creating ModelVariable Selection
• This variables are selected among
60 variables from raw data, based on
the relevant paper about weather.
Therefore, general variable selection
way is not used to select specific
variables in this process.
MeanTemperature
MeanWind Speed
PrecipitationMean
Relative Humidity
Radiation Sunshine
PM Degree
16
02 | ANALYSIS PROCESS
2.3 Creating ModelAnalysis Methods
PCAMultinomial
LogisticRegression
NeuralNetwork
17
02 | ANALYSIS PROCESS
2.3 Creating ModelPCA
18
02 | ANALYSIS PROCESS
2.3 Creating ModelPCA : Parallel Analysis
Parallel analysis suggests that the number of factors
= 319
02 | ANALYSIS PROCESS
2.3 Creating ModelPCA : 3 Principal component
• PC1 = 0.08425922*Mean Temperature + (-0.01601653)*Mean Wind Speed + 0.36003954*Precipitation +0.52249043*Mean Relative Humidity + (-0.51271657)*Radiation + (-0.57196228)*Sunshine
• PC2 = (-0.7977584)*Mean Temperature + 0.2431814*Mean Wind Speed + (-0.1878170)*Precipitation +(-0.2777519)*Mean Relative Humidity + (-0.4220239)*Radiation + (-0.1179781)*Sunshine
• PC3 = 0.080656658*Mean Temperature + 0.899196707*Mean Wind Speed + 0.387333525*Precipitation +0.009005798*Mean Relative Humidity + 0.160962973*Radiation + 0.094458149*Sunshine
20
02 | ANALYSIS PROCESS
2.3 Creating Modela. Multinomial Logistic Regression
21
02 | ANALYSIS PROCESS
2.3 Creating Modelb. Neural Network (Basic Variables)
22
02 | ANALYSIS PROCESS
2.3 Creating Modelc. Neural Network (Principal Component)
23
02 | ANALYSIS PROCESS
2.3 VerificationConfusion Matrix
62.01%
Accuracy
62.92%
68.27%
MultinomialLogistic Regression
Neural Network(Basic Variable)
Neural Network(Principal Component)
24
03CONCLUSION
25
03 | CONCLUSION
Correlation coefficient analysis showed that the correlation coefficient between PM10 concentration and meteorological factors ranged from -0.200 to 0.058.
As the result of model, principal component has higher prediction accuracy than basic variable. And, neural network has higher prediction accuracy than multinomial logistic regression.
26