2013추계학술대회 인쇄용
-
Upload
byung-kook-ha -
Category
Technology
-
view
215 -
download
12
Transcript of 2013추계학술대회 인쇄용
�
�
- 2 -
•( , )
SAS
1: Text Mining Applications
2: Feature Selection & Efficient Computing
• Density-Based Graph Partitioning Algorithm , ( ) ················································································································· 7
• Scalable Variational Bayesian Matrix Factorization , (POSTECH) ················································································································· 31
• A Hybrid Genetic Algorithm for Accelerating Feature Selection and Parameter Optimization of SVM
, ( ) ················································································································· 43
• Documents Recommendation Using Large Citation Data , ( ), ( ) ······························································· 53
•, ( ) ················································································································ 75
•, ( ) ················································································································ 91
• Document Indexing by Ensemble ModelYanshan Wang, ( ) ································································································· 107
•, , , , ( ) ··································································· 117
• Fused Lasso, ( ) ·············································································································· 131
• Classification with discrete and continuous variables via Markov Blanket Feature Selection, (POSTECH) ·················································································································· 147
• R, ( ) ·············································································································· 167
• Revisiting the Bradley-Terry Model and Its Application for Information Retrieval, ( ) ·············································································································· 177
3: Visualization & Text Analytics
4: SNS and Bibliography Analytics
5: Rcommendation Systems
6: Data Mining Applications
•( ) ····························································································································· 207
•( ), , , ( ) ··········································· 221
•, , ( ) ······························································································· 237
•, , , ( ) ····················································································· 257
• Modified LDA with Bibliography Information, ( ) ············································································································· 265
• : , ( ) ·············································································································· 273
•, , ( ) ··································································································· 319
• MovieRank: Combining Structural and Feature information Ranking Measure*, *, **, * (* , ** ) ···························· 329
• A New Approach to Recommend Novel Items, ( ) ············································································································· 343
•, ( ) ············································································································· 361
•, , ( ) ······························································································· 365
•, ( ) ············································································································· 377
SAS
- 5 -
- 6 -
•
•
•
•
- 7 -
[1] Pang-Ning, T., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. In Library of Congress.
•
•
•
•
•
[2] Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666..
- 8 -
[3] Schaeffer, S. E. (2007). Graph clustering. Computer Science Review, 1(1), 27-64.
•
[4] Boutin, F., & Hascoet, M. (2004, July). Cluster validity indices for graph partitioning. In Information Visualisation, 2004. IV 2004. Proceedings. Eighth International Conference on (pp. 376-381). IEEE.
[5] Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.
[6] Patkar, S. B., & Narayanan, H. (2003, January). An efficient practical heuristic for good ratio-cut partitioning. In VLSI Design, 2003. Proceedings. 16th International Conference on (pp. 64-69). IEEE.
1
k
jli j Ci
l Ci
min w= ∈
∉
∑ ∑
where k is the number of clusters
•
•
•
•
•
- 9 -
•
•
•
•
•
[7] Sibson, R. (1973), “SLINK: an optimally efficient algorithm for the single-link cluster method”, The Computer Journal, Vol. 116, No. 1, pp. 30-34.[8] Defays, D. (1977). An efficient algorithm for a complete link method. The Computer Journal, 20(4), 364-366.
• L
• A D d1, d2, ..., dn
L D A= −
1
, 1, 2ij
n
i ijj
A a i,j= , ,n
d a=
⎡ ⎤= ⎣ ⎦
= ∑
�
•
•
•
•
[9] Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing, 17(4), 395-416.[10] Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. Advances in neural information processing
systems, 2, 849-856.
- 10 -
•
( )1
1 / 22
k
jl j li j Ci
l Ci
Q A d d mm = ∈
∈
= −∑ ∑
[11] Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2), 026113.[12] Clauset, A., Newman, M. E., & Moore, C. (2004). Finding community structure in very large networks. Physical review E, 70(6), 066111.[13] Kehagias, A. (2012). Bad Communities with High Modularity. arXiv preprint arXiv:1209.2678.
•
•
•
•
•
•
•
•
[14] Daszykowski, M., Walczak, B., & Massart, D. L. (2001). Looking for natural patterns in data: Part 1. Density-based approach. Chemometricsand Intelligent Laboratory Systems, 56(2), 83-92.
- 11 -
�
�
�
�
�
�
•
•
- 12 -
•
•
( )2,
0
i j k kj i i jk k
ij i j
d x xexp if x x and x x
w d d
ohterwise
⎧ ⎛ ⎞⎪ ⎜ ⎟− ∈ ∈⎪ ⎜ ⎟= ⎨ ⎜ ⎟
⎝ ⎠⎪⎪⎩
•
• i j i j
•
[15] Zelnik-Manor, L., & Perona, P. (2004). Self-tuning spectral clustering. In Advances in neural information processing systems (pp. 1601-1608).[16] Ertoz, L., Steinbach, M., & Kumar, V. (2002, April). A new shared nearest neighbor clustering algorithm and its applications. In Workshop
on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on Data Mining (pp. 105-115).
where is the k-nearest set of point i and is distance between point i and k-th neighbor of point ikix k
id
•
- 13 -
•
•
•
,.i ij jk
k kj x j k xi i
d w w∈ ∈
= +∑ ∑
• ii
•
where is the k-nearest set of point ikix
•
• αα
•
•
-5 -4 -3 -2 -1 0 1 2 3 4 5
-2
-1
0
1
2
3
4
5
0 50 100 150 200 250 300 350 400 450 5000
20
40
60
80
100
120
140
160
- 14 -
•
•
- 15 -
•
•
•
•
- 16 -
•
•
- 17 -
•
•
- 18 -
•
•
- 19 -
•
•
- 20 -
10.9ij
j Cw
∈=∑
20ij
j Cw
∈=∑
•
•
10.9ij
j Cw
∈=∑
20ij
j Cw
∈=∑
- 21 -
•
10.2ij
j Cw
∈=∑
20.3ij
j Cw
∈=∑
•
10.2ij
j Cw
∈=∑
20.3ij
j Cw
∈=∑
- 22 -
•
•
•
- 23 -
•
•
12.1ij
j Cw
∈=∑
20.5ij
j Cw
∈=∑
- 24 -
•
10.7ij
j Cw
∈=∑
21.4ij
j Cw
∈=∑
•
- 25 -
•
•
•
•
- 26 -
•
•
•
•
-6 -4 -2 0 2 4
-2
-1
0
1
2
3
4
5
6
-6 -4 -2 0 2 4
-2
-1
0
1
2
3
4
5
6
Cluster 1Cluster 2Cluster 3
-6 -4 -2 0 2 4
-2
-1
0
1
2
3
4
5
6
Cluster 1Cluster 2Cluster 3
-6 -4 -2 0 2 4
-2
-1
0
1
2
3
4
5
6
Cluster 1Cluster 2Cluster 3
- 27 -
-8 -6 -4 -2 0 2 4 6 8
-1
0
1
2
3
4
5
-8 -6 -4 -2 0 2 4 6 8
-1
0
1
2
3
4
5
Cluster 1Cluster 2Cluster 3
-8 -6 -4 -2 0 2 4 6 8
-1
0
1
2
3
4
5
Cluster 1Cluster 2Cluster 3
-8 -6 -4 -2 0 2 4 6 8
-1
0
1
2
3
4
5
Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 6
-12 -10 -8 -6 -4 -2 0 2 4 6
-8
-6
-4
-2
0
2
4
6
Cluster 1Cluster 2
-12 -10 -8 -6 -4 -2 0 2 4 6
-8
-6
-4
-2
0
2
4
6
Cluster 1Cluster 2
-12 -10 -8 -6 -4 -2 0 2 4 6
-8
-6
-4
-2
0
2
4
6
-12 -10 -8 -6 -4 -2 0 2 4 6
-8
-6
-4
-2
0
2
4
6
Cluster 1Cluster 2Cluster 3
- 28 -
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
Cluster 1Cluster 2Cluster 3
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
Cluster 1Cluster 2Cluster 3
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 6Cluster 7
•
•
•
•
•
- 29 -
•
•
•
•
•
Thank you for Listening
Any Questions?
- 30 -
Scalable Variational Bayesian Matrix Factorization
KDMS 2013
Outline 1. Matrix Factorization
for Collaborative Prediction 2. Regularized Matrix Factorization
vs. Bayesian Matrix Factorization 3. Scalable Variational Bayesian
Matrix Factorization 4. Related Works 5. Numerical Experiments 6. Conclusion
2 - 31 -
Matrix Factorization for Collaborative Prediction
• Collaborative prediction Filling missing entries of the user-item rating matrix
• Matrix factorization Predicting an unknown rating by product of user factor vector and item factor vector
3
6 9 3 ? 4 ? 2 0 0 0 2 3 0 ? 4 ?
3 0 2 0 0 1 0 2
2 3 1 0 0 0 2 3
User 1
User 2
User 3
User 4
Item
1
Item
2
Item
3
Item
4
User Factor Matrix
Item Factor Matrix
��
Regularized Matrix Factorization • Minimize the regularized squared error loss
4
Alternating Least Squares (ALS)
Time complexity O(2|Ω|K2+(I+J)K3) Parallelization Easy Tuning parameter λ (regularization)
- 32 -
Regularized Matrix Factorization • Minimize the regularized squared error loss
5
Stochastic Gradient Descent (SGD)
Time complexity O(2|Ω|K) Parallelization Possible, but not easy Tuning parameter λ (regularization) (learning rate)
Problem of parameter tuning
6
• Too small : overfitting• Too large : underfitting
- 33 -
Problem of parameter tuning
7
Regularization parameter chosen by cross-validation on various datasets and rank K (Kim & Choi, IEEE SPL 2013)
• The value of optimal regularization parameter is different depend on the dataset and rank K.
Problem of parameter tuning • SGD require tuning of regularization parameter,
learning rate and even the number of epochs.
8
0.005 0.007 0.010 0.015 0.020
0.005 0.9061/ 13 0.9079/ 15 0.9117/ 19 0.9168/ 28 0.9168/ 44
0.007 0.9056/ 10 0.9074/ 11 0.9112/ 13 0.9168/ 19 0.9169/ 31
0.010 0.9064/ 7 0.9077/ 8 0.9113/ 10 0.9174/ 13 0.9186/ 21
0.015 0.9099/ 5 0.9011/ 6 0.9152/ 6 0.9257/ 7 0.9390/ 7
0.020 0.9166/ 4 0.9175/ 4 0.9217/ 4 0.9314/ 4 0.9431/ 3
Netflix probe10 RMSE/optimal number of epochs of the BRSIMF for various and values ( =40). (Tákacs et al., JMLR 2009)
- 34 -
Bayesian Matrix Factorization
9
Prior P(U), P(V)
Likelihood P(X |U,V)
Posterior P(U,V |X)
MCMC on Netflix
Approximate the posterior by MCMC (Salakhutdinov & Mnih, ICML 2008)
Variational method (Lim & Teh, KDDcup 2007)
No parameter tuning No overfitting High accuracy Huge computational cost
O(2|Ω|K2+(I+J)K3)
Scalable Variational Bayesian Matrix Factorization • No parameter tuning
• Linear space complexity: O(2(I+J)K)
• Linear time complexity: O(6|Ω|K)
• Easily parallelized on multi-core systems
• Optimize
element-wisely factorized variational distribution with coordinate descent method.
10 - 35 -
Variational Bayesian Matrix Factorization • Likelihood is given by
• Gaussian priors on factor matrices U and V:
• Approximate posterior by variational distribution by maximizing the variational lower bound, or equivalently minimizing the KL-divergence
11
VBMF-BCD (Lim & The KDDcup 2007)
• Matrix-wisely factorized variational distribution
12
VBMF-BCD Space complexity O((I+J)(K+K2)) Time complexity O(2|Ω|K2+(I+J)K3) Parallelization Easy
- 36 -
Scalable VBMF: linear space complexity
13
Element-wisely factorized variational distribution
K=100 O((I+J)(K+K2)) O(2(I+J)K)
Netflix I = 480,189 J = 17,770
4.4 GB 0.8 GB
Yahoo-music I = 1,000,990 J = 624,961
131 GB 2.6 GB
Scalable VBMF: quadratic time complexity
14
Updating rules for q(uki)
Updating all variational parameters
- 37 -
Scalable VBMF: linear time complexity
15
Let Rij denote the residual on ( i, j ) observation:
With Rij , updating rule can be rewritten as
When is changed to , can be easily updated to
Scalable VBMF: linear time complexity
16 - 38 -
Scalable VBMF: parallelization
17
K
I
• Each column of variational parameters can be updated independently from the updates of other columns.
• Parallelization can be easily done in a column-by-column manner.
• Easy implementation with the OpenMP library on multi-core system.
Related work (Pilásy et al., ReSys 2010)
• Similar idea is used to reduce the cubic time complexity of ALS to linear one.
18
With small extra effort, more accurate model is obtainable without tuning of regularization parameter
RMF
Scalable VBMF
- 39 -
Related Work (Raiko et al., ECML 2007)
• Consider element-wisely factorized variational distribution
• Update U and V by scaled gradient descent method
• Require tuning of learning rate • Learning speed is slower than our algorithm
19
Numerical Experiments • Compare VBMF-CD, VBMF-BCD (Lim & The KDDcup 2007),
VBMF-GD (Raiko et al., ECML 2007)
• Experimental environment – Quad-core Intel® core™ i7-3820 @ 3.6GHz – 64 GB memory – Implemented in Matlab 2011a, where main computational
modules are implemented in C++ as mex files – Parallelized with the OpenMP library
• Datasets
20
MovieLens10M Netflix Yahoo-music
# of user 69,878 480,189 1,000,990
# of item 10,677 17,770 624,961
# of rating 10,000,054 100,480,507 262,810,275
- 40 -
Numerical Experiments: = 20
21
RMSE versus computation time on a quad-core system for each dataset: (a) MovieLens10M, (b) Netflix, (c) Yahoo-music
MovieLens10M Netflix Yahoo-music
VBMF-CD 0.8589 0.9065 22.3425
VBMF-BCD 0.8671 0.9070 22.3671
VBMF-GD 0.8591 0.9167 22.5883
Numerical Experiments: Netflix, = 50
22
Time per iter.
VBMF-BCD 66 min.
VBMF-CD 77 sec.
VBMF-GD 29 sec.
RMSE VBMF-BCD VBMF-CD
Iter. Time Iter. Time
0.9005 19 21 h 63 74 m
0.9004 21 23 h 70 82 m
0.9003 22 24 h 84 98 m
0.9002 25 28 h 108 2 h
0.9001 27 31 h 680 13 h
0.9000 30 33 h
- 41 -
Conclusion • We presented scalable learning algorithm for VBMF, VBMF-
CD.
• VBMF-CD optimizes element-wisely factorized variational distributions with coordinate descent method.
• Space and time complexity of VBMF-CD are linear.
• VBMF-CD can be easily parallelized.
• Experimental results confirmed the user behavior of VBMF-CD such as scalability, fast learning, and prediction accuracy.
23
- 42 -
A hybrid genetic algorithm for accelerating feature selection and parameter optimization of support vector machine
2013. 11. 29.
Introduction • Support Vector Machine (SVM)
– One of the most popular state-of-the-art classification algorithms. – efficiently finds non-linear solutions by exploiting kernel functions. – Takes training time complexity O(N3).
• “Very important” issues on training SVM
– Feature selection • SVM is a distance based algorithm (kernel matrix computation), and doesn’t include
any feature selection mechanism. • Irrelevant features degrade the model performance.
– Parameter optimization • Model Tradeoff parameter C, Kernel parameter σ (for the RBF kernel). • SVM is very sensitive to the parameter settings.
– For SVM, feature selection and parameter optimization should be performed
simultaneously. 2
- 43 -
Introduction • Genetic algorithm (GA)
– A stochastic algorithm that mimics natural evolution. – easy, but very effective!
• GA-based feature selection and parameter selection of SVM [1-4] – GA effectively finds near-optimal feature subsets and parameters. – But, Slow. (But, MUCH better than Grid-search mechanism.)
Population
Parents
Offspring
p
Replacement
Selection
Genetic operation (Crossover, Mutation)
3
Introduction If the SVM have to be re-trained periodically, fast feature selection and parameter optimization is required.
This study aims to avoid producing a bad offspring in the “Genetic Operation” step of GA.
This study proposes a chromosome filtering method for faster convergence of GA using Decision Tree (DT) for feature selection and parameter optimization of SVM.
4 - 44 -
The proposed method • Flowchart
5
Initialization
Do genetic operations
Termination condition?
Optimized parameters and feature subset
Chromosome Filtering
yes
no
yes
no
Evaluate fitness
Population
Population Replacement
The proposed method • Chromosome design
– Parameters: binary representation
– Feature subset: binary representation
1 0 0 1 0 1
10-2 10-1 1 101 102 103
C=1 x 10-2 + 1 x 101 C:
2-5 , … , 25 σ :
1 0 0 1 0 … 1 0
f1 f2 f3 f4 f5 fp-1 fp
{f1, f4, … , fp-1}
6
Genotype Phenotype
- 45 -
The proposed method • Fitness evaluation
– Decode chromosome and obtain C, σ, and a feature subset. • Genotype � Phenotype
– Train SVM for a dataset given the selected C, σ, and feature subset.
– Fitness value: Cross Validation Accuracy
7
The proposed method • Genetic operation
– Parent selection • Roulette-wheel scheme - Fitness proportional selection (FPS)
• Probability of i-th chromosome ci in the population to be selected =
– where f(i) is the fitness of ci
– Crossover: N-point crossover • Choose N random crossover points, split along those points.
– Mutation: Bit-flipping mutation • Bitwise bit-flipping with fixed probability.
8
- 46 -
The proposed method • Chromosome Filtering
– For each generation, chromosomes and their fitness are stored in the knowledgebase. A DT is trained periodically based on the knowledgebase. Using the DT, the offspring chromosomes that are likely to have bad fitness are removed before the fitness evaluation step.
– Assumption • Some features and parameter settings improve (or degrade) the model
performance. • DT can find these rules.
9
The proposed method • Chromosome Filtering (continued)
– Why DT? • Effectively deal with Categorical Features. • Find Non-linear relationship. • Use a few, relevant features in the classification
procedure.
– DT Training • Each ci (i-th chromosome) in the knowledgebase is
labeled by – first highest M fitness values � GOOD
(probable to yield a good fitness value) – next highest M fitness values � NORMAL – remaining � BAD
(probable to yield a bad fitness value)
• Input feature: chromosome (in phenotype) • Output feature: label {GOOD, NORMAL, BAD}
10
c1 c2 c3 … cM
cM+1 cM+2 cM+3 … c2M
c2M+1 c2M+2 c2M+3
… … …
GOOD
NORMAL
BAD
Knowledgebase (sorted by fitness)
- 47 -
The proposed method • Chromosome Filtering (continued)
– Filtering • A DT gives rules that assess a chromosome before fitness evaluation.
: Is a chromosome GOOD or NORMAL or BAD? • Each chromosome has a different survival probability.
ex) GOOD: 1.0, NORMAL: 0.5, BAD: 0.2 • The DT is periodically updated, so the criteria of good chromosome changes
through the generations.
11
The proposed method • Chromosome Filtering (continued)
– DT example
12
C>100
Contain F1?
σ>0.25
Contain F3?
σ>1
GOOD
NORMAL BAD
BAD
GOOD BAD
- 48 -
The proposed method • Population Replacement: Steady state model
� to verify the effectiveness of the proposed method in the initial period of GA.
– Only one chromosome in the population is updated in a generation. – Replacement scheme [5, 6]: The offspring replaces one of its parents or the
lowest fitness chromosome in the population. • If the offspring is superior to both parents, it replaces the similar parent. • If it is in between the two parents, it replaces the inferior parent. • otherwise, the most inferior chromosome in the population is replaced.
13
Experiments • Experimental Design
– 10 datasets from UCI repository, all datasets were normalized to be in [-1,1]. – 5 independent runs, a random seed set was used for fairness. – In SVM training, 10-fold cross validation was used. – Parameter Settings
• GA parameters – population size Npop = 30 – crossover probability pc = 0.9 – mutation probability pm = 0.05 – max iteration = 300 – pgood=1; pnormal=0.5; pbad=0.2
• DT parameters – CART – Labeling: good=10, normal=10, bad=remaining – Training starting point: 30th generation / period=10
14 - 49 -
Experiments • Results
– Maximum fitness in the population
Datasets #data #feature
# class
50th generation 100th generation 200th generation
GA GA+DT GA GA+DT GA GA+DT
Iris 150 4 3 97.067 97.067 97.333 97.200 97.333 97.333
Wine 178 13 3 98.876 98.539 99.213 99.101 99.551 99.438
Sonar 208 60 2 87.596 87.596 88.462 90.769 91.731 92.788
Glass 214 9 6 71.682 71.963 72.336 73.364 73.645 74.112
Ionosphere 351 34 2 93.333 94.017 94.131 94.758 95.100 95.499
BreastCancer 683 9 2 97.160 97.160 97.247 97.277 97.306 97.365
Vehicle 846 18 4 81.773 81.939 82.648 83.948 85.201 85.721
Vowel 990 10 11 96.990 97.253 98.000 99.030 99.051 99.434
Yeast 1484 8 10 57.318 57.547 58.814 59.111 59.690 60.000
Segment 2310 19 7 96.736 97.004 97.212 97.411 97.740 97.818
15
50 100 150 200 250 30096.8
96.9
97
97.1
97.2
97.3
iris
GA+DTGA
50 100 150 200 250 300
98.5
99
99.5
wine50 100 150 200 250 300
86
88
90
92
sonar50 100 150 200 250 300
68
70
72
74
glass
50 100 150 200 250 30093
93.5
94
94.5
95
95.5
ionosphere50 100 150 200 250 300
96.9
97
97.1
97.2
97.3
breastcancer50 100 150 200 250 300
80
82
84
86
vehicle50 100 150 200 250 300
92
94
96
98
vowel
50 100 150 200 250 30054
56
58
60
yeast50 100 150 200 250 300
96
96.5
97
97.5
98
segment
Experiments • Results
– Maximum fitness in the population
16 # generation
CV
acc
urac
y (%
)
- 50 -
Concluding Remarks We presented a chromosome filtering method for GA-based feature selection and parameter optimization of SVM.
The proposed method employed a DT as a chromosome filter to remove the offspring chromosomes that are likely to have bad fitness before the fitness evaluation step of GA.
On most datasets, the proposed method showed faster improvement of fitness than standard GA.
17
Acknowledgements This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2011-0030814), and the Brain Korea 21 Program for Leading Universities & Students. This work was also supported by the Engineering Research Institute of SNU.
18 - 51 -
References 1. Frohlich, H., Chapelle, O., & Scholkopf, B. (2003, November). Feature selection for support vector
machines by means of genetic algorithm. In Tools with Artificial Intelligence, 2003. Proceedings. 15th IEEE International Conference on (pp. 142-148). IEEE.
2. Huang, C. L., & Wang, C. J. (2006). A GA-based feature selection and parameters optimization for support vector machines. Expert Systems with applications, 31(2), 231-240.
3. Min, S. H., Lee, J., & Han, I. (2006). Hybrid genetic algorithms and support vector machines for bankruptcy prediction. Expert Systems with Applications,31(3), 652-660.
4. Zhao, M., Fu, C., Ji, L., Tang, K., & Zhou, M. (2011). Feature selection and parameter optimization for support vector machines: A new approach based on genetic algorithm with feature chromosomes. Expert Systems with Applications,38(5), 5197-5204.
5. Bui, T. N., & Moon, B. R. (1996). Genetic algorithm and graph partitioning.Computers, IEEE Transactions on, 45(7), 841-855.
6. Oh, I. S., Lee, J. S., & Moon, B. R. (2004). Hybrid genetic algorithms for feature selection. Pattern Analysis and Machine Intelligence, IEEE Transactions on,26(11), 1424-1437.
19
20
End of Document
- 52 -
- 53 -
�
�
�
�
�
�
�
�
�
�
- 54 -
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
- 55 -
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
- 56 -
�
�
�
�
�
�
�
�
�
�
- 57 -
�
�
�
�
�
- 58 -
�
�
�
�
�
�
�
�
- 59 -
�
�
�
�
�
�
�
�
�
�
�
�
�
- 60 -
�
�
- 61 -
�
�
- 62 -
�
�
- 63 -
�
�
�
�
�
�
�
�
�
- 64 -
�
�
�
�
�
�
�
�
- 65 -
�
�
�
�
�
�
�
�
- 66 -
�
�
�
�
�
�
�
- 67 -
�
- 68 -
�
�
- 69 -
�
�
- 70 -
�
�
�
- 71 -
- 72 -
1: Text Mining Applications
- 73 -
- 74 -
- 75 -
�
- 76 -
•
•
- 77 -
- 78 -
�
�
- 79 -
��
��
�
- 80 -
�
�
- 81 -
- 82 -
��
- 83 -
- 84 -
- 85 -
�
�
•
•
•
- 86 -
• ��
• �
•
•
- 87 -
•
• ��
•
- 88 -
•
��
- 89 -
- 90 -
•
•
- 91 -
•
� Text MiningData Mining
�
�
Text Mining
•
�
�
- 92 -
•
�
�
�
•
�
�
�
- 93 -
•
•
- 94 -
•
�
�
�
•
�
�
�
�
�
- 95 -
•
�
•
�
�
�
�
- 96 -
•
�
�
�
�
•
�
�
- 97 -
•
•
�
�
�
�
- 98 -
•
�
�
�
�
•
�
�
�
�
�- 99 -
•
�
�
�
�
•
�
� - 100 -
•
�
�
�
�
�
•
- 101 -
•
•
�
�
�
- 102 -
•
•
- 103 -
•
•
�
�
�
- 104 -
•
�
�
•
•
•
•
•
- 105 -
- 106 -
Document Indexing by Ensemble Model
Yanshan Wang and In-Chan Choi
Korea UniversitySystem Optimization Lab
November 25, 2013
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 1 / 18
Overview
1 The BasicsInformation Retrieval and Document IndexingTopic ModellingIndexing by Latent Dirichlet Allocation
2 Indexing by Ensemble ModelIntroduction to Ensemble ModelAlgorithmsExperimental Results
3 Conclusions and Discussion
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 2 / 18- 107 -
The problem in Information Retrieval
Source: www.betaversion.org/ ste-fano/linotype/news/26/
As more information (BigData) becomes available, it ismore difficult to access whatusers are looking for.
We need new tools to help usunderstand and search amongvast amounts of information.
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 3 / 18
Document Indexing is Important
Users can get desired information by indexing (or ranking)documents (or items). The higher position the document has, themore valuable to users.
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 4 / 18- 108 -
Problems in Conventional Methods: WordRepresentation
The majority of rule-based and statistical Natural LanguageProcessing (NLP) models regards words as atomic symbols.
In Vector Space Models (VSM), a word is represented by one 1and a lot of zeros. For example,
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
Its problem:
motel [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0] ANDhotel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0] =0
The conceptual meaning of words is ignored.
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 5 / 18
Topic Modeling
Latent Dirichlet Allocation (LDA)[Blei et al. (2003)].
Uncover the hidden topics thatgenerate the collection.Words and Documents can berepresented according to thosetopics.Use the representation to organize,index and search the text.
apple =
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
0.3250.7920.2140.1070.1090.6120.3140.245
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 6 / 18- 109 -
LDA [Blei et al. (2003)]
� � � �
�
��
1 Choose the number of words N ∼ Poisson(ξ).
2 Choose θ ∼ Drichelet(α).3 For n = 1, 2, ..., N
Choose a topic zn ∼ Multinomial(θ);Choose a word wn ∼ Multinomial(wn|zn, β), a multinomialdistribution conditioned on the topic zn.
Joint Distribution: p(θ, z,d|α, β) = p(θ|α)∏N
n=1p(zn|θ)p(wn|zn, β)
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 7 / 18
Indexing by LDA (LDI) [Choi and Lee (2010)]
With adequate assumptions, the probability of a word wj
embodying the concept zk is
W kj = p(zk = 1|wj = 1) =
βjk∑Kh=1 βjh
The document (or query) probability can be defined within thetopic space
Dki (Q
ki ) =
∑Vj=1W
kj nij
Ndi
,
where nij denotes the number of occurrence of word wj indocument di and Ndi
denotes the number of words in thedocument di, i.e. Ndi
=∑V
j=1 nij.Similarity between document and query
ρ( �D,Q) = �D · �Q
where �D · �Q =⟨
D‖D‖ ,
Q‖Q‖
⟩.
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 8 / 18- 110 -
Indexing by Ensemble Model (EnM)[Wang et al. (2013)]
Motivation: There exit optimal weights over constituent models.
Table: A toy example. The values in the table represent similarities ofdocuments with respect to a given query. The scores of Ensemble 1 and2 are defined by 0.5*Model 1+0.5*Model 2 and 0.7*Model 1+0.3*Model2, respectively. The relevant document list is assumed to be {2,3}.
Model 1 Model 2 Ensemble 1 Ensemble 2
Document 1 0.35 0.2 0.55 0.305Document 2 0.4 0.1 0.5 0.31Document 3 0.25 0.7 0.95 0.385
(M)AP 0.72 0.72 0.72 0.89
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 9 / 18
AP and MAP
Average Precision (AP) and Mean Average Precision (MAP)
Notation
|Q| the number of queries in the query set;|Di| the number of documents in the relevant document
set w.r.t. the ith query;dij ∈ Di the jth document in Di;φki the relevant score returned by kth model w.r.t. ith
query;R(dij , φki) the indexing position of the jth document for the ith
query returned by the kth model;H =
∑αkφk the ensemble model, a linear combination of the con-
stituent models, where αk ≥ 0.Definition
E(H,Q) ==1
|Q|
|Q|∑i=1
AP (H,Di), AP (H,Di) =1
|Di|
|Di|∑j=1
j
R(dij , H).
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 10 / 18- 111 -
Formulation
Formulation of the Optimization Problem
Since 0 ≤ AP ≤ 1, we can define the empirical loss as follows:
min
|Q|∑i=1
(1−AP (H,Di)), or
min
|Q|∑i=1
(1−1
|Di|
|Di|∑j=1
j
R(dij , H)).
Our goal is to uncover optimal weights α’s that minimize theempirical loss.
Difficulty
The position function R(dij , H) is nonconvex, nondifferentiable andnoncontinuous w.r.t. α’s.
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 11 / 18
Boosting Scheme
1 Select model:
φj = argmaxj
|Q|∑i=1
DiAP (φji);
2 Update the weight:
αtj= αt
j+ δt
j,
where δj =12 log
∑|Q|i=1 Di(1+AP (φji))
∑|Q|i=1 Di(1−AP (φji))
;
3 Update distribution on queries:
Di =exp(−AP (Hi))
Z,
where Z is a normalizer.
� � �
� � �
�
� � �
� � �
�
� � �
� � �
�
��
��
��
��
��
��
��
��
��
��
��
�� ��
����� ��
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 12 / 18- 112 -
Coordinate Descent
Since the objective is nonconvex, not eachcoordinate will reduce the loss.
1 Select model:
φj = argmaxj
E(Q, φj);
2 Update the weight:
αj =1
2log
1 +AP (φji)
1−AP (φji);
3 If Et ≤ Et−1, delete this coordinate.�
�� �
�� �
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 13 / 18
Parallel Coordinate Descent
The coordinate descent algorithm can be parallelized on cores.
1: parfor p = 1, 2, ...,Kφ do
2: Update the weights using αp = 12log
1+AP (φpi)
1−AP (φpi);
3: end parfor
4: return Ensemble model H.
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 14 / 18- 113 -
Experimental Results on EnM
Data: MED corpus1.1033 documents from the National Library of Medicine.30 queries.
Results.
Table: MAP of various methods forMED corpus.
Method MAP improvement (%)TFIDF 0.4605LSI 0.5026 9.1pLSI 0.5334 15.8LDI 0.5738 24.6
EnM.B 0.6420 39.4EnM.CD 0.6461 40.3EnM.PCD 0.6414 39.3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Pre
cisi
on
TFIDFLSApLSILDIEnM
Figure: Precision-Recall Curves forvarious methods.
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 15 / 18
1: ftp://ftp.cs.cornell.edu/pub/smart.
Conclusions and Discussion
Conclusion
An ensemble model (EnM) is proposed and three algorithms areintroduced for solving the optimization problem.The EnM outperformed any basis models through the overall recallregimes.
Discussion
The algorithms cannot guarantee to converge to the global optimumdue to the nonconvexity of objective.The parallel coordinate descent algorithm cannot guarantee theoptimum, even local optimum, due to the coupling betweenvariables.
Future Works
Approximate the objective with convex functions.Using stochastic gradient descent for stochastic sequences andlarge-scale data sets.
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 16 / 18- 114 -
References
Yanshan Wang and In-Chan Choi(2013)
Indexing by ensemble model
Working Paper. arXiv preprint arXiv:1309.3421.
David M, Blei, Andrew Y, Ng and Micheal I, Jordan (2003)
Latent dirichlet allocation
the Journal of machine Learning research, 3, 993-1022.
In-Chan Choi and Jae-Sung Lee (2010)
Document indexing by latent dirichlet allocation
DMIN, 409-414.
Y. Freund and R. E. Schapire (1995)
A desicion-theoretic generalization of on-line learning and an application toboosting
Computational Learning Theory, Springer, 23-37.
My Homepage: http://optlab.korea.ac.kr/~sam/
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 17 / 18
The End
Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 18 / 18- 115 -
- 116 -
, , , ,
Page 2
� Mobile device computing • Context-aware
- 117 -
Page 3
,
( )
Communication, web
history
App
As is
Page 4
�• Context-aware
•
� : /
�•
�• –
• : • :
�••
- 118 -
Page 5
–
?
?
10:00 39
12:00 3
13:00
15:00
16:00
18:00
18:30 39 ?
? ? ?
Page 6
:
�• ( , )
�• A ,
5511 .
- 119 -
Page 7
:
�• ( , , )
�• B 1
. , eTL ..
Page 8
�• 2 4• ,
•• /
••
• GPS: • : • : • : /
•• / /
•• ,
• , • -
• : • On/Off:
- 120 -
Page 9
–
1.
5.
/
/
6-1. ( /)
//
SNS( , )
2. /
6-2. ( )
TV
( , )
/
3. /
// /
7. /
/
4. / 8.
1.2.3.4.5.6.
Page 10
� 2• 1 : 10 / 2012 9 ~10• 2 : 50 / 2012 11 ~12
�• (OS 2.3 )• : 2 4 , 10~30• : 1
�• 5Mbytes•
- 121 -
Page 11
� 47• 20
• , , , , • cross validation 40~50%
• 1/3 /
•• / / �
- 9562 - / 332 - / 97
/ - / 3350 ( )- / 331 ( )- 93
- 2908 - / 248 / - 91
- 2388 ( / )-SNS( ) 229 / - 88
/ - 2012 ( )- 227 ( / )- 74
- 906 / - ( , ) 210 / - 71
- 766 - 160 ( )- 71
- / 744 / - / 146 ( )- 66
- 717 - / 145 - / 62
( / )- 692 / - 142 / - / 57
- 625 - 121 ( / )- 50
( / )- 610 / - 118 - / 40
( / )- 525 ( )- / 113 - 32
( )-TV 422 - 106 - 25
/ - 408 - 104 ( )- 17
/ - 341 - 98
Page 12
–
126.4 126.5 126.6 126.7 126.8 126.9 127 127.1 127.2 127.3 127.436.7
36.8
36.9
37
37.1
37.2
37.3
37.4
37.5
37.6Location
longtitude
latit
ude
- 122 -
Page 13
� : �
1.• API • �
2.•
3.••
GPS trajectory
Sensors
Misc. context
Date/Time
Page 14
–
� ( , ) �• 1)
•• (merging)• 50 518
- 123 -
Page 15
–� : ,
�•
•• ) DBSCAN trajectory clustering (CB-SMoT)
••
• CHAMELEON
••
• Multivariate Gaussian, Gaussian mixture, kernel density estimation, …
•• ,
Page 16
– HMM & Adaboost�
• Hidden Markov model• / /
••
• >> API ( 15~20% )
� �• AdaBoost
• AdaBoost•
• precision• : 80%, : 74.6%, : 64.7%
- 124 -
Page 17
– Multiple instance learning� Multiple instance learning*
• Supervised learning•
• , ,
•
� MIL • MIL
• ?• ,
• : 10~20
Page 18
– Multiple instance learning� mi-SVM*
• Multiple instance learning SVM•
�•
• : , • Chunk
• , , • ,
••
� RBF linear kernel• kernel • RBF, linear kernel : 63:37
- 125 -
Page 19
– Multiple instance learning� mi-SVM cross validation
�••
•
�• mi-SVM
• Viterbi • 52.2%
• 43.9% / 60.8%
••
Sensitivity(Recall)
/ //
SVM 76.0% 5.3% 77.5% 43.8% 47.9% 36.0% 16.4% 70.0% 20.1%
77.0% 22.1% 80.0% 48.6% 67.3% 51.2% 50.5% 90.0% 72.1%
56.9% 10.0% 60.0% 0.0% 35.7% 10.0% 37.8% 90.0% 45.5%
87.6% 54.9% 100.0% 89.3% 93.6% 82.4% 73.3% 90.0% 90.3%
(Precision) 82.6% 34.0% 86.7% 73.9% 68.9% 50.3% 45.2% 91.7% 34.9%
(F-measure) 79.7% 26.8% 83.2% 58.6% 68.1% 50.8% 47.7% 90.8% 47.0%
Page 20
�•
• (activity) (behavior) •
•
•• coverage • 1.7%
•••
(2 4 )
•
- 126 -
Page 21
�• (Transfer learning)
• (reusable) •
•(cold start )
• /• Noise/Error• ,
•••
••• ( )• , •
Page 22
THANK YOU!
- 127 -
- 128 -
2: Feature Selection& Efficient Computing
- 129 -
- 130 -
1
Fused Lasso
2
Contents
Motivation••
Introduction••
Algorithm••
Results•••
Conclusion
- 131 -
3
Motivation��
4- 132 -
5
6
Reference : http://ghanahealthnest.com/2013/04/23/ghana-revenue-authority-rolls-out-a-new-malaria-control-strategy/
•
•
•
- 133 -
7
Introduction��
8
Fused Lasso
- 134 -
9
•
•
•
•
10
VMales1
MMales7
VFmales1
� � ��
•••
VMales14
MFmales7
VFmales14
- 135 -
11
VMales1
MMales7
VFmales1
VMales14
MFmales7
VFmales114
�
VMales1Y/N
MMales7Y/N
VFmales1Y/N
VMales14Y/N
MFmales7Y/N
VFmales14Y/N
�
12
� �
VMales1Y/N
MMales7Y/N
VFmales1Y/N
VMales14Y/N
MFmales7Y/N
VFmales14Y/N
�
- 136 -
13
Suarez, Estrella, et al. "Matrix-assisted laser desorption/ionization-mass spectrometry of cuticular lipid profiles can differentiate sex, age, and mating status of< i> Anopheles gambiae</i> mosquitoes." Analytica chimica acta706.1 (2011): 157-163.
•
•
•
•
14
Suarez, Estrella, et al. "Matrix-assisted laser desorption/ionization-mass spectrometry of cuticular lipid profiles can differentiate sex, age, and mating status of< i> Anopheles gambiae</i> mosquitoes." Analytica chimica acta706.1 (2011): 157-163.
- 137 -
15
Suarez, Estrella, et al. "Matrix-assisted laser desorption/ionization-mass spectrometry of cuticular lipid profiles can differentiate sex, age, and mating status of< i> Anopheles gambiae</i> mosquitoes." Analytica chimica acta706.1 (2011): 157-163.
t f ti l li id fil diff ti tSuarez Estrella et al "Matrix assisted laser desorption/ionization
16
•
•
•
•
Li, Lihua, et al. "Data mining techniques for cancer detection using serum proteomic profiling." Artificial intelligence in medicine 32.2 (2004): 71-83.
- 138 -
17
�
�
3. Better Performance
�
�
1. Sparsity
�
�
2. Smoothness
18
Algorithm��
- 139 -
19
(1)
Tibshirani, Robert, et al. "Sparsity and smoothness via the fused lasso."Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.1 (2005): 91-108.
���
20
(1)
m/z Fused Lasso Lasso m/z Fused
Lasso Lasso m/z Fused Lasso Lasso
654.9 0.002 0.000 656.3 1.647 0.000 657.7 1.403 0.000 655 0.002 0.000 656.4 1.647 0.000 657.8 1.403 0.000
655.1 1.123 0.662 656.5 1.647 0.000 657.9 1.403 0.000 655.2 4.159 3.169 656.6 1.647 0.000 658 1.403 0.010 655.3 1.791 0.002 656.7 1.647 0.000 658.1 1.403 -0.015 655.4 1.791 0.000 656.8 1.647 0.000 658.2 1.403 0.225 655.5 1.791 0.000 656.9 1.647 0.000 658.3 0.252 0.000 655.6 1.791 1.300 657 1.650 0.000 658.4 0.252 0.000 655.7 1.791 0.000 657.1 1.650 0.344 658.5 0.252 0.000 655.8 1.791 0.000 657.2 1.639 2.609 658.6 0.252 0.000 655.9 1.791 0.143 657.3 1.403 0.010 658.7 0.252 0.000 656 1.791 0.001 657.4 1.403 0.000 658.8 0.252 0.000
656.1 1.791 1.495 657.5 1.403 0.000 658.9 0.252 0.000 656.2 1.647 0.149 657.6 1.403 1.613 659 0.252 0.000
/z FL
- 140 -
21
(1)
��
Tibshirani, Robert, et al. "Sparsity and smoothness via the fused lasso."Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.1 (2005): 91-108.
22
Liu, Jun, Lei Yuan, and Jieping Ye. "An efficient algorithm for a class of fused lasso problems." Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.
- 141 -
23
(2)
Liu, Jun, Lei Yuan, and Jieping Ye. "An efficient algorithm for a class of fused lasso problems." Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.
24
Results���
- 142 -
25
PerformanceComparison
Averagemisclass. rate
AverageSelected features
26
-1 -0.5 0 0.5 1 1.5 2 2.5-1.5
-1
-0.5
0
0.5
1
1.5
2
1st principal component
2nd
prin
cipa
l com
pone
nt
OthersMFemale7
0 2000 4000 6000 8000 10000 120000
2
4
6Fused lasso coefficient abs
m/z
Coe
ffici
ent
0 2000 4000 6000 8000 10000 120000
0.5
1Fused lasso selected features intensity value
m/z
Inte
nsity
A
B
- 143 -
27
0 2000 4000 6000 8000 10000 120000
0.5
1
1.5
2
2.5
3
3.5
4
4.5
m/z
Coe
ffici
ent
Fused lassoLasso
28
•
•
•
- 144 -
29
30
Conclusion
- 145 -
31
�
3. More sophisticated
�
1. More accurate
�
2. More appropriate
32
Thank YouWhere New Challenges Begin!
- 146 -
������������� ����������������������������������������������������������������
�������
!"�#$�%�&'()*+,
!�������-��������������.���-����/!.�#.0�1��2�
3456�78�9:�;<�=>?+@AB
�����
� 9�������
� ���������������������������
� �������������C������-�.�D���E�
� "�������������������E����
� �����������
� $FG���E�����H�����
� ��������
- 147 -
9�������
�������������#���
.GG�����!���I
"���������J����E����
# ��������J����E����� ��D�
#������������J����E����� ��D������F�/KL0
�����������J����E����� ��D������F�����������/(L0�
������������J����E����� ��D������F����������������
�9�������E����������������������-������������ �����GG���I�
- 148 -
� �����D�������EG�����M�
� $FG�������D� �� �����E��������-���������E�������F�EG������N���������������-����E�����������
��������C�E��������-�
!��������!����E����
C�E��������-
O2�2
9���E��������-����������D��E����������
������������E�GG��D������� �����E�������G���
C�E������H������
P .��������D�������������������������E��������-�
��EG����G��������G����E����
Q ���������D������������M�����������������������D
�2D2� �������������EG������������R�� ���-��������������������
S H������D�����EG��F�-������E�����E������E�E�-�
T $�������D�E����D�������M���
!�������������
# �U�������.GG������
� ��������������
�������������������� ������� �
��������� ������� ��������D
� ��������F�����
���������� ����������������� �
���������� ������ �#������E��D
!�.��1C.�/������V�0
- 149 -
���������������������������
��-������W� ��
��-������W� ���
W��J�H���E���������
.��J�C�G������-��� ��������
� !�����������D��G������E����� H�G���������X���G�������-��������������������������E����������
- 150 -
������������
52 U��G��������� G��
� U������������������������EG�������
���������������� ��������������V��G����������������������������V�G������/�G����0�
�2�2��� � � �����������������
32 !�������������� G��
� U������������������������EG�������
���������������� �����E���E����������������� ����������������������
����G����������2�2��� � � ����� ���! "#$%� ��! �&� � � �
!��������
���������
�G�������
�������������������D������E������FG����������D����������
���������������E����
�������������C������-�.�D���E�
- 151 -
� ���������/���������0
� C������������-
� #�E���EG��F�-
�������������C������-�.�D���E�
W�E� !��2 Y��� �EE���
9.�� 3446
� �����������-� ��EG�����EG��E��� #�E����������� O��-�G�����������������-
���� 3446
� W������ #��������������������������D�G�D-�����E���� �����E�����������������EG������9.��� ������� ����EG������9.��
%9#"WZ�� 3446 � .��������������������G�D-�����E���
9.��
�������� �� � �� ��������� ���� �
� ��� '� ������� ( � �)* + ,-./��0 1 2 345� $%� ��! �&� ��� �� � 6 (&����� ��� ��7 3(5� ������ ���� ��� �����
� ����� !�� ��������� �� �� �" � � ��� � � ��� # �� � 6 ( �� 38 ��$ ��� �� 3�5�%&&&�������
���������9���G����������
U� �������������GG�����
� ���D���������������J�Z��������9� ��
� �����������������J�Z��������: ��
����D���������������G�������
- 152 -
� .�E�������������������N����E������9.���
� [���D�G�D-�����E���52 9�����-��D�;<�=�32 9�����-��D�>;�=� ���9
�����R�%9#"WZ��
C������������N�����GG����
!��������
���������
�G�������
�EG�������G����E����������������������������������� ����������������������E�����
"������������������������.������ �����������������������N�������������E�����/���-���344\0
- 153 -
��������Z���������������������/���0
� �������������������������������������������� H���������������������������������?� H��������-��E�D���������@
���
�A� � ,-.BC)DEF
C G C C H I )DFF>���������� ������J �� �����KLM� �� ��� ���� ������� ����� ������� �������� ������N � ��KL�� �� ��� ���� ������� ����� �������� ��������� ����
.���������������G�����������������������������������������������������G������-
����E�E�H��������-���F�E�E�H���������/EH�H0�
� ����������������������������� H���������������������������������?� H��������-��E�D���������@
EH�H
� "O � ,-.FP�QR"�NS! T� H
I� U " NS! NV
FW�B
� "X � ,-.FP�QR" NS! T �Y I
� U " NS! NVFW�B
"�Z! [������ � ������ ����" ����� � �� �����Z��[" Z! [ � \]�^ \_�/� Z! [ `ab c _!]
c _ c ]
.������������H�H�G���������������������������������G������-��������������������
- 154 -
�����������
�����������
� W�������-���/W�0� ��EG������������������-��V����������������������
� TV Nd! Ne!�Nc f � TV � Nd! Ne!�Nc TV � ��TV�g��Nh&TV�c
hid
� ��GG��O������������/�O�0
� ������D� �G���������-G��G������ �����E�F�E�M������E��D��
,jkl!m!n �Io p e G T
o q r4 rst ut ��[S p q v�ZS� G w � I H rS
� ]Z�������W��D������/]WW0
� W�ZG���E�����E������������� ��������������� #���E���EE������������������������������D���������� ����
- 155 -
$FG���E�����H�����������D������C������
C���C�����G��
� .����D-J������-�����
� !�E��^D���J� ������G�E�����������D�D������N������
� �G����^D���J �������G�����X��������������D�D������N������
� 1��D�������J� ���������D�������
_�.����D-��!�E��^D������G����^D��� ��E�[�9���G���-�_�1��D����������E��������-�`����������G���-�
� ���D�������������
aZ������������������b�a���G������
C����� c���������� c������������ c����������
.����D- de 33d 3f
!�E��^D��� a\ 54d 3
�G����^D��� d4 344 3
1��D������� 5f6 544 3
- 156 -
H�������.����D-
C����� �������������� c������������������/.����g0
W� �O� ]WW
.����D-
33d
.���������� de d62d4�/\\23f0 \h2f4�/ee2530 ae2e5�/ee2af0�
9.��4245 h hf23a�/hh2de0 ed25h�/5442440 ef24h�/5442440
424a 56 h62\f�/h\2e60 ea2\d�/5442440 he234�/5442440
����i�
%9#"WZ��
42454 Wi. Wi. Wi.
424a
��� 5a df264�/\52560 df24f�/\32eh0 d626f�/\f2h50
EH�H�j �9C
34 \52h3�/ha23d0 \f2eh�/e\2a50 da2hh�/eh2dd0
54 dh24h�/\e26d0 df26d�/hd2f50 ah2f3�/e42640
a dh2ef�/\d2350 de243�/\h26a0 dh2de�/\62660
EH�H�j �9k
34 \52hh�/h523e0 \a2dh�/e62640 da24d�/ed2fe0
54 d62a\�/de2h60 d5234�/\42ea0 ah2a5�/d\25a0
a d524d�/dd2540 aa2e\�/d32aa0 a\2a\�/d526e0
� 9.��l mnopqrstuvwxyz{t|}~������{��
� �����pl������(���7 ���t���:�;�pl����t|L 9.���
��7op��q��p�����l���m� ¡¢
H�������!�E��
C����� �������������� c������������������/.����g0
W� �O� ]WW
!�E��
54d
.���������� a\ e325a�/eh2ea0 \3266�/5442440 \\2e5�/h326e0
9.��4245 f e32ha�/ef2640 de2ha�/\52d60 ha2de�/e\2da0
424a 6 e523a�/ef2ff0 \52fa�/\52630 hd243�/ed2\d0
����4245 3f e42\6�/e\2d\0 \d2da�/ed2af0 \62df�/h32\40
424a a6 h\2a5�/eh2d30 \a24a�/5442440 h42\6�/h42e30
%9#"WZ��4245 3f e5265�/e\2e40 \62e5�/ed2fd0 \f2ad�/ha2330
424a a6 he2ha�/eh2\30 \62f5�/5442440 \d24a�/hh2e50
��� d ea23\�/ea2f\0 dh23\�/\623a0 h62ah�/5442440
EH�H�j �9C
a4 e32f\�/eh2\h0 \42ha�/5440 \62ad�/h42hh0
34 ea24f�/ed2ha0 \32ha�/e62ff0 h62a3�/\d2fa0
a ea23e�/ed2350 d\25d�/\32\\0 ha2e5�/ed2560
EH�H�j �9k
a4 e42\a�/eh2\f0 \f2h3�/5442440 \623a�/hd2d30
34 ef266�/e\2ff0 \524\�/e62e40 \d234�/h326f0
a e6256�/ed2a30 dh2ed�/\32\e0 \d234�/h326f0
� £yz{<ml¤��¥,¦§�opqrs�m¨�©���ª��«�¬
� <®¯°±²opqrs�mnopqrs,u³vL(´���
� <®¯°±²µ¶·�¸¹tº»mn¥,¦L¼��qr�p��(��pl�
- 157 -
H��������G�����½�����
C����� �������������� c������������������/.����g0
W� �O� ]WW
�G����
544
.���������� d4 e32d\�/ee2he0 h4244�/5442440 e\2\\�/eh2dh0
9.�� 4245i424a f e62hd�/e\2450 h525e�/hf2d40 e\2e4�/eh2hd0
����i
%9#"WZ��
42445 5f ed2ff�/ee2540 hd2\f�/e\2d30 eh236�/eh2ee0
4244a 3f ea24a�/ee2\\0 hf2eh�/5442440 eh26f�/ee2540
��� h ea2fa�/ee24e0 ha253�/ef25d0 eh2f6�/eh2e40
EH�H�j �9C
34 e\236�/ee26h0 h52d3�/5442440 e\2he�/eh2\60
54 ed23f�/ee2350 h\2\5�/ed2h60 e\2af�/eh2a60
a ed24f�/ee2340 h424a�/h\2d60 eh2ad�/eh2e40
EH�H�j �9k
34 e\236�/ee2\30 hf244�/5442440 eh26a�/ee2560
54 ea2f\�/ee26f0 h\2af�/ed26f0 eh23f�/eh2ef0
a e62hh�/eh24\0 h32e\�/he2d40 eh2f6�/eh2ha0
� £yz{<mEH�HZ�9C�¾s�l¤��¥,¦¿
� ��©mn¾s�LuÀ7pÁ���� Â
� �Ã:�;t|L���opqrs�Ä,�ÅÆ
H�������1��D�������
C����� �������������� c������������������/.����g0
W� �O� ]WW
1��D�������
544
.���������� 5f6 e6236�/ee2440 h\2h5�/5442440 e\2f4�/eh2fe0
9.�� 4245i424a d e62h5�/eh2he0 ef23�/ef2ha0 eh2a4�/ee2\30
����4245 a e32d4�/eh2hh0 e62f�/ea2ef0 ea2h3�/ed25d0
424a a e52d4�/eh26a0 e5�/ea2aa0 e62af�/ef24d0
%9#"WZ��4245 f e42h5�/e\2h40 e42d�/e32a50 he2h\�/e425h0
424a h ef245�/eh2h40 hh23�/ed2da0 ea2fh�/ea2ee0
��� 5\ ed244�/ee2\\0 he2d�/ee2df0 e\2h6�/eh2\30
EH�H�j �9C54 ea246�/ee2440 he23�/ed2h50 ed2h\�/e\2f50
a e6245�/e\2h30 e42h�/ef2630 e62h4�/ef23\0
EH�H�j �9k54 e52h5�/eh2\f0 he�/e62af0 ef2e\�/ed23a0
a e5244�/ed2ee0 e42h�/e326a0 e526e�/e52d40
� W��yz{¦ÇÈ�É� ���ll¤��¥,¦ ��mnÊyz{t|����¾s��Ë�
� W��yz{t|L�p¦�Ì7mÍ�� ¾sL�K���¥,»�p�
� ��©mn¾s�LuÀ7pÁ���� Â
� �Ã:�;t|L<®¯°±²opqrs�Ä,�ÅÆ
- 158 -
H�������!�E���/�E������EG�����M�0
C����� �������������� c������������������/.����g0
W� �O� ]WW
!�E��
a4
.���������� a\ h\235�/e52a40 \d2f6�/5442440 ed255�/e\2h40
9.�� 4245i424a 6 e623\�/e\2660 h32h\�/hf2\40 eh24e�/eh2hh0
����42445 e ef244�/eh2440 \h2ha�/e62550 ea266�/e\25e0
4244a 53 ea2da�/ee2530 \4246�/e\2330 ea2ae�/ed26a0
%9#"WZ��42445 e ef255�/eh2da0 \h2h3�/e62\e0 ea26a�/ed2ef0
4244a 5f e42h5�/ef2d\0 \426a�/ea2d50 e62\e�/ed26a0
��� f \52de�/h42ee0 fe233�/d52e30 e52ad�/ea2350
EH�H�j �9Ci�9k54 hh243�/e62a40 fh2hf�/\\2660 e52a5�/ef2ef0
a ha2da�/e323h0 af2fa�/dh25f0 e42e\�/e62h40
� m��p�op©�Î{��¡����� %9#"WZ��� Ï�pÁ� 5i54Ð o(
� :�;�p¦op�p mÑ�Ò©x(����opqrs�mnopqrstuv}~7¥,
�ÐÓ�Ô /H���0�¾s
H��������G�����½������/�E������EG�����M�0
C����� �������������� c������������������/.����g0
W� �O� ]WW
�G����
a4
.���������� d4 ha2e\�/e62360 h4253�/5442440 ed25d�/e\2\\0
9.�� 4245i424a 6 h\2hh�/ef26d0 hd26e�/hh2h50 ee2d5�/ee2h50
����i
%9#"WZ��
42445 \ ef2h4�/ed2ae0 he233�/eh2550 eh26h�/ee24\0
4244a e ed2f\�/eh2aa0 h62d5�/eh2\30 e\2e5�/eh2h50
��� \ e62d5�/ed2350 \e2ah�/ef2650 eh2ea�/ee2f30
EH�H�j �9C�54 e6235�/eh2630 hf26e�/eh2\60 eh244�/eh2hf0
a e32f4�/e\2hd0 \e25e�/e42440� ed2e6�/e\2fe0
EH�H�j �9k54 ea233�/eh2\d0 h62ah�/eh2d50 eh2h5�/ee2650
a e52d5�/ea2d60 h426\�/he2\30 eh2ae�/ee25\0
� :�;��p¦op�p m�Ò©x(�mnopqrs�����Õ�Ö�(×� ��©
���opqrs��Ø7��t|Ù�Ì�Ë7(×� Â
��Ú� !�E���:�;�ÛÜ���¥,L¼
� ��©Ý:�;t|� !�E���:�;��Þ¸mnopqrs�����ßà�o���¬
- 159 -
H�������1��D��������/�E������EG�����M�0
C����� �������������� c������������������/.����g0
W� �O� ]WW
1��D�������
a4
.���������� 5f6 e42h5�/ea2540 hd2f5�/5442440 e\24f�/eh2330
9.�� 4245i424a f e32f3�/e\2330 he235�/e\2a50 ee2h3�/ee2e30
����i�
%9#"WZ��
4245 6 ed244�/ee25e0 ef2hf�/ed2630 ed25d�/ed2440
424a 5d e32h\�/e\2eh0 ha2d6�/5442440 e\2da�/eh2aa0
��� 5a ea2da�/eh2h30 he25e�/5442440 ee2f6�/ee2\60
EH�H�j �9C5a e62eh�/eh2\h0 hf2\e�/5442440 eh23d�/eh2e60
a ea2h4�/eh2he0 e62d5�/ea2\f0 ed244�/ed2ah0
EH�H�j �9k5a ed244�/eh2hh0 e4244�/5442440 ed25\�/ed2e50
a e6235�/e\2fe0 hd2f5�/ed2650 ea2de�/ed2430
� :�;��p¦op��p mÑ�Ò©x(���{Aáâ©ã����¾s,mn¾sÇ��
ª��äåæpç
� Ý:�;t|���¾s�l¤��¥,¦ �è��©mnopqrs�L�t�é��¥,
� �Ã:�;t|L���opqrs�Ä,�ÅÆ
$FG���E�����H����������������C������
- 160 -
C���C�����G��
_���������!��������������E�]���H��D����ZE������������
� ���������������
54Z������������������b�a���G������
C����� c���������� c������������ c����������
��� ������ 3444 d3 3
!���� ������ 53d44 543 3
� ����������J� ���������������������/E�������-����0
� !�����������J� ������G����������������/E�������-����0
H������������������
- 161 -
H�������!������������
��������
- 162 -
��������
� ��EE��-
� yzyê�opqrsëÐ|�<®¯°±²
� fl����D�����:�;ì3l���������:�;t<®¯°±²opqrsíÈ
� mî7yz{tA7mnopqrs�,�u³yê
� ��������
� mî7:�;��mî7yz{tA�¡mnopqrs�,<®¯°±²opqrs
�ïðíëÐu³yê�¡<®¯°±²opqrs�Ä,ÅÆ
� ����������E��������-�
� mnopqrs�, Þ¸<®¯°±²opqrs�ªñAu:�;p�òót �7)
×�ôõ ��H�������
� 1�E����
� ö÷�op��pìøù�úLÐvê��û�üt£opqrs¾st| qr
ýoptA7þ�í�yê��
kR.
- 163 -
.GG����F
W�E� !��2 Y��� �EE���
9.�� 3446
� �����������-� ��EG�����EG��E��� #�E����������� O��-�G�����������������-
���� 3446
� W������ #��������������������������D�G�D-�����E���� �����E�����������������EG������9.��� ������� ����EG������9.��
%9#"WZ�� 3446 � .��������������������G�D-�����E���
!��� 344d
� �����������-� C�������������-�E����D�������G�D-�����E���� !�����E�����������-� C����D������G�������E�G�����i��������� C����D������E�������������E�G�����i��������
9!�Z�� 344\i344h
� �����������-
� #�E�������������������-�9EG�����
� ���������������!�������EG���D�/�������M�V0
� C����D������G�������E�G�����i��������
� C����D������E�������������E�G�����i��������
C"� 3455
� �����������-
� �����������������EG����� ���G����������
� ���������������9!�������EG���D
� C����D������G�������E�G�����i��������
� C����D������E�������������E�G�����i��������
� ��������Z����E�D�������E��-���.�D���E�
- 164 -
����
��'(���� �� � �� ��������� �� �� �
� �� � '� ������� � � ��� � � �0 �� 3�5� � � ��)* + �)* + ,-.xyz{ $%� ��! �&x�� ( � �)* + ,-./��0 z{ 345� $%� ��! �&�%� |�}�� �� � 6 (&�%� |(} ��" ��� � ��� 7 3(5# ���� �� ���� ��� �����
� ����� !�� ��������� �� '( �$ � � ��� � � �� � �% �� � 6 �&x � ���� x y �� 385 ���� �� � �� 3�5�� ����� ��
��������� �� ���� �������� �� �� ��
� �� � �������� �� � ��� � ���� ��� 7/�z{ ����� � 3�5
� �� �� � � �� ��������� ������ � � ��� � � ��#�� �� � � ��� �, � ���� ��� � 6 �&x �� �! � ~ x� � � ��� ( � �� � " �� � 6 �&x 7 3�5 ��# ��� �� 3�5$ ����� ��
C������������N�����GG����
%9#"WZ��
-�./01'(���� �� � '� ��#�� � � 3�5� ������
� �� ��� 2��� ������� �� �� �� �( � �)* + ,-./�{��z{ $%� ��! �&'�� ��� � ��� 7 3(5� ��#�� � ��#�� 3(5
� ����� !�� ��������� �� '( �" � � ��� � � �� � # �� � 6 �&x � ���� x y �� 385 ��$ �� � �� 3�5�% ���� ��#�� �� ����,�� ����� ��
-�./01������� �� ���� �������� �� �� ��
� �� � ������������ ��� ��� 7/�z{ ����� � 3�5
� ����� !�� ��������� �� ���� � � ��� � � ��� � � � ���( � �� � � �� � 6 �&x � ���� x y ( 7 �0 3�! �! (5���� ��� �� 3�5" ����� ��
C������������N�����GG����
- 165 -
- 166 -
- 167 -
�
�
�
�
- 168 -
- 169 -
Data Step, Statistical Summary, Tables/Cubes, Covariance, Linear & Logistic Regression, GLM, K-means clustering, …
- 170 -
- 171 -
- 172 -
- 173 -
- 174 -
- 175 -
- 176 -
- 177 -
- 178 -
- 179 -
- 180 -
- 181 -
- 182 -
- 183 -
- 184 -
- 185 -
- 186 -
- 187 -
- 188 -
- 189 -
- 190 -
- 191 -
- 192 -
- 193 -
- 194 -
- 195 -
- 196 -
- 197 -
- 198 -
- 199 -
- 200 -
- 201 -
- 202 -
- 203 -
- 204 -
3: Visualization & Text Analytics
- 205 -
- 206 -
- 207 -
�
� �
�
�
�
� �
�
� �
� �
�
�
�
� �
� �
- 208 -
�
� �
� �
- 209 -
�
Country City Latitude Longitude Year DataType DataType2 DataType3 Institution Purpose Scope-Collection
Scope-Application Time Lag Count Ratio
�
- 210 -
�
- 211 -
�
•
•
- 212 -
- 213 -
, vs
,
- 214 -
, 2,
, 3
- 215 -
vs ,
,
- 216 -
- , 3
--
1
1
- 217 -
�
�
�
�
�
�
�
�
�
�- 218 -
- 219 -
- 220 -
18
2- 221 -
�
3
•
•
•
4
�
•
•
- 222 -
5
�
•
•
6
�
•
- 223 -
7
�
•
∙
8- 224 -
9
10
•
- 225 -
11
•
•
•
�
�
�
�
�
�
�
�
�
�
�
12
all
pnpycncyi N
NNNNRatioVotingValid
����
• Nall = Total number of conservative/progressive parties
• Nall ≥ Ncy + Ncn + Npy + Npn
pnpycncy
pncnn
pnpycncy
pycyy
nykkki
NNNNNN
PNNNN
NNP
PPDiversityNoYes
����
����
��
��� ��
,
log},{
2
...,,
log},{},,{
4
pnpycncy
cncn
pnpycncy
cycy
nyjpciijiji
NNNNNP
NNNNN
P
PPDiversitynOrientatioPolitical
����
����
�� ���
- 226 -
�
�
13
BillContentiousness
Valid Voting Ratio(c1)
Yes-NoDiversity (c2)
Political Orientation
Diversity (c3)
14
•
X X
Y Y
- 227 -
15
�
�
�
�
�
�
�
i
j
nn
nnnnn
yy
yyyyy
Fynnn
nnnynn
nn
Fnyyy
yyynyy
yy
PrecisionRecallPrecisionRecall21,Precision,Recall
PrecisionRecallPrecisionRecall2
1,Precision,Recall
���
��
��
�
���
��
��
�
� nyyn FFF 11211 ��
16
�
�
VotingSimilarity (VS)
Shared Voting Ratio(c1)
F1yn(c2, c3)
- 228 -
17
•
•
18
•
•
- 229 -
19
•
•
20
•
- 230 -
21
•
22
•
- 231 -
23
•
24
•
- 232 -
252525252525
•
26
•
- 233 -
27
•
28
�
�
�
�
�
�
�
�
- 234 -
29
- 235 -
- 236 -
�
- 237 -
�
•
•
•
•
�
•
•
�
•
•
- 238 -
�
�
�
- 239 -
�
�
- 240 -
�
�
•
�
•
•
•
�
•
- 241 -
�
•
•
- 242 -
�
•
�
•
•
�
•
0
50
100
150
200
250
300
Total
- 243 -
�
•
�
�
•
•
•
�
�
�
- 244 -
�
�
•
•
–
•
- 245 -
�
•
•
�
�
�
•
•
None
Dove
Neutral
Hawk
Bernanke Bies Duke Ferguson Gramlich Greenspan Kohn Kroszner Meyer Olson'notably' 'auditors' 'economic-recovery' 'internationally' 'low-income' 'doubtless' 'board-staff-contributed' 'kroszner' 'nairu' 'gramm-leach-bliley-act''ben' 'internal-control' foreclosures' 'obviously' 'social-security' 'standards-living' 'own-necessarily' 'randall' 'trend-growth' 'bank-charters'
'likewise' 'corporate-governance' 'foreclosure' 'thank' 'income-borrowers' 'century' 'board-staff-contributed-remarks' 'randall-kroszner' 'supply-shocks' 'consolidation''bernanke' 'internal-audit' 'neighborhood-stabilization' 'regarding' 'social' 'half-century' 'board-staff' 'footnotes' 'utilization-rates' 'provisions'
'working-paper' 'internal-controls' 'crisis' 'inviting-me' 'getting' 'technologies' 'odds' 'am-delighted' 'nevertheless' 'net-interest-margins''financial-crisis' 'mitigating-controls' 'credit-availability' 'private-sector' 'involves' 'consequence' 'asset-prices' 'speech-delivered' 'favorable-supply-shocks' 'retail'
'pp' 'control-processes' servicers' 'communication' 'reinvestment' 'evident' 'footnotes-views' 'protect-consumers' 'cyclical' 'largest-banks''text' 'auditor' 'text-see' 'sectors' 'income-households' 'societies' 'open-market-committee' 'vol' 'leaves' 'commercial-banks'
'text-see' processes' 'financial-crisis' 'similarly' 'national-saving' 'surely' 'inflation-expectations' 'during-market' 'limits' 'safety-soundness''february' 'committee-sponsoring-organizations' 'community-banker' 'thank-inviting-me' 'lower-income' 'arguably' 'footnotes' 'turbulence' 'trend-rate' 'financial-services''university' 'treadway-commission' 'recovery' 'technology' 'predatory-lending' 'war' 'remarks-text' 'subprime' 'above-trend-growth' 'lines''washington' 'risk-management-practices' 'across-country' 'therefore' 'budget' 'owing' 'views-expressed' 'subprime-mortgages' 'disciplined' 'margins'
'economic-financial' 'officers' 'vacant' 'contingency' 'low-moderate-income' 'century-ago' 'unusually' 'journal' 'consensus' 'establishing''notwithstanding' 'risk-exposures' 'text-information' 'preparations' 'standpoint' 'ever' 'open-market-committee-fomc' 'home-ownership-equity-protection-act' 'full-employment' 'marketplace'
'economic-recovery' 'enterprise' 'purchase' 'economies' 'desirable' 'world-war-ii' 'households-businesses' 'unfair-deceptive' 'below-trend' 'asset-quality''leading' 'financial-reporting' 'board-governors-system' 'occur' 'neighborhood-reinvestment-corporation' 'goods-services' 'stance' 'undertake' 'avoiding' 'charter''pdf' 'accounting-standards' 'small-business' 'active' 'slightly' 'edge' 'resource-utilization' 'text-randall-kroszner' 'absence' 'profile'
'march' 'enterprise-risk-management' 'expenses' 'observers' 'saving' 'wholly' 'asset-price' 'due-diligence' 'bank-supervision' 'providers''board-governors-system' 'coso' 'neighborhoods' 'sharing' 'huge' 'presumably' 'colleagues-open-market-committee' 'trust-verify' 'appreciation-dollar' 'deposit'
'broader-economy' 'customer' 'solutions' 'probably' 'seven' 'readily' 'elevated' 'strahan' 'financial-modernization' 'securities-insurance''principal' 'basel-ii' 'foreclosed' 'asian' 'numbers' 'human' 'central-banks' 'penalties' 'therefore' 'banking-industry'
'vol' 'complex-organizations' 'small-businesses' 'nations' 'retirement' 'enabled' 'anchored' 'subprime-mortgage' 'principle' 'banking-securities''journal' 'audit-committee' 'demand-credit' 'let-me' 'gets' 'vast' 'members-board' 'simultaneously' 'estimate-nairu' 'inviting-me''robert' 'operational-risk' 'owner' 'coordination' 'passed' 'conceptual' 'policy-actions' 'prepayment-penalties' 'viewed' 'competitive'
'j' 'internal' 'see-board-governors-system' 'accurately' 'neighborhood' 'necessity' 'price-stability' 'pp' 'precisely' 'deposit-insurance''september' 'management-internal' 'weak' 'computer' 'get' 'nonetheless' 'markets-economy' 'text' 'judgment' 'legal''november' 'directors' 'loans-small' 'domestic' 'really' 'generations' 'contributed' 'verify' 'excessive' 'compete''sharp' 'operational' 'website' 'topic' 'save' 'creative-destruction' 'views' 'textreturn' 'structural' 'internal-controls''bureau' 'determine-whether' 'communities' 'speak' 'political' 'fear' 'textreturn' 'mortgage-market' 'theme' 'financial-holding'
'financial-economic' 'invitation' 'economic-conditions' 'transparency' 'matter' 'living' 'damp' 'subprime-market' 'rising-inflation' 'offices''national-bureau-economic-research' 'exposure' 'real-estate-owned' 'shared' 'counseling' 'advances' 'path' 'taxes-insurance' 'monetary-policy-cannot' 'allowed'
'recovery' 'controls' 'owners' 'date' 'finances' 'market-forces' 'resource' 'structured' 'higher-inflation' 'strategic''c' 'appetite' 'real-estate' 'play' 'home-ownership' 'technological' 'views-expressed-own-necessarily' 'issued' 'banking-financial' 'reporting'
'david' 'auditing' 'board-governors' 'asia' 'studies' 'accordingly' 'remarks-textreturn' 'under-home-ownership-equity-protection' 'challenges-monetary-policy' 'safety''april' 'compliance' 'stabilization' 'recognizing' 'fine' 'productive' 'movements' 'ownership-equity-protection-act-hoepa' 'prevailing' 'laws'
'recession' 'risk-appetite' 'equally' 'complete' 'difference' 'decades' 'impaired' 'e' 'unchanged' 'opportunities''nevertheless' 'risk-management-processes' 'don-t' 'accurate' 'suppose' 'inevitable' 'anticipated' 'prudent' 'forecasting' 'bank-holding-companies'
'speech' 'lines-business' 'foreclosure-crisis' 'news' 'involving' 'perceived' 'contributed-remarks-text' 'washington' 'direction' 'community-banks''october' 'guidance' 'press-release' 'additionally' 'main' 'hence' 'own-necessarily-members' 'alt' 'relation' 'banks-united-states''john' 'board-directors' 'eligible' 'emerge' 'communities' 'markedly' 'economic-stability' 'correlated' 'specifically' 'banking-organizations''press' 'conflicts-interest' 'delinquent' 'financial-sector' 'home-mortgage-disclosure-act-hmda' 'valued' 'house-prices' 'foreclosures' 'interpretation' 'monitor''papers' 'credit-quality' 'during-financial-crisis' 'technologies' 'urban' 'produced' 'downward' 'proposed-rules' 'interpreted' 'established''m' 'profile' 'properties' 'forces' 'programs' 'endeavor' 'economic-activity' 'kb-pdf' 'constant' 'risk-profile'
'textreturn' 'invitation-speak' 'qualify' 'operating' 'present' 'largely' 'boosted' 'p' 'initially' 'reviews''helping' 'enterprise-wide' 'senior-loan' 'financial-market' 'minority' 'technology' 'outlook' 'www' 'via' 'conducted'
'american-economic-review' 'relating' 'before-begin' 'firm' 'shown' 'history' 'circumstances' 'association' 'appreciate' 'integrity''sharply' 'entity' 'inventory' 'computers' 'apart' 'newer' 'judging' 'words' 'call' 'regulator'
'remainder' 'monitoring' 'www' 'industries' 'cuts' 'emergence' 'macroeconomic' 'market-turbulence' 'otherwise' 'efficiency''stabilize' 'sarbanes-oxley-act' 'nonprofit' 'prove' 'community-development' 'virtually' 'resilient' 'rigorous' 'swings' 'vehicles'
'financial-stability' 'draft' 'small-business-owners' 'communications' 'groups' 'telecommunications' 'slack' 'causes' 'inflation-rate' 'ownership'
�
•
•
•
•
- 246 -
�
•
•
•
•
•
•
•
•
�
- 247 -
�
�
�
- 248 -
�
�
�
�
�
�
- 249 -
�
�
�
�
�
�
- 250 -
�
�
�
�
�
•
•
�
•
•
•
�
�
�
- 251 -
�
•
•
•
•
•
�
•
•
•
•
•
- 252 -
�
'standpoint‘` 'see-board-governors-system' 'saving' 'members-board' 'foreclosures' 'laws''financial-crisis' 'economic-conditions' 'retirement' 'policy-actions' 'proposed-rules' 'bank-holding-companies'
'economic-financial' 'owners' 'save' 'markets-economy' 'market-turbulence' 'community-banks''board-governors-system' 'real-estate' 'counseling' 'damp' 'supply-shocks' 'banks-united-states'
'broader-economy' 'board-governors' 'finances' 'resource' 'utilization-rates' 'banking-organizations''financial-economic' 'foreclosure-crisis' 'urban' 'views-expressed-own-necessarily' 'favorable-supply-shocks'
'american-economic-review' 'during-financial-crisis' 'minority' 'movements' 'full-employment''auditors' 'nonprofit' 'standards-living' 'contributed-remarks-text' 'below-trend'
'corporate-governance' 'private-sector' 'technologies' 'own-necessarily-members' 'appreciation-dollar''financial-reporting' 'sectors' 'world-war-ii' 'house-prices' 'financial-modernization'
'accounting-standards' 'economies' 'goods-services' 'downward' 'excessive''complex-organizations' 'observers' 'creative-destruction' 'outlook' 'banking-financial'
'audit-committee' 'asian' 'market-forces' 'circumstances' 'challenges-monetary-policy''invitation' 'nations' 'emergence' 'macroeconomic' 'provisions''exposure' 'domestic' 'board-staff-contributed' 'resilient' 'net-interest-margins''auditing' 'transparency' 'own-necessarily' 'slack' 'retail'
'compliance' 'asia' 'board-staff-contributed-remarks' 'footnotes' 'largest-banks''board-directors' 'financial-sector' 'board-staff' 'am-delighted' 'safety-soundness''monitoring' 'financial-market' 'asset-prices' 'during-market' 'financial-services'
'draft' 'firm' 'footnotes-views' 'turbulence' 'establishing'foreclosures' 'industries' 'open-market-committee' 'subprime' 'marketplace''foreclosure' 'low-income' 'footnotes' 'subprime-mortgages' 'asset-quality'
'crisis' 'social-security' 'views-expressed' 'unfair-deceptive' 'providers'servicers' 'income-borrowers' 'open-market-committee-fomc' 'strahan' 'deposit'
'financial-crisis' 'income-households' 'households-businesses' 'penalties' 'securities-insurance''community-banker' 'national-saving' 'stance' 'subprime-mortgage' 'banking-industry''across-country' 'lower-income' 'resource-utilization' 'prepayment-penalties' 'banking-securities'
'board-governors-system' 'predatory-lending' 'asset-price' 'mortgage-market' 'deposit-insurance'
'expenses' 'budget' 'colleagues-open-market-committee' 'subprime-market' 'financial-holding'
'foreclosed' 'low-moderate-income' 'central-banks' 'taxes-insurance' 'safety'
�
'economic-recovery' 'small-business' 'recovery' 'ownership-equity-protection-act-hoepa'
'national-bureau-economic-research' 'small-businesses' 'home-mortgage-disclosure-act-hmda' 'trend-growth'
'recovery' 'demand-credit' 'community-development' 'trend-rate'
'recession' 'loans-small' 'economic-activity' 'above-trend-growth'
'lines-business' 'real-estate-owned' 'protect-consumers' 'bank-charters'
'conflicts-interest' 'senior-loan' 'home-ownership-equity-protection-act' 'commercial-banks'
'credit-quality' 'small-business-owners' 'due-diligence' 'charter'
'enterprise-wide' 'reinvestment' 'trust-verify'
'economic-recovery' 'neighborhood-reinvestment-corporation' 'verify'
'credit-availability' 'home-ownership' 'under-home-ownership-equity-protection'
�
'stabilize' 'stabilization' 'consolidation' 'bank-supervision'
'financial-stability' 'technology' 'internal-controls' 'estimate-nairu'
'internal-control' 'coordination' 'monitor' 'structural'
'internal-controls' 'technologies' 'risk-profile' 'rising-inflation'
'mitigating-controls' 'technological' 'regulator' 'monetary-policy-cannot'
'control-processes' 'technology' 'operational-risk' 'higher-inflation'
'committee-sponsoring-organizations' 'inflation-expectations' 'management-internal' 'inflation-rate'
'risk-management-practices' 'price-stability' 'controls' 'gramm-leach-bliley-act'
'risk-exposures' 'economic-stability' 'risk-appetite' 'neighborhood-stabilization'
'enterprise-risk-management' 'structured' 'risk-management-processes'
'coso' 'nairu' 'guidance'
'basel-ii' 'disciplined' 'sarbanes-oxley-act'
- 253 -
•
- 254 -
4: SNS and Bibliography Analytics
- 255 -
- 256 -
- 257 -
•�
•
•
•
•
•
•
- 258 -
�
- 259 -
•
- 260 -
- 261 -
- 262 -
- 263 -
•
•
•
- 264 -
System Optimization Lab.
Korea University
Young Min, Jun
Modified LDA with Bibliography Information
1
한국 BI 데이터마이닝 학회 2013 추계 학술 대회
Contents
1. LDA
2. Modified LDA with
Bibliography Information
1.1 Topic Model
1.2 LDA
2.1 Limitation of LDA
2.2 Introduction
2.3 Preliminary
2.4 Model
2.5 Expected Impacts
2- 265 -
1.1 Topic Model“Topic modeling provides a suite of algorithms to discover hidden thematic structure in large
collections of texts. The results of topic modeling algorithms can be used to summarize, visualize,
explore, and theorize about a corpus.”(DM Blei, 2012)
3
Example
Research of Topic Model
• LSA
• Based on reducing dimension (SVD Decomposition)
• pLSA
• Mixture decomposition
• LDA
• Most frequently studied model
• What is the “topics” on the New York Times?
• How change the “topics” on the Twitter?
• How similar are these article?
Geometric Interpretation
• Three topics for three words.
• LDA makes a smooth
distribution on the topics.
1.2 LDA“LDA is a generative probabilistic model for collection of discrete data such as text corpora. And this is
a three-level hierarchical Bayesian model, which each item of a collection is modeled as a finite mixture
over an underlying set of topics”(DM Blei, 2003)
4
Graphical Model
Example
Generative Process
- 266 -
ReferenceIndividual Explanation
2.1 Limitation of LDA
LDA is effective tool for discovering topic structure, but there are some further research to improve
LDA. In that areas, this research focus three aspects such as individual, reference, and explanation.
5
•LDA often gives result which is hard to
understand.
•Modified LDA expects to provide more
explainable result.
•LDA is a generative model for corpus. So
it provides information of whole
documents.
•It provides a vector for information of a
document.
•In this study, modified LDA gives more
information of a document.
•LDA not considers referring to reference
literature in generative process.
•Modified LDA provides bibliography of a
document and its distribution.
2.2 IntroductionLDA is motivated by writing a document.
Similarly, Modified LDA is motivated by writing a document in library.
6
Generic Generative Process
More detail
• The place in the library contains information that
probabilities of what reference is selected.
• References in same category have similar topics and
words.
- 267 -
•A set of documents for reference.
•Parent corpus is consisted with parent
documents.
•Parent document influence topics and
words of the new document.
•Each parent document has own place.
•Document distribution is the probability
distribution that selection of parent
corpus.
•Category is a cluster of parent
documents.
•Parent documents in same category have
same topic and word prior.
•Each parent documents in category has
probability of selection.
CategoryParent Corpus Document Distribution
2.3 PreliminaryIn this research, we use the language of text collections and introduce terms such as “parent corpus”,
“category”and “document distribution”
7
2.3 PreliminaryDocument distribution represents the information of new document.
8
Document Distribution
Mixture of Gaussian Distribution
• The number of mixture provides the number of
category.
• Probabilistic representation as well as deterministic
representation of a document.
• Probabilistic
• Distribution over the parent document that the
probability of being used in generating a
document.
• Deterministic
• List of documents with high probability for
selection
- 268 -
• Parent corpus assumed that it has
own alpha, beta.
• Each parent document places at the
point in the document distribution.
• Bag-of-words assumption• Probability of parent document
follows mixture gaussian
distribution.
• It is known to the number of
mixture.(완화가능)
Document DistributionParent Corpus LDA
2.3 PreliminaryThis slide contains assumption of Modified LDA with Bibliography Information
9
2.4 ModelThis slide contains notation and terminology and generative process of Modified LDA with
Bibliography Information.
10
Notation and Terminology
Generative Process
- 269 -
2.4 ModelThis slide contains graphical model and probability of document.
11
Graphical Model
Probability of Document
2.4 ModelEstimation
12- 270 -
•Verifying that a document is well
classified.
•Representation that information of the
important reference.
•Providing a variety of view for analyzing
text data.
Explanation
2.5 Expected ImpactsThis research focus three aspects such as individual, reference and explanation.
13
Individual
•Bibliography in probabilistic
representation of a document.
•Verifying plagiarism by comparing
document distribution.
Reference
•This model is depended on LDA, such as
the perplexity and the complexity.
•It is assumed that the number of mixture
in document distribution is known (완화
가능)
•This research yields a total number of
operations roughly on the order of
O(N⁴k²)
Computational ComplexityDependency on LDA Assumption
2.5 Expected ImpactsDrawbacks
14- 271 -
References
15
[1] Jeff A Bilmes et al. A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. International Computer Science Institute, 4(510):126, 1998.[2] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003.[3] DM Blei. Topic modeling and digital humanities. Journal of Digital Humanities, 2(1):8–11, 2012.[4] Nikos Vlassis and Aristidis Likas. A greedy em algorithm for gaussian mixture learn- ing. Neural Processing Letters, 15(1):77–87, 2002.
- 272 -
- 273 -
- 274 -
- 275 -
- 276 -
- 277 -
- 278 -
�
- 279 -
- 280 -
����
- 281 -
- 282 -
�
- 283 -
- 284 -
����
�
- 285 -
- 286 -
- 287 -
��
�
�
�
- 288 -
- 289 -
- 290 -
- 291 -
- 292 -
��
- 293 -
- 294 -
- 295 -
��
- 296 -
- 297 -
�
�
�
��
�
- 298 -
- 299 -
- 300 -
- 301 -
- 302 -
- 303 -
- 304 -
- 305 -
- 306 -
�
�
�
�
�
�
�
�
- 307 -
- 308 -
- 309 -
�
- 310 -
- 311 -
- 312 -
- 313 -
- 314 -
- 315 -
- 316 -
5: Rcommendation Systems
- 317 -
- 318 -
- 319 -
•
•
•
•
• �
• �
•
•
•
•
•
•
•
••
•••
•
••
••
••
•••
••
•••
- 320 -
••••
• • • •
••
- 321 -
•
•
•
•
•
•
•
•
•
••
•
- 322 -
•••
••
••••
••
••
••••
•••••
•••
- 323 -
–––
–••••
–••
- 324 -
•••
•
•
••
••
•
- 325 -
••
••
••
•
- 326 -
•
•
•
•
•
•
•
•
•
•
•
•
- 327 -
- 328 -
–
–•
–
- 329 -
–•
–•
•
•
•
- 330 -
–•
•
•
–•
–
–
- 331 -
–•
•
•
–
•
•
–
•
•
- 332 -
–
•
•–
–
•
–
–
•
•
•
- 333 -
–
•
•
–
––
- 334 -
–
–
•
–
–
- 335 -
–
•
•
–
–
–
- 336 -
–•
•–
–•
•
•
- 337 -
- 338 -
- 339 -
–
–
–
•
•
–
•
•
- 340 -
- 341 -
- 342 -
- 343 -
•
•
- 344 -
•
•
- 345 -
- 346 -
�
�
�
�
�
�
�
� �
�
�
•
••
- 347 -
- 348 -
•
•
�
�
- 349 -
•
•
•
•
- 350 -
- 351 -
- 352 -
•
•
•
•
•
- 353 -
•
- 354 -
• •
•
- 355 -
•
•
•
- 356 -
•
•
- 357 -
•
•
•
•
•••
•
- 358 -
6: Data Mining Applications
- 359 -
- 360 -
유통업 마케팅에서의 날씨정보 활용 방안에 관한 연구
김문욱a and 진서훈b
a 고려대학교 과학기술대학 응용통계학과
339-700, 세종특별자치시 세종로 2511
Tel: +82-44-860-1762, E-mail: [email protected]
b 고려대학교 과학기술대학 응용통계학과
339-700, 세종특별자치시 세종로 2511
Tel: +82-44-860-1555, E-mail: [email protected]
Abstract
대형마트의 매출을 일별로 예측하기 위하여 먼저 매출 추세를 이용하여 각 주차 별 매출을 예측하였다.
이후, 날씨 정보를 활용하여 일 별 매출을 예측하는
모델을 개발하였으며 그 결과 매출의 추세만을 이용한 모델에 비해 날씨정보를 이용한 모델의 설명력이
상승하는 효과를 가지게 되었다.
Keywords:
상관계수; 회귀분석; 날씨정보; 대형마트; 매출예상
Introduction
대형마트의 매출은 여러 요인들에 의해 예측이 가능
하며 실제로 대형마트 회사에서는 계절성, 인지도,
경쟁사, 브랜드, 상권 데이터 등을 이용하여 한 해의
매출을 주 단위로 예측하고 있다. 이렇게 예측된 매
출 정보는 제품의 발주 시스템 개선과 마케팅에 활
용된다. 이러한 매출 예측을 일별로 예측을 할 수
있다면 더욱더 효율적인 제고 관리와 적절한 마케팅
으로 적은 비용으로 고효율을 낼 수 있을 것이다.
특히나 2012년부터 소비심리 위축으로 소비자들의
지갑이 열리지 않고, 유통 산업 발전 법 개정으로
인해 강제 휴업이 진행되면서 대형마트의 성장에 제
동이 걸리기 시작한 시점에서 일별 매출 예측을 통
한 기존의 고객의 방문을 늘리고 지갑을 열기 위한
마케팅, 점포의 효율적인 관리를 위해서는 더 세부
적인 일별 매출 예측이 필요하다.
하지만 일별 매출을 예측하기에는 눈에 보이지 않는
다양한 변수들의 영향으로 인해 예측하기가 쉽지 않
다. 이번 연구에서는 눈에 보이지 않는 다양한 변수
중 날씨 정보를 활용하여 일별 매출을 예측해 볼 것
이다.
본론
Data Mart
본 연구를 위하여 사용된 자료는 2011년 07월 부터
2012년 12월까지 A대형마트의 점포별, 카테고리별
매출 자료와 같은 기간 날씨 정보를 추출하였으며
날씨 변수 목록은 Table 1.과 같다.
Table 1 - 날씨변수 목록
변수 변수명
STN_ID 지점번호
TM 관측시각(년월일)
CA_MID 중하층운량(1/10)
CA_TOT 전운량(1/10)
CH_MIN 최저운고(100m)
CT 운형코드
HM 습도(%)
PA 현지기압(hPa)
PS 해면기압(hPa)
PV 증기압(hPa)
RN 강수량(mm)
SD_HR3 3시간신적설(cm)
SD_TOT 적설(cm)
SI 일사(MJ/m2)
SS 일조(hr)
ST_GD 지면상태코드
TA 기온(℃)
TD 이슬점온도(℃)
TE_005 0.05m지중온도(℃)
TE_01 0.1m지중온도(℃)
TE_02 0.2m지중온도(℃)
TE_03 0.3m지중온도(℃)
TS 지면온도(℃)
VS 시정(10m)
WS 풍속(m/s)
WW 현상번호(국내식)
- 361 -
모델링
Trend
Figuers 1.과 같이 모든 점포들의 주차 별 매출은 매
년 같은 Trend를 보이는 것을 알 수 있다. 다만
2012년부터 시행된 강제 휴무로 인해 매출 Trend가
지그재그의 모습을 보이게 되며, 설과 추석의 경우
음력으로 날짜가 정해짐으로 인해 다른 주차에서 매
출이 상승하는 것으로 나타난다. 이 외에서 창립 기
념 달과 가족의 달, 크리스마스가 포함된 주차에서
강한 Trend가 나오는 것을 확인 할 수 있다.
Figure 1 - 주차별 Trend 그래프
하지만 점포 단위로 보게 된다면 각각 다른 Trend를
보이게 된다. 다만 점포별로 매년 같은 Trend로 인
해 전체 매출의 Trend가 같게 나타나는 것을 알 수
있다. 그래서 유사한 Trend를 가진 점포들을 찾아
묶어주기 위하여 Table 2.와 같이 점포별로 매출의
상관계수를 확인하여 유사한 점포를 찾아 점포별
Trend를 구하게 되었다.
Table 2 - 점포 매출간의 상관계수
Code Store_1 Store_2 Store_3 Store_4 Store_5 Store_6 Store_7
Store_1 100% 60% 57% 54% 60% 57% 63%
Store_2 60% 100% 85% 84% 77% 87% 86%
Store_3 57% 85% 100% 97% 89% 98% 97%
Store_4 54% 84% 97% 100% 93% 93% 96%
Store_5 60% 77% 89% 93% 100% 90% 92%
Store_6 57% 87% 98% 93% 90% 100% 96%
Store_7 63% 86% 97% 96% 92% 96% 100%
그럼 아래와 같이 유사한 Trend를 가지는 점포들끼
리 묶이게 된다.
Figure 2 - 유사한 Trend를 가진 점포들의 매출 변화
일별 매출 예측
위와 같은 결과를 통해 주차 별 Trend를 구하고, 명
절, 강제 휴무에 대한 Trend를 적용하여 최종적으로
각 주차의 매출을 예상하게 된다.
이 후, 요일 별로 매출 비율을 구하여 예측된 주차
별 매출에 대입하게 되면 카테고리 별 매출의 비율
을 적용하여 일별, 카테고리 별 예측 매출을 완성하
게 된다.
날씨 정보를 활용한 모델링
독립변수를 앞에서 완성한 일별 예측 매출과 날씨
정보로 두고 종속변수를 실제 매출로 두어 회귀분석
을 돌린 결과 거의 대부분의 카테고리에서 일별 예
측 매출만 두고 돌린 회귀분석 결과에 비해 더 높은
설명력을 가지게 되었다.
Table 3.은 날씨 정보를 빼고 일별 예측 매출만을 가
지고 단순회귀분석 결과이며 Table 4.는 똑같은 조건
에 독립변수에 날씨 정보를 추가한 결과이다. 날씨
정보를 추가 할 경우 설명력이 약 0.13정도 올라가
는 것을 확인 할 수 있다.
Table 3 - 단순회귀 분석 결과
R-Square 0.6384
Variable Parameter Estimate
Standard Error
t Value Pr > |t|
Intercept 1634778 699310 2.34 0.0217
SALE_62 0.55257 0.04434 12.46 <.0001
Table 4 - 날씨 정보를 추가한 회귀분석 결과
R-Square 0.769
Variable Parameter Estimate
Standard Error
t Value Pr > |t|
Intercept -1860973 1109163 -1.68 0.097
SALE_62 0.57618 0.03774 15.27 <.0001
평균온도 336647 66869 5.03 <.0001
평균 습도 67049 17918 3.74 0.0003
이 외에도 대부분의 카테고리에서도 설명력이 상승
하였다. 결과는 대부분 겨울에는 온도가 올라갈 수
록 매출이 상승하며, 여름에는 온도가 내려갈 수록
매출이 상승하는 것으로 나타났다.
주목할 만한 결과
엔터테이먼트와 완구 카테고리에서는 다른 결과가
나왔다. Table 5.와 Table6.을 살펴보면 눈이 오는 주
말일 경우 적설량이 증가 할 수록 매출이 증가하는
것으로 나타났다. 이는 대형마트의 경우 가족단위
쇼핑이 이루어 지는 공간으로서 눈이 오늘 주말에
자녀를 둔 가정의 외출 증가로 인해 매출도 함께 상
승하는 것으로 보인다.
- 362 -
Table 5 - 주말 엔터네인먼트 카테고리 회귀분석 결과
R-Square 0.7524
Variable Parameter Estimate
Standard Error
t Value Pr > |t|
Intercept 421452 132309 3.19 0.0021
엔터네인먼트 0.55766 0.03661 15.23 <.0001
적설량 503835 203283 2.48 0.0154
Table 6 - 주말 완구 카테고리 회귀분석 결과
R-Square 0.6911
Variable Parameter Estimate
Standard Error
t Value Pr > |t|
Intercept 18173896 4666139 3.89 0.0002
완구 1.04387 0.12251 8.52 <.0001
적설량 17362609 2444824 7.1 <.0001
습도 -284168 82179 -3.46 0.0009
또한 날씨에 전혀 영향을 미치지 않는 카테고리도
있었다. 바로 담배이다. Table 7.을 살펴보면 날씨 정
보가 모두 기각되면서 담배의 경우 날씨에 영향을
받지 않고 꾸준한 매출을 보이는 것으로 나타났다.
Table 7 - 담배 카테고리 회귀분석 결과
R-Square 0.5649
Variable Parameter Estimate
Standard Error
t Value Pr > |t|
Intercept 376084 2751863 0.14 0.892
SALE_33 0.91692 0.09141 10.03 <.0001
평균 온도 -19109 58539 -0.33 0.7459
풍속 -19572 154368 -0.13 0.8998
강수량 -16605 21960 -0.76 0.4544
습도 4740.2652 14832 0.32 0.7511
또한, Table8.과 Table9.를 살펴보면 주류의 경우 여름
에는 온도가 상승할 수록, 겨울에는 온도가 내려갈
수록 매출이 증가하는 것으로 나타났다.
Table 8 - 여름철 주류 카테고리 회귀분석 결과
R-Square 0.8475
Variable Parameter Estimate
Standard Error
t Value Pr > |t|
Intercept -12250912 2484309 -4.93 <.0001
SALE_26 0.88892 0.02195 40.5 <.0001
온도 304146 66443 4.58 <.0001
습도 47137 13641 3.46 0.0006
Table 9 - 겨울철 주류 카테고리 회귀분석 결과
R-Square 0.8979
Variable Parameter Estimate
Standard Error
t Value Pr > |t|
Intercept 716022 257562 2.78 0.0058
SALE_26 0.83811 0.01724 48.6 <.0001
온도 -119691 32307 -3.7 0.0003
결론
본 연구를 통해 날씨 정보를 활용한 일 별 매출 예
측이 가능한 모델이 완성되었다. 모델링 결과 여름
에는 온도가 내려 갈 수록 매출이 증가하며 겨울에
경우 온도가 올라 갈 수록 매출이 증가하는 당연한
결과들이 나오기도 하였지만, 주말에 눈이 오면 매
출이 증가하는 카테고리를 발견하였다.
즉, 다음날의 일기 예보에 따라 발주 시스템을 조절
할 수 있으며 적정 수준의 직원 수를 판단하여 점포
유지비용을 줄여 나갈 수 있으며, 날씨에 의해 매출
이 감소, 증가 할 것으로 예상되는 시점에 마케팅
비용을 적절하게 투입함으로써 저비용으로 고수익을
올릴 수 있을 것으로 예상된다.
다가오는 시대는 빅 데이터 시대로 대용양 데이터를
마음껏 가공하고 분석할 수 있는 시대가 다가오고
있다. 이를 대비하여 지금의 일별, 카테고리 별 매출
예측이 아닌 상품별로 다음날의 매출을 예측하여 더
효과적인 점포운영과 차별화된 마케팅을 실시한다면
매년 마이너스 성장으로 예상되는 대형마트의 성장
곡선을 플러스 성장 곡선으로 만들 수 있을 것으로
예상된다.
References
[1] 사명우, (2010). 결측된 목표 변수의 대체를 통한
교차판매 추천 모형 개발, 고려대학교 대학원.
[2] 강현절, 한상태, 최종후, 이성건, 이성건, 김은석,
엄익현, 김미경 (2006). 고객관리(CRM)를 위한 데
이터마이닝 방법론 Enterprose Miner 활용사례를
중심으로, 자유아카데미.
[3] 박창이, 김용대, 김진석, 송종우, 최호식 (2011). R
을 이용한 데이터마이닝, 교우사.
[4] 김기영, 전명식, 강현철, 이성건 (2009). 예제를
통한 회귀분석, 자유아카데미.
[5] 박태성, 이승연 (2009). 범주형 자료분석 개론, 자
유아카데미.
[6] 김기영, 강현철, 한상태, 최병진 (2009). 예제로
배우는 SAS 데이터 분석 입문, 자유아카데미.
[7] Trevor Hastie, Robert Tibshirani, Jerome Friedman
(2008). The Elements of Statistical Learning Data Mining,
Inference, and Prediction, Springer
- 363 -
- 364 -
- 365 -
�
�
�
- 366 -
�
�
�
�
- 367 -
�
�
�
�
�
�
- 368 -
�
- 369 -
�
�
�
�
�
- 370 -
�
�
�
- 371 -
�
�
�
- 372 -
�
�
�
�
- 373 -
�
- 374 -
�
�
�
�
�
�
�
- 375 -
- 376 -
Contents
•
•
•
•
•
•
2- 377 -
3
4- 378 -
(Product Life Cycle)
5
•
•
- 379 -
7
•
•
��
8
(Unsupervised learning)
(Supervised learning)
- 380 -
• Window size
•
(1) (2) (3)
1
23
•
•
•,
9
10
W
•
�
�
�
•
�
�
- 381 -
11
• K- (K-means clustering)
•
•
“A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo et. al
H. Zha, C. Ding, M. Gu, X. He and H.D. Simon. "Spectral Relaxation for K-means Clustering", Neural Information Processing Systems vol.14 (NIPS 2001). pp. 1057-1064, Vancouver, Canada. Dec. 2001.
•
•
2
1 j
k
n jj n S
J x u� �
� ���
nx ju jS
•
•
12
• K--
Class1
Class2
Class3
- 382 -
13
• (Decision tree) - C5.0
••••
)()(),()(
vAvaluev
v SEntropySS
SEntropyASIG ��
��
21
( ) logc
i ii
Entropy S p p�
� ��
: Entropy Information gain
14
A B
Product attribute?
First week sales?
10 or more Less than 10
•- Class �
- , �
•-
- 383 -
•
C1 C2 C3
C1 49 3 1
C2 0 60 4
C3 2 12 38
0
2 12
60
3 1
4
Training
Test
A B
Future item[Product attribute=B, First week sales=13]
Product attribute?
First week sales?
10 or more Less than 10
• ‘A’ 2011 ~ 2012
• 977 (2011 489 ,2012 488 )
•
16
21263865
39,000 ~1,490,000
- 384 -
• ‘A’ W- 17 90% -
17
2% 4% 4% 5% 6% 7% 8% 8% 8% 8% 7% 6% 5% 4% 4% 3% 2% 2% 2% 1% 1% 1% 1%2%7%
11%
16%
22%
29%
36%
45%
53%
60%
67%
72%77%
81%85%
88%90%
92% 94% 96% 97% 98%99%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
(%)
(%)
• (silhouette statistic)
18
Rousseeuw, Peter J. "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis." Journal of computational and applied mathematics 20 (1987): 53-65.
( )max( , )
i ii
i i
b aSa b�
�
• ai: • bi :
11 ��� iS
.11��
�n
iiSn
S
,
• 1• 0.5
- 385 -
19
-
-
-
-
1 3 5 7 9 11 13 15 17
-
-
-
-
1 3 5 7 9 11 13 15 17
-
-
-
-
1 3 5 7 9 11 13 15 17
-
-
-
-
1 3 5 7 9 11 13 15 17
( )
(2011 )
-
-
-
-
-
-
-
-
-
-
-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Slow-up
Fast-up
High Peak
Never-up
20
• (2011 )
• : 76%
- 386 -
2121
•
• 2012
22
-
-
-
-
-
-
-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
-
-
-
-
-
-
-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
-
-
-
-
-
-
-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
-
-
-
-
-
-
-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
- 387 -
23
1
2
3
•
•
•
24- 388 -
25
- 389 -