2013추계학술대회 인쇄용

�

�

•( , )

SAS

1: Text Mining Applications

2: Feature Selection & Efficient Computing

• Density-Based Graph Partitioning Algorithm , ( ) ················································································································· 7

• Scalable Variational Bayesian Matrix Factorization , (POSTECH) ················································································································· 31

• A Hybrid Genetic Algorithm for Accelerating Feature Selection and Parameter Optimization of SVM

, ( ) ················································································································· 43

• Documents Recommendation Using Large Citation Data , ( ), ( ) ······························································· 53

•, ( ) ················································································································ 75

•, ( ) ················································································································ 91

• Document Indexing by Ensemble ModelYanshan Wang, ( ) ································································································· 107

•, , , , ( ) ··································································· 117

• Fused Lasso, ( ) ·············································································································· 131

• Classification with discrete and continuous variables via Markov Blanket Feature Selection, (POSTECH) ·················································································································· 147

• R, ( ) ·············································································································· 167

• Revisiting the Bradley-Terry Model and Its Application for Information Retrieval, ( ) ·············································································································· 177

3: Visualization & Text Analytics

4: SNS and Bibliography Analytics

5: Rcommendation Systems

6: Data Mining Applications

•( ) ····························································································································· 207

•( ), , , ( ) ··········································· 221

•, , ( ) ······························································································· 237

•, , , ( ) ····················································································· 257

• Modified LDA with Bibliography Information, ( ) ············································································································· 265

• : , ( ) ·············································································································· 273

•, , ( ) ··································································································· 319

• MovieRank: Combining Structural and Feature information Ranking Measure*, *, **, * (* , ** ) ···························· 329

• A New Approach to Recommend Novel Items, ( ) ············································································································· 343

•, ( ) ············································································································· 361

•, , ( ) ······························································································· 365

•, ( ) ············································································································· 377

SAS

- 5 -

•

•

•

•

- 7 -

[1] Pang-Ning, T., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. In Library of Congress.

•

•

•

•

•

[2] Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666..

- 8 -

[3] Schaeffer, S. E. (2007). Graph clustering. Computer Science Review, 1(1), 27-64.

•

[4] Boutin, F., & Hascoet, M. (2004, July). Cluster validity indices for graph partitioning. In Information Visualisation, 2004. IV 2004. Proceedings. Eighth International Conference on (pp. 376-381). IEEE.

[5] Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.

[6] Patkar, S. B., & Narayanan, H. (2003, January). An efficient practical heuristic for good ratio-cut partitioning. In VLSI Design, 2003. Proceedings. 16th International Conference on (pp. 64-69). IEEE.

1

k

jli j Ci

l Ci

min w= ∈

∉

∑ ∑

where k is the number of clusters

•

•

•

•

•

- 9 -

•

•

•

•

•

[7] Sibson, R. (1973), “SLINK: an optimally efficient algorithm for the single-link cluster method”, The Computer Journal, Vol. 116, No. 1, pp. 30-34.[8] Defays, D. (1977). An efficient algorithm for a complete link method. The Computer Journal, 20(4), 364-366.

• L

• A D d1, d2, ..., dn

L D A= −

1

, 1, 2ij

n

i ijj

A a i,j= , ,n

d a=

⎡ ⎤= ⎣ ⎦

= ∑

�

•

•

•

•

[9] Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing, 17(4), 395-416.[10] Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. Advances in neural information processing

systems, 2, 849-856.

- 10 -

•

( )1

1 / 22

k

jl j li j Ci

l Ci

Q A d d mm = ∈

∈

= −∑ ∑

[11] Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2), 026113.[12] Clauset, A., Newman, M. E., & Moore, C. (2004). Finding community structure in very large networks. Physical review E, 70(6), 066111.[13] Kehagias, A. (2012). Bad Communities with High Modularity. arXiv preprint arXiv:1209.2678.

•

•

•

•

•

•

•

•

[14] Daszykowski, M., Walczak, B., & Massart, D. L. (2001). Looking for natural patterns in data: Part 1. Density-based approach. Chemometricsand Intelligent Laboratory Systems, 56(2), 83-92.

- 11 -

�

�

�

�

�

�

•

•

- 12 -

•

•

( )2,

0

i j k kj i i jk k

ij i j

d x xexp if x x and x x

w d d

ohterwise

⎧ ⎛ ⎞⎪ ⎜ ⎟− ∈ ∈⎪ ⎜ ⎟= ⎨ ⎜ ⎟

⎝ ⎠⎪⎪⎩

•

• i j i j

•

[15] Zelnik-Manor, L., & Perona, P. (2004). Self-tuning spectral clustering. In Advances in neural information processing systems (pp. 1601-1608).[16] Ertoz, L., Steinbach, M., & Kumar, V. (2002, April). A new shared nearest neighbor clustering algorithm and its applications. In Workshop

on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on Data Mining (pp. 105-115).

where is the k-nearest set of point i and is distance between point i and k-th neighbor of point ikix k

id

•

- 13 -

•

•

•

,.i ij jk

k kj x j k xi i

d w w∈ ∈

= +∑ ∑

• ii

•

where is the k-nearest set of point ikix

•

• αα

•

•

-5 -4 -3 -2 -1 0 1 2 3 4 5

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250 300 350 400 450 5000

20

40

60

80

100

120

140

160

- 14 -

•

•

- 15 -

•

•

•

•

- 16 -

•

•

- 17 -

•

•

- 18 -

•

•

- 19 -

•

•

- 20 -

10.9ij

j Cw

∈=∑

20ij

j Cw

∈=∑

•

•

10.9ij

j Cw

∈=∑

20ij

j Cw

∈=∑

- 21 -

•

10.2ij

j Cw

∈=∑

20.3ij

j Cw

∈=∑

•

10.2ij

j Cw

∈=∑

20.3ij

j Cw

∈=∑

- 22 -

•

•

•

- 23 -

•

•

12.1ij

j Cw

∈=∑

20.5ij

j Cw

∈=∑

- 24 -

•

10.7ij

j Cw

∈=∑

21.4ij

j Cw

∈=∑

•

- 25 -

•

•

•

•

- 26 -

•

•

•

•

-6 -4 -2 0 2 4

-2

-1

0

1

2

3

4

5

6

-6 -4 -2 0 2 4

-2

-1

0

1

2

3

4

5

6

Cluster 1Cluster 2Cluster 3

-6 -4 -2 0 2 4

-2

-1

0

1

2

3

4

5

6


-6 -4 -2 0 2 4

-2

-1

0

1

2

3

4

5

6


- 27 -

-8 -6 -4 -2 0 2 4 6 8

-1

0

1

2

3

4

5

-8 -6 -4 -2 0 2 4 6 8

-1

0

1

2

3

4

5


-8 -6 -4 -2 0 2 4 6 8

-1

0

1

2

3

4

5


-8 -6 -4 -2 0 2 4 6 8

-1

0

1

2

3

4

5

Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 6

-12 -10 -8 -6 -4 -2 0 2 4 6

-8

-6

-4

-2

0

2

4

6

Cluster 1Cluster 2

-12 -10 -8 -6 -4 -2 0 2 4 6

-8

-6

-4

-2

0

2

4

6

Cluster 1Cluster 2

-12 -10 -8 -6 -4 -2 0 2 4 6

-8

-6

-4

-2

0

2

4

6

-12 -10 -8 -6 -4 -2 0 2 4 6

-8

-6

-4

-2

0

2

4

6


- 28 -

-1.5 -1 -0.5 0 0.5 1 1.5

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

3.5


-1.5 -1 -0.5 0 0.5 1 1.5

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

3.5


-1.5 -1 -0.5 0 0.5 1 1.5

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

-1.5 -1 -0.5 0 0.5 1 1.5

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 6Cluster 7

•

•

•

•

•

- 29 -

•

•

•

•

•

Thank you for Listening

Any Questions?

- 30 -

Scalable Variational Bayesian Matrix Factorization

KDMS 2013

Outline 1. Matrix Factorization

for Collaborative Prediction 2. Regularized Matrix Factorization

vs. Bayesian Matrix Factorization 3. Scalable Variational Bayesian

Matrix Factorization 4. Related Works 5. Numerical Experiments 6. Conclusion

2 - 31 -

Matrix Factorization for Collaborative Prediction

• Collaborative prediction Filling missing entries of the user-item rating matrix

• Matrix factorization Predicting an unknown rating by product of user factor vector and item factor vector

3

6 9 3 ? 4 ? 2 0 0 0 2 3 0 ? 4 ?

3 0 2 0 0 1 0 2

2 3 1 0 0 0 2 3

User 1

User 2

User 3

User 4

Item

1

Item

2

Item

3

Item

4

User Factor Matrix

Item Factor Matrix

��

Regularized Matrix Factorization • Minimize the regularized squared error loss

4

Alternating Least Squares (ALS)

Time complexity O(2|Ω|K2+(I+J)K3) Parallelization Easy Tuning parameter λ (regularization)

- 32 -

Regularized Matrix Factorization • Minimize the regularized squared error loss

5

Stochastic Gradient Descent (SGD)

Time complexity O(2|Ω|K) Parallelization Possible, but not easy Tuning parameter λ (regularization) (learning rate)

Problem of parameter tuning

6

• Too small : overfitting• Too large : underfitting

- 33 -

Problem of parameter tuning

7

Regularization parameter chosen by cross-validation on various datasets and rank K (Kim & Choi, IEEE SPL 2013)

• The value of optimal regularization parameter is different depend on the dataset and rank K.

Problem of parameter tuning • SGD require tuning of regularization parameter,

learning rate and even the number of epochs.

8

0.005 0.007 0.010 0.015 0.020

0.005 0.9061/ 13 0.9079/ 15 0.9117/ 19 0.9168/ 28 0.9168/ 44

0.007 0.9056/ 10 0.9074/ 11 0.9112/ 13 0.9168/ 19 0.9169/ 31

0.010 0.9064/ 7 0.9077/ 8 0.9113/ 10 0.9174/ 13 0.9186/ 21

0.015 0.9099/ 5 0.9011/ 6 0.9152/ 6 0.9257/ 7 0.9390/ 7

0.020 0.9166/ 4 0.9175/ 4 0.9217/ 4 0.9314/ 4 0.9431/ 3

Netflix probe10 RMSE/optimal number of epochs of the BRSIMF for various and values ( =40). (Tákacs et al., JMLR 2009)

- 34 -

Bayesian Matrix Factorization

9

Prior P(U), P(V)

Likelihood P(X |U,V)

Posterior P(U,V |X)

MCMC on Netflix

Approximate the posterior by MCMC (Salakhutdinov & Mnih, ICML 2008)

Variational method (Lim & Teh, KDDcup 2007)

No parameter tuning No overfitting High accuracy Huge computational cost

O(2|Ω|K2+(I+J)K3)

Scalable Variational Bayesian Matrix Factorization • No parameter tuning

• Linear space complexity: O(2(I+J)K)

• Linear time complexity: O(6|Ω|K)

• Easily parallelized on multi-core systems

• Optimize

element-wisely factorized variational distribution with coordinate descent method.

10 - 35 -

Variational Bayesian Matrix Factorization • Likelihood is given by

• Gaussian priors on factor matrices U and V:

• Approximate posterior by variational distribution by maximizing the variational lower bound, or equivalently minimizing the KL-divergence

11

VBMF-BCD (Lim & The KDDcup 2007)

• Matrix-wisely factorized variational distribution

12

VBMF-BCD Space complexity O((I+J)(K+K2)) Time complexity O(2|Ω|K2+(I+J)K3) Parallelization Easy

- 36 -

Scalable VBMF: linear space complexity

13

Element-wisely factorized variational distribution

K=100 O((I+J)(K+K2)) O(2(I+J)K)

Netflix I = 480,189 J = 17,770

4.4 GB 0.8 GB

Yahoo-music I = 1,000,990 J = 624,961

131 GB 2.6 GB

Scalable VBMF: quadratic time complexity

14

Updating rules for q(uki)

Updating all variational parameters

- 37 -

Scalable VBMF: linear time complexity

15

Let Rij denote the residual on ( i, j ) observation:

With Rij , updating rule can be rewritten as

When is changed to , can be easily updated to

Scalable VBMF: linear time complexity

16 - 38 -

Scalable VBMF: parallelization

17

K

I

• Each column of variational parameters can be updated independently from the updates of other columns.

• Parallelization can be easily done in a column-by-column manner.

• Easy implementation with the OpenMP library on multi-core system.

Related work (Pilásy et al., ReSys 2010)

• Similar idea is used to reduce the cubic time complexity of ALS to linear one.

18

With small extra effort, more accurate model is obtainable without tuning of regularization parameter

RMF

Scalable VBMF

- 39 -

Related Work (Raiko et al., ECML 2007)

• Consider element-wisely factorized variational distribution

• Update U and V by scaled gradient descent method

• Require tuning of learning rate • Learning speed is slower than our algorithm

19

Numerical Experiments • Compare VBMF-CD, VBMF-BCD (Lim & The KDDcup 2007),

VBMF-GD (Raiko et al., ECML 2007)

• Experimental environment – Quad-core Intel® core™ i7-3820 @ 3.6GHz – 64 GB memory – Implemented in Matlab 2011a, where main computational

modules are implemented in C++ as mex files – Parallelized with the OpenMP library

• Datasets

20

MovieLens10M Netflix Yahoo-music

# of user 69,878 480,189 1,000,990

# of item 10,677 17,770 624,961

# of rating 10,000,054 100,480,507 262,810,275

- 40 -

Numerical Experiments: = 20

21

RMSE versus computation time on a quad-core system for each dataset: (a) MovieLens10M, (b) Netflix, (c) Yahoo-music

MovieLens10M Netflix Yahoo-music

VBMF-CD 0.8589 0.9065 22.3425

VBMF-BCD 0.8671 0.9070 22.3671

VBMF-GD 0.8591 0.9167 22.5883

Numerical Experiments: Netflix, = 50

22

Time per iter.

VBMF-BCD 66 min.

VBMF-CD 77 sec.

VBMF-GD 29 sec.

RMSE VBMF-BCD VBMF-CD

Iter. Time Iter. Time

0.9005 19 21 h 63 74 m

0.9004 21 23 h 70 82 m

0.9003 22 24 h 84 98 m

0.9002 25 28 h 108 2 h

0.9001 27 31 h 680 13 h

0.9000 30 33 h

- 41 -

Conclusion • We presented scalable learning algorithm for VBMF, VBMF-

CD.

• VBMF-CD optimizes element-wisely factorized variational distributions with coordinate descent method.

• Space and time complexity of VBMF-CD are linear.

• VBMF-CD can be easily parallelized.

• Experimental results confirmed the user behavior of VBMF-CD such as scalability, fast learning, and prediction accuracy.

23

- 42 -

A hybrid genetic algorithm for accelerating feature selection and parameter optimization of support vector machine

2013. 11. 29.

Introduction • Support Vector Machine (SVM)

– One of the most popular state-of-the-art classification algorithms. – efficiently finds non-linear solutions by exploiting kernel functions. – Takes training time complexity O(N3).

• “Very important” issues on training SVM

– Feature selection • SVM is a distance based algorithm (kernel matrix computation), and doesn’t include

any feature selection mechanism. • Irrelevant features degrade the model performance.

– Parameter optimization • Model Tradeoff parameter C, Kernel parameter σ (for the RBF kernel). • SVM is very sensitive to the parameter settings.

– For SVM, feature selection and parameter optimization should be performed

simultaneously. 2

- 43 -

Introduction • Genetic algorithm (GA)

– A stochastic algorithm that mimics natural evolution. – easy, but very effective!

• GA-based feature selection and parameter selection of SVM [1-4] – GA effectively finds near-optimal feature subsets and parameters. – But, Slow. (But, MUCH better than Grid-search mechanism.)

Population

Parents

Offspring

p

Replacement

Selection

Genetic operation (Crossover, Mutation)

3

Introduction If the SVM have to be re-trained periodically, fast feature selection and parameter optimization is required.

This study aims to avoid producing a bad offspring in the “Genetic Operation” step of GA.

This study proposes a chromosome filtering method for faster convergence of GA using Decision Tree (DT) for feature selection and parameter optimization of SVM.

4 - 44 -

The proposed method • Flowchart

5

Initialization

Do genetic operations

Termination condition?

Optimized parameters and feature subset

Chromosome Filtering

yes

no

yes

no

Evaluate fitness

Population

Population Replacement

The proposed method • Chromosome design

– Parameters: binary representation

– Feature subset: binary representation

1 0 0 1 0 1

10-2 10-1 1 101 102 103

C=1 x 10-2 + 1 x 101 C:

2-5 , … , 25 σ :

1 0 0 1 0 … 1 0

f1 f2 f3 f4 f5 fp-1 fp

{f1, f4, … , fp-1}

6

Genotype Phenotype

- 45 -

The proposed method • Fitness evaluation

– Decode chromosome and obtain C, σ, and a feature subset. • Genotype � Phenotype

– Train SVM for a dataset given the selected C, σ, and feature subset.

– Fitness value: Cross Validation Accuracy

7

The proposed method • Genetic operation

– Parent selection • Roulette-wheel scheme - Fitness proportional selection (FPS)

• Probability of i-th chromosome ci in the population to be selected =

– where f(i) is the fitness of ci

– Crossover: N-point crossover • Choose N random crossover points, split along those points.

– Mutation: Bit-flipping mutation • Bitwise bit-flipping with fixed probability.

8

- 46 -

The proposed method • Chromosome Filtering

– For each generation, chromosomes and their fitness are stored in the knowledgebase. A DT is trained periodically based on the knowledgebase. Using the DT, the offspring chromosomes that are likely to have bad fitness are removed before the fitness evaluation step.

– Assumption • Some features and parameter settings improve (or degrade) the model

performance. • DT can find these rules.

9

The proposed method • Chromosome Filtering (continued)

– Why DT? • Effectively deal with Categorical Features. • Find Non-linear relationship. • Use a few, relevant features in the classification

procedure.

– DT Training • Each ci (i-th chromosome) in the knowledgebase is

labeled by – first highest M fitness values � GOOD

(probable to yield a good fitness value) – next highest M fitness values � NORMAL – remaining � BAD

(probable to yield a bad fitness value)

• Input feature: chromosome (in phenotype) • Output feature: label {GOOD, NORMAL, BAD}

10

c1 c2 c3 … cM

cM+1 cM+2 cM+3 … c2M

c2M+1 c2M+2 c2M+3

… … …

GOOD

NORMAL

BAD

Knowledgebase (sorted by fitness)

- 47 -


– Filtering • A DT gives rules that assess a chromosome before fitness evaluation.

: Is a chromosome GOOD or NORMAL or BAD? • Each chromosome has a different survival probability.

ex) GOOD: 1.0, NORMAL: 0.5, BAD: 0.2 • The DT is periodically updated, so the criteria of good chromosome changes

through the generations.

11


– DT example

12

C>100

Contain F1?

σ>0.25

Contain F3?

σ>1

GOOD

NORMAL BAD

BAD

GOOD BAD

- 48 -

The proposed method • Population Replacement: Steady state model

� to verify the effectiveness of the proposed method in the initial period of GA.

– Only one chromosome in the population is updated in a generation. – Replacement scheme [5, 6]: The offspring replaces one of its parents or the

lowest fitness chromosome in the population. • If the offspring is superior to both parents, it replaces the similar parent. • If it is in between the two parents, it replaces the inferior parent. • otherwise, the most inferior chromosome in the population is replaced.

13

Experiments • Experimental Design

– 10 datasets from UCI repository, all datasets were normalized to be in [-1,1]. – 5 independent runs, a random seed set was used for fairness. – In SVM training, 10-fold cross validation was used. – Parameter Settings

• GA parameters – population size Npop = 30 – crossover probability pc = 0.9 – mutation probability pm = 0.05 – max iteration = 300 – pgood=1; pnormal=0.5; pbad=0.2

• DT parameters – CART – Labeling: good=10, normal=10, bad=remaining – Training starting point: 30th generation / period=10

14 - 49 -

Experiments • Results

– Maximum fitness in the population

Datasets #data #feature

# class

50th generation 100th generation 200th generation

GA GA+DT GA GA+DT GA GA+DT

Iris 150 4 3 97.067 97.067 97.333 97.200 97.333 97.333

Wine 178 13 3 98.876 98.539 99.213 99.101 99.551 99.438

Sonar 208 60 2 87.596 87.596 88.462 90.769 91.731 92.788

Glass 214 9 6 71.682 71.963 72.336 73.364 73.645 74.112

Ionosphere 351 34 2 93.333 94.017 94.131 94.758 95.100 95.499

BreastCancer 683 9 2 97.160 97.160 97.247 97.277 97.306 97.365

Vehicle 846 18 4 81.773 81.939 82.648 83.948 85.201 85.721

Vowel 990 10 11 96.990 97.253 98.000 99.030 99.051 99.434

Yeast 1484 8 10 57.318 57.547 58.814 59.111 59.690 60.000

Segment 2310 19 7 96.736 97.004 97.212 97.411 97.740 97.818

15

50 100 150 200 250 30096.8

96.9

97

97.1

97.2

97.3

iris

GA+DTGA

50 100 150 200 250 300

98.5

99

99.5

wine50 100 150 200 250 300

86

88

90

92

sonar50 100 150 200 250 300

68

70

72

74

glass

50 100 150 200 250 30093

93.5

94

94.5

95

95.5

ionosphere50 100 150 200 250 300

96.9

97

97.1

97.2

97.3

breastcancer50 100 150 200 250 300

80

82

84

86

vehicle50 100 150 200 250 300

92

94

96

98

vowel

50 100 150 200 250 30054

56

58

60

yeast50 100 150 200 250 300

96

96.5

97

97.5

98

segment

Experiments • Results

– Maximum fitness in the population

16 # generation

CV

acc

urac

y (%

)

- 50 -

Concluding Remarks We presented a chromosome filtering method for GA-based feature selection and parameter optimization of SVM.

The proposed method employed a DT as a chromosome filter to remove the offspring chromosomes that are likely to have bad fitness before the fitness evaluation step of GA.

On most datasets, the proposed method showed faster improvement of fitness than standard GA.

17

Acknowledgements This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2011-0030814), and the Brain Korea 21 Program for Leading Universities & Students. This work was also supported by the Engineering Research Institute of SNU.

18 - 51 -

References 1. Frohlich, H., Chapelle, O., & Scholkopf, B. (2003, November). Feature selection for support vector

machines by means of genetic algorithm. In Tools with Artificial Intelligence, 2003. Proceedings. 15th IEEE International Conference on (pp. 142-148). IEEE.

2. Huang, C. L., & Wang, C. J. (2006). A GA-based feature selection and parameters optimization for support vector machines. Expert Systems with applications, 31(2), 231-240.

3. Min, S. H., Lee, J., & Han, I. (2006). Hybrid genetic algorithms and support vector machines for bankruptcy prediction. Expert Systems with Applications,31(3), 652-660.

4. Zhao, M., Fu, C., Ji, L., Tang, K., & Zhou, M. (2011). Feature selection and parameter optimization for support vector machines: A new approach based on genetic algorithm with feature chromosomes. Expert Systems with Applications,38(5), 5197-5204.

5. Bui, T. N., & Moon, B. R. (1996). Genetic algorithm and graph partitioning.Computers, IEEE Transactions on, 45(7), 841-855.

6. Oh, I. S., Lee, J. S., & Moon, B. R. (2004). Hybrid genetic algorithms for feature selection. Pattern Analysis and Machine Intelligence, IEEE Transactions on,26(11), 1424-1437.

19

20

End of Document

- 52 -

- 53 -

�

�

�

�

�

�

�

�

�

�

- 54 -

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

- 55 -

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

- 56 -

�

�

�

�

�

�

�

�

�

�

- 57 -

�

�

�

�

�

- 58 -

�

�

�

�

�

�

�

�

- 59 -

�

�

�

�

�

�

�

�

�

�

�

�

�

- 60 -

�

�

- 61 -

�

�

- 62 -

�

�

- 63 -

�

�

�

�

�

�

�

�

�

- 64 -

�

�

�

�

�

�

�

�

- 65 -

�

�

�

�

�

�

�

�

- 66 -

�

�

�

�

�

�

�

- 67 -

�

- 68 -

�

�

- 69 -

�

�

- 70 -

�

�

�

- 71 -

- 72 -

1: Text Mining Applications

- 73 -

- 74 -

- 75 -

�

- 76 -

•

•

- 77 -

- 78 -

�

�

- 79 -

��

��

�

- 80 -

�

�

- 81 -

- 82 -

��

- 83 -

- 84 -

- 85 -

�

�

•

•

•

- 86 -

• ��

• �

•

•

- 87 -

•

• ��

•

- 88 -

•

��

- 89 -

- 90 -

•

•

- 91 -

•

� Text MiningData Mining

�

�

Text Mining

•

�

�

- 92 -

•

�

�

�

•

�

�

�

- 93 -

•

•

- 94 -

•

�

�

�

•

�

�

�

�

�

- 95 -

•

�

•

�

�

�

�

- 96 -

•

�

�

�

�

•

�

�

- 97 -

•

•

�

�

�

�

- 98 -

•

�

�

�

�

•

�

�

�

�

�- 99 -

•

�

�

�

�

•

�

� - 100 -

•

�

�

�

�

�

•

- 101 -

•

•

�

�

�

- 102 -

•

•

- 103 -

•

•

�

�

�

- 104 -

•

�

�

•

•

•

•

•

- 105 -

- 106 -

Document Indexing by Ensemble Model

Yanshan Wang and In-Chan Choi

Korea UniversitySystem Optimization Lab

[email protected]

November 25, 2013

Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 1 / 18

Overview

1 The BasicsInformation Retrieval and Document IndexingTopic ModellingIndexing by Latent Dirichlet Allocation

2 Indexing by Ensemble ModelIntroduction to Ensemble ModelAlgorithmsExperimental Results

3 Conclusions and Discussion

Yanshan Wang and In-Chan Choi (KU) Indexing by EnM November 25, 2013 2 / 18- 107 -

The problem in Information Retrieval

Source: www.betaversion.org/ ste-fano/linotype/news/26/

As more information (BigData) becomes available, it ismore difficult to access whatusers are looking for.

We need new tools to help usunderstand and search amongvast amounts of information.


Document Indexing is Important

Users can get desired information by indexing (or ranking)documents (or items). The higher position the document has, themore valuable to users.


Problems in Conventional Methods: WordRepresentation

The majority of rule-based and statistical Natural LanguageProcessing (NLP) models regards words as atomic symbols.

In Vector Space Models (VSM), a word is represented by one 1and a lot of zeros. For example,

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]

Its problem:

motel [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0] ANDhotel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0] =0

The conceptual meaning of words is ignored.


Topic Modeling

Latent Dirichlet Allocation (LDA)[Blei et al. (2003)].

Uncover the hidden topics thatgenerate the collection.Words and Documents can berepresented according to thosetopics.Use the representation to organize,index and search the text.

apple =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0.3250.7920.2140.1070.1090.6120.3140.245

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦


LDA [Blei et al. (2003)]

� � � �

�

��

1 Choose the number of words N ∼ Poisson(ξ).

2 Choose θ ∼ Drichelet(α).3 For n = 1, 2, ..., N

Choose a topic zn ∼ Multinomial(θ);Choose a word wn ∼ Multinomial(wn|zn, β), a multinomialdistribution conditioned on the topic zn.

Joint Distribution: p(θ, z,d|α, β) = p(θ|α)∏N

n=1p(zn|θ)p(wn|zn, β)


Indexing by LDA (LDI) [Choi and Lee (2010)]

With adequate assumptions, the probability of a word wj

embodying the concept zk is

W kj = p(zk = 1|wj = 1) =

βjk∑Kh=1 βjh

The document (or query) probability can be defined within thetopic space

Dki (Q

ki ) =

∑Vj=1W

kj nij

Ndi

,

where nij denotes the number of occurrence of word wj indocument di and Ndi

denotes the number of words in thedocument di, i.e. Ndi

=∑V

j=1 nij.Similarity between document and query

ρ( �D,Q) = �D · �Q

where �D · �Q =⟨

D‖D‖ ,

Q‖Q‖

⟩.


Indexing by Ensemble Model (EnM)[Wang et al. (2013)]

Motivation: There exit optimal weights over constituent models.

Table: A toy example. The values in the table represent similarities ofdocuments with respect to a given query. The scores of Ensemble 1 and2 are defined by 0.5*Model 1+0.5*Model 2 and 0.7*Model 1+0.3*Model2, respectively. The relevant document list is assumed to be {2,3}.

Model 1 Model 2 Ensemble 1 Ensemble 2

Document 1 0.35 0.2 0.55 0.305Document 2 0.4 0.1 0.5 0.31Document 3 0.25 0.7 0.95 0.385

(M)AP 0.72 0.72 0.72 0.89


AP and MAP

Average Precision (AP) and Mean Average Precision (MAP)

Notation

|Q| the number of queries in the query set;|Di| the number of documents in the relevant document

set w.r.t. the ith query;dij ∈ Di the jth document in Di;φki the relevant score returned by kth model w.r.t. ith

query;R(dij , φki) the indexing position of the jth document for the ith

query returned by the kth model;H =

∑αkφk the ensemble model, a linear combination of the con-

stituent models, where αk ≥ 0.Definition

E(H,Q) ==1

|Q|

|Q|∑i=1

AP (H,Di), AP (H,Di) =1

|Di|

|Di|∑j=1

j

R(dij , H).


Formulation

Formulation of the Optimization Problem

Since 0 ≤ AP ≤ 1, we can define the empirical loss as follows:

min

|Q|∑i=1

(1−AP (H,Di)), or

min

|Q|∑i=1

(1−1

|Di|

|Di|∑j=1

j

R(dij , H)).

Our goal is to uncover optimal weights α’s that minimize theempirical loss.

Difficulty

The position function R(dij , H) is nonconvex, nondifferentiable andnoncontinuous w.r.t. α’s.


Boosting Scheme

1 Select model:

φj = argmaxj

|Q|∑i=1

DiAP (φji);

2 Update the weight:

αtj= αt

j+ δt

j,

where δj =12 log

∑|Q|i=1 Di(1+AP (φji))

∑|Q|i=1 Di(1−AP (φji))

;

3 Update distribution on queries:

Di =exp(−AP (Hi))

Z,

where Z is a normalizer.

� � �

� � �

�

� � �

� � �

�

� � �

� � �

�

��

��

��

��

��

��

��

��

��

��

��

��

��


Coordinate Descent

Since the objective is nonconvex, not eachcoordinate will reduce the loss.

1 Select model:

φj = argmaxj

E(Q, φj);

2 Update the weight:

αj =1

2log

1 +AP (φji)

1−AP (φji);

3 If Et ≤ Et−1, delete this coordinate.�

��

��


Parallel Coordinate Descent

The coordinate descent algorithm can be parallelized on cores.

1: parfor p = 1, 2, ...,Kφ do

2: Update the weights using αp = 12log

1+AP (φpi)

1−AP (φpi);

3: end parfor

4: return Ensemble model H.


Experimental Results on EnM

Data: MED corpus1.1033 documents from the National Library of Medicine.30 queries.

Results.

Table: MAP of various methods forMED corpus.

Method MAP improvement (%)TFIDF 0.4605LSI 0.5026 9.1pLSI 0.5334 15.8LDI 0.5738 24.6

EnM.B 0.6420 39.4EnM.CD 0.6461 40.3EnM.PCD 0.6414 39.3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

on

TFIDFLSApLSILDIEnM

Figure: Precision-Recall Curves forvarious methods.


1: ftp://ftp.cs.cornell.edu/pub/smart.

Conclusions and Discussion

Conclusion

An ensemble model (EnM) is proposed and three algorithms areintroduced for solving the optimization problem.The EnM outperformed any basis models through the overall recallregimes.

Discussion

The algorithms cannot guarantee to converge to the global optimumdue to the nonconvexity of objective.The parallel coordinate descent algorithm cannot guarantee theoptimum, even local optimum, due to the coupling betweenvariables.

Future Works

Approximate the objective with convex functions.Using stochastic gradient descent for stochastic sequences andlarge-scale data sets.


References

Yanshan Wang and In-Chan Choi(2013)

Indexing by ensemble model

Working Paper. arXiv preprint arXiv:1309.3421.

David M, Blei, Andrew Y, Ng and Micheal I, Jordan (2003)

Latent dirichlet allocation

the Journal of machine Learning research, 3, 993-1022.

In-Chan Choi and Jae-Sung Lee (2010)

Document indexing by latent dirichlet allocation

DMIN, 409-414.

Y. Freund and R. E. Schapire (1995)

A desicion-theoretic generalization of on-line learning and an application toboosting

Computational Learning Theory, Springer, 23-37.

My Homepage: http://optlab.korea.ac.kr/~sam/


The End


- 116 -

, , , ,

Page 2

� Mobile device computing • Context-aware

- 117 -

,

( )

Communication, web

history

App

As is

Page 4

�• Context-aware

•

� : /

�•

�• –

• : • :

�••

- 118 -

?

?

10:00 39

12:00 3

13:00

15:00

16:00

18:00

18:30 39 ?

? ? ?

Page 6

:

�• ( , )

�• A ,

5511 .

- 119 -

�• ( , , )

�• B 1

. , eTL ..

Page 8

�• 2 4• ,

•• /

••

• GPS: • : • : • : /

•• / /

•• ,

• , • -

• : • On/Off:

- 120 -

1.

5.

/

/

6-1. ( /)

//

SNS( , )

2. /

6-2. ( )

TV

( , )

/

3. /

// /

7. /

/

4. / 8.

1.2.3.4.5.6.

Page 10

� 2• 1 : 10 / 2012 9 ~10• 2 : 50 / 2012 11 ~12

�• (OS 2.3 )• : 2 4 , 10~30• : 1

�• 5Mbytes•

- 121 -

� 47• 20

• , , , , • cross validation 40~50%

• 1/3 /

•• / / �

- 9562 - / 332 - / 97

/ - / 3350 ( )- / 331 ( )- 93

- 2908 - / 248 / - 91

- 2388 ( / )-SNS( ) 229 / - 88

/ - 2012 ( )- 227 ( / )- 74

- 906 / - ( , ) 210 / - 71

- 766 - 160 ( )- 71

- / 744 / - / 146 ( )- 66

- 717 - / 145 - / 62

( / )- 692 / - 142 / - / 57

- 625 - 121 ( / )- 50

( / )- 610 / - 118 - / 40

( / )- 525 ( )- / 113 - 32

( )-TV 422 - 106 - 25

/ - 408 - 104 ( )- 17

/ - 341 - 98

Page 12

–

126.4 126.5 126.6 126.7 126.8 126.9 127 127.1 127.2 127.3 127.436.7

36.8

36.9

37

37.1

37.2

37.3

37.4

37.5

37.6Location

longtitude

latit

ude

- 122 -

� : �

1.• API • �

2.•

3.••

GPS trajectory

Sensors

Misc. context

Date/Time

Page 14

–

� ( , ) �• 1)

•• (merging)• 50 518

- 123 -

� : ,

�•

•• ) DBSCAN trajectory clustering (CB-SMoT)

••

• CHAMELEON

••

• Multivariate Gaussian, Gaussian mixture, kernel density estimation, …

•• ,

Page 16

– HMM & Adaboost�

• Hidden Markov model• / /

••

• >> API ( 15~20% )

� �• AdaBoost

• AdaBoost•

• precision• : 80%, : 74.6%, : 64.7%

- 124 -

Multiple instance learning� Multiple instance learning*

• Supervised learning•

• , ,

•

� MIL • MIL

• ?• ,

• : 10~20

Page 18

– Multiple instance learning� mi-SVM*

• Multiple instance learning SVM•

�•

• : , • Chunk

• , , • ,

••

� RBF linear kernel• kernel • RBF, linear kernel : 63:37

- 125 -

Multiple instance learning� mi-SVM cross validation

�••

•

�• mi-SVM

• Viterbi • 52.2%

• 43.9% / 60.8%

••

Sensitivity(Recall)

/ //

SVM 76.0% 5.3% 77.5% 43.8% 47.9% 36.0% 16.4% 70.0% 20.1%

77.0% 22.1% 80.0% 48.6% 67.3% 51.2% 50.5% 90.0% 72.1%

56.9% 10.0% 60.0% 0.0% 35.7% 10.0% 37.8% 90.0% 45.5%

87.6% 54.9% 100.0% 89.3% 93.6% 82.4% 73.3% 90.0% 90.3%

(Precision) 82.6% 34.0% 86.7% 73.9% 68.9% 50.3% 45.2% 91.7% 34.9%

(F-measure) 79.7% 26.8% 83.2% 58.6% 68.1% 50.8% 47.7% 90.8% 47.0%

Page 20

�•

• (activity) (behavior) •

•

•• coverage • 1.7%

•••

(2 4 )

•

- 126 -

�• (Transfer learning)

• (reusable) •

•(cold start )

• /• Noise/Error• ,

•••

••• ( )• , •

Page 22

THANK YOU!

- 127 -

- 128 -

2: Feature Selection& Efficient Computing

- 129 -

- 130 -

1

Fused Lasso

2

Contents

Motivation••

Introduction••

Algorithm••

Results•••

Conclusion

- 131 -

3

Motivation��

4- 132 -

5

6

Reference : http://ghanahealthnest.com/2013/04/23/ghana-revenue-authority-rolls-out-a-new-malaria-control-strategy/

•

•

•

- 133 -

7

Introduction��

8

Fused Lasso

- 134 -

9

•

•

•

•

10

VMales1

MMales7

VFmales1

� � ��

•••

VMales14

MFmales7

VFmales14

- 135 -

11

VMales1

MMales7

VFmales1

VMales14

MFmales7

VFmales114

�

VMales1Y/N

MMales7Y/N

VFmales1Y/N

VMales14Y/N

MFmales7Y/N

VFmales14Y/N

�

12

� �

VMales1Y/N

MMales7Y/N

VFmales1Y/N

VMales14Y/N

MFmales7Y/N

VFmales14Y/N

�

- 136 -

13

Suarez, Estrella, et al. "Matrix-assisted laser desorption/ionization-mass spectrometry of cuticular lipid profiles can differentiate sex, age, and mating status of< i> Anopheles gambiae</i> mosquitoes." Analytica chimica acta706.1 (2011): 157-163.

•

•

•

•

14


- 137 -

15


t f ti l li id fil diff ti tSuarez Estrella et al "Matrix assisted laser desorption/ionization

16

•

•

•

•

Li, Lihua, et al. "Data mining techniques for cancer detection using serum proteomic profiling." Artificial intelligence in medicine 32.2 (2004): 71-83.

- 138 -

17

�

�

3. Better Performance

�

�

1. Sparsity

�

�

2. Smoothness

18

Algorithm��

- 139 -

19

(1)

Tibshirani, Robert, et al. "Sparsity and smoothness via the fused lasso."Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.1 (2005): 91-108.

��

20

(1)

m/z Fused Lasso Lasso m/z Fused

Lasso Lasso m/z Fused Lasso Lasso

654.9 0.002 0.000 656.3 1.647 0.000 657.7 1.403 0.000 655 0.002 0.000 656.4 1.647 0.000 657.8 1.403 0.000

655.1 1.123 0.662 656.5 1.647 0.000 657.9 1.403 0.000 655.2 4.159 3.169 656.6 1.647 0.000 658 1.403 0.010 655.3 1.791 0.002 656.7 1.647 0.000 658.1 1.403 -0.015 655.4 1.791 0.000 656.8 1.647 0.000 658.2 1.403 0.225 655.5 1.791 0.000 656.9 1.647 0.000 658.3 0.252 0.000 655.6 1.791 1.300 657 1.650 0.000 658.4 0.252 0.000 655.7 1.791 0.000 657.1 1.650 0.344 658.5 0.252 0.000 655.8 1.791 0.000 657.2 1.639 2.609 658.6 0.252 0.000 655.9 1.791 0.143 657.3 1.403 0.010 658.7 0.252 0.000 656 1.791 0.001 657.4 1.403 0.000 658.8 0.252 0.000

656.1 1.791 1.495 657.5 1.403 0.000 658.9 0.252 0.000 656.2 1.647 0.149 657.6 1.403 1.613 659 0.252 0.000

/z FL

- 140 -

21

(1)

��

Tibshirani, Robert, et al. "Sparsity and smoothness via the fused lasso."Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.1 (2005): 91-108.

22

Liu, Jun, Lei Yuan, and Jieping Ye. "An efficient algorithm for a class of fused lasso problems." Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.

- 141 -

23

(2)

Liu, Jun, Lei Yuan, and Jieping Ye. "An efficient algorithm for a class of fused lasso problems." Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.

24

Results��

- 142 -

25

PerformanceComparison

Averagemisclass. rate

AverageSelected features

26

-1 -0.5 0 0.5 1 1.5 2 2.5-1.5

-1

-0.5

0

0.5

1

1.5

2

1st principal component

2nd

prin

cipa

l com

pone

nt

OthersMFemale7

0 2000 4000 6000 8000 10000 120000

2

4

6Fused lasso coefficient abs

m/z

Coe

ffici

ent

0 2000 4000 6000 8000 10000 120000

0.5

1Fused lasso selected features intensity value

m/z

Inte

nsity

A

B

- 143 -

27

0 2000 4000 6000 8000 10000 120000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

m/z

Coe

ffici

ent

Fused lassoLasso

28

•

•

•

- 144 -

29

30

Conclusion

- 145 -

31

�

3. More sophisticated

�

1. More accurate

�

2. More appropriate

32

Thank YouWhere New Challenges Begin!

- 146 -

��

��

!"�#$�%�&'()*+,

!��-��.��-��/!.�#.0�1��2�

3456�78�9:�;<�=>?+@AB

��

� 9��

� ��

� ��C��-�.�D��E�

� "��E��

� ��

� $FG��E��H��

� ��

- 147 -

9��

��#��

.GG��!��I

"��J��E��

# ��J��E�� D�

#��J��E�� D��F�/KL0

��J��E�� D��F��/(L0�

��J��E�� D��F��

�9��E��-�� GG��I�

- 148 -

� ��D��EG��M�

� $FG��D� �� E��-��E��F�EG��N��-��E��

��C�E��-�

!��!��E��

C�E��-

O2�2

9��E��-��D��E��

��E�GG��D�� E��G��

C�E��H��

P .��D��E��-�

��EG��G��G��E��

Q ��D��M��D

�2D2� ��EG��R�� -��

S H��D��EG��F�-��E��E��E�E�-�

T $��D�E��D��M��

!��

# �U��.GG��

� ��

��

�� D

� ��F��

��

�� #��E��D

!�.��1C.�/��V�0

- 149 -

��

��-��W� ��

��-��W� ��

W��J�H��E��

.��J�C�G��-��

� !��D��G��E�� H�G��X��G��-��E��

- 150 -

��

52 U��G�� G��

� U��EG��

�� V��G��V�G��/�G��0�

�2�2��

32 !�� G��

� U��EG��

�� E��E��

��G��2�2�� ! "#$%� ��! �&� � � �

!��

��

�G��

��D��E��FG��D��

��E��

��C��-�.�D��E�

- 151 -

� ��/��0

� C��-

� #�E��EG��F�-

��C��-�.�D��E�

W�E� !��2 Y�� EE��

9.�� 3446

� ��-� ��EG��EG��E�� #�E�� O��-�G��-

�� 3446

� W�� #��D�G�D-��E�� E��EG��9.�� EG��9.��

%9#"WZ�� 3446 � .��G�D-��E��

9.��

��

� �� '� �� ( � �)* + ,-./��0 1 2 345� $%� ��! �&� �� 6 (&�� 7 3(5� ��

� �� !�� " � � �� # �� 6 ( �� 38 ��$ �� 3�5�%&&&��

��9��G��

U� ��GG��

� ��D��J�Z��9� ��

� ��J�Z��: ��

��D��G��

- 152 -

� .�E��N��E��9.��

� [��D�G�D-��E��52 9��-��D�;<�=�32 9��-��D�>;�=� ��9

��R�%9#"WZ��

C��N��GG��

!��

��

�G��

�EG��G��E�� E��

"��.�� N��E��/��-��344\0

- 153 -

��Z��/��0

� �� H��?� H��-��E�D��@

��

�A� � ,-.BC)DEF

C G C C H I )DFF>�� J �� KLM� �� N � ��KL��

.��G��G��-

��E�E�H��-��F�E�E�H��/EH�H0�

� �� H��?� H��-��E�D��@

EH�H

� "O � ,-.FP�QR"�NS! T� H

I� U " NS! NV

FW�B

� "X � ,-.FP�QR" NS! T �Y I

� U " NS! NVFW�B

"�Z! [�� " �� Z��[" Z! [ � \]�^ \_�/� Z! [ `ab c _!]

c _ c ]

.��H�H�G��G��-��

- 154 -

��

��

� W��-��/W�0� ��EG��-��V��

� TV Nd! Ne!�Nc f � TV � Nd! Ne!�Nc TV � ��TV�g��Nh&TV�c

hid

� ��GG��O��/�O�0

� ��D� �G��-G��G�� E�F�E�M��E��D��

,jkl!m!n �Io p e G T

o q r4 rst ut ��[S p q v�ZS� G w � I H rS

� ]Z��W��D��/]WW0

� W�ZG��E��E�� #��E��EE��D��

- 155 -

$FG��E��H��D��C��

C��C��G��

� .��D-J��-��

� !�E��^D��J� ��G�E��D�D��N��

� �G��^D��J ��G��X��D�D��N��

� 1��D��J� ��D��

_�.��D-��!�E��^D��G��^D�� E�[�9��G��-�_�1��D��E��-�`��G��-�

� ��D��

aZ��b�a��G��

C�� c�� c�� c��

.��D- de 33d 3f

!�E��^D�� a\ 54d 3

�G��^D�� d4 344 3

1��D�� 5f6 544 3

- 156 -

H��.��D-

C�� c��/.��g0

W� �O� ]WW

.��D-

33d

.�� de d62d4�/\\23f0 \h2f4�/ee2530 ae2e5�/ee2af0�

9.��4245 h hf23a�/hh2de0 ed25h�/5442440 ef24h�/5442440

424a 56 h62\f�/h\2e60 ea2\d�/5442440 he234�/5442440

��i�

%9#"WZ��

42454 Wi. Wi. Wi.

424a

�� 5a df264�/\52560 df24f�/\32eh0 d626f�/\f2h50

EH�H�j �9C

34 \52h3�/ha23d0 \f2eh�/e\2a50 da2hh�/eh2dd0

54 dh24h�/\e26d0 df26d�/hd2f50 ah2f3�/e42640

a dh2ef�/\d2350 de243�/\h26a0 dh2de�/\62660

EH�H�j �9k

34 \52hh�/h523e0 \a2dh�/e62640 da24d�/ed2fe0

54 d62a\�/de2h60 d5234�/\42ea0 ah2a5�/d\25a0

a d524d�/dd2540 aa2e\�/d32aa0 a\2a\�/d526e0

� 9.��l mnopqrstuvwxyz{t|}~��{��

� ��pl��(��7 ��t��:�;�pl��t|L 9.��

��7op��q��p��l��m� ¡¢

H��!�E��

C�� c��/.��g0

W� �O� ]WW

!�E��

54d

.�� a\ e325a�/eh2ea0 \3266�/5442440 \\2e5�/h326e0

9.��4245 f e32ha�/ef2640 de2ha�/\52d60 ha2de�/e\2da0

424a 6 e523a�/ef2ff0 \52fa�/\52630 hd243�/ed2\d0

��4245 3f e42\6�/e\2d\0 \d2da�/ed2af0 \62df�/h32\40

424a a6 h\2a5�/eh2d30 \a24a�/5442440 h42\6�/h42e30

%9#"WZ��4245 3f e5265�/e\2e40 \62e5�/ed2fd0 \f2ad�/ha2330

424a a6 he2ha�/eh2\30 \62f5�/5442440 \d24a�/hh2e50

�� d ea23\�/ea2f\0 dh23\�/\623a0 h62ah�/5442440

EH�H�j �9C

a4 e32f\�/eh2\h0 \42ha�/5440 \62ad�/h42hh0

34 ea24f�/ed2ha0 \32ha�/e62ff0 h62a3�/\d2fa0

a ea23e�/ed2350 d\25d�/\32\\0 ha2e5�/ed2560

EH�H�j �9k

a4 e42\a�/eh2\f0 \f2h3�/5442440 \623a�/hd2d30

34 ef266�/e\2ff0 \524\�/e62e40 \d234�/h326f0

a e6256�/ed2a30 dh2ed�/\32\e0 \d234�/h326f0

� £yz{<ml¤��¥,¦§�opqrs�m¨�©��ª��«�¬

� <®¯°±²opqrs�mnopqrs,u³vL(´��

� <®¯°±²µ¶·�¸¹tº»mn¥,¦L¼��qr�p��(��pl�

- 157 -

H��G��½��

C�� c��/.��g0

W� �O� ]WW

�G��

544

.�� d4 e32d\�/ee2he0 h4244�/5442440 e\2\\�/eh2dh0

9.�� 4245i424a f e62hd�/e\2450 h525e�/hf2d40 e\2e4�/eh2hd0

��i

%9#"WZ��

42445 5f ed2ff�/ee2540 hd2\f�/e\2d30 eh236�/eh2ee0

4244a 3f ea24a�/ee2\\0 hf2eh�/5442440 eh26f�/ee2540

�� h ea2fa�/ee24e0 ha253�/ef25d0 eh2f6�/eh2e40

EH�H�j �9C

34 e\236�/ee26h0 h52d3�/5442440 e\2he�/eh2\60

54 ed23f�/ee2350 h\2\5�/ed2h60 e\2af�/eh2a60

a ed24f�/ee2340 h424a�/h\2d60 eh2ad�/eh2e40

EH�H�j �9k

34 e\236�/ee2\30 hf244�/5442440 eh26a�/ee2560

54 ea2f\�/ee26f0 h\2af�/ed26f0 eh23f�/eh2ef0

a e62hh�/eh24\0 h32e\�/he2d40 eh2f6�/eh2ha0

� £yz{<mEH�HZ�9C�¾s�l¤��¥,¦¿

� ��©mn¾s�LuÀ7pÁ�� Â

� �Ã:�;t|L��opqrs�Ä,�ÅÆ

H��1��D��

C�� c��/.��g0

W� �O� ]WW

1��D��

544

.�� 5f6 e6236�/ee2440 h\2h5�/5442440 e\2f4�/eh2fe0

9.�� 4245i424a d e62h5�/eh2he0 ef23�/ef2ha0 eh2a4�/ee2\30

��4245 a e32d4�/eh2hh0 e62f�/ea2ef0 ea2h3�/ed25d0

424a a e52d4�/eh26a0 e5�/ea2aa0 e62af�/ef24d0

%9#"WZ��4245 f e42h5�/e\2h40 e42d�/e32a50 he2h\�/e425h0

424a h ef245�/eh2h40 hh23�/ed2da0 ea2fh�/ea2ee0

�� 5\ ed244�/ee2\\0 he2d�/ee2df0 e\2h6�/eh2\30

EH�H�j �9C54 ea246�/ee2440 he23�/ed2h50 ed2h\�/e\2f50

a e6245�/e\2h30 e42h�/ef2630 e62h4�/ef23\0

EH�H�j �9k54 e52h5�/eh2\f0 he�/e62af0 ef2e\�/ed23a0

a e5244�/ed2ee0 e42h�/e326a0 e526e�/e52d40

� W��yz{¦ÇÈ�É� ��ll¤��¥,¦ ��mnÊyz{t|��¾s��Ë�

� W��yz{t|L�p¦�Ì7mÍ�� ¾sL�K��¥,»�p�

� ��©mn¾s�LuÀ7pÁ�� Â

� �Ã:�;t|L<®¯°±²opqrs�Ä,�ÅÆ

- 158 -

H��!�E��/�E��EG��M�0

C�� c��/.��g0

W� �O� ]WW

!�E��

a4

.�� a\ h\235�/e52a40 \d2f6�/5442440 ed255�/e\2h40

9.�� 4245i424a 6 e623\�/e\2660 h32h\�/hf2\40 eh24e�/eh2hh0

��42445 e ef244�/eh2440 \h2ha�/e62550 ea266�/e\25e0

4244a 53 ea2da�/ee2530 \4246�/e\2330 ea2ae�/ed26a0

%9#"WZ��42445 e ef255�/eh2da0 \h2h3�/e62\e0 ea26a�/ed2ef0

4244a 5f e42h5�/ef2d\0 \426a�/ea2d50 e62\e�/ed26a0

�� f \52de�/h42ee0 fe233�/d52e30 e52ad�/ea2350

EH�H�j �9Ci�9k54 hh243�/e62a40 fh2hf�/\\2660 e52a5�/ef2ef0

a ha2da�/e323h0 af2fa�/dh25f0 e42e\�/e62h40

� m��p�op©�Î{��¡�� %9#"WZ�� Ï�pÁ� 5i54Ð o(

� :�;�p¦op�p mÑ�Ò©x(��opqrs�mnopqrstuv}~7¥,

�ÐÓ�Ô /H��0�¾s

H��G��½��/�E��EG��M�0

C�� c��/.��g0

W� �O� ]WW

�G��

a4

.�� d4 ha2e\�/e62360 h4253�/5442440 ed25d�/e\2\\0

9.�� 4245i424a 6 h\2hh�/ef26d0 hd26e�/hh2h50 ee2d5�/ee2h50

��i

%9#"WZ��

42445 \ ef2h4�/ed2ae0 he233�/eh2550 eh26h�/ee24\0

4244a e ed2f\�/eh2aa0 h62d5�/eh2\30 e\2e5�/eh2h50

�� \ e62d5�/ed2350 \e2ah�/ef2650 eh2ea�/ee2f30

EH�H�j �9C�54 e6235�/eh2630 hf26e�/eh2\60 eh244�/eh2hf0

a e32f4�/e\2hd0 \e25e�/e42440� ed2e6�/e\2fe0

EH�H�j �9k54 ea233�/eh2\d0 h62ah�/eh2d50 eh2h5�/ee2650

a e52d5�/ea2d60 h426\�/he2\30 eh2ae�/ee25\0

� :�;��p¦op�p m�Ò©x(�mnopqrs��Õ�Ö�(×� ��©

��opqrs��Ø7��t|Ù�Ì�Ë7(×� Â

��Ú� !�E��:�;�ÛÜ��¥,L¼

� ��©Ý:�;t|� !�E��:�;��Þ¸mnopqrs��ßà�o��¬

- 159 -

H��1��D��/�E��EG��M�0

C�� c��/.��g0

W� �O� ]WW

1��D��

a4

.�� 5f6 e42h5�/ea2540 hd2f5�/5442440 e\24f�/eh2330

9.�� 4245i424a f e32f3�/e\2330 he235�/e\2a50 ee2h3�/ee2e30

��i�

%9#"WZ��

4245 6 ed244�/ee25e0 ef2hf�/ed2630 ed25d�/ed2440

424a 5d e32h\�/e\2eh0 ha2d6�/5442440 e\2da�/eh2aa0

�� 5a ea2da�/eh2h30 he25e�/5442440 ee2f6�/ee2\60

EH�H�j �9C5a e62eh�/eh2\h0 hf2\e�/5442440 eh23d�/eh2e60

a ea2h4�/eh2he0 e62d5�/ea2\f0 ed244�/ed2ah0

EH�H�j �9k5a ed244�/eh2hh0 e4244�/5442440 ed25\�/ed2e50

a e6235�/e\2fe0 hd2f5�/ed2650 ea2de�/ed2430

� :�;��p¦op��p mÑ�Ò©x(��{Aáâ©ã��¾s,mn¾sÇ��

ª��äåæpç

� Ý:�;t|��¾s�l¤��¥,¦ �è��©mnopqrs�L�t�é��¥,

� �Ã:�;t|L��opqrs�Ä,�ÅÆ

$FG��E��H��C��

- 160 -

C��C��G��

_��!��E�]��H��D��ZE��

� ��

54Z��b�a��G��

C�� c�� c�� c��

�� 3444 d3 3

!�� 53d44 543 3

� ��J� ��/E��-��0

� !��J� ��G��/E��-��0

H��

- 161 -

H��!��

��

- 162 -

��

� ��EE��-

� yzyê�opqrsëÐ|�<®¯°±²

� fl��D��:�;ì3l��:�;t<®¯°±²opqrsíÈ

� mî7yz{tA7mnopqrs�,�u³yê

� ��

� mî7:�;��mî7yz{tA�¡mnopqrs�,<®¯°±²opqrs

�ïðíëÐu³yê�¡<®¯°±²opqrs�Ä,ÅÆ

� ��E��-�

� mnopqrs�, Þ¸<®¯°±²opqrs�ªñAu:�;p�òót �7)

×�ôõ ��H��

� 1�E��

� ö÷�op��pìøù�úLÐvê��û�üt£opqrs¾st| qr

ýoptA7þ�í�yê��

kR.

- 163 -

.GG��F

W�E� !��2 Y�� EE��

9.�� 3446

� ��-� ��EG��EG��E�� #�E�� O��-�G��-

�� 3446

� W�� #��D�G�D-��E�� E��EG��9.�� EG��9.��

%9#"WZ�� 3446 � .��G�D-��E��

!�� 344d

� ��-� C��-�E��D��G�D-��E�� !��E��-� C��D��G��E�G��i�� C��D��E��E�G��i��

9!�Z�� 344\i344h

� ��-

� #�E��-�9EG��

� ��!��EG��D�/��M�V0

� C��D��G��E�G��i��

� C��D��E��E�G��i��

C"� 3455

� ��-

� ��EG�� G��

� ��9!��EG��D

� C��D��G��E�G��i��

� C��D��E��E�G��i��

� ��Z��E�D��E��-��.�D��E�

- 164 -

��

��'(��

� �� '� �� 0 �� 3�5� � � ��)* + �)* + ,-.xyz{ $%� ��! �&x�� ( � �)* + ,-./��0 z{ 345� $%� ��! �&�%� |�}�� 6 (&�%� |(} ��" �� 7 3(5# ��

� �� !�� '( �$ � � �� % �� 6 �&x � �� x y �� 385 �� 3�5��

��

� �� 7/�z{ �� 3�5

� �� #�� , � �� 6 �&x �� ! � ~ x� � � �� ( � �� " �� 6 �&x 7 3�5 ��# �� 3�5$ ��


%9#"WZ��

-�./01'(�� '� ��#�� 3�5� ��

� �� 2�� ( � �)* + ,-./�{��z{ $%� ��! �&'�� 7 3(5� ��#�� #�� 3(5

� �� !�� '( �" � � �� # �� 6 �&x � �� x y �� 385 ��$ �� 3�5�% �� #�� ,��

-�./01��

� �� 7/�z{ �� 3�5

� �� !�� ( � �� 6 �&x � �� x y ( 7 �0 3�! �! (5�� 3�5" ��


- 165 -

- 166 -

- 167 -

�

�

�

�

- 168 -

- 169 -

Data Step, Statistical Summary, Tables/Cubes, Covariance, Linear & Logistic Regression, GLM, K-means clustering, …

- 170 -

- 171 -

- 172 -

- 173 -

- 174 -

- 175 -

- 176 -

- 177 -

- 178 -

- 179 -

- 180 -

- 181 -

- 182 -

- 183 -

- 184 -

- 185 -

- 186 -

- 187 -

- 188 -

- 189 -

- 190 -

- 191 -

- 192 -

- 193 -

- 194 -

- 195 -

- 196 -

- 197 -

- 198 -

- 199 -

- 200 -

- 201 -

- 202 -

- 203 -

- 204 -

3: Visualization & Text Analytics

- 205 -

- 206 -

- 207 -

�

� �

�

�

�

� �

�

� �

� �

�

�

�

� �

� �

- 208 -

�

� �

� �

- 209 -

�

Country City Latitude Longitude Year DataType DataType2 DataType3 Institution Purpose Scope-Collection

Scope-Application Time Lag Count Ratio

�

- 210 -

�

- 211 -

�

•

•

- 212 -

- 213 -

, vs

,

- 214 -

, 2,

, 3

- 215 -

vs ,

,

- 216 -

- , 3

--

1

1

- 217 -

�

�

�

�

�

�

�

�

�

�- 218 -

- 219 -

- 220 -

18

2- 221 -

�

3

•

•

•

4

�

•

•

- 222 -

5

�

•

•

6

�

•

- 223 -

7

�

•

∙

8- 224 -

9

10

•

- 225 -

11

•

•

•

�

�

�

�

�

�

�

�

�

�

�

12

all

pnpycncyi N

NNNNRatioVotingValid

��

• Nall = Total number of conservative/progressive parties

• Nall ≥ Ncy + Ncn + Npy + Npn

pnpycncy

pncnn

pnpycncy

pycyy

nykkki

NNNNNN

PNNNN

NNP

PPDiversityNoYes

��

��

��

��

,

log},{

2

...,,

log},{},,{

4

pnpycncy

cncn

pnpycncy

cycy

nyjpciijiji

NNNNNP

NNNNN

P

PPDiversitynOrientatioPolitical

��

��

��

- 226 -

�

�

13

BillContentiousness

Valid Voting Ratio(c1)

Yes-NoDiversity (c2)

Political Orientation

Diversity (c3)

14

•

X X

Y Y

- 227 -

15

�

�

�

�

�

�

�

i

j

nn

nnnnn

yy

yyyyy

Fynnn

nnnynn

nn

Fnyyy

yyynyy

yy

PrecisionRecallPrecisionRecall21,Precision,Recall

PrecisionRecallPrecisionRecall2

1,Precision,Recall

��

��

��

�

��

��

��

�

� nyyn FFF 11211 ��

16

�

�

VotingSimilarity (VS)

Shared Voting Ratio(c1)

F1yn(c2, c3)

- 228 -

17

•

•

18

•

•

- 229 -

19

•

•

20

•

- 230 -

21

•

22

•

- 231 -

23

•

24

•

- 232 -

252525252525

•

26

•

- 233 -

27

•

28

�

�

�

�

�

�

�

�

- 234 -

29

- 235 -

- 236 -

�

- 237 -

�

•

•

•

•

�

•

•

�

•

•

- 238 -

�

�

�

- 239 -

�

�

- 240 -

�

�

•

�

•

•

•

�

•

- 241 -

�

•

•

- 242 -

�

•

�

•

•

�

•

0

50

100

150

200

250

300

Total

- 243 -

�

•

�

�

•

•

•

�

�

�

- 244 -

�

�

•

•

–

•

- 245 -

�

•

•

�

�

�

•

•

None

Dove

Neutral

Hawk

Bernanke Bies Duke Ferguson Gramlich Greenspan Kohn Kroszner Meyer Olson'notably' 'auditors' 'economic-recovery' 'internationally' 'low-income' 'doubtless' 'board-staff-contributed' 'kroszner' 'nairu' 'gramm-leach-bliley-act''ben' 'internal-control' foreclosures' 'obviously' 'social-security' 'standards-living' 'own-necessarily' 'randall' 'trend-growth' 'bank-charters'

'likewise' 'corporate-governance' 'foreclosure' 'thank' 'income-borrowers' 'century' 'board-staff-contributed-remarks' 'randall-kroszner' 'supply-shocks' 'consolidation''bernanke' 'internal-audit' 'neighborhood-stabilization' 'regarding' 'social' 'half-century' 'board-staff' 'footnotes' 'utilization-rates' 'provisions'

'working-paper' 'internal-controls' 'crisis' 'inviting-me' 'getting' 'technologies' 'odds' 'am-delighted' 'nevertheless' 'net-interest-margins''financial-crisis' 'mitigating-controls' 'credit-availability' 'private-sector' 'involves' 'consequence' 'asset-prices' 'speech-delivered' 'favorable-supply-shocks' 'retail'

'pp' 'control-processes' servicers' 'communication' 'reinvestment' 'evident' 'footnotes-views' 'protect-consumers' 'cyclical' 'largest-banks''text' 'auditor' 'text-see' 'sectors' 'income-households' 'societies' 'open-market-committee' 'vol' 'leaves' 'commercial-banks'

'text-see' processes' 'financial-crisis' 'similarly' 'national-saving' 'surely' 'inflation-expectations' 'during-market' 'limits' 'safety-soundness''february' 'committee-sponsoring-organizations' 'community-banker' 'thank-inviting-me' 'lower-income' 'arguably' 'footnotes' 'turbulence' 'trend-rate' 'financial-services''university' 'treadway-commission' 'recovery' 'technology' 'predatory-lending' 'war' 'remarks-text' 'subprime' 'above-trend-growth' 'lines''washington' 'risk-management-practices' 'across-country' 'therefore' 'budget' 'owing' 'views-expressed' 'subprime-mortgages' 'disciplined' 'margins'

'economic-financial' 'officers' 'vacant' 'contingency' 'low-moderate-income' 'century-ago' 'unusually' 'journal' 'consensus' 'establishing''notwithstanding' 'risk-exposures' 'text-information' 'preparations' 'standpoint' 'ever' 'open-market-committee-fomc' 'home-ownership-equity-protection-act' 'full-employment' 'marketplace'

'economic-recovery' 'enterprise' 'purchase' 'economies' 'desirable' 'world-war-ii' 'households-businesses' 'unfair-deceptive' 'below-trend' 'asset-quality''leading' 'financial-reporting' 'board-governors-system' 'occur' 'neighborhood-reinvestment-corporation' 'goods-services' 'stance' 'undertake' 'avoiding' 'charter''pdf' 'accounting-standards' 'small-business' 'active' 'slightly' 'edge' 'resource-utilization' 'text-randall-kroszner' 'absence' 'profile'

'march' 'enterprise-risk-management' 'expenses' 'observers' 'saving' 'wholly' 'asset-price' 'due-diligence' 'bank-supervision' 'providers''board-governors-system' 'coso' 'neighborhoods' 'sharing' 'huge' 'presumably' 'colleagues-open-market-committee' 'trust-verify' 'appreciation-dollar' 'deposit'

'broader-economy' 'customer' 'solutions' 'probably' 'seven' 'readily' 'elevated' 'strahan' 'financial-modernization' 'securities-insurance''principal' 'basel-ii' 'foreclosed' 'asian' 'numbers' 'human' 'central-banks' 'penalties' 'therefore' 'banking-industry'

'vol' 'complex-organizations' 'small-businesses' 'nations' 'retirement' 'enabled' 'anchored' 'subprime-mortgage' 'principle' 'banking-securities''journal' 'audit-committee' 'demand-credit' 'let-me' 'gets' 'vast' 'members-board' 'simultaneously' 'estimate-nairu' 'inviting-me''robert' 'operational-risk' 'owner' 'coordination' 'passed' 'conceptual' 'policy-actions' 'prepayment-penalties' 'viewed' 'competitive'

'j' 'internal' 'see-board-governors-system' 'accurately' 'neighborhood' 'necessity' 'price-stability' 'pp' 'precisely' 'deposit-insurance''september' 'management-internal' 'weak' 'computer' 'get' 'nonetheless' 'markets-economy' 'text' 'judgment' 'legal''november' 'directors' 'loans-small' 'domestic' 'really' 'generations' 'contributed' 'verify' 'excessive' 'compete''sharp' 'operational' 'website' 'topic' 'save' 'creative-destruction' 'views' 'textreturn' 'structural' 'internal-controls''bureau' 'determine-whether' 'communities' 'speak' 'political' 'fear' 'textreturn' 'mortgage-market' 'theme' 'financial-holding'

'financial-economic' 'invitation' 'economic-conditions' 'transparency' 'matter' 'living' 'damp' 'subprime-market' 'rising-inflation' 'offices''national-bureau-economic-research' 'exposure' 'real-estate-owned' 'shared' 'counseling' 'advances' 'path' 'taxes-insurance' 'monetary-policy-cannot' 'allowed'

'recovery' 'controls' 'owners' 'date' 'finances' 'market-forces' 'resource' 'structured' 'higher-inflation' 'strategic''c' 'appetite' 'real-estate' 'play' 'home-ownership' 'technological' 'views-expressed-own-necessarily' 'issued' 'banking-financial' 'reporting'

'david' 'auditing' 'board-governors' 'asia' 'studies' 'accordingly' 'remarks-textreturn' 'under-home-ownership-equity-protection' 'challenges-monetary-policy' 'safety''april' 'compliance' 'stabilization' 'recognizing' 'fine' 'productive' 'movements' 'ownership-equity-protection-act-hoepa' 'prevailing' 'laws'

'recession' 'risk-appetite' 'equally' 'complete' 'difference' 'decades' 'impaired' 'e' 'unchanged' 'opportunities''nevertheless' 'risk-management-processes' 'don-t' 'accurate' 'suppose' 'inevitable' 'anticipated' 'prudent' 'forecasting' 'bank-holding-companies'

'speech' 'lines-business' 'foreclosure-crisis' 'news' 'involving' 'perceived' 'contributed-remarks-text' 'washington' 'direction' 'community-banks''october' 'guidance' 'press-release' 'additionally' 'main' 'hence' 'own-necessarily-members' 'alt' 'relation' 'banks-united-states''john' 'board-directors' 'eligible' 'emerge' 'communities' 'markedly' 'economic-stability' 'correlated' 'specifically' 'banking-organizations''press' 'conflicts-interest' 'delinquent' 'financial-sector' 'home-mortgage-disclosure-act-hmda' 'valued' 'house-prices' 'foreclosures' 'interpretation' 'monitor''papers' 'credit-quality' 'during-financial-crisis' 'technologies' 'urban' 'produced' 'downward' 'proposed-rules' 'interpreted' 'established''m' 'profile' 'properties' 'forces' 'programs' 'endeavor' 'economic-activity' 'kb-pdf' 'constant' 'risk-profile'

'textreturn' 'invitation-speak' 'qualify' 'operating' 'present' 'largely' 'boosted' 'p' 'initially' 'reviews''helping' 'enterprise-wide' 'senior-loan' 'financial-market' 'minority' 'technology' 'outlook' 'www' 'via' 'conducted'

'american-economic-review' 'relating' 'before-begin' 'firm' 'shown' 'history' 'circumstances' 'association' 'appreciate' 'integrity''sharply' 'entity' 'inventory' 'computers' 'apart' 'newer' 'judging' 'words' 'call' 'regulator'

'remainder' 'monitoring' 'www' 'industries' 'cuts' 'emergence' 'macroeconomic' 'market-turbulence' 'otherwise' 'efficiency''stabilize' 'sarbanes-oxley-act' 'nonprofit' 'prove' 'community-development' 'virtually' 'resilient' 'rigorous' 'swings' 'vehicles'

'financial-stability' 'draft' 'small-business-owners' 'communications' 'groups' 'telecommunications' 'slack' 'causes' 'inflation-rate' 'ownership'

�

•

•

•

•

- 246 -

�

•

•

•

•

•

•

•

•

�

- 247 -

�

�

�

- 248 -

�

�

�

�

�

�

- 249 -

�

�

�

�

�

�

- 250 -

�

�

�

�

�

•

•

�

•

•

•

�

�

�

- 251 -

�

•

•

•

•

•

�

•

•

•

•

•

- 252 -

�

'standpoint‘` 'see-board-governors-system' 'saving' 'members-board' 'foreclosures' 'laws''financial-crisis' 'economic-conditions' 'retirement' 'policy-actions' 'proposed-rules' 'bank-holding-companies'

'economic-financial' 'owners' 'save' 'markets-economy' 'market-turbulence' 'community-banks''board-governors-system' 'real-estate' 'counseling' 'damp' 'supply-shocks' 'banks-united-states'

'broader-economy' 'board-governors' 'finances' 'resource' 'utilization-rates' 'banking-organizations''financial-economic' 'foreclosure-crisis' 'urban' 'views-expressed-own-necessarily' 'favorable-supply-shocks'

'american-economic-review' 'during-financial-crisis' 'minority' 'movements' 'full-employment''auditors' 'nonprofit' 'standards-living' 'contributed-remarks-text' 'below-trend'

'corporate-governance' 'private-sector' 'technologies' 'own-necessarily-members' 'appreciation-dollar''financial-reporting' 'sectors' 'world-war-ii' 'house-prices' 'financial-modernization'

'accounting-standards' 'economies' 'goods-services' 'downward' 'excessive''complex-organizations' 'observers' 'creative-destruction' 'outlook' 'banking-financial'

'audit-committee' 'asian' 'market-forces' 'circumstances' 'challenges-monetary-policy''invitation' 'nations' 'emergence' 'macroeconomic' 'provisions''exposure' 'domestic' 'board-staff-contributed' 'resilient' 'net-interest-margins''auditing' 'transparency' 'own-necessarily' 'slack' 'retail'

'compliance' 'asia' 'board-staff-contributed-remarks' 'footnotes' 'largest-banks''board-directors' 'financial-sector' 'board-staff' 'am-delighted' 'safety-soundness''monitoring' 'financial-market' 'asset-prices' 'during-market' 'financial-services'

'draft' 'firm' 'footnotes-views' 'turbulence' 'establishing'foreclosures' 'industries' 'open-market-committee' 'subprime' 'marketplace''foreclosure' 'low-income' 'footnotes' 'subprime-mortgages' 'asset-quality'

'crisis' 'social-security' 'views-expressed' 'unfair-deceptive' 'providers'servicers' 'income-borrowers' 'open-market-committee-fomc' 'strahan' 'deposit'

'financial-crisis' 'income-households' 'households-businesses' 'penalties' 'securities-insurance''community-banker' 'national-saving' 'stance' 'subprime-mortgage' 'banking-industry''across-country' 'lower-income' 'resource-utilization' 'prepayment-penalties' 'banking-securities'

'board-governors-system' 'predatory-lending' 'asset-price' 'mortgage-market' 'deposit-insurance'

'expenses' 'budget' 'colleagues-open-market-committee' 'subprime-market' 'financial-holding'

'foreclosed' 'low-moderate-income' 'central-banks' 'taxes-insurance' 'safety'

�

'economic-recovery' 'small-business' 'recovery' 'ownership-equity-protection-act-hoepa'

'national-bureau-economic-research' 'small-businesses' 'home-mortgage-disclosure-act-hmda' 'trend-growth'

'recovery' 'demand-credit' 'community-development' 'trend-rate'

'recession' 'loans-small' 'economic-activity' 'above-trend-growth'

'lines-business' 'real-estate-owned' 'protect-consumers' 'bank-charters'

'conflicts-interest' 'senior-loan' 'home-ownership-equity-protection-act' 'commercial-banks'

'credit-quality' 'small-business-owners' 'due-diligence' 'charter'

'enterprise-wide' 'reinvestment' 'trust-verify'

'economic-recovery' 'neighborhood-reinvestment-corporation' 'verify'

'credit-availability' 'home-ownership' 'under-home-ownership-equity-protection'

�

'stabilize' 'stabilization' 'consolidation' 'bank-supervision'

'financial-stability' 'technology' 'internal-controls' 'estimate-nairu'

'internal-control' 'coordination' 'monitor' 'structural'

'internal-controls' 'technologies' 'risk-profile' 'rising-inflation'

'mitigating-controls' 'technological' 'regulator' 'monetary-policy-cannot'

'control-processes' 'technology' 'operational-risk' 'higher-inflation'

'committee-sponsoring-organizations' 'inflation-expectations' 'management-internal' 'inflation-rate'

'risk-management-practices' 'price-stability' 'controls' 'gramm-leach-bliley-act'

'risk-exposures' 'economic-stability' 'risk-appetite' 'neighborhood-stabilization'

'enterprise-risk-management' 'structured' 'risk-management-processes'

'coso' 'nairu' 'guidance'

'basel-ii' 'disciplined' 'sarbanes-oxley-act'

- 253 -

•

- 254 -

4: SNS and Bibliography Analytics

- 255 -

- 256 -

- 257 -

•�

•

•

•

•

•

•

- 258 -

�

- 259 -

•

- 260 -

- 261 -

- 262 -

- 263 -

•

•

•

- 264 -

System Optimization Lab.

Korea University

Young Min, Jun

Modified LDA with Bibliography Information

1

한국 BI 데이터마이닝 학회 2013 추계 학술 대회

Contents

1. LDA

2. Modified LDA with

Bibliography Information

1.1 Topic Model

1.2 LDA

2.1 Limitation of LDA

2.2 Introduction

2.3 Preliminary

2.4 Model

2.5 Expected Impacts

2- 265 -

1.1 Topic Model“Topic modeling provides a suite of algorithms to discover hidden thematic structure in large

collections of texts. The results of topic modeling algorithms can be used to summarize, visualize,

explore, and theorize about a corpus.”(DM Blei, 2012)

3

Example

Research of Topic Model

• LSA

• Based on reducing dimension (SVD Decomposition)

• pLSA

• Mixture decomposition

• LDA

• Most frequently studied model

• What is the “topics” on the New York Times?

• How change the “topics” on the Twitter?

• How similar are these article?

Geometric Interpretation

• Three topics for three words.

• LDA makes a smooth

distribution on the topics.

1.2 LDA“LDA is a generative probabilistic model for collection of discrete data such as text corpora. And this is

a three-level hierarchical Bayesian model, which each item of a collection is modeled as a finite mixture

over an underlying set of topics”(DM Blei, 2003)

4

Graphical Model

Example

Generative Process

- 266 -

ReferenceIndividual Explanation

2.1 Limitation of LDA

LDA is effective tool for discovering topic structure, but there are some further research to improve

LDA. In that areas, this research focus three aspects such as individual, reference, and explanation.

5

•LDA often gives result which is hard to

understand.

•Modified LDA expects to provide more

explainable result.

•LDA is a generative model for corpus. So

it provides information of whole

documents.

•It provides a vector for information of a

document.

•In this study, modified LDA gives more

information of a document.

•LDA not considers referring to reference

literature in generative process.

•Modified LDA provides bibliography of a

document and its distribution.

2.2 IntroductionLDA is motivated by writing a document.

Similarly, Modified LDA is motivated by writing a document in library.

6

Generic Generative Process

More detail

• The place in the library contains information that

probabilities of what reference is selected.

• References in same category have similar topics and

words.

- 267 -

•A set of documents for reference.

•Parent corpus is consisted with parent

documents.

•Parent document influence topics and

words of the new document.

•Each parent document has own place.

•Document distribution is the probability

distribution that selection of parent

corpus.

•Category is a cluster of parent

documents.

•Parent documents in same category have

same topic and word prior.

•Each parent documents in category has

probability of selection.

CategoryParent Corpus Document Distribution

2.3 PreliminaryIn this research, we use the language of text collections and introduce terms such as “parent corpus”,

“category”and “document distribution”

7

2.3 PreliminaryDocument distribution represents the information of new document.

8

Document Distribution

Mixture of Gaussian Distribution

• The number of mixture provides the number of

category.

• Probabilistic representation as well as deterministic

representation of a document.

• Probabilistic

• Distribution over the parent document that the

probability of being used in generating a

document.

• Deterministic

• List of documents with high probability for

selection

- 268 -

• Parent corpus assumed that it has

own alpha, beta.

• Each parent document places at the

point in the document distribution.

• Bag-of-words assumption• Probability of parent document

follows mixture gaussian

distribution.

• It is known to the number of

mixture.(완화가능)

Document DistributionParent Corpus LDA

2.3 PreliminaryThis slide contains assumption of Modified LDA with Bibliography Information

9

2.4 ModelThis slide contains notation and terminology and generative process of Modified LDA with

Bibliography Information.

10

Notation and Terminology

Generative Process

- 269 -

2.4 ModelThis slide contains graphical model and probability of document.

11

Graphical Model

Probability of Document

2.4 ModelEstimation

12- 270 -

•Verifying that a document is well

classified.

•Representation that information of the

important reference.

•Providing a variety of view for analyzing

text data.

Explanation

2.5 Expected ImpactsThis research focus three aspects such as individual, reference and explanation.

13

Individual

•Bibliography in probabilistic

representation of a document.

•Verifying plagiarism by comparing

document distribution.

Reference

•This model is depended on LDA, such as

the perplexity and the complexity.

•It is assumed that the number of mixture

in document distribution is known (완화

가능)

•This research yields a total number of

operations roughly on the order of

O(N⁴k²)

Computational ComplexityDependency on LDA Assumption

2.5 Expected ImpactsDrawbacks

14- 271 -

References

15

[1] Jeff A Bilmes et al. A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. International Computer Science Institute, 4(510):126, 1998.[2] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003.[3] DM Blei. Topic modeling and digital humanities. Journal of Digital Humanities, 2(1):8–11, 2012.[4] Nikos Vlassis and Aristidis Likas. A greedy em algorithm for gaussian mixture learning. Neural Processing Letters, 15(1):77–87, 2002.

- 272 -

- 273 -

- 274 -

- 275 -

- 276 -

- 277 -

- 278 -

�

- 279 -

- 280 -

��

- 281 -

- 282 -

�

- 283 -

- 284 -

��

�

- 285 -

- 286 -

- 287 -

��

�

�

�

- 288 -

- 289 -

- 290 -

- 291 -

- 292 -

��

- 293 -

- 294 -

- 295 -

��

- 296 -

- 297 -

�

�

�

��

�

- 298 -

- 299 -

- 300 -

- 301 -

- 302 -

- 303 -

- 304 -

- 305 -

- 306 -

�

�

�

�

�

�

�

�

- 307 -

- 308 -

- 309 -

�

- 310 -

- 311 -

- 312 -

- 313 -

- 314 -

- 315 -

- 316 -

5: Rcommendation Systems

- 317 -

- 318 -

- 319 -

•

•

•

•

• �

• �

•

•

•

•

•

•

•

••

•••

•

••

••

••

•••

••

•••

- 320 -

••••

• • • •

••

- 321 -

•

•

•

•

•

•

•

•

•

••

•

- 322 -

•••

••

••••

••

••

••••

•••••

•••

- 323 -

–––

–••••

–••

- 324 -

•••

•

•

••

••

•

- 325 -

••

••

••

•

- 326 -

•

•

•

•

•

•

•

•

•

•

•

•

- 327 -

- 328 -

–

–•

–

- 329 -

–•

–•

•

•

•

- 330 -

–•

•

•

–•

–

–

- 331 -

–•

•

•

–

•

•

–

•

•

- 332 -

–

•

•–

–

•

–

–

•

•

•

- 333 -

–

•

•

–

––

- 334 -

–

–

•

–

–

- 335 -

–

•

•

–

–

–

- 336 -

–•

•–

–•

•

•

- 337 -

- 338 -

- 339 -

–

–

–

•

•

–

•

•

- 340 -

- 341 -

- 342 -

- 343 -

•

•

- 344 -

•

•

- 345 -

- 346 -

�

�

�

�

�

�

�

� �

�

�

•

••

- 347 -

- 348 -

•

•

�

�

- 349 -

•

•

•

•

- 350 -

- 351 -

- 352 -

•

•

•

•

•

- 353 -

•

- 354 -

• •

•

- 355 -

•

•

•

- 356 -

•

•

- 357 -

•

•

•

•

•••

•

- 358 -

6: Data Mining Applications

- 359 -

- 360 -

유통업 마케팅에서의 날씨정보 활용 방안에 관한 연구

김문욱a and 진서훈b

a 고려대학교 과학기술대학 응용통계학과

339-700, 세종특별자치시 세종로 2511

Tel: +82-44-860-1762, E-mail: [email protected]

b 고려대학교 과학기술대학 응용통계학과

339-700, 세종특별자치시 세종로 2511

Tel: +82-44-860-1555, E-mail: [email protected]

Abstract

대형마트의 매출을 일별로 예측하기 위하여 먼저 매출 추세를 이용하여 각 주차 별 매출을 예측하였다.

이후, 날씨 정보를 활용하여 일 별 매출을 예측하는

모델을 개발하였으며 그 결과 매출의 추세만을 이용한 모델에 비해 날씨정보를 이용한 모델의 설명력이

상승하는 효과를 가지게 되었다.

Keywords:

상관계수; 회귀분석; 날씨정보; 대형마트; 매출예상

Introduction

대형마트의 매출은 여러 요인들에 의해 예측이 가능

하며 실제로 대형마트 회사에서는 계절성, 인지도,

경쟁사, 브랜드, 상권 데이터 등을 이용하여 한 해의

매출을 주 단위로 예측하고 있다. 이렇게 예측된 매

출 정보는 제품의 발주 시스템 개선과 마케팅에 활

용된다. 이러한 매출 예측을 일별로 예측을 할 수

있다면 더욱더 효율적인 제고 관리와 적절한 마케팅

으로 적은 비용으로 고효율을 낼 수 있을 것이다.

특히나 2012년부터 소비심리 위축으로 소비자들의

지갑이 열리지 않고, 유통 산업 발전 법 개정으로

인해 강제 휴업이 진행되면서 대형마트의 성장에 제

동이 걸리기 시작한 시점에서 일별 매출 예측을 통

한 기존의 고객의 방문을 늘리고 지갑을 열기 위한

마케팅, 점포의 효율적인 관리를 위해서는 더 세부

적인 일별 매출 예측이 필요하다.

하지만 일별 매출을 예측하기에는 눈에 보이지 않는

다양한 변수들의 영향으로 인해 예측하기가 쉽지 않

다. 이번 연구에서는 눈에 보이지 않는 다양한 변수

중 날씨 정보를 활용하여 일별 매출을 예측해 볼 것

이다.

본론

Data Mart

본 연구를 위하여 사용된 자료는 2011년 07월 부터

2012년 12월까지 A대형마트의 점포별, 카테고리별

매출 자료와 같은 기간 날씨 정보를 추출하였으며

날씨 변수 목록은 Table 1.과 같다.

Table 1 - 날씨변수 목록

변수 변수명

STN_ID 지점번호

TM 관측시각(년월일)

CA_MID 중하층운량(1/10)

CA_TOT 전운량(1/10)

CH_MIN 최저운고(100m)

CT 운형코드

HM 습도(%)

PA 현지기압(hPa)

PS 해면기압(hPa)

PV 증기압(hPa)

RN 강수량(mm)

SD_HR3 3시간신적설(cm)

SD_TOT 적설(cm)

SI 일사(MJ/m2)

SS 일조(hr)

ST_GD 지면상태코드

TA 기온(℃)

TD 이슬점온도(℃)

TE_005 0.05m지중온도(℃)

TE_01 0.1m지중온도(℃)



TS 지면온도(℃)

VS 시정(10m)

WS 풍속(m/s)

WW 현상번호(국내식)

- 361 -

모델링

Trend

Figuers 1.과 같이 모든 점포들의 주차 별 매출은 매

년 같은 Trend를 보이는 것을 알 수 있다. 다만

2012년부터 시행된 강제 휴무로 인해 매출 Trend가

지그재그의 모습을 보이게 되며, 설과 추석의 경우

음력으로 날짜가 정해짐으로 인해 다른 주차에서 매

출이 상승하는 것으로 나타난다. 이 외에서 창립 기

념 달과 가족의 달, 크리스마스가 포함된 주차에서

강한 Trend가 나오는 것을 확인 할 수 있다.

Figure 1 - 주차별 Trend 그래프

하지만 점포 단위로 보게 된다면 각각 다른 Trend를

보이게 된다. 다만 점포별로 매년 같은 Trend로 인

해 전체 매출의 Trend가 같게 나타나는 것을 알 수

있다. 그래서 유사한 Trend를 가진 점포들을 찾아

묶어주기 위하여 Table 2.와 같이 점포별로 매출의

상관계수를 확인하여 유사한 점포를 찾아 점포별

Trend를 구하게 되었다.

Table 2 - 점포 매출간의 상관계수

Code Store_1 Store_2 Store_3 Store_4 Store_5 Store_6 Store_7

Store_1 100% 60% 57% 54% 60% 57% 63%

Store_2 60% 100% 85% 84% 77% 87% 86%

Store_3 57% 85% 100% 97% 89% 98% 97%

Store_4 54% 84% 97% 100% 93% 93% 96%

Store_5 60% 77% 89% 93% 100% 90% 92%

Store_6 57% 87% 98% 93% 90% 100% 96%

Store_7 63% 86% 97% 96% 92% 96% 100%

그럼 아래와 같이 유사한 Trend를 가지는 점포들끼

리 묶이게 된다.

Figure 2 - 유사한 Trend를 가진 점포들의 매출 변화

일별 매출 예측

위와 같은 결과를 통해 주차 별 Trend를 구하고, 명

절, 강제 휴무에 대한 Trend를 적용하여 최종적으로

각 주차의 매출을 예상하게 된다.

이 후, 요일 별로 매출 비율을 구하여 예측된 주차

별 매출에 대입하게 되면 카테고리 별 매출의 비율

을 적용하여 일별, 카테고리 별 예측 매출을 완성하

게 된다.

날씨 정보를 활용한 모델링

독립변수를 앞에서 완성한 일별 예측 매출과 날씨

정보로 두고 종속변수를 실제 매출로 두어 회귀분석

을 돌린 결과 거의 대부분의 카테고리에서 일별 예

측 매출만 두고 돌린 회귀분석 결과에 비해 더 높은

설명력을 가지게 되었다.

Table 3.은 날씨 정보를 빼고 일별 예측 매출만을 가

지고 단순회귀분석 결과이며 Table 4.는 똑같은 조건

에 독립변수에 날씨 정보를 추가한 결과이다. 날씨

정보를 추가 할 경우 설명력이 약 0.13정도 올라가

는 것을 확인 할 수 있다.

Table 3 - 단순회귀 분석 결과

R-Square 0.6384

Variable Parameter Estimate

Standard Error

t Value Pr > |t|

Intercept 1634778 699310 2.34 0.0217

SALE_62 0.55257 0.04434 12.46 <.0001

Table 4 - 날씨 정보를 추가한 회귀분석 결과

R-Square 0.769


Standard Error

t Value Pr > |t|

Intercept -1860973 1109163 -1.68 0.097

SALE_62 0.57618 0.03774 15.27 <.0001

평균온도 336647 66869 5.03 <.0001

평균 습도 67049 17918 3.74 0.0003

이 외에도 대부분의 카테고리에서도 설명력이 상승

하였다. 결과는 대부분 겨울에는 온도가 올라갈 수

록 매출이 상승하며, 여름에는 온도가 내려갈 수록

매출이 상승하는 것으로 나타났다.

주목할 만한 결과

엔터테이먼트와 완구 카테고리에서는 다른 결과가

나왔다. Table 5.와 Table6.을 살펴보면 눈이 오는 주

말일 경우 적설량이 증가 할 수록 매출이 증가하는

것으로 나타났다. 이는 대형마트의 경우 가족단위

쇼핑이 이루어 지는 공간으로서 눈이 오늘 주말에

자녀를 둔 가정의 외출 증가로 인해 매출도 함께 상

승하는 것으로 보인다.

- 362 -

Table 5 - 주말 엔터네인먼트 카테고리 회귀분석 결과

R-Square 0.7524


Standard Error

t Value Pr > |t|

Intercept 421452 132309 3.19 0.0021

엔터네인먼트 0.55766 0.03661 15.23 <.0001

적설량 503835 203283 2.48 0.0154

Table 6 - 주말 완구 카테고리 회귀분석 결과

R-Square 0.6911


Standard Error

t Value Pr > |t|

Intercept 18173896 4666139 3.89 0.0002

완구 1.04387 0.12251 8.52 <.0001

적설량 17362609 2444824 7.1 <.0001

습도 -284168 82179 -3.46 0.0009

또한 날씨에 전혀 영향을 미치지 않는 카테고리도

있었다. 바로 담배이다. Table 7.을 살펴보면 날씨 정

보가 모두 기각되면서 담배의 경우 날씨에 영향을

받지 않고 꾸준한 매출을 보이는 것으로 나타났다.

Table 7 - 담배 카테고리 회귀분석 결과

R-Square 0.5649


Standard Error

t Value Pr > |t|

Intercept 376084 2751863 0.14 0.892

SALE_33 0.91692 0.09141 10.03 <.0001

평균 온도 -19109 58539 -0.33 0.7459

풍속 -19572 154368 -0.13 0.8998

강수량 -16605 21960 -0.76 0.4544

습도 4740.2652 14832 0.32 0.7511

또한, Table8.과 Table9.를 살펴보면 주류의 경우 여름

에는 온도가 상승할 수록, 겨울에는 온도가 내려갈

수록 매출이 증가하는 것으로 나타났다.

Table 8 - 여름철 주류 카테고리 회귀분석 결과

R-Square 0.8475


Standard Error

t Value Pr > |t|

Intercept -12250912 2484309 -4.93 <.0001

SALE_26 0.88892 0.02195 40.5 <.0001

온도 304146 66443 4.58 <.0001

습도 47137 13641 3.46 0.0006

Table 9 - 겨울철 주류 카테고리 회귀분석 결과

R-Square 0.8979


Standard Error

t Value Pr > |t|

Intercept 716022 257562 2.78 0.0058

SALE_26 0.83811 0.01724 48.6 <.0001

온도 -119691 32307 -3.7 0.0003

결론

본 연구를 통해 날씨 정보를 활용한 일 별 매출 예

측이 가능한 모델이 완성되었다. 모델링 결과 여름

에는 온도가 내려 갈 수록 매출이 증가하며 겨울에

경우 온도가 올라 갈 수록 매출이 증가하는 당연한

결과들이 나오기도 하였지만, 주말에 눈이 오면 매

출이 증가하는 카테고리를 발견하였다.

즉, 다음날의 일기 예보에 따라 발주 시스템을 조절

할 수 있으며 적정 수준의 직원 수를 판단하여 점포

유지비용을 줄여 나갈 수 있으며, 날씨에 의해 매출

이 감소, 증가 할 것으로 예상되는 시점에 마케팅

비용을 적절하게 투입함으로써 저비용으로 고수익을

올릴 수 있을 것으로 예상된다.

다가오는 시대는 빅 데이터 시대로 대용양 데이터를

마음껏 가공하고 분석할 수 있는 시대가 다가오고

있다. 이를 대비하여 지금의 일별, 카테고리 별 매출

예측이 아닌 상품별로 다음날의 매출을 예측하여 더

효과적인 점포운영과 차별화된 마케팅을 실시한다면

매년 마이너스 성장으로 예상되는 대형마트의 성장

곡선을 플러스 성장 곡선으로 만들 수 있을 것으로

예상된다.

References

[1] 사명우, (2010). 결측된 목표 변수의 대체를 통한

교차판매 추천 모형 개발, 고려대학교 대학원.

[2] 강현절, 한상태, 최종후, 이성건, 이성건, 김은석,

엄익현, 김미경 (2006). 고객관리(CRM)를 위한 데

이터마이닝 방법론 Enterprose Miner 활용사례를

중심으로, 자유아카데미.

[3] 박창이, 김용대, 김진석, 송종우, 최호식 (2011). R

을 이용한 데이터마이닝, 교우사.

[4] 김기영, 전명식, 강현철, 이성건 (2009). 예제를

통한 회귀분석, 자유아카데미.

[5] 박태성, 이승연 (2009). 범주형 자료분석 개론, 자

유아카데미.

[6] 김기영, 강현철, 한상태, 최병진 (2009). 예제로

배우는 SAS 데이터 분석 입문, 자유아카데미.

[7] Trevor Hastie, Robert Tibshirani, Jerome Friedman

(2008). The Elements of Statistical Learning Data Mining,

Inference, and Prediction, Springer

- 363 -

- 364 -

- 365 -

�

�

�

- 366 -

�

�

�

�

- 367 -

�

�

�

�

�

�

- 368 -

�

- 369 -

�

�

�

�

�

- 370 -

�

�

�

- 371 -

�

�

�

- 372 -

�

�

�

�

- 373 -

�

- 374 -

�

�

�

�

�

�

�

- 375 -

- 376 -

Contents

•

•

•

•

•

•

2- 377 -

3

4- 378 -

(Product Life Cycle)

5

•

•

- 379 -

7

•

•

��

8

(Unsupervised learning)

(Supervised learning)

- 380 -

• Window size

•

(1) (2) (3)

1

23

•

•

•,

9

10

W

•

�

�

�

•

�

�

- 381 -

11

• K- (K-means clustering)

•

•

“A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo et. al

H. Zha, C. Ding, M. Gu, X. He and H.D. Simon. "Spectral Relaxation for K-means Clustering", Neural Information Processing Systems vol.14 (NIPS 2001). pp. 1057-1064, Vancouver, Canada. Dec. 2001.

•

•

2

1 j

k

n jj n S

J x u� �

� ��

nx ju jS

•

•

12

• K--

Class1

Class2

Class3

- 382 -

13

• (Decision tree) - C5.0

••••

)()(),()(

vAvaluev

v SEntropySS

SEntropyASIG ��

��

21

( ) logc

i ii

Entropy S p p�

� ��

: Entropy Information gain

14

A B

Product attribute?

First week sales?

10 or more Less than 10

•- Class �

- , �

•-

- 383 -

•

C1 C2 C3

C1 49 3 1

C2 0 60 4

C3 2 12 38

0

2 12

60

3 1

4

Training

Test

A B

Future item[Product attribute=B, First week sales=13]

Product attribute?

First week sales?

10 or more Less than 10

• ‘A’ 2011 ~ 2012

• 977 (2011 489 ,2012 488 )

•

16

21263865

39,000 ~1,490,000

- 384 -

• ‘A’ W- 17 90% -

17

2% 4% 4% 5% 6% 7% 8% 8% 8% 8% 7% 6% 5% 4% 4% 3% 2% 2% 2% 1% 1% 1% 1%2%7%

11%

16%

22%

29%

36%

45%

53%

60%

67%

72%77%

81%85%

88%90%

92% 94% 96% 97% 98%99%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

(%)

(%)

• (silhouette statistic)

18

Rousseeuw, Peter J. "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis." Journal of computational and applied mathematics 20 (1987): 53-65.

( )max( , )

i ii

i i

b aSa b�

�

• ai: • bi :

11 �� iS

.11��

�n

iiSn

S

,

• 1• 0.5

- 385 -

19

-

-

-

-

1 3 5 7 9 11 13 15 17

-

-

-

-

1 3 5 7 9 11 13 15 17

-

-

-

-

1 3 5 7 9 11 13 15 17

-

-

-

-

1 3 5 7 9 11 13 15 17

( )

(2011 )

-

-

-

-

-

-

-

-

-

-

-

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Slow-up

Fast-up

High Peak

Never-up

20

• (2011 )

• : 76%

- 386 -

2121

•

• 2012

22

-

-

-

-

-

-

-

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

-

-

-

-

-

-

-

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

-

-

-

-

-

-

-

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

-

-

-

-

-

-

-

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

- 387 -

23

1

2

3

•

•

•

24- 388 -

25

- 389 -

2013추계학술대회 인쇄용

Technology

Transcript of 2013추계학술대회 인쇄용