Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing...

61
Universit¨ at des Saarlandes Max-Planck-Institut f¨ ur Informatik AG5 Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by Prashant Yadava angefertigt unter der Leitung von / supervised by Dr. Pauli Miettinen begutachtet von / reviewer Prof. Dr. Gerhard Weikum October 2012

Transcript of Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing...

Page 1: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Universitat des SaarlandesMax-Planck-Institut fur Informatik

AG5

Boolean Matrix Factorization with

missing values

Masterarbeit im Fach Informatik

Master’s Thesis in Computer Science

von / by

Prashant Yadava

angefertigt unter der Leitung von / supervised by

Dr. Pauli Miettinen

begutachtet von / reviewer

Prof. Dr. Gerhard Weikum

October 2012

Page 2: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

i

Hilfsmittelerklarung

Hiermit versichere ich, die vorliegende Arbeit selbstandig verfasst und keine anderen als

die angegebenen Quellen und Hilfsmittel benutzt zu haben.

Non-plagiarism Statement

Hereby I confirm that this thesis is my own work and that I have documented all sources

used.

Saarbrucken, den 31. Oktober 2012,

(Prashant Yadava)

Einverstandniserklarung

Ich bin damit einverstanden, dass meine (bestandene) Arbeit in beiden Versionen in die

Bibliothek der Informatik aufgenommen und damit veroffentlichtwird.

Declaration of Consent

Herewith I agree that my thesis will be made available through the library of the Com-

puter Science Department at Saarland University.

Saarbrucken, den 31. Oktober 2012,

(Prashant Yadava)

Page 3: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

”Your time is limited, so don’t waste it living someone else’s life.

Don’t be trapped by dogma - which is living with the results of other people’s thinking.

Don’t let the noise of other’s opinions drown out your own inner voice.

And most important, have the courage to follow your heart and intuition. They

somehow already know what you truly want to become. Everything else is secondary.”

- Steve Jobs (1955-2011)

Page 4: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Abstract

Is it possible to meaningfully analyze the structure of a Boolean matrix for which 99%

data is missing?

Real-life data sets usually contain a high percentage of missing values which hamper

structure estimation from the data and the difficulty only increases when the missing

values dominate the known elements in the data set. There are good real-valued fac-

torization methods for such scenarios, but there exist another class of data —Boolean

data, which demand a different handling strategy than their real-valued counterpart.

There are many application which find logical representation only via Boolean matrices,

where real-valued factorization methods do not provide correct and intuitive solutions.

Currently, there exists no method which can factorize a Boolean matrix containing a

percentage of missing values usually associated with non-trivial real-world data set. In

this thesis, we introduce a method to fill this gap. Our method is based on the correla-

tion among the data records and is not restricted by the percentage of unknowns in the

matrix. It performs greedy selection of the basis vectors, which represent the underlying

structure in the data.

This thesis also presents several experiments on a variety of synthetic and real-world

data, and discusses the performance of the algorithm for a range of data properties.

However, it was not easy to obtain comparison statistics with existing methods, for the

reason that none exist. Hence we present indirect comparisons with existing matrix

completion methods which work with real-valued data sets.

Page 5: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Acknowledgements

I would like to thank my adviser Dr. Pauli Miettinen for providing me the opportunity

to work with him for this thesis. I cannot thank him enough for his guidance, encour-

agement and naturally, his patience. His constant support and belief in me, coupled

with his vision and careful planning instilled confidence in me throughout the duration

of the thesis. It has been a privilege working with him and I got to learn a lot from him

during this period. I could not have asked for a better captain for the ship!

A special note of thanks to Prof. Dr. Gerhard Weikum for giving me the opportunity

to pursue this thesis at the D5 group in the Max-Planck Institute for Informatics. I am

grateful to him to review my master thesis.

I would like to take this opportunity to thank my family for giving me love, joy, comfort,

strength, and purpose, every single day, every single moment. I am grateful to my

parents for bringing me up in an atmosphere where education and learning were always

prioritized. I am thankful to them for trusting me with the choices I made in my career

and supporting me in achieving my goals.

Hey Sis! I did it!

I cannot imagine coming to this stage without the unconditional love, support, and

motivation of my elder brother, Saurabh, who means so much to me. It would never

have been possible without you, Bhaiya!

Dharneesh, Prasoon, this goes to you guys as well! Thanks for everything!

I would like to thank all my friends at the Saarland University and the Max Planck

Institute who made every single day interesting, fun and exciting.

Special thanks to Niket and Gaurav for being there whenever I needed them and for

making my stay at Saarbrucken a pleasant one.

iv

Page 6: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Contents

Abstract iii

Acknowledgements iv

List of Figures vii

List of Tables viii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Problem definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Boolean matrix factorization with missing values . . . . . . . . . . . . . . 5

1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Related Work 9

2.1 Matrix Factorizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Real valued matrix factorizations . . . . . . . . . . . . . . . . . . . 9

2.1.2 Boolean matrix factorizations for complete matrix . . . . . . . . . 11

2.1.2.1 The Discrete Basis Problem . . . . . . . . . . . . . . . . . 11

2.1.3 Boolean matrix factorizations for incomplete matrix . . . . . . . . 12

3 Model and algorithm 14

3.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 The Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Calculating association with missing values . . . . . . . . . . . . . 18

3.2.2 The cover value and the mask . . . . . . . . . . . . . . . . . . . . . 20

3.2.3 The fit matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Experiments and Results 26

4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.1 Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.1.1 Expected Output . . . . . . . . . . . . . . . . . . . . . . 28

v

Page 7: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Contents vi

4.1.1.2 The Approach . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.1.3 Choice of remaining data parameters . . . . . . . . . . . 28

4.1.2 Unknowns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.2.1 Expected Output . . . . . . . . . . . . . . . . . . . . . . 29

4.1.2.2 The Approach . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.2.3 Choice of data parameters . . . . . . . . . . . . . . . . . 29

4.1.3 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.3.1 Expected Output . . . . . . . . . . . . . . . . . . . . . . 30

4.1.3.2 The Approach . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.3.3 Choice of data parameters . . . . . . . . . . . . . . . . . 30

4.1.4 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.4.1 Expected Output . . . . . . . . . . . . . . . . . . . . . . 31

4.1.4.2 The Approach . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.4.3 Choice of data parameters . . . . . . . . . . . . . . . . . 31

4.2 Results for synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.1 Varying density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.2 Varying unknowns . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.3 Varying noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.3.1 Low unknowns . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.3.2 High unknowns . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.4 Varying rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3.1 Movie Lens Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3.1.1 Comparison with other methods . . . . . . . . . . . . . . 38

4.3.1.2 Results and discussion . . . . . . . . . . . . . . . . . . . . 38

4.3.2 KDD Cup Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3.2.1 Results and discussion . . . . . . . . . . . . . . . . . . . . 43

4.3.2.2 Comparison with other methods . . . . . . . . . . . . . . 45

5 Conclusions 47

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Bibliography 49

Page 8: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

List of Figures

1.1 An example of a Boolean data with unknowns. Users are presented inrows, while items are in columns. . . . . . . . . . . . . . . . . . . . . . . . 5

4.1 Varying density: (a) best and (b) worst error rates, Matrix dimensions(300,500), rank 10, unknowns 20%, noise 5% . . . . . . . . . . . . . . . . 32

4.2 Varying unknowns: (a) best and (b) worst error rates, Matrix dimensions(300, 500), rank 10, density 10%, noise 5% . . . . . . . . . . . . . . . . . . 33

4.3 Varying noise (with low unknowns): (a) best and (b) worst error rates,Matrix dimensions (300, 500), rank 10, density 10%, unknowns 20% . . . 34

4.4 Varying noise (with high unknowns): (a) best and (b) worst error rates,Matrix dimensions (300, 500), rank 10, density 10%, unknowns 70% . . . 35

4.5 Varying rank: (a) best and (b) worst error rates, Matrix dimensions (300,500), density 10%, unknowns 20%, noise 5% . . . . . . . . . . . . . . . . 36

vii

Page 9: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

List of Tables

4.1 Summary of the data characteristic used for synthetic data experimentsτ = 0.2:0.1:0.8, rows = 300, cols = 500 . . . . . . . . . . . . . . . . . . . . 27

4.2 The properties of the Movie Lens data set after pruning . . . . . . . . . . 37

4.3 Factorization results for the movielens data for BMF with missing valuesmethod. hamming denotes the hamming distance between the originaland reconstructed matrices, while rmse denotes the root-mean squareerror. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4 Factorization results for Movie Lens data with OptSpace method r de-notes the rank, test and train denote respectively, the test and trainingerror rates Hamming denotes the Hamming error distance metric, whileRMSE denotes Root-mean squared error . . . . . . . . . . . . . . . . . . 41

4.5 Factorization results for Movie Lens data with ALM method r denotes therank, test and train denote respectively, the test and training error ratesHamming denotes the Hamming error distance metric, while RMSEdenotes Root-mean squared error . . . . . . . . . . . . . . . . . . . . . . . 41

4.6 The properties of the Half -KDD Cup Data set . . . . . . . . . . . . . . . 43

4.7 The reconstruction error statistics for the experiment to choose the asso-ciation threshold, τ for the rank 50 for the Half -KDD Cup Data. errorsrepresents the errors for known values, while errors-knowns-true for theBoolean TRUE values among the known values. train and test denoterespectively, the training and testing error rates. . . . . . . . . . . . . . . 44

4.8 The density of factor matrices in the experiment to choose the best τ . . . 44

4.9 The reconstruction error statistics for the experiment for the Half -KDDCup Data. τ represents the association threshold; rank is the factorizationrank; errors represents the errors for known values, while errors-knowns-true for the Boolean TRUE values among the known values. train andtest denote respectively, the training and testing error rates. . . . . . . . . 45

4.10 Factorization results for the Half -KDD Cup Data with OptSpace method.r denotes the rank, test and train denote respectively, the test and train-ing error rates Hamming denotes the Hamming error distance metric,while RMSE denotes Root-mean squared error errors-known-true rep-resents the errors when matching Boolean TRUE in the matrices amongthe known values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.11 Factorization results for the Half -KDD Cup Data with ALM method. rdenotes the rank, test and train denote respectively, the test and train-ing error rates Hamming denotes the Hamming error distance metric,while RMSE denotes Root-mean squared error errors-known-true rep-resents the errors when matching Boolean TRUE in the matrices amongthe known values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

viii

Page 10: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

List of Algorithms

1 The DBP (ASSO) algorithm provided by Miettinen et al. [2008] . . . . . . 15

2 Algorithm to construct the association matrix for a data matrix . . . . . . 21

3 Algorithm to calculate the association confidence among two vectors . . . 22

4 Algorithm for finding out the factors for a data matrix . . . . . . . . . . . 23

5 Algorithm for initializing the Fit matrix . . . . . . . . . . . . . . . . . . . 23

6 Algorithm for calculating the cover value . . . . . . . . . . . . . . . . . . . 24

7 Algorithm for finding the best basis and usage vector . . . . . . . . . . . . 24

8 Algorithm for updating mask matrix . . . . . . . . . . . . . . . . . . . . . 25

9 Algorithm for updating Fit matrix . . . . . . . . . . . . . . . . . . . . . . 25

To mummy and papa. . .

ix

Page 11: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 1

Introduction

1.1 Motivation

Matrix factorization has proven to be of immense utility when it comes to the analysis

of large scale data by creating low-dimensional models of the problem. Such models

facilitate a better understanding of the data observations. Matrix factorizations also help

uncover some latent information present in the data which usually cannot be perceived

from the original data. The Modeling decisions for many engineering disciplines assume

the distribution of the data along a low dimensional linear subspace. Many factorization

methods, such as Principal component analysis (PCA) and its underlying Singular value

decomposition (SVD) use this principle to provide compact representations of the data

while still preserving its essential inherent structure. The data that these methods work

on is usually real-valued. However, of late, Boolean data have obtained a special and

important space in the domain of data analysis [Li, 2005]. The data in many research

and application fields, such as market basket data, document-term data, Web click-

stream data (users vs websites), DNA microarray expression profiles, or protein-protein

complex interaction network [Zhang et al., 2007], find natural representations in the

realms of Binary matrices. Such matrices upon factorization are also expected to yield

binary matrices which find interpretability in the problem domain. However, methods

such as PCA, SVD, Non-negative matrix factorization (NMF) only work with real-valued

matrices and do not provide interpretable factors for Boolean matrices.

Lately, however, the problem of Boolean matrix factorization has been gaining increasing

attention among the data mining, knowledge discovery and related research communi-

ties. Many research problems in Boolean data analysis can be reduced to Boolean matrix

factorization [Lu et al., 2009]. To cite a few examples, Geerts et al. [2004] have proposed

a tiling databases problem which aims to find a set of tiles to cover a Binary database.

1

Page 12: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 1. Introduction 2

Miettinen et al. [2008] have recently proposed the discrete basis problem (DBP), where,

for a given Boolean matrix D, a lower dimensional Boolean matrix B consisting of basis

vectors is obtained. The basis vectors, in turn, could be used to reconstruct the matrix

D. This problem is further developed by Miettinen [2008] to limit the matrix B to be

the subset of columns of D. Lu et al. [2008] have used Boolean matrix factorization to

solve the role based access control problem.

Real life data sets are usually sparse, containing relatively smaller percentage of entries

whose values are known. Rest of the values are missing from the data set and hence,

are unknown. The available methods for Boolean matrix factorization do not yet handle

such data sets containing a high percentage of missing values, and are useful only in

the scenarios when a complete data set is available. Vreeken and Siebes [2008] have

proposed a matrix completion for Boolean matrices containing a limited percentage of

missing data, but this method does not scales for real world data sets. In this thesis,

we try to bridge this gap by introducing an algorithm for Boolean matrix factorization

with missing values.

1.2 Notation

We use upper-case bold-faced alphabets to represent matrices, e.g., D, or Dm×n, while

vectors are represented by lower-case bold-faced alphabets, such as d. A Matrix trans-

pose is represented by DT , while the transpose of a vector is represented as dT . We

assume readers have familiarity with Mathworks’s MATLAB® notation, as in this the-

sis, we use its notation as a shorthand at many instances. Specifically, we use D(k,:)

and D(:,k) respectively, to denote the kth row and column of D. Also, we use its range-

operator ’:’ to denote a range of equal-spaced values. For example, v = 10:2:16 is used

to imply v = {10, 12, 14, 16}. We use nnz(D) to denote the number of non-zero elements

in D.

We use ? to represent the unknowns or missing values in Boolean matrices. On the other

hand, among the known values in the matrices, we use known true to imply Boolean

1 and known false to denote 0. Let matrix Dm×n be a Boolean matrix containing

unknowns, then Dm×n ∈ { 0, 1, ?}.

Before going to Boolean matrix product, we first define regular matrix multiplication.

Multiplying matrices Bm×k ∈ R and Xk×n ∈ R, gives Dm×n ∈ R, whose entries are

given by the dot product of the corresponding row of B and the corresponding column

of X:

Page 13: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 1. Introduction 3

dij =

k∑r=1

birxrj (1.1)

where, dij denotes the element at the intersection of D(i, :) and D(:, j) and so on.

Boolean matrix multiplication, denoted by ◦, is based on similar principles as regular

matrix multiplication, but uses Boolean addition for performing summations, viz., the

rule 1 + 1 = 1. Considering Boolean matrices, Bm×k ∈ {0, 1} and Xk×n ∈ {0, 1}, their

product, B ◦X, gives Dm×n ∈ {0, 1}, where the elements of the product matrix D are

given by,

dij =k∨

r=1

bir ∧ xrj (1.2)

In this thesis, we use an error metric, which we call, Hamming error rate, based on the

Hamming norm to judge the quality of factorizations. The Hamming norm is defined

as the absolute value of the element-wise difference among a pair of matrices. For two

matrices, Am×n and Bm×n, it is given as,

‖A−B‖1 =m∑i=1

n∑j=1

|aij − bij | (1.3)

Coming back to the Hamming error rate, intuitively, it is the rate of disagreements

among two matrices in terms of the known values in the first matrix. It involves first

finding the number of indices where a pair of matrices disagree in their values, at those

indices where the first matrix has non-zero values (these are the indices containing known

values). This summation is then divided by the number of non-zero indices in the first

matrix, giving the rate of disagreements.

The Hamming error rate for the reconstructed matrix, Rm×n ∈ {0, 1} corresponding to

a matrix Dm×n ∈ {0, 1, ?} is defined as,

Hamming error rate =‖D−R‖1

nnz(D(1.4)

where, ‖D−R‖1 is calculated only for the known elements in D.

Another error metric used for matrices is the Forbenius norm, given for the matrices A

and B as,

‖A−B‖F =

√√√√ m∑i=1

n∑j=1

(aij − bij)2 (1.5)

Page 14: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 1. Introduction 4

For comparing our results with other methods, we need to calculate the root-mean

squared error for the matrices. For matrices D and R, it is given as

rmse(D,R) =

√√√√√ m∑i=1

n∑j=1

(dij − rij)2

nnz(D)(1.6)

where the quantity in the numerator is the Forbenius norm, ‖.‖F

It must be noted that when the matrices are Boolean, the ‖A−B‖F =√‖A−B‖1,

hence the rmse can be obtained using the equation,

rmse =√

Hamming error rate (1.7)

Thus the main error metric for us is the hamming error rate.

It must be noted, that the matrices dealt with in the thesis are essentially Boolean,

as the elements marked ? also stand for either Boolean true or false. The new trinary

new notation only represents a convenience coined to include missing values. In the

applications employing the proposed algorithm in the thesis, such as matrix completions,

the ? would give way to either 1 or 0 in the reconstructed matrix.

Let Dm×n be a Boolean matrix, then its Boolean rank, denoted by b(D) is the least

r, for which there exist, Boolean matrices Bm×r and Xr×n, such that D = B ◦ X.

The Boolean rank is different from the regular rank. The rank of a matrix implies the

minimum number of linearly independent rows or columns which can be combined to

construct the rows or columns of the matrix. We use Boolean arithmetic with Boolean

matrices, where the addition operation uses the property 1+1 = 1, which is the basic

reason why the Boolean rank is different from the regular rank. Consider the matrix,

D =

1 0 1

0 1 1

1 1 1

The regular rank of D is 3, but its Boolean rank is 2, because the column

1

1

1

, can be

expressed as the sum of the remaining columns,

1

0

1

and

0

1

1

. In this thesis, by rank

we mean the Boolean rank, unless otherwise stated.

Page 15: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 1. Introduction 5

1.3 Problem definitions

Problem 1 (Boolean matrix factorization with missing values)

Given a matrix, Dm×n ∈ {1, 0, ?}, find factor matrices Bm×r ∈ {0, 1} and Xr×n ∈ {0, 1},that minimize

‖D−B ◦X‖1 (1.8)

1.4 Boolean matrix factorization with missing values

A good example of real life Boolean data with missing values is the 2012 KDD Cup

released by Tencent Weibo [kdd, 2012a], which offers a wealth of social-networking in-

formation. The data represents a sampled snapshot of the social-networking website’s

users’ preference for various items - the recommendation to users, and their response to

it, i.e., if the user accepted to follow the item or rejected it. The goal of the competition

is to predict which items a user will follow, based on the revealed entries. This data

is clearly Boolean in nature. In addition, we also have the possibility of an unknown

recommendation result - arising if the user was never presented with a particular item’s

recommendation, or the user did not take an action on a recommendation presented.

When this data is captured in a matrix, we have a Boolean matrix, along with some

unknowns, as presented in the Figure 1.1, where the 1s represent if a user agreed with

the recommendation to follow the item, and 0 if otherwise. A ’? ’ implies we do not yet

have any information for the user-item combination.

A B C D E FAmy 1 1 ? 0 ? ?William ? 1 ? 1 0 0Glenn 0 ? 0 ? 1 ?Irina ? 0 0 1 1 1Danny 1 ? ? 0 ? 0Marjika 0 1 ? 1 0 ?Alan ? ? 0 0 ? ?Evelyn 1 1 1 ? 0 0

Figure 1.1: An example of a Boolean data with unknowns.Users are presented in rows, while items are in columns.

Page 16: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 1. Introduction 6

D =

1 1 ? 0 ? ?

? 1 ? 1 0 0

0 ? 0 ? 1 ?

? 0 0 1 1 1

1 ? ? 0 ? 0

0 1 ? 1 0 ?

? ? 0 0 ? ?

1 1 1 ? 0 0

Let the data in the Figure 1.1 be represented in the matrix D. The goal of this thesis

is to obtain low-rank factorizations for matrices such as D, which are Boolean matrices

containing unknowns, ?. Now the real world data sets are usually sparse, with a high

percentage of ? dominating the matrices. The 2012 KDD Cup data in question here

contains over 99% ?. Clearly, the ? do not reveal any information regarding the under-

lying structure of the data and we depend on the known values, the 1s and 0s, to obtain

the basis vectors that can explain the elements of the data.

An example of a good factorization for the matrix in the Figure 1.1 are the matrices B

and X presented below:

B =

1 1 0 0

1 0 1 0

0 0 0 1

0 0 0 1

1 1 0 0

0 0 1 0

1 1 0 0

0 1 1 0

, X =

1 1 0 0 0 0

1 0 0 0 0 0

0 1 1 1 0 0

0 0 0 1 1 1

Performing Boolean multiplication between B and X results in the matirx D′ as given

below:

Page 17: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 1. Introduction 7

D′ = B ◦X =

1 1 0 0 0 0

1 1 1 1 0 0

0 0 0 1 1 1

0 0 0 1 1 1

1 1 0 0 0 0

0 1 1 1 0 0

1 1 0 0 0 0

1 1 1 1 0 0

Here, the matrices D and D′ agree on all the known values, and there are no reconstruc-

tion errors, but generally this would not be true for larger dimension matrices. Ideally

we are interested in factorizations that can minimize such errors.

1.5 Organization

The rest of the thesis is organized as follows. In the Chapter 2, earlier research work on

related topics has been covered. Chapter 3 details the algorithms this thesis provides,

while in Chapter 4, the experimental setup is discussed. The results of the experiments

is also presented and discussed in the same chapter. Chapter 5 concludes the thesis with

a summary of the contributions and further research directions.

1.6 Contribution

The major contributions made in this thesis are:

• Current matrix factorizations methods for Boolean data do not handle missing

values. This thesis fills that gap, by introducing a method that handles high

percentage of missing values in a Boolean data set.

• The method introduced provides matrix completion with low reconstruction errors

having high precision.

• We present experimental results to demonstrate the performance of the algorithm.

– Results for a set of synthetic data matrices are presented.

Page 18: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 1. Introduction 8

– Results on real life data sets, including 2012 KDD Cup 1 and MovieLens 2

data sets is presented.

• Direct comparison with existing methods is not possible as there is no method that

works with Boolean data matrices with a majority of unknown values. Still we

try to achieve an indirect comparison with an existing matrix completion method,

namely OptSpace [Keshavan et al., 2009], to gain a measure of where our method

stands with respect to the existing benchmarks.

• We improve few aspects of the Discrete basis problem proposed by Miettinen et al.

[2008]. The most important of these is, of course, handling missing values in the

Boolean matrices. The other improvements are to optimize some aspects of the

algorithm, such are efficient calculation of the cover function and introduction of

what we call, the Fit matrix, which records the cover values. These concepts are

explained in detail in Chapter 3.

1http://www.kddcup2012.org/c/kddcup2012-track1[Last accessed: 26-10-2012]2http://movielens.umn.edu/[Last accessed: 26-10-2012]

Page 19: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 2

Related Work

In this chapter, the prior work in the field related to the topics this thesis covers is

presented. We move from a general discussion about matrix factorizations to the core of

the thesis —Boolean matrix factorization with missing values in the following sections.

2.1 Matrix Factorizations

Factorization methods are commonly used in the analysis of multivariate data for vari-

ous applications ranging from gene expression data (Zhang et al. [2010]), neuroscience

(Van Hamme [2012]), web indexing (Caicedo and Gonzalez [2012]), meteorology (Berge-

mann et al. [2009]), oceanography (Willis A.J. [1994]), computer graphics (Gotsman

[1994]) and many other applications (Ruiters et al. [2009], Pei and Liu [2008]) Klema

and Laub [1980]. Matrix factorization methods provide low dimensional representation

capturing the important aspects of the original data. Factorization helps in uncover-

ing the underlying structure of data, dimensionality reduction, data compression among

other uses. One of the best known methods for such factorization is the singular value

decomposition (SVD), while non-negative matrix factorization (NMF) is preferred when

the data is constrained to be non-negative (Lee and Seung [2000], Devarajan [2008]).

Since these methods bear relevance to the problem addressed in this thesis, we discuss

them further.

2.1.1 Real valued matrix factorizations

The singular value decomposition (SVD) factorizes a matrix A into the three matrices,

given by,

9

Page 20: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 2. Related Work 10

A = UΣVT (2.1)

where U and V are orthogonal matrices, and Σ is a diagonal matrix with non-negative

entries at the diagonal, which are the singular values of A, denoted by σi. In various

applications such as modeling problems and data compression, low-rank approximation

of a matrix is desired. This is a minimization problem in which the cost function

measures the fit between a given matrix, and an approximating matrix, subject to the

constraint that the approximating matrix has a reduced rank. When the fit is measured

with respect to the Forbenius norm, i.e.,

minimize over A, ‖D−A‖F subject to rank(A) ≤ r

the optimum solution is provided by the SVD of D, D = U Σ VT , by setting all but

top r entries of the diagonal elements of Σ, σ1-σr to 0. This result is known as Eckart-

Young theorem and is generally attributed to Eckart and Young [1936] (but it is now

acknowledged that this result was known to researchers even earlier, as presented by

Stewart [1993]).

For those cases where the data is inherently non-negative, for instance, by the virtue of

the application requirements, SVD cannot be used because the matrices U and V could

contain negative values, which are non-intuitive in the context of the non-negative data

requirements of the problem. For such cases, NMF (Paatero and Tapper [1994], Lee and

Seung [1999]) is used, which allows only non-negative values among its factor matrices.

NMF of a non-negative matrix, Vm×n provides two non-negative factor matrices Wm×r

and Hr×n, whose product provides an approximation to the original matrix V, i.e.,

V ≈WH (2.2)

where usually, r � m and r � n.

It models the generation of directly visible variables of V from hidden variables repre-

sented in the matrix H. Since the hidden variables’ additive linear combination leads

to the original data matrix, the matrix H is regarded as the basis matrix. The matrix

W represent the weight by which this linear combination is performed.

Though, any cost function could be used to formulate the NMF problem, the original

NMF algorithm involved minimization of the Forbenius-norm,

‖V −WH‖2F =m∑i=1

n∑j=1

|vij − (wh)ij |2 (2.3)

Page 21: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 2. Related Work 11

NMF provides good results with real-valued non-negative data matrices, providing non-

negative factor matrices. But when it comes to Boolean data, it also fails to provide

factor matrices as the factor matrices contain real values, which are hard to interpret in

Binary data context ([Snasel et al., 2008]).

2.1.2 Boolean matrix factorizations for complete matrix

The recent years have seen some good methods for Boolean matrix factorization, such

as those provide by Miettinen et al. [2008], Lu et al. [2009] and Snasel et al. [2008]. All

of these methods assume complete matrices. Vreeken and Siebes [2008] assume partially

filled Boolean matrices but perform well only with a small percentage of missing values.

In the following sections we discuss the methods that are related to our work.

2.1.2.1 The Discrete Basis Problem

The main work on which thesis is based is the Discrete basis problem, DBP by Miettinen

et al. [2008], where the authors have introduced a matrix decomposition formulation,

providing a greedy solution for Boolean matrix factorization. Please note we discuss this

work in sufficient detail in the Chapter 3.

The DBP, for a given binary matrix Cm×n and a positive integer k, provides binary

matrices Sm×k and Bk×n, which minimizes

|C− S ◦B| =m∑i=1

n∑j=1

|cij − (S ◦B)ij | (2.4)

The matrix B contains basis vectors in its columns, which represent a set of correlated

attributes, while the rows of S contain information on how each row of C could be

expressed using the basis vectors.

The basis vectors represent the correlation among the data vectors, and are calculated

using the association rule mining principles given by Agrawal et al. [1993]. However,

association rule mining could be used only with data sets whose all entries are known.

As a result, when the DBP is presented with a data containing missing values, it fails

to obtain factorizations for them.

In this thesis, we build on the Discrete basis problem and propose an algorithm for

determining the association among the data vectors even in the presence of missing

values. This algorithm forms a key element of our work in solving the problem of

Boolean matrix factorization with missing values.

Page 22: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 2. Related Work 12

2.1.3 Boolean matrix factorizations for incomplete matrix

Vreeken and Siebes [2008] present a method called Krimp-minimization, KM , to solve

the problem of imputation of incomplete data sets, which is similar to the problem

focused on in this thesis. They focus on Binary databases in particular and categorical

databases in general. KM is based on MDL-based KRIMP algorithm, that provides

high quality data descriptions through compression of the data using frequent item sets

[Siebes et al., 2006]. The MDL principle implies that for an incomplete database, if

there are multiple completed database available, then the best among those is the one

which can be compressed most, simply because it adheres to the patterns present in the

actual database.

Vreeken and Siebes [2008] define a quantitative measure based on the supports of items

over the databases to measure the quality of the reconstructed databases. Their algo-

rithm uses a set of items, called a code table, (CT) to achieve loss-less encoding and

decoding of the data. The table associates item-sets with optimal codes, with more

frequently occurring item-sets assigned shorter codes. The database is represented in

terms of item-sets, which are selected in order to minimize the total encoded size. The

encoded size of the transactions represents the likelihood of the transaction, given the

data the code table was induced on.

In order to complete a transaction, say t, containing a missing item, one version of the

algorithm provided by Vreeken and Siebes [2008] replaces t by that item-set containing

t that has the minimum encoded length. However this strategy of always choosing

the most likely candidate, the algorithm overestimates some item-sets at the expense

of others which have higher encoded length. They present an improved version of the

algorithm, called Krimp completion, KC, which chooses the item-set for completion

based on the probability distribution which is proportional to its encoded length of the

item-set. However both these algorithms need enough complete data to produce decent

results, and produce arbitrarily bad results otherwise.

The best version of the algorithm presented by Vreeken and Siebes [2008] is called

Krimp-minimization, KM, and follows EM-like approach (Dempster et al. [1977]). KM

starts with a random completion of the incomplete database. It iterates through a

number of Krimp and KC steps. In the Krimp step it compresses the current complete

database. In the KC step, it completes the incomplete database D using the code table

computed in the Krimp step. This is continued as long as the total encoded length of

the completed database shrinks. The algorithm returns the final completed database,

which would always be the shortest encoded length of the considered completions. The

algorithm always terminates with the total encoded length shrinking at every step.

Page 23: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 2. Related Work 13

Since the encoded size of any database if finite, KM can only execute a finite number

of iteration.

The results presented by Vreeken and Siebes [2008], though promising, produce good

results only up to 24% missing values in the data. This method better addresses the

class of problems where the known values dominate the missing values. Most of the

real-world data sets, however, have much higher percentage of unknowns, for which we

clearly need a method that does not have such limitations.

The work presented in this thesis is similar to the work by the authors in the sense that

we also model the data based on local patterns in the data. The authors achieve it by

compressing the database - the better the compression, the closer their model is to the

underlying distribution. In our work, we try to achieve the same via determining the

association accuracies among the data observations. The correct the determination of

the association accuracies, the closer the model is to the underlying distribution.

Page 24: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 3

Model and algorithm

3.1 The Model

Miettinen et al. [2008], in solving for the Discrete Basis Problem, given by the Equation

2.4 for complete Boolean matrices, exploit the correlation among the columns of the

data matrix. The core of their method relies on the association confidence measure as

defined by Agrawal et al. [1993]. For a matrix D, the association confidence among the

ith and jth columns is represented as c(i⇒ j,D), and is defined as,

c(D(:, i)⇒ D(:, j)) =〈D(:, i),D(:, j)〉〈D(:, i),D(:, i)〉

, (3.1)

where 〈., .〉 denotes vector product. Here, the column i is called the antecedent, while

j is called the consequent of the association rule. Given the threshold τ , an association

among columns i and j is τ -strong if c(i⇒ j,D) ≥ τ .

The DBP, for a Boolean matrix D, finds two factors, B, called the basis matrix and

X, called the usage matrix. The columns of B are called the basis vectors, whose

combination results in an approximate matrix, C. The rows of the other factor matrix X,

called the usage vectors, determine how the basis vectors are combined while generating

C.

Miettinen et al. [2008] determine the factorization for a given Boolean matrix as a three

step process. First they calculate the association confidence among every vector of the

data matrix, D, and store them in a matrix called the association matrix A. Second,

the associations are used to form a pool of candidate basis vectors, from which, in the

last step, a small set of basis vectors are selected in a greedy way to form the basis

14

Page 25: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 3. Model and algorithm 15

matrix. The selection of the basis vectors determines the corresponding usage vector in

the usage matrix.

A row A(i, :) of the association matrix consists of 1’s at index j when the association

confidence among the ith and jth column c(i⇒ j,D) ≥ τ . Each row of A is considered

as a candidate for being a basis vector. The threshold τ controls the level of confidence

required to include an attribute to the basis vector candidate and is assumed as a

parameter of the algorithm. It is from the matrix A, that the best k vectors are chosen

for a rank–k factorization, in a greedy fashion, to form the basis vector matrix.

Miettinen et al. [2008], while selecting the best basis vectors, use the concept of cover

function to determine how well a given basis vector can explain the elements in the data

matrix. The cover function can be intuitively understood as the number of agreements

for 1s, among the indices of the vector being covered, say d, and the candidate basis

vector, which is covering it, say b. The algorithm provided by the author is mindful of

the fact that while trying to make a basis vector cover a particular column of a data

matrix, it might introduce a 1 at an index in the approximate matrix, where there was a

0 in the original data matrix, or vice versa. Since this scenario degrades the factorization

quality, and ultimately increases the reconstruction error, the algorithm penalizes such

negative cover, using weights w− and at the same time rewarding covering 1s in the data

matrix, using weights w+. Hence the algorithm greedily chooses the best basis vector on

the basis of its net cover value, after adjusting the reward and penalty. The Algorithm

1 presents the algorithm for the DBP, called ASSO, provided by Miettinen et al. [2008]

to factorize a Boolean matrix C into Boolean matrices B and X.

Algorithm 1 The DBP (ASSO) algorithm provided by Miettinen et al. [2008]

Input: A matrix C ∈ {0, 1}n×m for data, a positive integer k, a threshold value τ ∈]0,1], and real-valued weights w+ and w−.Output: Matrices B ∈ {0, 1}k×m and X ∈ {0, 1}n×k.

1: function ASSO(C, k, τ, w+, w−)2: for i = 1, . . . ,m do . Construct matrix A row by row.3: ai ← (1(c(i⇒ j,C) ≥ τ))mj=1

4: end for5: B← [ ],X← [ ] . B and X are empty matrices.6: for l = 1 . . . k do . Select the k basis vectors from A

7: (ai,x)← argmaxai,x∈0,1n×1 cover

([Bai

],[X x

],C, w+, w−

)8: B←

[Bai

],X←

[X x

]9: end for

10: return B and X11: end function

Page 26: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 3. Model and algorithm 16

Before presenting our algorithm to solve for the factorization of a Boolean matrix con-

taining missing values, let us take a small example to understand how the method

provided by Miettinen et al. [2008] works for a complete Boolean matrix. It must be

noted, however our algorithm has some minor modifications to the approach used by

Miettinen et al. [2008], as we calculate the association among the rows of the data matrix

D, rather than columns, and as a result, the candidate basis vectors are available as the

columns of the association matrix and not the rows. This is more of an implementation

distinction towards solving the same problem, and does not modifies the conceptual

basis of their method. Other than this, the approach to factorize a complete Boolean

matrix remains the same between both the methods. Our method diverges when missing

values are present in the Boolean data matrices. We explain our algorithm later in the

Chapter.

Suppose we have the data set as represented in D and we wish to obtain a rank-3

factorization for it.

D =

1 1 1 0

1 1 1 1

1 0 0 1

1 1 1 0

1 1 0 1

,

We calculate the association confidence among all the pair of rows of D and store it

in the matrix A′, where A′(j, i) denotes the association confidence score of D(i, :) with

D(j, :).

A′ =

1.00 0.75 0.5 1.00 0.67

1.00 1.00 1.0 1.00 1.00

0.33 0.50 1.0 0.33 0.67

1.00 0.75 0.5 1.00 0.67

0.67 0.75 1.0 0.67 1.00

,

Using the association threshold, τ = 0.8, we obtain a Boolean association matrix A,

from A′, which contains the candidate basis vectors. All values greater than or equal to

τ in A′ are stored in A as a 1 and 0 otherwise.

A =

1 0 0 1 0

1 1 1 1 1

0 0 1 0 0

1 0 0 1 0

0 0 1 0 1

,

Page 27: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 3. Model and algorithm 17

Now to obtain the rank-3 factorization of D, we first see how each column of D could

be constructed using the columns of A. The algorithm, using the cover values, comes

up with the following combinations:

D(:, 1) = A(:, 1) + A(:, 2)

D(:, 2) = A(:, 1) + A(:, 3)

D(:, 3) = A(:, 1)

D(:, 4) = A(:, 2)

Representing the same in matrix product form,

D =

1 1 1 0

1 1 1 1

1 0 0 1

1 1 1 0

1 1 0 1

1 0 0

1 1 1

0 1 0

1 0 0

0 1 1

1 1 1 0

1 0 0 1

0 1 0 0

= B ◦X

where the columns of B are the basis vectors chosen to reconstruct the data matrix

D, while the value xij is non-zero if the cover value of B(:, i) for the column D(:, j)

is non-zero. It can be observed that the non-zero values in a column of X contain the

information on which basis vectors have been used to create a particular column of

D. For example, X(:, 1) is

1

1

0

, implying that in the reconstructed matrix, D(:, 1) is

represented using the Boolean addition of B(:, 1) and B(:, 2).

The premise behind using the above approach in factorizing a matrix is that each basis

vector can be seen as comprising of a set of features, represented by attributes, which

can be used to explain the features of an observed data points. For example, all the

points in the y − z plane could be represented using the basis vector, b = (0,1,1), or

in other words, the basis vector b can explain all points in the y − z plane. However,

b cannot completely explain the points in z − x plane, or in the 3-D space, as it lacks

Page 28: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 3. Model and algorithm 18

the attribute necessary for the x−axis. Of course, b could partly explain a point lying

in the 3-D space. By calculating the association confidence among the rows of the data

matrix, we try to determine if a particular attribute, say v could be used to explain the

corresponding portion of an observation point. If yes, then all those basis vectors that

contain the attribute v, are a candidate to explain not only that particular observation,

but all those observations which contain v. Once a set of candidate basis vectors are

determined, we choose the best r such basis vectors for a rank-r factorization. The greedy

solution always chooses that basis vector that has the maximum number of attributes,

provided it can explain more observations in the data set as compared to any other

basis vector. However, as would be described later, in trying to explain a part of an

observation, even the ’best’ candidate basis vector can wrongly and eagerly introduce

some attributes that were actually absent from the observation. On the other hand, due

to the same reason, there might be attributes for some data points, that no basis vector

can explain. This are the reasons for the divergence of a reconstructed matrix from

the original. Our goal is to obtain the factorizations such that this divergence could be

minimized.

3.2 The Concepts

We start by describing the concepts that would help better explain the algorithm that

we would be presenting later in the Chapter.

3.2.1 Calculating association with missing values

A major contribution of this thesis is to determine the correlation among data vectors

even when there are ? occurring along side Boolean values in the matrix. We modify the

original definition of association confidence to include a probabilistic component, which

we call record-bias. It is directly proportional to the amount of knowledge contained

in a vector and it extrapolates the information already provided by a particular record

(e.g., a user’s) in determining the value to be used for calculating the association matrix.

Intuitively, it predicts what would have been at the place of ? given the user’s bias to

positively rate items. Quantitatively it is the ratio among the known true elements and

all the known elements in a Boolean vector. For Dm,n ∈ {0, 1, ?}, record-bias(D(k, :)) is

defined as,

record− bias(D(k, :)) =

∑nj=1(dkj = 1)∑n

j=1(dkj = 0) +∑n

j=1(dkj = 1)(3.2)

Page 29: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 3. Model and algorithm 19

Next we detail the strategy we adopt to handle unknowns, ?, while calculating the

association confidence scores.

1. (1 & ?) This case contributes the maximum of 1 score to the self-association

(the denominator component of the association confidence calculation), while it

contributes towards the inter-row association score (the numerator component),

the value that is used to replace the ? with the record-bias.

2. (0 & ?) In this case, the consequent itself is absent, hence this case makes no

contribution to the either of the association calculations.

3. (? & 0) This case does not contributes to the inter-row association score. This

is because, we have two possibilities for the ?–0 or 1. Suppose there was a 0 in

place of ?, then we are left with a pair of 0s, which implies both the antecedent

and the consequent are absent and we can ignore the contribution to the inter-row

association score. On the other hand, if we had a 1 in place of ?, the consequent

would still be absent, hence no contribution is made to the inter-row association

score in both the cases. However, for the self association, it does contributes the

value obtained by replacing the ? with the record-bias.

4. (? & 1) In this case, we could have two values for the ?. In case there was a

1, then we would have the equal contribution to the both the self-association and

inter-row association score. Hence we could either add the same component (the

record-bias of d1) to both the scores or take a conservative approach and not do

anything. We decided to settle for the latter case of not adding it.

5. (? & ?) If we have to calculate the association among a pair of ?s, the we chose

to ignore this case, as we do not have enough information to deal with this case in

an informed manner.

6. rest all cases not involving ? For dealing with the cases not involving an ?, we

follow the regular rules of association mining.

Lets understand with an example. Suppose we are calculating the association confidence

of d1 with d2,

d1 =[? 1 1 0 ? ?

]d2 =

[0 ? 1 ? ? 0

]

It can be observes we have only three cases which contribute to the association confidence

score calculation among the two vectors: (? & 0) at index 1, (1 & ?) at index 2 and

(1 & 1) at index 3. The rest of the index pairs do not contribute to the score.

Page 30: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 3. Model and algorithm 20

Hence the confidence among the vectors, using the record-bias(d1) = 2/3 and record-

bias(d2) = 1/3, is,

c(d1 ⇒ d2) =〈d1, d2〉〈d1, d1〉

=(1 ∗ 1

3) + (1 ∗ 1)

(23)2 + 12 + 12

= 0.545

3.2.2 The cover value and the mask

Each basis vector that gets selected as part of the factorization, explains few entries of

the data matrix, which the later basis vectors need not attempt to explain. One reason

for this is that we do not gain anything in trying to explain an already covered entry by a

new basis vector, since according to Boolean addition, once we attain a 1, it will remain

a 1 no matter how many quantities are added to it. There is another serious reason why

we keep the already covered indices out of bound for the later basis vectors. It is the

risk of flipping a value in the data matrix, i.e., turning a 1 into 0 and vice versa. The

mask matrix maintains this information and during the initial steps in the algorithm,

masks away all the indices where the data matrix has unknowns, as we cannot cover

such values in the matrix. The algorithm provides a reward-weight, w+ for estimating

a 1 where the original matrix also had a 1 at the same index, and a penalizing-weight,

w− where a 0 in the original matrix has been estimated to be a 1. Currently we assume

equal weightage, and use w+ = 1 and w− = −1. This, however, can be tuned, if needed,

according to the distribution of known true and known false in the data.

3.2.3 The fit matrix

In this thesis, we introduce the concept of the fit matrix, which maintains the current

snapshot of the net cover values for all the candidate basis vectors for each data column.

Suppose we denote the cover value of the column A(:, j) for a column D(:, k) by cov(k,j).

For a given matrix, D, the value stored at any index for the fit matrix F is given by:

F(k, j) = cov(k, j)

Hence, a given column F(:, j) contains the cover value of all the columns of D, stored

at its respective index. This implies that the best basis vector at any given time is the

vector A(:, j) for which the corresponding column F(:, j) has the highest column-wise

sum. This not only makes the task of picking the best basis vector straightforward, but

Page 31: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 3. Model and algorithm 21

also helps in identifying those basis vectors for which we might need to re-calculate the

cover values. Suppose a particular element dlm got covered by any basis vector. It is

easy to identify that we need to update only the cover values stored in the row F(m, :),

which in turn involves calculating the cover values of all basis vector for the column

D(:,m). The approach is a major improvement over the earlier approach adopted by

Miettinen et al. [2008], where the cover values of all the candidate basis vectors associated

with all the data vectors was re-calculated with each basis vector selection. With this

improvement, as the algorithm progresses, it take progressively less time to update the

fit matrix and proceed to find the next basis vectors. This improvement is more obvious

with high-dimensional matrices.

3.3 The Algorithm

Given the data matrix Dm×n ∈ {0, 1, ?}, the algorithms presented in this section fac-

torize the matrix into two factor matrices, Bm×r ∈ {0, 1}, and Xr×n ∈ {0, 1}. The

first step is to obtain the vector ratio1×n ∈ R, which is then use by the Algorithm 2 to

construct the association matrix. The association returned by the algorithm comprises

of the real-valued entries, which denote the association among each pair of the rows of

D. The maximum possible value in this matrix is 1.0 occuring for highly correlated

vectors such as when the association of a vector with itself is obtained. As a result the

diagonal entries of the association matrix is always 1.

The algorithm 2 uses the Algorithm 3 to compute the association confidence scores for

Boolean vectors containing unknowns, ?.

Algorithm 2 Algorithm to construct the association matrix for a data matrix

Input: Data matrix Dm×n ∈ {0, 1, ?}Output: Matrix Am×m ∈ R1: function CreateAssociationMatrix(D)2: ratio← CalculateRatioV ector(D)3: for i = 1 . . .m do4: for j = 1 . . .m do5: A(j, i) ← calcAssociationScore(D(i, :),D(j, :), ratio)6: end for7: end for8: end function

Once the association matrix is constructed, it is converted to Boolean format using

the threshold parameter, τ . The association matrix now contains the candidate basis

vectors from which the vectors which have the highest cover values for the columns of

the data matrix, D would be chosen in a greedy manner. The Algorithm 4 describes the

Page 32: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 3. Model and algorithm 22

Algorithm 3 Algorithm to calculate the association confidence among two vectors

Input: Row vectors D(i, :) ∈ {0, 1, ?}, D(j, :) ∈ {0, 1, ?} and the ratio vectors,ratio1×n

i ∈ R, and ratio1×nj ∈ R

Output: associationScore ∈ R1: function calcAssociationScore(D(i, :), D(j, :), ratio)2: cumulativeScore← 03: pairAssociation← 04: selfAssociation← 05: for l = 1 . . . n do6: if di(l) = 1 & dj(l) = 1 then7: pairAssociation← 18: selfAssociation← selfAssociation+ 19: else if di(l) = 1 & dj(l) = 0 then

10: pairAssociation← 011: selfAssociation← selfAssociation+ 112: else if di(l) = 1 & dj(l) = ? then13: pairAssociation← ratioj(l)14: selfAssociation← selfAssociation+ 115: else if di(l) = ? & dj(l) = 0 then16: pairAssociation← 017: selfAssociation← selfAssociation+ ratioi(l)18: else if di(l) = ? & dj(l) = ? then19: pairAssociation← 020: else if di(l) = ? & dj(l) = 1 then21: pairAssociation← 022: else if di(l) = 0 & dj(l) = ? then23: pairAssociation← 024: else if di(l) = 0 & dj(l) = 1 then25: pairAssociation← 026: else if di(l) = 0 & dj(l) = 0 then27: pairAssociation← 028: end if29: cumulativeScore← cumulativeScore+ pairAssociation30: end for31: if selfAssociation 6= 0 then32: associationScore← cumulativeScore

selfAssociation33: else34: associationScore← 035: end if36: end function

Page 33: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 3. Model and algorithm 23

steps involved. It starts with invoking the procedure InitializeMaskMatrix which

initializes the mask matrix, Mm×n ∈ {0, 1}, . The matrix M has 0 for all those indices

for which the cover value need not be computed while constructing or updating the fit

matrix. It is initialized to contain a 1 at every index where we have a known value (0

or 1) in D, and 0 where D contains ?, and as the algorithm proceeds, is subsequently

updated.

Algorithm 4 Algorithm for finding out the factors for a data matrix

Input: Data matrix Dm×n ∈ {0, 1, ?}, Association matrix Am×m ∈ {0, 1}, a positiveinteger r ≤ min{m,n}Output: Basis Matrix Bm×r ∈ {0, 1} and the Usage matrix Xr×n ∈ {0, 1}1: function FactorDataMatrix(D, A, r)2: InitializeMaskMatrix( )3: InitializeF itMatrix( )4: repeat5: [bestColB, bestRowX]← FindBestV ectorsBandX( )6: add bestColB to B . bestColB: current best Basis vector7: add bestRowX to X . bestRowX: current best usage vector8: UpdateMaskMatrix(bestColB, bestRowX)9: UpdateFitMatrix(bestRowX)

10: until (r factors found)11: end function

Algorithm 5 Algorithm for initializing the Fit matrix

Input: Data matrix, Dm×n ∈ {0, 1, ?}, Association matrix, Am×m ∈ {0, 1}, Maskmatrix, Mm×n ∈ {0, 1}Output: Fit matrix, Fn×m ∈ {0, 1}1: function InitializeFitMatrix2: for s = 1 . . . n do3: for t = 1 . . .m do4: F (s, t)← calculateCover(D(:, s),A(:, t),M(:, s))5: end for6: end for7: end function

In the next step, the fit matrix F is initialized using the procedure InitializeFitMa-

trix, explained in the Algorithm 5. The entry F(s, t) stores the cover value associated

with the candidate basis vector, A(:, t) for the column D(:, s). The function that cal-

culates the cover value calculateCover is presented in the Algorithm 6. The mask

matrix M is used to determine if the cover value associated with a particular index needs

to be computed.

The algorithm iterates to find the best r basis vectors and the corresponding usage vec-

tors. The procedure FindBestVectorsBandX, explained in the Algorithm 7, returns

that basis vector that covers the maximum number of yet uncovered elements of D.

Page 34: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 3. Model and algorithm 24

Algorithm 6 Algorithm for calculating the cover value

Input: Data column vector, D(:, k), Association column vector, D(:, l), Mask columnvector, M(:, k), reward weight, w+ = 1, and penalizing weight, w− = −1Output: cover value, cover > 0

1: function calculateCover2: p← 0 . initialized to 03: n← 04: len← length(D(:, k))5: for i ∈ 1 : len do6: p← p+ 1, if M(i, k) = 1 and D(i, k) = 1 and A(i, l) = 17: n← n+ 1, if M(i, k) = 1 and D(i, k) = 0 and A(i, l) = 18: if ((w+ ∗ p) + (w− ∗ n)) < 0 then9: cover ← 0

10: else11: cover ← ((w+ ∗ p) + (w− ∗ n))12: end if13: end for14: end function

The Algorithm 4 then adds the current best basis column vector and usage row vector

to B and X respectively, using the procedure FindBestVectorsBandX described in

the Algorithm 7. It decides which elements in the data matrix it should attempt to

cover based on a mask matrix, M.

The procedure FindBestVectorsBandX in the Algorithm 7, chooses the best column

A(:, k) to be added to B and the corresponding vector X(k, :).

Algorithm 7 Algorithm for finding the best basis and usage vector

Input: The fit matrix, Fn×n ∈ {0, 1}Output: vectors bestB, bestX

1: function FindBestVectorsBandX2: bestIndex← k′ : sum(F(:, k′)) is maximum among all F(:, k)3: bestB← A(:, bestIndex)4: bestX← rowVecT, where rowVec← (F(:, bestIndex) > 0)

. making a Boolean row of F(:, bestIndex)5: end function

After one set of bestB and bestX are chosen and added to B and X respectively, M

needs to be updated to reflect the recently covered indices in D. The procedure Up-

dateMaskMatrix performs this task. The columns D(:, k), part of which got covered

during the last iteration, need to have their cover value recalculated. This is performed

by the procedure UpdateFitMatrix.

Page 35: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 3. Model and algorithm 25

Algorithm 8 Algorithm for updating mask matrix

Input: The mask matrix, Mm×n ∈ {0, 1}, vectors B(:, k) and X(k :, )

Output: The mask matrix, Mm×n ∈ {0, 1}

1: procedure UpdateMaskMatrix(B(:, k), X(k, :))

2: tempMatrix← B(:, k)X(k, :)

3: M(tempMatrix)← 0 . the non-zero indices get covered in mask matrix

4: end procedure

Algorithm 9 Algorithm for updating Fit matrix

Input: The Fit matrix, Fn×m, and the vector X(k, :)

Output: The mask matrix, Mm×n

1: procedure UpdateFitMatrix(X(k, :))

2: nonZeroIndices← find(X(k, :)) . gets the non-zero indices in X(k, :)

3: for k ∈ 1 : nonZeroIndices do

4: for l ∈ 1 : n do . n denotes the number of columns in A

5: F (k, l)← calculateCover(D(:, k)),A(:, l)),M(:, k))

. re-calculate the cover value associated with all columns index of D, which are

non-zero in X(k, :).

6: end for

7: end for

8: end procedure

The algorithm halts after selecting r such bestB and bestX

Page 36: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4

Experiments and Results

Experiments

The algorithms introduced in the thesis were tested on both real world data and synthetic

data sets. The first real world data set used is the track1 data set from the KDD Cup

competition, 2012 1 related to Tencent Weibo, a Chinese microblogging website launched

in 2010. The official problem statement of the competition is ”Predict which users (or

information sources) one user might follow in Tencent Weibo. This data set is chosen

as it is the largest data set that shares inherently the same features that this thesis

attempts to solve, namely, Boolean Matrix factorization with missing values.

The second real world data set is obtained from The 2nd International Workshop on

Information Heterogeneity and Fusion in Recommender Systems, HetRec 2011 2. Among

the few data sets released by the workshop, the movie ratings data set of the MovieLens3

system was chosen for the experiments in this thesis as its characteristics seemed to fit

the problem statement of this thesis. The movieLens system is a movie recommender

system managed by GroupLens Research4, a research lab in the Department of Computer

Science and Engineering at the University of Minnesota, Twin Cities.

To put our results into perspective, we compare them with existing low-rank matrix

completion methods. However, since no such methods exist which factorizes Boolean

matrices with high number of missing values, we selected a couple of methods which

solve a close problem —factorization of real-valued matrices with hidden or missing data

values. We reduce their reconstructed matrix to Boolean format to achieve a comparison

with our results, but we understand that this is not a naturally comparable scenario.

1http://www.kddcup2012.org/c/kddcup2012-track1[Last accessed: 26-10-2012]2http://ir.ii.uam.es/hetrec2011[Last accessed: 26-10-2012]3http://movielens.umn.edu/[Last accessed: 26-10-2012]4http://www.grouplens.org/[Last accessed: 26-10-2012]

26

Page 37: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 27

Table 4.1: Summary of the data characteristic used for synthetic data experimentsτ = 0.2:0.1:0.8, rows = 300, cols = 500

τ , rows, cols

Density (%) Unknown (%) Noise (%) Rank

Density Batch 10:10:90 20 5 10

Unknown Batch 10 30:10:90, 94, 98 5 10

Noise BatchI 10 20 5:5:20 10

Noise BatchII 10 70 5:5:20 10

Rank Batch 10 20 5 10:10:50

4.1 Synthetic Data

We perform experiments with synthetic data to observe the changes in the quality of

factorization upon varying important important properties of the data. We identified

density, unknowns, noise and rank as the features interesting from the algorithm’s per-

spective. Though there could be many permutations of such features, we usually modify

one feature at a time, keeping all other constant. The rate of errors in the reconstructed

matrix is used to judge the algorithm’s performance with synthetic data. For the exper-

iments with the real world data, we also calculate the root-mean squared error.

All synthetic matrices are constructed to be of dimension (300 × 500). The association

threshold τ , is a parameter to the algorithm. In these experiments we try to understand

how the algorithm’s behavior changes with different values of τ . Hence, we repeat the

experiments with τ = {0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}. Later, in the results section, we

present and discuss the representative cases with best and worst τ .

4.1.1 Density

Density of a matrix roughly translates to its information content providing valuable

insight into its underlying structure, more so when the majority of its values are un-

known. Our algorithm depends on the principles of association rule mining to build

the association matrix, from which the basis vectors are chosen while factorization. A

higher density matrix invariably boosts the probability of 1s in the association matrix

increases, which enables better segregation of good basis vectors from not-so-good ones

based on their cover values. Hence a high density would lead to better factors.

Page 38: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 28

4.1.1.1 Expected Output

A lower density matrix does not help much in revealing the correct underlying structure

of the matrix. The cover function would have less number of 1s to work with, and even

the best basis vectors available would have low cover values. Hence the algorithm would

not be able to effectively distinguish good basis vectors from less optimal ones, leading

to higher reconstruction errors on account of poor factorizations.

4.1.1.2 The Approach

Using the combination of parameters mentioned in Table 4.1, for each density value

listed, 10 random matrices are generated. Each matrix is split into training data matrix

(with 80% known values) and testing data matrix (with remaining 20% known values).

Two factor matrices are obtained for the training data, which are then used to explain

the known values in the testing data. The hamming error rates are calculated for the

testing and training data matrices.

4.1.1.3 Choice of remaining data parameters

• Unknowns

Since the goal of the experiments is to measure the performance of the algorithm

with matrices containing missing values, 20% missing values or unknown values are

introduced in all the matrices. Along with low values of density, this value might

prove to be the dominant data property, impacting the factor quality significantly,

but as we move to higher density values its affect is bound to be less pronounced.

• Noise

All the experimental data sets are generated to contain a small percentage of

random noise to simulate real world data. We introduce 5% noisy indices in all

experiments distributed randomly, where the indices containing known values have

their values flipped.

• Rank and Dimension

Our algorithm requires the rank for the factorization in advance unlike some other

methods employing SVD which ’guess’ it from the data itself (e.g., Keshavan et al.

[2009]). The rank implies the number of variables that have gone into generating

the data, and for the size of the synthetic datasets (300 × 500), we chose rank as

10.

Page 39: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 29

4.1.2 Unknowns

Here we observe the performance of the algorithm with varying unknowns in the ma-

trices. While calculating the association confidence of a row, say r1 is calculated with

another row, r2, the ? in r1 are replaced by the ratio of Boolean true in r1. Thereafter,

the association calculation proceeds according to the Algorithm 3. Hence, with un-

knowns in the matrices, the association confidence calculation tends to depend on such

replacements, which are, at best, approximations. As the percentage of ? increases, the

generation of the candidate basis vectors are influenced by such approximation more

than certain information in the data, hence lead to poor factor matrices.

4.1.2.1 Expected Output

The reconstruction errors are expected to increase linearly with an increase in the per-

centage of unknowns.

4.1.2.2 The Approach

The approach is similar to the section 4.1.1.2, but here the focus is on percentage of ?.

4.1.2.3 Choice of data parameters

• Density As discussed, the density of a row, which is directly proportional to the

overall density of the matrix, is used in replacement of ? for association confidence

calculation. We keep density at 10% in these experiments to ensure the algorithm

has just enough Boolean true to assist in such replacement.

The rationale for the choosing the other parameter values in this part of the experiment

remains the same as the section 4.1.1.3.

4.1.3 Noise

Here we observe the change in the quality of factorization with varying noise in the

data matrices. Real life data sets are noisy, which affects the underlying structure

in the data. Since the noise is usually randomly distributed in the matrix, a very

high noise percentage can distort the structure so much that the algorithm can end up

trying to determine the structure when none exists. A noisy data is different from one

containing unknowns where the algorithm is forced to make approximations, but here,

Page 40: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 30

the algorithm is presented with data where random data values are flipped, leading to

incorrect calculation of association scores.

4.1.3.1 Expected Output

With increasing noise, the factor matrices obtained tend to increasingly focus on ex-

plaining noisy bits, and as a result the calculated structure drifts further away from the

real one, resulting in an increase in the reconstruction errors.

4.1.3.2 The Approach

This part of the experiments are divided in two parts - one where we focus on varying

noise with low percentage of ? in the data, and the other where we vary the noise at the

same rate, but with increasing percentage of unknowns. A high percentage of unknowns

is expected to offset the distortion caused by noise to an extent. On the other hand, data

containing higher noise percentages along with a high unknown percentage is expected

to produce much higher errors. Table 4.1 contains the parameters chosen.

4.1.3.3 Choice of data parameters

• Noise

The noise values is being varied from a low value of 5% to a maximum of 20%, as

beyond this the algorithm is highly probable to infer wrong structure and errors

are expected to proliferate.

• Unknowns

A low value of 20% unknowns provides a benchmark to compare the corresponding

error figures for each noise value with its higher unknown counterpart of 70%,

which is more likely to simulate a real data, where high percentages of unknowns

are common.

The rationale for the choosing the other parameter values in this part of the experiment

remains the same as the section 4.1.1.3.

4.1.4 Rank

Usually the rank of a data matrix is not known in advance, and even if the rank, r

is some how known, as in the synthetic data generated here, there is still the problem

Page 41: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 31

of finding the exact combination of the r vectors producing the data matrix. In this

experiment, we focus on how the reconstruction errors behave when the algorithm has

increasingly many possibilities of combining the linearly independent vectors, in light of

the increasing rank.

4.1.4.1 Expected Output

Given a constant size of matrix, with increasing rank, the reconstruction errors are

expected to increase. By factorizing a rank r matrix, we have r basis vectors. Different

combinations of these vectors would provide a different matrix. The higher the value

of r, more the number of such combinations and hence higher number of resulting

matrices are possible. Correspondingly, it becomes increasingly difficult to find that

particular combination of the vectors that generated the original matrix, and hence a

higher probability of reconstruction errors by the algorithm.

4.1.4.2 The Approach

The approach is similar to the section 4.1.1.2, but here the focus is on rank.

4.1.4.3 Choice of data parameters

• Rank

Too high values for rank are not chosen to avoid over fitting.

The rationale for the choosing the other parameter values in this part of the experiment

remains the same as the section 4.1.1.3.

4.2 Results for synthetic data

Now we present and discuss the result obtained for the synthetic data. All the plots

in this section depict average reconstruction errors. Also, each figure shows a pair of

contrasting results with different association threshold value, τ .

4.2.1 Varying density

Figure 4.1 shows how the reconstruction error rates change when the density of the

matrices is varied. Figure 4.1a presents good reconstruction obtained with τ = 0.5. Low

Page 42: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 32

density matrices contain less 1s, and their associated association matrix is also sparse,

containing very few 1s. Now, the probability to introduce 1s where there was 0 in the

original matrix reduces when the basis vectors themselves are sparse. Hence the low

errors at low density.

As the density is increased, we have more information while calculating the correla-

tion among the rows, but at the same time, the probability to make an error in the

reconstructed matrix also increases, as we have increasingly more indices to cover. This

explains the rise in the curve. Now at higher densities, the 1s dominate the data ma-

trices and as a result, the basis vectors too. In such circumstances, the probability of

failing to cover a 1 is very low, because even if one basis vector misses few 1s, there

would be some other basis vector covering those value.

In the Figure 4.1b, where τ = 0.8, training and testing error rates lie farther apart,

implying over-fitting. With higher τ , the association matrix would be sparse, and while

choosing the best basis vectors, the ones which can explain the training data best are

preferred, leading to higher errors when we attempt to explain the testing data using

those factor matrices.

10 20 30 40 50 60 70 80 900

0.05

0.1

0.15

0.2

0.25

0.3

0.35

density (%)

rate

of err

ors

Test ( µ ± σ)

Train ( µ ± σ)

(a) τ = 0.5

10 20 30 40 50 60 70 80 900

0.05

0.1

0.15

0.2

0.25

0.3

0.35

density (%)

rate

of err

ors

Test ( µ ± σ)

Train ( µ ± σ)

(b) τ = 0.8

Figure 4.1: Varying density: (a) best and (b) worst error rates, Matrix dimensions(300,500), rank 10, unknowns 20%, noise 5%

4.2.2 Varying unknowns

Figure 4.2a shows the best reconstruction for varying unknowns, obtained using τ = 0.8.

When the data matrix contains a high percentage of unknowns, the row’s density is used

Page 43: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 33

30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25

unknowns (%)

rate

of

err

ors

Test ( µ ± σ)

Train ( µ ± σ)

(a) τ = 0.8

30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25

unknowns (%)

rate

of

err

ors

Test ( µ ± σ)

Train ( µ ± σ)

(b) τ = 0.2

Figure 4.2: Varying unknowns: (a) best and (b) worst error rates, Matrix dimensions(300, 500), rank 10, density 10%, noise 5%

in replacing the ?, while constructing the association matrix. Since the known values

in the data rows are less, the basis vectors embed a far higher knowledge of the global

pattern information of the data. The local pattern knowledge, which comes from the

known values while calculating the association confidences, is low in such cases. When

coupled with a high threshold value, this leads to less number of 1s in the association

matrix. Hence when the testing data is explained using such basis vectors, the error

rate is almost constant. The training error is low at probably those points, where some

rows had known values dominating the unknowns, leading to over-fitting.

On the other hand, when the threshold value is relaxed, as presented for the case in

Figure 4.2b with τ = 0.2, the number of 1s in the association matrix increases. The

training error is lower than the previous case(4.2a), as now there are more 1s in the

basis vectors to cover the 1s in the data matrix. The huge difference among the training

errors and testing errors occur due to high over-fitting.

4.2.3 Varying noise

4.2.3.1 Low unknowns

Figure 4.3 show results with varying noise in the data. Figure 4.3a shows the plot

obtained using τ - 0.3, when the matrices contain low unknown percentages (20%). The

association confidence calculations are generally obtained with known information (0s

and 1s), and the approximations arising due to the presence of ’?’ are less. Hence the

Page 44: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 34

basis vectors are able to explain the data well, as can be seen with the near overlap of

the training and testing error curves. This error increases with increasing noise, because

even though the factor matrices explain the data well, the reconstructions would differ

at the noisy indices.

5 10 15 20

0.05

0.1

0.15

0.2

0.25

0.3

noise (%)

rate

of

err

ors

Test ( µ ± σ)

Train ( µ ± σ)

(a) τ = 0.3

5 10 15 20

0.05

0.1

0.15

0.2

0.25

0.3

noise (%)

rate

of

err

ors

Test ( µ ± σ)

Train ( µ ± σ)

(b) τ = 0.8

Figure 4.3: Varying noise (with low unknowns): (a) best and (b) worst error rates,Matrix dimensions (300, 500), rank 10, density 10%, unknowns 20%

When compared with the 4.3b, shows that both types of errors increase when the as-

sociation matrix has fewer 1s due to a higher τ value of 0.8. The basis vectors do

not represent the true underlying structure of the data. The factor matrices are con-

structed to explain the training data using such basis vectors, leading to over-fit, and a

corresponding increase in reconstruction errors with the testing data.

4.2.3.2 High unknowns

Figure 4.4a show that good reconstructions are obtained by using a high τ = 0.8 when the

noisy data has high unknowns (70%). This is consistent with the behavior in Figure 4.2a,

which also had good reconstructions with high τ with high unknowns. The increasing

noise in the matrices leads to a corresponding increase in both the error rates as explained

in the Section 4.2.3.1. However when the association matrix is constructed using a low

threshold, τ = 0.3 here, it has much higher number of 1s, which, though lead to low

training errors as compared with the Figure 4.3a, cause a higher testing error due to

over-fitting.

Page 45: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 35

5 10 15 20

0.05

0.1

0.15

0.2

0.25

0.3

noise (%)

rate

of

err

ors

Test ( µ ± σ)

Train ( µ ± σ)

(a) τ = 0.8

5 10 15 20

0.05

0.1

0.15

0.2

0.25

0.3

noise (%)

rate

of

err

ors

Test ( µ ± σ)

Train ( µ ± σ)

(b) τ = 0.3

Figure 4.4: Varying noise (with high unknowns): (a) best and (b) worst error rates,Matrix dimensions (300, 500), rank 10, density 10%, unknowns 70%

4.2.4 Varying rank

Figure 4.5a shows when low τ of 0.3 is used to obtain low rank factorizations, the errors

remain low and the training and testing errors are closer, implying good factorizations.

However, when the rank is increased, the testing errors rate increases, even though the

training error rate remains less, implying over-fitting. The reason is that with higher

ranks, the model generated has a high number of parameters relative to the number of

observations (300, here).

As shown in Figure 4.5b, for a higher τ of 0.8 here, the reconstructions get worse as

over-fitting increases. At higher ranks, both the errors rates increase as compared to

the Figure 4.5a, which can be attributed to having less number of 1s in the association

matrix. The divide between the training and testing gets wider, while the testing error

rate stays more or less at the same level. The decline of training errors with increasing

rank is due to the fact even though the rank is increased, the density stays at 10% for

all the matrices, and with more basis vectors available to explain such sparse data, the

error decreases. However this also brings us to the danger of over-fit, as the basis vectors

picked to explain the sparse matrices, tend not to have much predictive capability for

unseen data.

Page 46: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 36

10 20 30 40 50

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15

rank

rate

of

err

ors

Test ( µ ± σ)

Train ( µ ± σ)

(a) τ = 0.3

10 20 30 40 50

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15

rank

rate

of

err

ors

Test ( µ ± σ)

Train ( µ ± σ)

(b) τ = 0.8

Figure 4.5: Varying rank: (a) best and (b) worst error rates, Matrix dimensions (300,500), density 10%, unknowns 20%, noise 5%

Page 47: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 37

4.3 Real Data

4.3.1 Movie Lens Data

The movie ratings data set of the MovieLens5 is referred to as Movie Lens dataset in this

thesis. This original dataset contains of 2113 users providing ratings for 10197 movies.

The movies and users are represented by integral identifiers, while the ratings range

from 0.5 to 5.0 with possible increments of 0.5. The data was preprocessed and reduced

to Boolean format for the experiments, for which we used the 80th-percentile ratings of

each record as the threshold for the Boolean reduction. As part of removing outliers, we

removed movies that were rated less than 5 times. In the original data, the minimum

user-frequency was already 20, and hence no pruning was performed. The Table 4.2

provides the final statistics after the preprocessing.

Feature Value

Percentages (%)

Unknowns 95.90Known Ratings 4.10True Ratings 1.04False Ratings 3.06True among known ratings (Density) 25.39False among known ratings 74.61

Counts

Movies pruned (frequency less than 5) 1954Final movies retained 9831Final users retained 2113

Table 4.2: The properties of the Movie Lens data set after pruning

Neither the rank of the matrix, nor a good estimate of the association threshold value,

τ is known. Hence we execute the experiments on a few combination of parameters to

choose the one giving lower error rates. We chose τ = {0.1, 0.3, 0.5, 0.7, 0.9}, while

the rank values are {3, 10, 20, 30, 40, 60, 80}. For the matrix size (2113 × 9831), a

higher rank choice might lead to over-fitting and hence poor factorizations. The data set

was split into disjoint training and testing subsets in 80-20% with respect to the known

values (true and false).

5http://movielens.umn.edu/[Last accessed: 26-10-2012]

Page 48: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 38

4.3.1.1 Comparison with other methods

The algorithms selected for comparision with the BMF with missing values, are discussed

below:

1. OptSpace : A Matrix Completion Algorithm6

As part of the obtaining comparison statistics with existing methods solving similar

problems as being focused on in this thesis, we chose the OptSpace method by Ke-

shavan et al. [2009]. This method solves the low-rank matrix completion problem.

The method first obtains the factorization using SVD followed by local manifold

optimization. It first trims the matrix containing the hidden or unknown values to

remove highly weighted rows and columns that do not contain much information

about hidden entries, basically throwing away information in the process. Then

the trimmed matrix is adjusted to minimize the error that is made at the entries

whose values are known via a gradient decent procedure.

2. Augmented Lagrange Multiplier (ALM) Method

To obtain more comparison statistics with existing method, we also chose to obtain

reconstruction results with the ALM method by Lin et al. [2010]. The authors

propose algorithms for recovering a low-rank matrix with an unknown fraction of

its entries arbitrarily corrupted. In the paper published by the authors, the method

of augmented Lagrange multipliers (ALM) is used to solve the problem of Robust

PCA via convex optimization minimizing a combination of nuclear norm and the

`1-norm. The authors claim to obtain promising results for the related problem

of matrix completion using the ALM technique, which we compare against our

proposed algorithm.

4.3.1.2 Results and discussion

The Table 4.3 presents the result for the movieLens data set. The same data set is

factorized using different ranks and for each rank multiple association threshold values

are used. The general observable trend here is that with increasing rank, the training

error rate has gradually reduced, but this has occurred at the cost of a corresponding

increase in the testing error. The same pattern is observed for the root-mean squared

error, rmse also. This hints at over-fitting and the likely reason for this is the skewed-

nature of the data in terms of the distribution of known true, and known false in the

data. As can be seen from the Table 4.2, where the summary of the movieLens data

is presented, the Boolean true constitute only around 25% among the known values in

6http://www.stanford.edu/~raghuram/optspace/index.html[Last accessed: 26-10-2012]

Page 49: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 39

the data set. Our method, however, uses equal cost-based loss functions, i.e., providing

equal weights w+ and w− while calculating the cover value.

Table 4.3: Factorization results for the movielens datafor BMF with missing values method. hamming denotes the hamming distance betweenthe original and reconstructed matrices, while rmse denotes the root-mean square error.

hamming rmse

r τ train test train test

3 0.1 0.1915 0.2057 0.4376 0.4535

0.3 0.1867 0.2003 0.4321 0.4476

0.5 0.1907 0.2044 0.4367 0.4521

0.7 0.2046 0.2183 0.4523 0.4672

0.9 0.2349 0.2404 0.4847 0.4903

10 0.1 0.1860 0.2069 0.4313 0.4548

0.3 0.1807 0.2009 0.4251 0.4482

0.5 0.1820 0.2040 0.4266 0.4517

0.7 0.1968 0.2168 0.4437 0.4656

0.9 0.2289 0.2402 0.4784 0.4901

20 0.1 0.1829 0.2081 0.4277 0.4562

0.3 0.1765 0.2026 0.4201 0.4501

0.5 0.1759 0.2044 0.4194 0.4521

0.7 0.1906 0.2168 0.4366 0.4656

0.9 0.2225 0.2402 0.4717 0.4901

30 0.1 0.1810 0.2089 0.4255 0.4571

0.3 0.1739 0.2031 0.4170 0.4507

0.5 0.1714 0.2044 0.4140 0.4521

0.7 0.1856 0.2166 0.4308 0.4655

0.9 0.2172 0.2401 0.4660 0.4900

40 0.1 0.1797 0.2100 0.4239 0.4583

0.3 0.1718 0.2035 0.4145 0.4511

0.5 0.1678 0.2043 0.4096 0.4520

0.7 0.1813 0.2168 0.4258 0.4656

0.9 0.2123 0.2403 0.4608 0.4902

60 0.1 0.1780 0.2108 0.4219 0.4591

0.3 0.1683 0.2040 0.4103 0.4517

0.5 0.1616 0.2044 0.4020 0.4521

Continued . . .

Page 50: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 40

Table 4.3: (continued)

hamming rmse

r τ train test train test

0.7 0.1738 0.2169 0.4168 0.4657

0.9 0.2036 0.2403 0.4512 0.4902

80 0.1 0.1768 0.2113 0.4205 0.4597

0.3 0.1655 0.2042 0.4069 0.4519

0.5 0.1564 0.2048 0.3954 0.4526

0.7 0.1672 0.2169 0.4089 0.4657

0.9 0.1960 0.2403 0.4428 0.4902

The table 4.4 presents the results for the same data set using the OptSpace factorization

[Keshavan et al., 2009]. As discussed before, the method works with real-valued data, but

we provided it the movieLens data set in Binary format, using the value 2 to represent

Boolean true and 1 for Boolean false, while the unknowns were represented using 0.

Obviously the OptSpace method treated the data as real-valued and produced real-

valued factor matrices. The next step was to reduce reconstructed matrix to Boolean

format, before we could compare the results with our method. We used the threshold

value of 1.5 for this reduction. As the Table 4.4 shows, the results by our methods

are close to, if not better always, the results provided by OptSpace. The testing error

by OptSpace remains lower than our method even when our method provides a lower

training error. While comparing the results, we are aware that methods working with

continuous values have more elbowroom while deciding on problems statements like we

had in this scenario : if a user will like a movie or not. Whereas, with Boolean methods

it is a hard decision with more scope to commit errors it a particular threshold is crossed.

Moreover, our method uses Boolean arithmetic while combining the basis vectors while

trying to explain an observation, where if we obtain a 1 where there was a 0 in the data,

it stays.

Table 4.5 presents the results for movieLens data set with ALM method [Lin et al., 2010].

This method returns real-valued factor matrices which it finds treating the Boolean data

as real-valued. We used a threshold value, similar to the one used in the experiment with

OptSpace above, to reduce the reconstructed matrix to Boolean format for obtaining

the result statistics. This method does not takes the rank as a parameter, guessing

the same from the data. As the table shows, the method determined the best rank for

the data to be 1, stopping at the first value it tested for the rank after finding good

reconstruction results. Rank of 1 obviously does not provides much latent information

Page 51: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 41

Table 4.4: Factorization results for Movie Lens data with OptSpace methodr denotes the rank, test and train denote respectively, the test and training error ratesHamming denotes the Hamming error distance metric, while RMSE denotes Root-

mean squared error

hamming rmse

r train test train test

3 0.1823 0.1886 0.4270 0.4343

10 0.1793 0.1865 0.4235 0.4318

30 0.1763 0.1866 0.4199 0.4320

in any data set. The reasons for this are not hard to understand. For any Boolean data

set, as an extreme case, rank-1 factor matrices which are full would always provide the

best reconstruction, but at the cost of finding the existing pattern in the data set. For

the movieLens data set it can be understood thus: if it is found that users on an average

dislike more movies than they like, then coming up with the blanket conclusion that all

users dislike all the movies, we would still make less errors regarding the predictions,

because the users are generally found to dislike the movies. But then this pattern does

not truly reflects the pattern behind users’ preferences.

Table 4.5: Factorization results for Movie Lens data with ALM methodr denotes the rank, test and train denote respectively, the test and training error ratesHamming denotes the Hamming error distance metric, while RMSE denotes Root-

mean squared error

hamming rmse

r train test train test

1 0.1827 0.1896 0.4274 0.4354

Page 52: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 42

4.3.2 KDD Cup Data

The track1 data set from the KDD Cup competition, 2012 (kdd [2012b]) is referred to

in this thesis as the KDD Cup data. The data consists of real responses of users to

recommendations to follow an item at the microblogging website Tencent Weibo7. The

items belong to various categories such as news, games, advertisements, products etc.

If the user ignored the suggestion and did not provide an explicit response, it results

in an unknown value in the matrix, otherwise we have a Boolean true for follow and

false when user does not wants to follow the item. We also have unknowns for all other

items which were not presented to users. The goal of the KDD Cup 2012 for this data

is to predict which items a user will follow, among all potential items. Clearly this is

a Boolean data with missing values and requires the matrix to be completed and is an

exact match to the problem this thesis addresses.

Some preprocessing was required before experiments could be performed on this data

set. The first issue was the huge size of the data means and to handle it, we decided

to prune users with frequency less than 5 as we feel their contribution is insignificant

for modeling considerations. Next, among the surviving records, all the items having

frequency also less than 5 were removed. These steps reduced the number of users by

16% and item by 3%.

Still, the data was huge, containing 5.3 Billion entries but containing over 99% missing

values. Performing random sampling of the records, we obtained a smaller data set

containing 2.6 Billion entries having half the number of users (583113). We refer to it as

the Half -KDD Cup data throughout. Due to the random sampling, the characteristics

of the data, in terms of density, percentage of known values etc. was observed to remain

same after the reduction. The Table 4.6 summarizes the properties of the Half -KDD

Cup data set.

Due to the size considerations, it was not possible to try a full range of association

threshold values over the full data. Rather, the threshold value that gives minimum

error with a small subset of data is chosen for the Half -KDD Cup Data. We created

a training subset containing a random sample of 5% of the known values of Half -KDD

Cup, and a disjoint testing subset containing 1.25% from the remaining known values

such that both subsets contain known values in the ratio of 80:20. Next we executed

the algorithm on these subsets using association thresholds, τ = {0.2, 0.4. 0.6. 0.8}.

We constructed the training and testing data from the Half -KDD Cup Data into using

a 80-20% split with respect to the known values, i.e., the Boolean true. The τ from this

set giving best reconstruction results is then used for obtaining the factorization for the

7http://t.qq.com/[Last accessed: 26-10-2012]

Page 53: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 43

Table 4.6: The properties of the Half -KDD Cup Data set

Feature Value

Percentages (%)

Unknowns 99.24Known Ratings 0.76True Ratings 0.06False Ratings 0.70True among known ratings (Density) 8.8False among known ratings 74.61

Counts

Final items retained 4551Final users retained 583113

training subset of the Half -KDD Cup data. It is important to choose an appropriate

factorization rank for the data set, as our algorithm does not predicts the rank from

the data. We feel that for the 4551 items in the Half -KDD Cup data matrix, rank

= 50 should provide efficient low-rank factorization summarizing the most important

characteristics of the data.

4.3.2.1 Results and discussion

Table 4.7 presents the results of the experiment to choose the best τ . It can be seen that

the training errors have been extremely low to the tune of 3%, but the testing errors,

though only around 8% are relatively much higher. The reason for low errors is the

extremely low density of the matrices —the training data has approximately 0.00038%

of known values among all data, and among the known values, only 8.75% are known

true values. In this scenario the basis vectors obtained by our algorithm explain the

training data extremely well, because even a few 1s in the factor matrices can cover the

given 1s in the data matrix well. It is interesting to observe the density of the factor

matrices themselves, presented in the Table 4.8. Though, using a lower τ = 0.2 allows

somewhat higher percentage of 1s in the association matrix and hence the basis matrix,

but overall the factor matrices are extremely sparse.

Coming back to the Table 4.7, we present the error rates with respect to known true

values to better understand the behavior of the algorithm. The density of the matrices

being so low, we focus only on the prediction of the known true in the data and find that

the rate of test errors is extremely high, being more than 99% for higher values of τ .

Given such skewed distribution of the known true and false as in this data, using unequal

weights w+ and w−, which rewards covering 1s and penalizes covering 0s respectively,

Page 54: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 44

would help. However, here we had both w+ and w− as 1. The choice of optimal weights

is not a direct decision, and requires trying out a few candidate values. The data size in

question here prevented us from exercising this option, but, in general, it could be used

for data with skewed distribution among the known values.

Table 4.7: The reconstruction error statistics for the experiment to choose the asso-ciation threshold, τ for the rank 50 for the Half -KDD Cup Data.

errors represents the errors for known values, while errors-knowns-true for the BooleanTRUE values among the known values.

train and test denote respectively, the training and testing error rates.

hamming errors-knowns-true

rank τ train test train test

50 0.2 0.0042 0.0977 0.0394 0.9234

0.6 0.0332 0.0882 0.3779 0.9954

0.8 0.0351 0.0881 0.4001 0.9967

Table 4.8: The density of factor matrices in the experiment to choose the best τ

density (in %)

τ basis matrix usage matrix

0.2 5.21 0.28

0.6 0.36 0.19

0.8 0.30 0.18

We choose τ = 0.8 to perform the experiment with the Half -KDD Cup Data, as this

value provides balanced results with the training and testing data. If we had chosen

τ = 0.2 we would have introduced far too many 1s in the association matrix, which

though would have resulted in lower training error, but at the expense of testing error.

The Table 4.9 presents the result of the experiment with the Half -KDD Cup Data. The

lower testing errors in terms of known values for our method is largely due to predicting

the Boolean false (0s) in the matrix correctly. Our method could match very few Boolean

true values (1s) in the original matrix, committing 99.94% errors. The reason for this

is low density in the data, as it directly results in a very sparse association matrix. As

a result, the candidate basis vectors have very low number of 1s to be able to cover the

1s in the data matrix.

Page 55: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 45

Table 4.9: The reconstruction error statistics for the experiment for the Half -KDDCup Data.

τ represents the association threshold; rank is the factorization rank; errors representsthe errors for known values, while errors-knowns-true for the Boolean TRUE values

among the known values.train and test denote respectively, the training and testing error rates.

hamming errors-knowns-true

rank τ train test train test

50 0.8 0.0525 0.0877 0.5995 0.9994

4.3.2.2 Comparison with other methods

Table 4.10 presents the result obtained for the Half -KDD Cup Data using the OptSpace

method. When we let the algorithm guess the rank from the data, and obtain the best

training and testing errors, it obtained the factorization using rank = 3. Probably the

Boolean data was too sparse for the method to determine much structure in the matrix,

hence the low rank. The training and testing errors for OptSpace are close to each other,

hence the method found similar structure in the training and testing data. However,

when compared to our results in the Table 4.9, we see our method obtains less error

rate using a higher rank for the factorization. Our method obtained lower training and

testing errors in terms of the known values, but provides higher testing errors in terms

of known true values.

Table 4.10: Factorization results for the Half -KDD Cup Data with OptSpace method.r denotes the rank, test and train denote respectively, the test and training error ratesHamming denotes the Hamming error distance metric, while RMSE denotes Root-

mean squared errorerrors-known-true represents the errors when matching Boolean TRUE in the matrices

among the known values.

hamming errors-knowns-true rmse

r train test train test train test

3 0.1164 0.1169 0.7685 0.7756 0.3412 0.3419

Page 56: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 4. Experiments and Results 46

Table 4.11 presents the result obtained for the same Half -KDD Cup Data using the

ALM method. We allowed the algorithm to figure out the rank from the data itself,

and present the best estimates for the training and testing data sets. Again the method

obtains a rank-1 for the matrix, and the factor matrices were almost full, giving low

error rates. This is similar to the algorithm’s behavior with the MovieLens data set

(Table 4.5). Here the number of Boolean true in the matrix were extremely low, and

if the method treats every unseen data value as a Boolean false, then the errors would

obviously be low. Hence we could not obtain much information from the factorization

results. Also this method obtains higher error rates in terms of known trues both for

the training and testing data sets.

Table 4.11: Factorization results for the Half -KDD Cup Data with ALM method.r denotes the rank, test and train denote respectively, the test and training error ratesHamming denotes the Hamming error distance metric, while RMSE denotes Root-

mean squared errorerrors-known-true represents the errors when matching Boolean TRUE in the matrices

among the known values

hamming errors-knowns-true rmse

r train test train test train test

1 0.0796 0.0822 0.8219 0.8346 0.2822 0.2867

Page 57: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 5

Conclusions

5.1 Conclusion

This thesis introduces a factorization method for Boolean matrices containing unknown

or missing values, based on correlation among the records of the data matrices. Our

method uses Boolean arithmetic to obtain intuitive basis vectors which represent the

underlying structure in the Boolean data sets containing missing values, which is an

improvement over the method on which this work is based —the Discrete Basis problem

by Miettinen et al. [2008], which was limited in providing factorizations for completed

Boolean matrices only.

The biggest contribution of this work is in providing a method to enable association

confidence scores calculations even in the presence of missing values in the data matrix,

which cannot be done using the classical methods of association rule mining. Our method

expresses a given matrix as the product of a basis matrix and an associated usage

matrix. The basis matrix consists of the basis vectors using which the data could be

re-constructed, while the usage matrix contains information regarding the combinations

which should be used to obtain the reconstruction using such basis vectors.

We provide an optimized approach of finding the best basis vectors, by introducing the

concept of the Fit matrix, which helps in efficient determination of the best factors of a

matrix. The benefits of this approach are more visible with higher dimensions matrices

as it take progressively less time to calculate the later factors.

Our method provides good results with synthetic data, where we demonstrated the

algorithm’s performance on various combinations of varying data parameters. The algo-

rithm produced good matrix completions with increasing percentage of unknowns, even

47

Page 58: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Chapter 5. Conclusion 48

if the matrix had a density of (10%). Our algorithm also obtains good results with with

changes to rank, density, and noise in the data and the association threshold.

We performed experiments with two sets of real world data. The first among those

is the movieLens data set, which had almost 96% unknowns. Our algorithm achieved

reconstruction errors rate of around 20% for its testing subset.

The second data set chosen by us is the KDD Cup data, which had an extremely low

density of 0.76%. While treating a subset of this data to choose the best τ , the algorithm

worked with matrix density as low as 0.00038%. We observed relatively higher testing

errors in this scenario, especially in terms of known true values, but then this can be

attributed to extremely low density in the matrices. But with higher density matrices,

as in the movieLens data set, we see good reconstruction results.

We obtained comparable results with the OptSpace method [Keshavan et al., 2009].

OptSpace provides matrix completion solution for real-valued matrices, and the method

did not provide factors intuitive for Boolean data bases. Hence we were able to only

obtain approximate comparison statistics.

5.2 Future Work

In the scenario of skewed distribution among the known values in the data, where usually

the percentage of known true is much less than those of known false, the algorithm could

be designed to take advantage of employing unequal weights, w+ and w− to introduce

bias in choosing the best basis vectors by putting more emphasis on covering 1s without

worrying much about the case of introducing 1s where the matrix originally had 0s.

We could improve the prediction capability of the algorithm for unseen data by making

use of regularizes while constructing the factor matrices. For different sets of factor

matrices providing same errors, choosing the ones which are more sparse would induce

less errors for the unseen data.

During the preparation of Boolean data from real-valued data sets, e.g., with the movie-

Lens data, we used the user’s bias for reducing the user’s real-valued ratings to Boolean

format. In using only the user’s preference we might have adopted a narrow view in the

normalization of the ratings. On the other hand, incorporating both the user’s and the

movie’s normalized ratings would help in better Boolean reduction of the data. This

approach would help us better estimate the structure in the data, when we obtain the

factorizations and perhaps obtain better results in comparison with real-valued methods.

Page 59: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Bibliography

Pauli Miettinen, Taneli Mielikainen, Aristides Gionis, Gautam Das, and Heikki Mannila.

The discrete basis problem. IEEE Trans. Knowl. Data Eng., 20(10):1348–1362, 2008.

T. Li. A general model for clustering binary data. In Proceedings of the eleventh ACM

SIGKDD international conference on Knowledge discovery in data mining, pages 188–

197. ACM, 2005.

Zhongyuan Zhang, Tao Li, Chris Ding, and Xiangsun Zhang. Binary matrix factor-

ization with applications. In Proceedings of the 2007 Seventh IEEE International

Conference on Data Mining, ICDM ’07, pages 391–400, Washington, DC, USA, 2007.

IEEE Computer Society. ISBN 0-7695-3018-4.

Haibing Lu, Jaideep Vaidya, Vijayalakshmi Atluri, and Yuan Hong. Extended boolean

matrix decomposition. In Wei Wang 0010, Hillol Kargupta, Sanjay Ranka, Philip S.

Yu, and Xindong Wu, editors, ICDM, pages 317–326. IEEE Computer Society, 2009.

ISBN 978-0-7695-3895-2.

Floris Geerts, Bart Goethals, and Taneli Mielikinen. Tiling databases. In Einoshin

Suzuki and Setsuo Arikawa, editors, Discovery Science, volume 3245 of Lecture Notes

in Computer Science, pages 278–289. Springer Berlin Heidelberg, 2004. ISBN 978-3-

540-23357-2.

Pauli Miettinen. The boolean column and column-row matrix decompositions. Data

Min. Knowl. Discov., 17(1):39–56, August 2008. ISSN 1384-5810.

Haibing Lu, Jaideep Vaidya, and Vijayalakshmi Atluri. Optimal boolean matrix decom-

position: Application to role engineering. In Proceedings of the 2008 IEEE 24th

International Conference on Data Engineering, ICDE ’08, pages 297–306, Wash-

ington, DC, USA, 2008. IEEE Computer Society. ISBN 978-1-4244-1836-7. doi:

10.1109/ICDE.2008.4497438.

Jilles Vreeken and Arno Siebes. Filling in the blanks - krimp minimisation for missing

data. In ICDM, pages 1067–1072. IEEE Computer Society, 2008.

49

Page 60: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Bibliography 50

KDD Cup: User Modeling based on Microblog Data and Search Click Data. Website,

2012a. URL http://kdd2012.sigkdd.org/kddcup.shtml. last checked: 26.10.2012.

Raghunandan H. Keshavan, Sewoong Oh, and Andrea Montanari. Matrix completion

from a few entries. In Proceedings of the 2009 IEEE international conference on

Symposium on Information Theory - Volume 1, ISIT’09, pages 324–328, Piscataway,

NJ, USA, 2009. IEEE Press. ISBN 978-1-4244-4312-3. URL http://dl.acm.org/

citation.cfm?id=1701495.1701562.

Zhong-Yuan Zhang, Tao Li, Chris Ding, Xian-Wen Ren, and Xiang-Sun Zhang. Binary

matrix factorization for analyzing gene expression data. Data Min. Knowl. Discov.,

20(1):28–52, January 2010. ISSN 1384-5810. doi: 10.1007/s10618-009-0145-2.

Hugo Van Hamme. An on-line nmf model for temporal pattern learning: theory with

application to automatic speech recognition. In Proceedings of the 10th international

conference on Latent Variable Analysis and Signal Separation, LVA/ICA’12, pages

306–313, Berlin, Heidelberg, 2012. Springer-Verlag. ISBN 978-3-642-28550-9. doi:

10.1007/978-3-642-28551-6 38.

Juan C. Caicedo and Fabio A. Gonzalez. Online matrix factorization for multimodal

image retrieval. In CIARP, pages 340–347, 2012.

K. Bergemann, G. Gottwald, and S. Reich. Ensemble propagation and continuous matrix

factorization algorithms. Quarterly Journal of the Royal Meteorological Society, 135

(643):1560–1572, 2009.

De Mello Koch R. Willis A.J. A minimum norm array processing algorithm for super

resolution target profiling. volume 1, pages 217–222, September 1994.

Craig Gotsman. Constant-time filtering by singular value decomposition. Computer

Graphics Forum, 13(2):153–163, 1994. ISSN 1467-8659. doi: 10.1111/1467-8659.

1320153.

Roland Ruiters, Martin Rump, and Reinhard Klein. Parallelized matrix factorization

for fast btf compression. In Eurographics Symposium on Parallel Graphics and Visu-

alization, pages 25–32, March 2009.

Soo-Chang Pei and Hsin-Hua Liu. Improved svd-based watermarking for digital images.

In Proceedings of the 2008 Sixth Indian Conference on Computer Vision, Graphics &

Image Processing, ICVGIP ’08, pages 273–280, Washington, DC, USA, 2008. IEEE

Computer Society. ISBN 978-0-7695-3476-3. doi: 10.1109/ICVGIP.2008.99.

V. Klema and A. Laub. The singular value decomposition: Its computation and some

applications. Automatic Control, IEEE Transactions on, 25(2):164–176, 1980.

Page 61: Boolean Matrix Factorization with missing values · Boolean Matrix Factorization with missing values Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by

Bibliography 51

Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization.

In In NIPS, pages 556–562. MIT Press, 2000.

Karthik Devarajan. Nonnegative matrix factorization: An analytical and interpretive

tool in computational biology. PLoS Comput Biol, 4(7):e1000029, 07 2008. doi: 10.

1371/journal.pcbi.1000029.

C. Eckart and G. Young. The approximation of one matrix by another of lower rank.

Psychometrika, 1(3):211–218, 1936.

G. Stewart. On the early history of the singular value decomposition. SIAM Review, 35

(4):551–566, 1993. doi: 10.1137/1035134.

Pentti Paatero and Unto Tapper. Positive matrix factorization: A non-negative factor

model with optimal utilization of error estimates of data values. Environmetrics, 5

(2):111–126, 1994. doi: 10.1002/env.3170050203.

Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative

matrix factorization. Nature, 401(6755):788–791, October 1999. ISSN 0028-0836. doi:

10.1038/44565.

V. Snasel, J. Platos, P. Kromer, D. Husek, R. Neruda, and A.A. Frolov. Investigating

boolean matrix factorization. In Investigating Boolean matrix factorization, 2008.

Vaclav Snasel, Jan Platos, and Pavel Kromer. Developing genetic algorithms for boolean

matrix factorization. In DATESO, volume 330 of CEUR Workshop Proceedings.

CEUR-WS.org, 2008.

Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules

between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors,

Proceedings of the 1993 ACM SIGMOD International Conference on Management of

Data, Washington, D.C., May 26-28, 1993, pages 207–216. ACM Press, 1993.

Arno Siebes, Jilles Vreeken, and Matthijs van Leeuwen. Item sets that compress. In

SDM, 2006.

A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via

the EM algorithm. Royal statistical Society B, 39:1–38, 1977.

Zhouchen Lin, Minming Chen, Leqin Wu, and Yi Ma. The augmented lagrange mul-

tiplier method for exact recovery of corrupted low-rank matrices. Technical Report

arXiv:1009.5055. University of Illinois at Urbana-Champaign technical report UILU-

ENG-09-2215, University of Illinois at Urbana-Champaign, Sep 2010.

KDD Cup 2012, Track 1. Website, 2012b. URL http://www.kddcup2012.org/c/

kddcup2012-track1. last checked: 26.10.2012.