Machine Learning and Data Mining 19 Mining Text and Web Data 26716

download Machine Learning and Data Mining 19 Mining Text and Web Data 26716

of 75

Transcript of Machine Learning and Data Mining 19 Mining Text and Web Data 26716

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    1/75

    Prof. Pier Luca Lanzi

    Text and Web MiningMachine Learning and Data Mining (Unit 19)

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    2/75

    2

    Prof. Pier Luca Lanzi

    References

    Jiawei Han and Micheline Kamber,"Data Mining: Concepts andTechniques", The Morgan KaufmannSeries in Data Management Systems(Second Edition)

    Chapter 10, part 2

    Web Mining Course byGregory-Platesky Shapiro available atwww.kdnuggets.com

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    3/75

    3

    Prof. Pier Luca Lanzi

    Mining Text Data: An Introduction

    Data Mining / Knowledge Discovery

    Structured Data Multimedia Free Text Hypertext

    HomeLoan (Loanee: Frank RizzoLender: MWF

    Agency: Lake View Amount: $200,000

    Term: 15 years)

    Frank Rizzo boughthis home from LakeView Real Estate in1992.

    He paid $200,000

    under a15-year loanfrom MW Financial.

    Frank Rizzo Bought this home from LakeView Real Estate In 1992 .

    ...Loans( $200K ,[map ],... )

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    4/75

    4

    Prof. Pier Luca Lanzi

    Bag-of-Tokens Approaches

    Four score and sevenyears ago our fathers broughtforth on this continent, a newnation , conceived in Liberty,and dedicated to theproposition that all men arecreated equal.

    Now we are engaged in agreat civil war, testing

    whether that nation , or

    nation 5civil - 1war 2men 2

    died 4people 5Liberty 1God 1

    FeatureExtraction

    Loses all order-specific information!Severely limits context!

    Documents Token Sets

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    5/75

    5

    Prof. Pier Luca Lanzi

    Natural Language Processing

    A dog is chasing a boy on the playgroundDet Noun Aux Verb Det Noun Prep Det Noun

    Noun Phrase Complex Verb Noun PhraseNoun Phrase

    Prep PhraseVerb Phrase

    Verb Phrase

    Sentence

    Dog(d1).Boy(b1).Playground(p1).Chasing(d1,b1,p1).

    Semantic analysis

    Lexicalanalysis

    (part-of-speechtagging)

    Syntactic analysis

    (Parsing)

    A person saying this maybe reminding another person to

    get the dog back

    Pragmatic analysis(speech act)

    Scared(x) if Chasing(_,x,_).

    +

    Scared(b1)Inference

    (Taken from ChengXiang Zhai, CS 397cxz Fall 2003)

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    6/75

    6

    Prof. Pier Luca Lanzi

    General NLPToo Difficult!

    Word-level ambiguity design can be a noun or a verb (Ambiguous POS) root has multiple meanings (Ambiguous sense)

    Syntactic ambiguity natural language processing (Modification) A man saw a boy with a telescope. (PP Attachment)

    Anaphora resolution John persuaded Bill to buy a TV for himself. (himself = John or Bill?)

    Presupposition

    He has quit smoking. implies that he smoked before.

    (Taken from ChengXiang Zhai, CS 397cxz Fall 2003)

    Humans rely on context to interpret (when possible).This context may extend beyond a given document!

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    7/75

    7

    Prof. Pier Luca Lanzi

    Shallow Linguistics

    English LexiconPart-of-Speech TaggingWord Sense Disambiguation

    Phrase Detection / Parsing

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    8/75

    8

    Prof. Pier Luca Lanzi

    WordNet

    An extensive lexical network for the English languageContains over 138,838 words.Several graphs, one for each part-of-speech.Synsets (synonym sets), each defining a semantic sense.Relationship information (antonym, hyponym, meronym )Downloadable for free (UNIX, Windows)Expanding to other languages (Global WordNet Association)

    Funded >$3 million, mainly government (translation interest)Founder George Miller, National Medal of Science, 1991.

    wet dry

    watery

    moist

    damp

    parched

    anhydrous

    aridsynonym

    antonym

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    9/75

    9

    Prof. Pier Luca Lanzi

    1 1

    1 1 1

    11

    ( ,..., , ,..., )

    ( | )... ( | ) ( )... ( )

    ( | ) ( | )

    k k

    k k k

    k

    i i i ii

    p w w t t

    p t w p t w p w p w

    p w t p t t =

    =

    1 1

    1 1 1

    11

    ( ,..., , ,..., )

    ( | )... ( | ) ( )... ( )

    ( | ) ( | )

    k k

    k k k

    k

    i i i ii

    p w w t t

    p t w p t w p w p w

    p w t p t t =

    =

    Pick the most likely tag sequence.

    Part-of-Speech Tagging

    This sentence serves as an example of annotated textDet N V1 P Det N P V2 N

    Training data (Annotated text)

    POS TaggerThis is a new sentence. This is a new sentence.Det Aux Det Adj N

    Partial dependency(HMM)

    Independent assignmentMost common tag

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    10/75

    10

    Prof. Pier Luca Lanzi

    Word Sense Disambiguation

    Supervised LearningFeatures:

    Neighboring POS tags (N Aux V P N)

    Neighboring words (linguistics are rooted in ambiguity)Stemmed form (root)Dictionary/Thesaurus entries of neighboring wordsHigh co-occurrence words (plant, tree, origin,)Other senses of word within discourse

    Algorithms:Rule-based Learning (e.g. IG guided)Statistical Learning (i.e. Nave Bayes)Unsupervised Learning (i.e. Nearest Neighbor)

    The difficulties of computational linguistics are rooted in ambiguity .N Aux V P N

    ?

    (Adapted from ChengXiang Zhai, CS 397cxz Fall 2003)

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    11/75

    11

    Prof. Pier Luca Lanzi

    Parsing

    (Adapted from ChengXiang Zhai, CS 397cxz Fall 2003)

    Choose most likely parse tree

    the playground

    S

    NP VP

    BNP

    N

    Det

    A

    dog

    VP PPAux V

    is ona boy

    chasing

    NP P NP

    Probabil ity of this tree=0.000015

    ...S

    NP VP

    BNP

    N

    dog

    PPAux V

    is

    ona boy

    chasing

    NP

    P NP

    DetA

    the playground

    NP

    Probabil ity of this tree=0.000011

    S NP VPNP Det BNPNP BNPNP NP PPBNP NVP VVP Aux V NPVP VP PPPP P NP

    V chasingAux isN dogN boyN playgroundDet theDet aP on

    Grammar

    Lexicon

    1.0

    0.30.40.3

    1.0

    0.01

    0.003

    Probabilistic CFG

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    12/75

    12

    Prof. Pier Luca Lanzi

    Obstacles

    Ambiguity

    A man saw a boy with a telescope.

    Computational Intensity

    Imposes a context horizon.

    Text Mining NLP ApproachLocate promising fragments using fast IR methods

    (bag-of-tokens)Only apply slow NLP techniques to promising fragments

    S Sh ll NLP

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    13/75

    13

    Prof. Pier Luca Lanzi

    Summary: Shallow NLP

    However, shallow NLP techniques are feasible and useful :Lexicon machine understandable linguistic knowledge

    possible senses, definitions, synonyms, antonyms, typeof,etc.

    POS Tagging limit ambiguity (word/POS), entity extractionWSD stem/synonym/hyponym matches (doc and query)

    Query: Foreign cars Document: Im selling a 1976 Jaguar

    Parsing logical view of information (inference?,translation?)

    A man saw a boy with a telescope. Even without complete NLP, any additional knowledgeextracted from text data can only be beneficial .Ingenuity will determine the applications .

    R f f NLP

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    14/75

    14

    Prof. Pier Luca Lanzi

    References for NLP

    C. D. Manning and H. Schutze, Foundations of Natural LanguageProcessing, MIT Press, 1999.S. Russell and P. Norvig, Artificial Intelligence: A ModernApproach, Prentice Hall, 1995.S. Chakrabarti, Mining the Web: Statistical Analysis of Hypertext

    and Semi-Structured Data, Morgan Kaufmann, 2002.G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R.Tengi. Five papers on WordNet. Princeton University, August 1993.C. Zhai, Introduction to NLP, Lecture Notes for CS 397cxz, UIUC,Fall 2003.

    M. Hearst, Untangling Text Data Mining, ACL99, invited pap er.http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.htmlR. Sproat, Introduction to Computational Linguistics, LING 306,UIUC, Fall 2003.

    A Road Map to Text Mining and Web Mining, University of Texa sresource page. http://www.cs.utexas.edu/users/pebronia/text-mining/Compu tational Linguistics and Text Min ing Group, IBM Research,http:// www.research.ibm.com/dssgrp/

    T t D t b d I f ti R t i l

    http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.htmlhttp://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.htmlhttp://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.htmlhttp://www.cs.utexas.edu/users/pebronia/text-mining/http://www.cs.utexas.edu/users/pebronia/text-mining/http://www.cs.utexas.edu/users/pebronia/text-mining/http://www.cs.utexas.edu/users/pebronia/text-mining/http://www.research.ibm.com/dssgrp/http://www.research.ibm.com/dssgrp/http://www.research.ibm.com/dssgrp/http://www.cs.utexas.edu/users/pebronia/text-mining/http://www.cs.utexas.edu/users/pebronia/text-mining/http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.htmlhttp://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html
  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    15/75

    15

    Prof. Pier Luca Lanzi

    Text Databases and Information Retrieval

    Text databases (document databases)Large collections of documents from various sources:news articles, research papers, books, digital libraries,E-mail messages, and Web pages, library database, etc.Data stored is usually semi-structuredTraditional information retrieval techniques becomeinadequate for the increasingly vast amounts of text data

    Information retrievalA field developed in parallel with database systemsInformation is organized into (a large number of)documentsInformation retrieval problem: locating relevantdocuments based on user input, such as keywords or

    example documents

    16Information Retrieval (IR)

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    16/75

    16

    Prof. Pier Luca Lanzi

    Information Retrieval (IR)

    Typical Information Retrieval systemsOnline library catalogsOnline document management systems

    Information retrieval vs. database systemsSome DB problems are not present in IR, e.g.,update, transaction management, complex objectsSome IR problems are not addressed well in DBMS, e.g.,unstructured documents, approximate search usingkeywords and relevance

    17Basic Measures for Text Retrieval

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    17/75

    17

    Prof. Pier Luca Lanzi

    Basic Measures for Text Retrieval

    Precision : the percentage of retrieved documents that are in factrelevant to the query (i.e., correct responses)

    Recall : the percentage of documents that are relevant to the queryand were, in fact, retrieved

    |}{||}|

    RelevantRetrievedRelevant

    precision

    =

    |}{||}|

    RetrievedRetrievedRelevantprecision

    =

    Relevant Relevant & Retrieved Retrieved

    All Documents

    18Information Retrieval Techniques

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    18/75

    18

    Prof. Pier Luca Lanzi

    Information Retrieval Techniques

    Basic ConceptsA document can be described by a set of representativekeywords called index terms.Different index terms have varying relevance when usedto describe document contents.This effect is captured through the assignment of numerical weights to each index term of a document.

    (e.g.: frequency, tf-idf)

    DBMS AnalogyIndex Terms AttributesWeights Attribute Values

    19Information Retrieval Techniques

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    19/75

    19

    Prof. Pier Luca Lanzi

    Information Retrieval Techniques

    Index Terms (Attribute) Selection:Stop listWord stem

    Index terms weighting methods

    Terms U Documents Frequency Matrices

    Information Retrieval Models:Boolean ModelVector Model

    Probabilistic Model

    20Boolean Model

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    20/75

    20

    Prof. Pier Luca Lanzi

    Boolean Model

    Consider that index terms are either presentor absent in a documentAs a result, the index term weights are assumedto be all binariesA query is composed of index terms linked by threeconnectives: not , and , and or

    E.g.: car and repair, plane or airplane

    The Boolean model predicts that each document is eitherrelevant or non-relevant based on the match of a documentto the query

    21Keyword-Based Retrieval

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    21/75

    21

    Prof. Pier Luca Lanzi

    Keyword Based Retrieval

    A document is represented by a string, which can beidentified by a set of keywords

    Queries may use expressions of keywordsE.g., car and repair shop, tea or coffee,DBMS but not OracleQueries and retrieval should consider synonyms,

    e.g., repair and maintenance

    Major difficulties of the modelSynonymy: A keyword T does not appear anywhere in thedocument, even though the document is closely related toT, e.g., data miningPolysemy: The same keyword may mean different things

    in different contexts, e.g., mining

    22Similarity-Based Retrieval in Text Data

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    22/75

    22

    Prof. Pier Luca Lanzi

    Similarity Based Retrieval in Text Data

    Finds similar documents based on a set of common keywords

    Answer should be based on the degree of relevance based onthe nearness of the keywords, relative frequency of thekeywords, etc.

    Basic techniques

    Stop listSet of words that are deemed irrelevant, even thoughthey may appear frequentlyE.g., a, the, of, for, to, with, etc.Stop lists may vary when document set varies

    23Similarity-Based Retrieval in Text Data

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    23/75

    Prof. Pier Luca Lanzi

    Similarity Based Retrieval in Text Data

    Word stemSeveral words are small syntactic variants of each other sincethey share a common word stemE.g., drug, drugs, drugged

    A term frequency tableEach entry frequent_table(i, j) = # of occurrences of the wordti in document diUsually, the ratio instead of the absolute number of occurrencesis used

    Similarity metrics: measure the closeness of a document to a query(a set of keywords)

    Relative term occurrencesCosine distance:

    ||||),( 2121

    21 vvvv

    vvsim

    =

    24Indexing Techniques

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    24/75

    Prof. Pier Luca Lanzi

    g q

    Inverted indexMaintains two hash- or B+-tree indexed tables: document_table: a set of document records term_table: a set of term records,

    Answer query: Find all docs associated with one or a set of terms+ easy to implement do not handle well synonymy and polysemy, and posting listscould be too long (storage could be very large)

    Signature fileAssociate a signature with each documentA signature is a representation of an ordered list of terms that

    describe the documentOrder is obtained by frequency analysis, stemming and stop lists

    25Vector Space Model

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    25/75

    Prof. Pier Luca Lanzi

    p

    Documents and user queries are represented as m-dimensionalvectors, where m is the total number of index terms in thedocument collection.The degree of similarity of the document d with regard to the query

    q is calculated as the correlation between the vectors that representthem, using measures such as the Euclidian distance or the cosineof the angle between these two vectors.

    26Latent Semantic Indexing (1)

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    26/75

    Prof. Pier Luca Lanzi

    g ( )

    Basic ideaSimilar documents have similar word frequenciesDifficulty: the size of the term frequency matrix is very largeUse a singular value decomposition (SVD) techniques to reduce

    the size of frequency tableRetain the K most significant rows of the frequency table

    Method

    Create a term x document weighted frequency matrix A

    SVD construction: A = U * S * V

    Define K and obtain U k , , S k , and V k.

    Create query vector q .

    Project q into the term-document space: Dq = q * U k * S k-1

    Calculate similarities: cos = Dq . D / ||Dq|| * ||D||

    27Latent Semantic Indexing (2)

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    27/75

    Prof. Pier Luca Lanzi

    Weighted Frequency Matrix

    Query Terms:- Insulation

    - Joint

    28Probabilistic Model

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    28/75

    Prof. Pier Luca Lanzi

    Basic assumption: Given a user query, there is a set of documents which contains exactly the relevant documentsand no other (ideal answer set)Querying process as a process of specifying the properties of an ideal answer set. Since these properties are not known atquery time, an initial guess is madeThis initial guess allows the generation of a preliminaryprobabilistic description of the ideal answer set which is usedto retrieve the first set of documentsAn interaction with the user is then initiated with the purposeof improving the probabilistic description of the answer set

    29Types of Text Data Mining

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    29/75

    Prof. Pier Luca Lanzi

    Keyword-based association analysisAutomatic document classificationSimilarity detection

    Cluster documents by a common authorCluster documents containing informationfrom a common source

    Link analysis: unusual correlation between entities

    Sequence analysis: predicting a recurring eventAnomaly detection: find information that violates usualpatterns

    Hypertext analysisPatterns in anchors/linksAnchor text correlations with linked objects

    30Keyword-Based Association Analysis

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    30/75

    Prof. Pier Luca Lanzi

    MotivationCollect sets of keywords or terms that occur frequentlytogether and then find the association or correlationrelationships among them

    Association Analysis ProcessPreprocess the text data by parsing, stemming,removing stop words, etc.

    Evoke association mining algorithms Consider each document as a transaction View a set of keywords in the document as

    a set of items in the transaction

    Term level association mining No need for human effort in tagging documents The number of meaningless results and

    the execution time is greatly reduced

    31Text Classification (1)

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    31/75

    Prof. Pier Luca Lanzi

    MotivationAutomatic classification for the large number of on-linetext documents (Web pages, e-mails, corporate intranets,etc.)

    Classification ProcessData preprocessingDefinition of training set and test setsCreation of the classification model using the selectedclassification algorithmClassification model validationClassification of new/unknown text documents

    Text document classification differsfrom the classification of relational data

    Document databases are not structured according toattribute-value pairs

    32Text Classification (2)

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    32/75

    Prof. Pier Luca Lanzi

    Classification Algorithms:Support Vector MachinesK-Nearest NeighborsNave BayesNeural NetworksDecision TreesAssociation rule-basedBoosting

    33Document Clustering

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    33/75

    Prof. Pier Luca Lanzi

    MotivationAutomatically group related documents based on their contentsNo predetermined training sets or taxonomiesGenerate a taxonomy at runtime

    Clustering ProcessData preprocessing:remove stop words, stem, feature extraction, lexical analysis,etc.Hierarchical clustering:compute similarities applying clustering algorithms.Model-Based clustering (Neural Network Approach):clusters are represented by exemplars. (e.g.: SOM)

    34Text Categorization

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    34/75

    Prof. Pier Luca Lanzi

    Pre-given categories and labeled document examples(Categories may form hierarchy)Classify new documentsA standard classification (supervised learning ) problem

    CategorizationSystem

    Sports

    Business

    Education

    ScienceSportsBusiness

    Education

    35Applications

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    35/75

    Prof. Pier Luca Lanzi

    News article classificationAutomatic email filteringWebpage classificationWord sense disambiguation

    36Categorization Methods

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    36/75

    Prof. Pier Luca Lanzi

    Manual : Typically rule-basedDoes not scale up (labor-intensive, rule inconsistency)May be appropriate for special data on a particulardomain

    Automatic : Typically exploiting machine learning techniquesVector space model based

    Prototype-based (Rocchio) K-nearest neighbor (KNN) Decision-tree (learn rules) Neural Networks (learn non-linear classifier) Support Vector Machines (SVM)

    Probabilistic or generative model based Nave Bayes classifier

    37Vector Space Model

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    37/75

    Prof. Pier Luca Lanzi

    Represent a doc by a term vectorTerm: basic concept, e.g., word or phraseEach term defines one dimensionN terms define a N-dimensional spaceElement of vector corresponds to term weightE.g., d = (x 1 ,,x N), x i is importance of term I

    New document is assigned to the most likely category basedon vector similarity

    38Illustration of the Vector Space Model

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    38/75

    Prof. Pier Luca Lanzi

    Java

    Microsoft

    StarbucksC2 Category 2

    C1 Category 1

    C3

    Category 3

    new doc

    39What VS Model Does Not Specify?

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    39/75

    Prof. Pier Luca Lanzi

    How to select terms to capture basic concepts Word stopping

    e.g. a, the, always, along

    Word stemming e.g. computer, computing, computerize => compute

    Latent semantic indexingHow to assign weights

    Not all words are equally important: Some are moreindicative than others

    e.g. algebra vs. science

    How to measure the similarity

    40How to Assign Weights

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    40/75

    Prof. Pier Luca Lanzi

    Two-fold heuristics based on frequency

    TF (Term frequency)More frequent within a document more relevant tosemanticsE.g., query vs. commercial

    IDF (Inverse document frequency)Less frequent among documents more discriminativeE.g. algebra vs. science

    41Term Frequency Weighting

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    41/75

    Prof. Pier Luca Lanzi

    WeightingThe more frequent, the more relevant to topicE.g. query vs. commercial Raw TF= f(t,d): how many times term t appears in doc d

    NormalizationWhen document length varies,

    then relative frequency is preferredE.g., Maximum frequency normalization

    42Inverse Document Frequency Weighting

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    42/75

    Prof. Pier Luca Lanzi

    Idea: The less frequent terms among documents are themore discriminative

    Formula:

    Givenn, total number of docsk, the number of docs with term t appearing(the DF document frequency)

    43TF-IDF Weighting

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    43/75

    Prof. Pier Luca Lanzi

    TF-IDF weighting: weight(t, d) = TF(t, d) * IDF(t)Frequent within doc high tf high weightSelective among docs high idf high weight

    Recall VS modelEach selected term represents one dimensionEach doc is represented by a feature vector

    Its t-term coordinate of document d is the TF-IDF weightThis is more reasonable

    Just for illustration Many complex and more effective weighting variants existin practice

    44How to Measure Similarity?

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    44/75

    Prof. Pier Luca Lanzi

    Given two document

    Similarity definitiondot product

    normalized dot product (or cosine)

    45Illustrative Example

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    45/75

    Prof. Pier Luca Lanzi

    text mining travel map search engine govern president congressIDF(faked) 2.4 4.5 2.8 3.3 2.1 5.4 2.2 3.2 4.3

    doc1 2(4.8) 1(4.5) 1(2.1) 1(5.4)doc2 1(2.4 ) 2 (5.6) 1(3.3)doc3 1 (2.2) 1(3.2) 1(4.3)

    newdoc 1(2.4) 1(4.5)

    doc3

    textminingsearch

    enginetext

    traveltext

    maptravel

    governmentpresidentcongress

    doc1

    doc2

    To whom is newdocmore similar?

    Sim(newdoc,doc1)=4.8*2.4+4.5*4.5

    Sim(newdoc,doc2)=2.4*2.4

    Sim(newdoc,doc3)=0

    46Vector Space Model-Based Classifiers

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    46/75

    Prof. Pier Luca Lanzi

    What do we have so far?A feature space with similarity measureThis is a classic supervised learning problemSearch for an approximation to classification hyper plane

    VS model based classifiersK-NN

    Decision tree basedNeural networksSupport vector machine

    47Probabilistic Model

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    47/75

    Prof. Pier Luca Lanzi

    Category C is modeled as a probability distributionof predefined random eventsRandom events model the processof generating documents

    Therefore, how likely a document d belongsto category C is measured through the probabilityfor category C to generate d.

    48Quick Revisit of Bayes Rule

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    48/75

    Prof. Pier Luca Lanzi

    Category Hypothesis space: H = {C 1 , , C n}One document: D

    As we want to pick the most likely category C*, we can dropp(D)

    ( | ) ( )( | )( )

    i ii

    P D C P CP C DP D

    =

    Posterior probability of C i

    Document model for category C

    * argmax ( | ) argmax ( | ) ( )C CC P C D P D C P C= =

    49Probabilistic Model: Multi-Bernoulli

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    49/75

    Prof. Pier Luca Lanzi

    Event: word presence or absence

    D = (x 1 , , x |V| )x i =1 for presence of word w ix i =0 for absence

    Parameters

    {p(wi=1|C), p(wi=0|C)}p(wi=1|C)+ p(wi=0|C)=1

    | | | | | |

    1 | |1 1, 1 1, 0

    ( ( ,..., ) | ) ( | ) ( 1| ) ( 0| )i i

    V V V

    V i i i ii i x i x

    p D x x C p w x C p w C p w C= = = = =

    = = = = = =

    50Probabilistic Model: Multinomial

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    50/75

    Prof. Pier Luca Lanzi

    Event: word selection/sampling

    D = (n 1 , , n |V| )n i: frequency of word w in=n 1 ,++ n |V|

    Parameters

    {p(wi|C)}p(w 1 |C)+ p(w |v| |C) = 1

    | |

    1 | |1 | | 1( ( ,..., ) | ) ( | ) ( | )...

    i

    Vn

    v iV i

    np D n n C p n C p w Cn n =

    = =

    51Parameter Estimation

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    51/75

    Prof. Pier Luca Lanzi

    Category prior

    Multi-Bernoulli Doc model

    Multinomial doc model

    Training examples:

    C1 C2

    Ck

    E(C 1)

    E(C k)

    E(C 2)

    Vocabulary: V = {w 1 , , w |V| }

    1

    | ( ) |( )

    | ( ) |

    ii k

    j j

    E Cp C

    E C=

    =

    ( )( , ) 0.5

    1( 1| ) ( , )

    | ( )| 1 0i

    j jd E C

    j i ji

    w dif w occursind

    pw C w dEC otherwise

    +

    = = =+

    ( )| |

    1 ( )

    ( , ) 1( | ) ( , )

    ( , ) | |

    i

    i

    jd E C

    j i j jV

    m

    m d E C

    cw dpw C cw d countsof w ind

    cw d V

    =

    += =

    +

    52Classification of New Document

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    52/75

    Prof. Pier Luca Lanzi

    1 | |

    | |

    1| |

    1

    ( ,..., ) {0,1}* argmax ( | ) ( )

    argmax ( | ) ( )

    argmax log ( ) log ( | )

    V

    C

    V

    C i i

    iV

    C i ii

    d x x xC P D C P C

    pw x C P C

    pC p w x C

    =

    =

    = =

    = =

    = + =

    1 | | 1 | |

    | |

    1

    | |

    1

    | |

    1

    ( ,..., ) | | ...* argmax ( | ) ( )

    argmax ( | ) ( | ) ( )

    argmax log ( | ) log ( ) log ( | )

    argmax log ( ) log ( | )

    i

    V V

    C

    Vn

    C ii

    V

    C i ii

    V

    C i ii

    d n n d n n nC P D C P C

    pn C p w C P C

    pn C pC n p w C

    pC n p w C

    =

    =

    =

    = = = + +=

    =

    = + +

    +

    Multi-Bernoulli Multinomial

    53Categorization Methods

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    53/75

    Prof. Pier Luca Lanzi

    Vector space modelK-NNDecision treeNeural networkSupport vector machine

    Probabilistic model

    Nave Bayes classifier

    Many, many others and variants exist

    e.g. Bim, Nb, Ind, Swap-1, LLSF, Widrow-Hoff, Rocchio,Gis-W,

    54Evaluation (1)

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    54/75

    Prof. Pier Luca Lanzi

    Effectiveness measureClassic: Precision & Recall

    Precision

    Recall

    55Evaluation (2)

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    55/75

    Prof. Pier Luca Lanzi

    BenchmarksClassic: Reuters collection A set of newswire stories classified under categories related

    to economics.

    EffectivenessDifficulties of strict comparison

    different parameter setting different split (or selection) between training and testing various optimizations

    However widely recognizable Best: Boosting-based committee classifier & SVM

    Worst: Nave Bayes classifierNeed to consider other factors, especially efficiency

    56Summary: Text Categorization

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    56/75

    Prof. Pier Luca Lanzi

    Wide application domainComparable effectiveness to professionalsManual Text Classification is not 100% and unlikely toimprove substantially

    Automatic Text Classification is growing at a steady paceProspects and extensions

    Very noisy text, such as text from O.C.R.

    Speech transcripts

    57Research Problems in Text Mining

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    57/75

    Prof. Pier Luca Lanzi

    Google: what is the next step?How to find the pages that match approximately thesophisticated documents, with incorporation of user-profilesor preferences?

    Look back of Google: inverted indiciesConstruction of indicies for the sophisticated documents,with incorporation of user-profiles or preferencesSimilarity search of such pages using such indicies

    58References for Text Mining

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    58/75

    Prof. Pier Luca Lanzi

    Fabrizio Sebastiani, Machine Learning in Automated TextCategorization, ACM Computing Surveys, Vol. 34, No.1,March 2002Soumen Chakrabarti, Data mining for hypertext: A tutorial

    survey, ACM SIGKDD Explorations, 2000.Cleverdon, Optimizing convenient online accesss tobibliographic databases, Information Survey, Use4, 1, 37-47, 1984Yiming Yang, An evaluation of statistical approaches to textcategorization, Journal of Information Retrieval, 1:67-88,1999.Yiming Yang and Xin Liu A re-examination of textcategorization methods. Proceedings of ACM SIGIRConference on Research and Development in InformationRetrieval (SIGIR'99, pp 42--49), 1999.

    59World Wide Web: a brief history

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    59/75

    Prof. Pier Luca Lanzi

    Who invented the wheel is unknownWho invented the World-Wide Web ?

    (Sir) Tim Berners-Leein 1989, while working at CERN,invented the World Wide Web,including URL scheme, HTML, and in1990 wrote the first server and thefirst browserMosaic browser developed by MarcAndreessen and Eric Bina at NCSA

    (National Center for SupercomputingApplications) in 1993; helped rapidweb spreadMosaic was basis for Netscape

    60What is Web Mining?

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    60/75

    Prof. Pier Luca Lanzi

    Discovering interesting and useful informationfrom Web content and usage

    ExamplesWeb search, e.g. Google, Yahoo, MSN, Ask, Specialized search: e.g. Froogle (comparison shopping),

    job ads (Flipdog)eCommerceRecommendations (Netflix, Amazon, etc.)Improving conversion rate: next best product to offer

    Advertising, e.g. Google AdsenseFraud detection: click fraud detection, Improving Web site design and performance

    61How does it differ from classicalData Mining?

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    61/75

    Prof. Pier Luca Lanzi

    The web is not a relationTextual information and linkage structure

    Usage data is huge and growing rapidlyGoogles usage logs are bigger than their web crawlData generated per day is comparable to largestconventional data warehouses

    Ability to react in real-time to usage patterns

    No human in the loop

    Reproduced from Ullman & Rajaramanwith permission

    62How big is the Web?

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    62/75

    Prof. Pier Luca Lanzi

    The number of pages is technically, infiniteBecause of dynamically generated contentLots of duplication (30-40%)

    Best estimate of unique static HTML pages comes fromsearch engine claims

    Google = 8 billion

    Yahoo = 20 billionLots of marketing hype

    Reproduced from Ullman & Rajaramanwith permission

    63Netcraft Survey:76,184,000 web sites (Feb 2006)

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    63/75

    Prof. Pier Luca Lanzi

    http://news.netcraft.com/archives/web_server_survey.html

    Netcraft survey

    64Mining the World-Wide Web

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    64/75

    Prof. Pier Luca Lanzi

    The WWW is huge, widely distributed, global informationservice center for

    Information services: news, advertisements, consumerinformation, financial management, education,

    government, e-commerce, etc.Hyper-link informationAccess and usage information

    WWW provides rich sources for data miningChallenges

    Too huge for effective data warehousing and data miningToo complex and heterogeneous: no standards and

    structure

    65Web Mining: A more challenging task

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    65/75

    Prof. Pier Luca Lanzi

    Searches forWeb access patternsWeb structuresRegularity and dynamics of Web contents

    ProblemsThe abundance problemLimited coverage of the Web: hidden Web sources,

    majority of data in DBMSLimited query interface based on keyword-oriented searchLimited customization to individual users

    66Web Mining Taxonomy

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    66/75

    Prof. Pier Luca Lanzi

    Web Mining

    Web StructureMiningWeb ContentMining

    Web Page

    Content MiningSearch Result

    Mining

    Web UsageMining

    General Access

    Pattern Tracking

    Customized

    Usage Tracking

    67Mining the World-Wide Web

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    67/75

    Prof. Pier Luca Lanzi

    Web Mining

    Web StructureMining

    Web Page Content Mining

    Web Page SummarizationWebLog , WebOQL :Web Structuring query languages;Can identify information within given webpagesAhoy! :Uses heuristics to distinguish personalhome pages from other web pagesShopBot : Looks for product prices within webpages

    Web Content

    Mining

    Search ResultMining

    Web UsageMining

    General AccessPattern Tracking

    CustomizedUsage Tracking

    68Mining the World-Wide Web

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    68/75

    Prof. Pier Luca Lanzi

    Web Mining

    Web UsageMining

    General AccessPattern Tracking

    CustomizedUsage Tracking

    Web StructureMining

    Web ContentMining

    Web PageContent Mining

    Search Result Mining

    Search Engine Result

    SummarizationClustering Search Result :Categorizes documents usingphrases in titles and snippets

    69Mining the World-Wide Web

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    69/75

    Prof. Pier Luca Lanzi

    Web Mining

    Web Structure MiningUsing LinksPageRankCLEVERUse interconnections between web pages to giveweight to pages.

    Using GeneralizationMLDB, VWVUses a multi-level database representation of theWeb. Counters (popularity) and link lists are usedfor capturing structure.

    Web ContentMining

    Web PageContent Mining

    Search ResultMining

    Web UsageMining

    General AccessPattern Tracking

    CustomizedUsage Tracking

    70Mining the World-Wide Web

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    70/75

    Prof. Pier Luca Lanzi

    Web Mining

    Web StructureMining

    Web ContentMining

    Web PageContent Mining

    Search ResultMining

    General Access Pattern Tracking

    Web Log MiningUses KDD techniques to understandgeneral access patterns and trends.Can shed light on better structure andgrouping of resource providers.

    Web UsageMining

    CustomizedUsage Tracking

    71Mining the World-Wide Web

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    71/75

    Prof. Pier Luca Lanzi

    Web Mining

    Customized Usage Tracking

    Adaptive Sites

    Analyzes access patterns of each user at atime. Web site restructures itself automaticallyby learning from user access patterns.

    Web UsageMining

    General AccessPattern Tracking

    Web StructureMining

    Web ContentMining

    Web PageContent Mining

    Search ResultMining

    72Web Usage Mining

    Mi i g W b l g d t di tt f

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    72/75

    Prof. Pier Luca Lanzi

    Mining Web log records to discover user access patterns of Web pagesApplications

    Target potential customers for electronic commerce

    Enhance the quality and delivery of Internet informationservices to the end userImprove Web server system performanceIdentify potential prime advertisement locations

    Web logs provide rich information about Web dynamicsTypical Web log entry includes the URL requested, the IPaddress from which the request originated, and a

    timestamp

    73Web Usage Mining

    Understanding is a pre requisite to improvement

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    73/75

    Prof. Pier Luca Lanzi

    Understanding is a pre-requisite to improvement1 Google, but 70,000,000+ web sites

    What Applications?

    Simple and Basic Monitor performance, bandwidth usage Catch errors (404 errors- pages not found)

    Improve web site design (shortcuts for frequent paths,remove links not used, etc)

    Advanced and Business Critical

    eCommerce: improve conversion, sales, profit Fraud detection: click stream fraud,

    74Techniques for Web usage mining

    Construct multidimensional view on the Weblog database

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    74/75

    Prof. Pier Luca Lanzi

    Construct multidimensional view on the Weblog databasePerform multidimensional OLAP analysis to find the top Nusers, top N accessed Web pages, most frequentlyaccessed time periods, etc.

    Perform data mining on Weblog recordsFind association patterns, sequential patterns, and trendsof Web accessing

    May need additional information,e.g., user browsingsequences of the Web pages in the Web server buffer

    Conduct studies to

    Analyze system performance, improve system design byWeb caching, Web page prefetching, and Web pageswapping

    75References

    Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, Extracting Content Structure forb b d l h f h f b f

  • 7/28/2019 Machine Learning and Data Mining 19 Mining Text and Web Data 26716

    75/75

    Prof. Pier Luca Lanzi

    Web Pages based on Visual Representation, The Fifth Asia Pacific Web Conference,2003.Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, VIPS: a Vision-based PageSegmentation Algorithm, Microsoft Technical Report (MSR-TR-2003-79), 2003.Shipeng Yu, Deng Cai, Ji-Rong Wen and Wei-Ying Ma, Improving Pseudo-RelevanceFeedback in Web Information Retrieval Using Web Page Segmentation, 12thInternational World Wide Web Conference (WWW2003), May 2003.

    Ruihua Song, Haifeng Liu, Ji-Rong Wen and Wei-Ying Ma, Learning Block ImportanceModels for Web Pages, 13th International World Wide Web Conference (WWW2004),May 2004.Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, Block-based Web Search,SIGIR 2004, July 2004 .Deng Cai, Xiaofei He, Ji-Rong Wen and Wei-Ying Ma, Block-Level Link Analysis, SIGIR

    2004, July 2004 .Deng Cai, Xiaofei He, Wei-Ying Ma, Ji-Rong Wen and Hong-Jiang Zhang, OrganizingWWW Images Based on The Analysis of Page Layout and Web Link Structure, TheIEEE International Conference on Multimedia and EXPO (ICME'2004) , June 2004Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma and Ji-Rong Wen, Hierarchical Clusteringof WWW Image Search Results Using Visual, Textual and Link Analysis,12th ACMInternational Conference on Multimedia, Oct. 2004 .