Web Mining through the Online Analyst Alessandro Zanasi IBM ...

Web Mining through the Online Analyst

Alessandro ZanasiIBM Global ServicesBologna KDD Center, Italy

Abstract

While the amount of data available to us through the Web and the intranets isincreasing, our capacity of reading and analyzing this information remainsconstant. Search engines instead of reducing the problem augment it making moreand more documents quickly available to us. Web mining is a new research areathat tries to solve the information overload problem by exploiting recent advancesin different fields of technology. Documents and web pages are a source ofknowledge in an unstructured data format that can be decodified, analyzed andturned into actionable intelligence thanks to online analysis by text mining.

In this paper an example of a real and operative application to competitiveintelligence (CI) is given: the Online Analyst. The final objective of thisapplication is to give to end users an intelligent agent to read and quickly analyzehuge volumes of documents retrieved online, especially from the web. Its typicalusers are intelligence analysts, in military as in business or politics field.

The most exciting conclusion is that the shown approach, as also theprototype, here shown in a CI application case, may be already easily used in alle-commerce applications (currently under development worldwide), that take intoaccount the necessity of working with web and /or text data.

1 Overview

1.1 Introduction

This paper extends the subject of a previous work [Za98] especially in threeareas, taking into account the possibilityEl of utilizing online technology (with the results available to a not specialistend user in some seconds),

Data Mining II, C.A. Brebbia & N.F.F. Ebecken (Editors) © 2000 WIT Press, www.witpress.com, ISBN 1-85312-821-X

4 Data Alining II

03 of using free public sources as press surveys and web pages,

0*1 of analyzing the content of these documents taking into account the free text(using linguistic analysis to solve ambiguity and meaning problems) instead oflimiting our analysis to codified information.

This solution (the Online Analyst), in spite of its name, has not the ambition ofsubstituting the analyst role but, on the contrary, to give him a powerful, new toolto remain in its election area: analysis, and not data gathering and summarizing.It is especially well suited for web mining activity.

1.2 KDD/Data, Text and Web Mining

1.2.1 KDD/Data MiningKnowledge Discovery in Databases (or KDD) through data mining is the processof extracting previously unknown, valid, and actionable knowledge from largedatabases and then of using it to make crucial business decisions [Za97].The mined data are generally structured data, regarding, very often, marketingproblems [CFZ98], [PTZOO].

1.2.2 Text Mining

1.2.2.1 Introduction Text mining is a particular form of data mining where thedata are in textual format. Seen that the data are in an unstructured format, thedata preparation step is longer than usual and requires a linguistic preprocessing[GG96],[GG98] to solve (also if partially) the ambiguities. In this sense textmining extends the content analysis concepts [We90], which didn't take into inaccount the possibility of working on the text before analyzing it, in the directionshown by [Man97]. The objective remains, anyway, the same: discovering theknowledge contained in the database, without defining our topic of research.

The proposed solution differ from other solutions directed to offer moreefficient ways of searching the textual databases (also if it takes advantage of themost advanced tools in this field, e.g. integrating web search engines), once wehave defined our topic of research.

This second approach, directed to offer best search and navigation, lies on theconsideration that many corpora, such as Internet directories, digital libraries andpatent databases are manually organized into topic hierarchies, also calledtaxonomies. The key concept in this trend of research is defining techniques toautomatically separate the feature words, or discriminants, from the noise wordsat each node of the taxonomy (see [CDAR98]).

As in data mining the data consist in subjects (e.g. customers) defined by a setof features (e.g. age, salary, accounts...), in text mining the data consists indocuments defined by a set of words.

The principal difference is that, in data mining, the features that characterizea subject are tens, or hundreds, with a limited variability and ,in text mining, thewords that characterize a document are thousands, with a variability of hundredsof thousands.


Data Mining 11 ?

The feature selection in the case of text mining needs particular attention andelaboration, with help coming from automatic filtering and/or human intervention.

1.2.2.2 Linguistic step In this linguistic step (by using dictionaries, syntactictaggers, lemmatization engines and a specific IBM SW: Intelligent Miner forText), the Online Analyst:

El solves the ambiguities linked to the language (of course some level ofambiguity can't be eliminated). In this step, for instance, the words:- 'record' in the English sentence "we record this record"- 'couvent' in the French sentence "les poules du couvent couvent"- T, 'vitelli', 'dei', 'romani', 'sono', 'belli', in the Italian or Latin sentence "ivitelli dei romani sono belli"are recognized in their meaning;El recognizes the semantic value of the words (electronic dictionaries are, withsynonym maps, for the languages we want to manage),E] lemmatizes the words ("International Business Machines" is transformed in"IBM"; "loves", "love", "loved" are transformed in "to love"),

E] and automatically indexes (i.e. associates the key concepts/best indexes to thechosen document) them.

1.2.2.3 Data mining step Then, in the data mining step, the documents arecollected in clusters, depending on their content. Connections/links/similaritiesamong the authors or different research/political trends are discovered.

1.2.2.4 Rules discovery step In this step the concepts identified as interestingsignals are detected. For example, if an interesting signal to be recognized is the'partnership' concept, the Online Analyst, after recognizing the names of theactors of the partnership (e.g. the company names X and Y, also if they were notpreviously defined) and the concept of partnership (also if the word "partnership"is not used), lists all of them in the format: 'Partnership (X,Y)'.

1.2.3 Web miningWeb mining is the coupling of web crawling with online text mining.

1.3 The Data sources

1.3.1 Web dataInternet is becoming one of the main media by which documents, data andinformation can be obtained for CI targets [Za95j. The main advantage is thatthis flood of information is often free. The main disadvantage is that generally thisinformation is too much, and it is not easy to detect where the important piece ofinformation is located.


g Data Mining II

1.3.2 On line data banksThe on line data banks are collections of specialized information, generallyaccessible through Internet by subscription. Examples of those ones are the databanks dedicated to press surveys, to patents or to scientific articles (chemical orphysical or mathematical ones).

Some of the most well known on line data banks are:El WPEL - World Patent Index Latest (Derwent)

El Chemical AbstractsEl INSPECEl Medline

E3 RBB (Reuters Business Briefings)

El Tulsaavailable directly or through information brokers as:

El EINS (European Information Network Services)

El The Dialog Corporation

El Reuters

El Bloomberg

1.3.3 Private sources

A company may have a private electronic data bank (built during years) that canbe merged with other sources.The format and the content of the documents of this private data bank is generallycompletely different from those ones coming from online data banks.

2 The information overload problem

2.1 Competitive intelligence (CI)

The idea of applying supercomputer power and text mining process to analyzethousands of documents in the same instant with CI targets, had a birth from thefollowing consideration.In our advanced Information Age, everything that has some relevance (VIPsdeclarations, technology advancements, political decisions, scientific findings...) isprobably present in electronic format in an on-line data bank or in a web site,hidden by too many other pieces of information [Za99].

The problem lies in how to put in evidence this hidden piece of relevantinformation, or weak signal. A solution consists in relating, automatically thanksto text mining, the weak signals among them to detect the relevant information.So to understand the competitive arena (CI target), we need to analyze quicklythe maximum possible amount of the knowledge available through the Net,available remotely through TCP/IP connection.


Data Mining 11 ^

2.2 The problems

To profit completely of this knowledge, two problems must be solved.1) The available documents are too many to be read quickly by only one person.2) The knowledge contained in an article is present at different, sometimeshidden, levels.

Of course, only by implementing this huge volume of data in the correctstrategic framework (research of time trends, invisible networks [Pr62]: networksof speakers and/or researchers, journalists, newspapers, pools of companies orresearch centers, discovery of the correct relation among the documents...) theembedded knowledge may be put in evidence.

Several problems come from the need of- taking into account documents from different sources, multimedia, languages,formats, capacity of distinguishing synonyms (different words with the samemeaning), words with different meanings, context analysis,- reducing the complexity dividing the total document mass in topics, withouthaving defined previously what these topics are,- presenting a rough description of what these topics regard, with indicators abouttheir content (volume, context...),- discovering networks, giving an outlook of what relations exist among thetopics (that might seem at first look not all linked among them when in the realitythey are),- discovering signals of change, as alliances or forecastings, that may be"drowned" in the text,- discovering the names and actions of new entrants in a market,- discovering the different levels of knowledge hidden inside.

2.3 The required solution

The solution to the previous problems needs1. to have available all the really interesting documents for his activity (byprofiling), coming from different and heterogeneous sources (as Web sites, pressnews, proprietary data...) (data selection and collection), shown as a singlesource, in huge volumes (merging), updated regularly (updating),2. to access the documents regarding an interesting subject (query)3. to have grouped these documents in different clusters regarding the differentsubjects that they treat, without having previously defined these subjects(clustering),4. to have in evidence the connections among these groups (visualization),5. to discover trends and trend breaks (time trends),6. to have in evidence the principal concepts expressed in them (rules discovery),7. to retrieve the final document (retrieval);assuring that all these activities are performed in some seconds length, with thecapability of detecting in the data the important strategic signals referring toAlliances, JVs, Partnerships, New Entrants, Technology.


o Data Mining II

3 The Online Analyst solution

3.1 Overview

The CI Online Analyst is a solution consisting in SW, HW and consulting services(for text analysis, architecture design, competitive intelligence expertise).It extracts information and insight from large collections of documents by search,analysis, categorization (without need to define the categories before) andnavigation capabilities applied to collections of documents.

It establishes relationship, extracts concepts and stimulates new ideas bynavigation and visualization of textual data from databases originating fromvarious sources that can be combined, in order to enable cross-fertilization ofknowledge.

The CI Online Analyst can quickly categorize and discover the factual contentof document collections, and explore relationships between the selectedcategories, highlighting the strategic signals with capacity of triggering an alertsystem.

To perform that, it uses some of the most recent methods to analyzeautomatically a downloaded data set [HQD91], [Hu96] in successive steps.An innovative approach, applicable to text mining, has been described in [Ma86].

3.2 The process

We describe here the different steps necessary to start the Online Analystsolution.

3.2.1 Constitution of the KDD/Data Mining Center.It is necessary to have a center where the documents (or their abstracts) arecollected and analyzed and where the CPU power and SW engine are installed.This center may be in-house or outsourced.

The Online Analyst was developed using the resources of the Bologna KDDCenter [BKC95] and we refer to it in the following lines.This Center was built in partnership in 1995 between IBM and Cineca (ItalianUniversities Supercomputing Center).

Currently this center is based on the following resources:El a group of highly skilled professionals, with specialization in data mining,mathematics, statistics and computer science;El know-how in competitive intelligence;El computer power (a parallel computer: IBM SP2, with 36 nodes);El SW (of different nature and sources);El connections to the most important worldwide online data banks (Cineca isthe BINS [EINS98] National Center), to easily collect data.


Data Mining II g

3.2.2 ProfilingIn this step the target of interest is established. For instance, it may be prepared aprofile regarding:A) companies to be tracked (for each company also the controlled companies andthe names of their most known executives/gurus/owners: e.g.JBM+Lou Gestner,may be taken into account),B) topics to be tracked (for each account also their most common synonyms aretaken into account: e.g., CRM+Customer Relationship Management).C) Some typical CI concepts (e.g., Alliances, JVs, Partnerships, Forecastings,Particular actions...) are defined. A particular type of research activity on theseconcepts were later performed.

3.2.3 Data selectionIn this step the sources from where the data would be collected are established.They consisted in two types:3.2.3.1 Press news A documentation center (e.g. the enterprise library) selectsthe interesting articles by using the previously defined profile in several sources(also more than 3000): magazines, newspapers, database syndicators, pressagencies. All the articles containing the characteristics defined in the profile wereperiodically and automatically retrieved and sent to the central repository locatedin the KDD Center, where they were put in the specific data base regarding thespecific subject.3.2.3.2 The web A group of web crawlers installed in the KDD Center looks for,selects and sends to the central repository all the web pages appeared on theprofiled public web sites.

3.2.4 Data filtering/linguistic analysis3.2.4.1 Document receiving When the documents (from the press or the web)arrive to the KDD Center, the activity of data filtering starts.

We must recognize two different approaches to document filtering, dependingon the characteristics of the two possible types of documents: highly structureddocuments and unstructured documents.

3.2.4.2 Structured documents Receiving "highly structured" documents, theOnline Analyst employs 'Filters', (written in PERL language) to break thedocument(s) down into their respective components like Title, Abstract, MainDocument (possibly consisting of chapters and conclusions), andBibliography/List of References. In this step the previous different formats of thedocuments (e.g.Medline or WPIL format, PDF pictures, images...) areconverted into a new, homogeneous format in XML, recognizing and separatingthe structured content of the most known data banks format (as Medline,WPDL,...) from the unstructured content (free text).

In this step the original documents, written in heterogeneous format (e.g. inMedline format) are transformed in new documents, as <TITLE>, <AUTHOR>,<ABSTRACT> <FULL TEXT>, ...


i A Data Mining II

3.2.4.3 Unstructured documents In this step the unstructured textual material istransformed into a format that can be analyzed mathematically. The objective ofthis step is to extract automatically the features [DGS99] from the document, thatis to recognize and classify the items present in natural language texts. Thisobjective is reached through linguistic analysis, whose the three basic componentsare:1. A language model, offered in principal languages, that provides thegrammatical knowledge to break sentences into their basic components includingnouns, verbs, adjectives, dates etc. selecting the new keywords from the title orthe abstract or the total document.2. a generic dictionary of words and multi-word expression, which can beenhanced with industry-specific dictionaries to support the process ofcategorizing highly technical documents.3. A "Relations Extraction" engine. All the sentences where there arerelations as "subject X" "verb or expression meaning" "subject Y" are detected.This component allows the detection of the so called "signals" [e.g., alliances].The extracted information will be automatically classified into the followingcategories:• Names of persons, organizations and places (like Mr.Bill Gates,Microsoft Inc., Seattle, USA)• Multiword terms (Asset Liability Management, British Telecom)

• Abbreviations (ALM, for Asset Liability Management; BT, for BritishTelecom)

• Relations (Bill Gates-President-Microsoft, Compaq-owns-DEC).+ Others: dates, currencies, textual forms of numbers, etc.As a result of the linguistic analysis, a matrix is created with rows associated withrows associated with individual documents to be analyzed, and with columsassociated with keywords contained in the 'dictionary': <WORD1>, <WORD2>,..., <WORD N>. The implementation of feature extraction relies on linguisticallymotivated heuristics and pattern matching with part-of-speech information.

A canonical form is always assigned to each feature that has been found in thedocument. A canonical form abstracts from different morphological variants of asingle term, (e.g., singular and plural forms are mapped to the same canonicalform). It is the most explicit, least ambiguous name among the variants referringto the subject. This approach reduces the ambiguity of the variants: in onedocument IBM is associated to International_Business_Machines, in another toIndustrie_Bolognesi_Meccaniche.

3.2.4.4 Data enrichment Then we add other colums associated with chosenkeywords. Eg :<COMPETITORS> It means that all the competitors' names(declared previously as competitors) are recognized not as simple names but arerecognized as 'competitors' names and put here too.Other columns are <COMPANIES>, <PRODUCTS>, <SPECIAL TOPICS>(egCRM),.,<KEYWORD>,,fields


Data Mining II j ^

A column for each CI topic is added (e.g.<ALLIANCE>...).

3.2.4.5 Link connection A link between this document and the originaldocument (in the web site or in the data bank) is added. Thanks to this link, theoriginal document may be retrieved when desired.

3.2.5 Optional creation of document subsetAlthough the Online CI Analyst can handle tens of thousands of documents, it isoften desirable and/or practical to create subsets of the total document collection,for instance by identifying "all documents/parts of documents containing a certainkeyword, or combination of keywords". This means that a specific dataset iscreated for each competitor (if we are facing CI problems) , or business topic,that will be contained in the central repository. At the end of these steps, severaldifferent datasets are organized, consisting in thousands of documents. Each dataset covers a different topic to be investigated. The same document may becontained in more than one data set.

3.2.6 AccessThe only software required for the user to interact with the data mining center is aweb browser. A unique user name and password will be provided to the user toview results via the web browser. It is possible that multiple customers haveconcurrent http access to the system. To extract a subset of documents from thetotal collection are used specific queries applying a search engine (such as IBMIntelligent Miner for Text).

The result may consist in, for instance, 1000 docs: too many to be read in oneday or one week. So we must mine them to extract the knowledge located insidethem without the necessity of reading them. A possible text mining technique thatwe may use is Clustering.

3.2.7 SummarizationThe matrix created by linguistic analysis is analyzed by a mathematical / statisticalmethod named "Relational Analysis" [Ma91] which automatically creates clusters(unsupervised categorization method) of documents exhibiting a high degree ofsimilarity, as measured by significance overlap of lists of keywords. Documentsare assigned uniquely to one and only one cluster, the one containing documentsexhibiting the highest similarity with the document at hand. The clusters arelabeled uniquely by a list of the most significant keywords contained.Relationships between clusters are shown by colored bars.

A special group collects all the documents that are not related to the otherones: they contain new companies, or references to new technologies. Its functionis the function of being an "alert system", where we can check the noveltiesappeared on the market.It is named: Others.


i 2 Data Mining II

3.2.8 Rules discoveryAs an important aspect of the analyst job is to recognize the market weak signalsby which it is possible to forecast future actions, the Online Analyst recognizessignals of change in the competitive environment looking for particular signals,as: top executives declarations, partnerships, alliances or joint ventures,forecastings published on the media. The Online Analyst detects these signalsafter having transformed these signals in rules.E.g.:IFyou discover <company_X><trigger_sentence><company_Y>THENwrite Trigger (X,Y).

3.2.9 Viewing the resultsAdvanced interactive graphics is used to manage more easily the documents andthe related clusters. Results can be visualized with a cluster map that provides anoverview of the categories and allows the end user to drill down into eachcategory or 'topic' to find individual documents. Web browser technology can beused as a graphical tool to display the Online Analyst results, using Java applets.

3.2.10 Examples of results3.2.10.1 Relationships between business and/or research areas Two areas,which from a superficial point of view are completely different, can be found veryclose sharing the same basic technologies.

Is this the signal of a possible synergy? Is someone already trying of takingprofit from it?

This is useful to discover opportunities as, e.g., areas where is sufficient tochange our mind/our approach to profit at once of them.

The relations are automatically put in evidence through colored bars.

3.2.10.2 Time trends Utilizing the "time" variable, we can assess how, over time,strategies change. We can discover that some areas are disappearing and thatother ones are augmenting their importance or interest.

3.2.10.3 Competitors We can discover that our most dangerous competitor isn'tthe most well known (that, on the contrary, may be leaving the sector!) but a newone, located maybe far from our business sector or geographic area, that ispreparing new tools.

If the target of analysis is about a business area, a list of the involvedcompanies is presented.

3.2.10.4 Alliances In what area and with whom our most dangerous competitoris doing alliances?

If we flag in the first page this key (<ALLIANCE>), all the docs wherealliances are described are collected and a summary is given. E.g. we can discover


Data Mining II i g

that in 25 articles there are quotations about an alliance between IBM andCISCO, in 20 between IBM and KPMG... The interesting docs may easily beaccessed.

3.2.10.5 Forecasting A well known method of forecasting is Delphi method. Itconsists in interviewing all the experts in a defined domain andrecording/averaging their forecastings. An "automated Delphi method" may beimplemented, using textual analysis to collect all the forecastings appeared on thepress and then to present a summary of them.

3.2.10.6 Scientists/experts/top executives Another possible objective isdetecting the names of scientists or experts or top executives that work on a topicand their particulars.

If a scientist who always worked in a certain area starts working in anotherone, is it a signal that he changed his interests or that he found a newmethodology to solve an old problem?

References

[BKC95] - Bologna KDD Center - http://open.cineca.it/datamining[CDAR98] - S.Chakrabarty, B.Dom, R.Agrawal, P.Raghavan, 1998 - Scalablefeature selection, classification and signature generation for organizing large textdatabases into hierarchical topic taxonomies-The VLDB Journal (1998) 7:163-178[CFZ98] - S.Castaldo, M.R.Fallarino, A.Zanasi - 1998 - La gestione delle risorsecustomer based mediante la fidelity card - Trade Marketing - Franco Angeli[DGS99] - Doerre, Gerstl, Seiffert, 1999 - Text Mining: Finding Nuggets inMountains of Textual Data - KDD99 Proceedings - ACM[EINS98] - http://www.eins.org[Fa99] - Liam Fahey, 1999 - Competitors - John Wiley & Sons[GG98] - C.Gouttas, J.M.Granier, 1998 - The Death of Lady D - A Semiotic andPsycho-sociological Understanding of the Funeral Eulogy - ESOMAR Seminaron the Internet and Market Research[GG96] - C.Gouttas, J.M.Granier, 1996 - Les mots de 1'Entreprise: AnalyseTextuelle Automatique et Semiotique - JADT Rome 96 Seminar[HQD91] - C.Huot, L.Quoniam, H.Dou, 1991 - A new Method for AnalyzingDownloaded Data for Strategic Decision - Scientometrics, Vol.22-No.2[Hu96] - C.Huot, 1996 - IBM Technology Watch - IBM Internal Report[Ma86] - J.F.Marcotorchino, 1986-Maximal Associations as a tool forClassification, - W.Gaul & M.Schader, North Holland, pp.275-288[Ma91] - J.F.Marcotorchino, (1991) - L'Analyse Factorielle Relationelle (partie Iet II) - Etude du CEMAP IBM France - N°MAP-003[Man97] - I.Mani et al., (1997) - Towards Content-Based Browsing of BroadcastNews Video - in Intelligent Multimedia Information Retrieval, Ed.AAAI Press,


j4 Data Mining II

[Pr62] - DJ. De Soila Price, 1962 - "Is Technology Historically Independent ofScience? A Study in Statistical Historiography"[PTZOO] - G.Pedrazzi, RTurra, A.Zanasi - CRM in an Insurance Company -Data Mining 2000 Proceedings[We90] - R.P.Weber, 1990 - Basic Content Analysis - SAGE Publications[Za95] - A.Zanasi, 1995 - Data Mining and Competitive Intelligence throughInternet - III N1R-IT-95 (Third Network Information Retrieval ConferenceProceedings-Milan-Italy)[Za97] - A.Zanasi, 1997, in Discovering Data Mining - Prentice Hall[Za98] - A.Zanasi, 1998 - Competitive Intelligence Thru Data Mining PublicSources - Competitive Intelligence Review - Vol.9(1) - John Wiley & Sons, Inc.[Za99] - A.Zanasi, 1999, in Competitive Technical Intelligence - John Wiley &Sons


Web Mining through the Online Analyst Alessandro Zanasi IBM ...

Documents

Transcript of Web Mining through the Online Analyst Alessandro Zanasi IBM ...