winter project main

download winter project main

of 176

Transcript of winter project main

  • 8/7/2019 winter project main

    1/176

    A REPORT ON

    St. Francis Institute of Management and ResearchSt. Francis Institute of Management and ResearchMount Poinsur, S.V.P Road, Borivali (West)

    Mumbai-400103

    pg. 1

  • 8/7/2019 winter project main

    2/176

    St. Francis Institute of Management andSt. Francis Institute of Management and

    Research.Research.

    Mount Poinsur, S.V.P Road, Borivali (West),Mumbai-400103.`

    Winter Project (Information technology studies)

    Report

    Title:

    Prepared for Mumbai University in the partial fulfillment

    of the requirement for the award of the degree in

    MASTER IN MANAGEMENT STUDIES

    SUBMITTED BY

    Patel Pramod RameshchandraRoll No: 38 Year: 2009-11

    Under the Guidance ofProf. Manoj Mathew.

    pg. 2

  • 8/7/2019 winter project main

    3/176

    St. Francis Institute ofSt. Francis Institute ofManagement and ResearchManagement and Research

    Certificate Of Merit

    This is certify that the workentered in this project is thework of an individual

    Mr.Patel PramodRameshchandra.Roll No: 38 MMS-II

    Has worked for the Semester IVof the year2010-2011 in the college. Date:

    pg. 3

  • 8/7/2019 winter project main

    4/176

  • 8/7/2019 winter project main

    5/176

    AcknowledgmentAcknowledgment

    I would like to express my sincere gratitudeI would like to express my sincere gratitudetoward thetoward the MBAMBA department of thedepartment of theST. Francis Institute of Management andST. Francis Institute of Management andResearchResearchfor encouraging me in the development of thisfor encouraging me in the development of thisproject. I would like to thank, ourproject. I would like to thank, ourDirector Dr. Thomas MathewDirector Dr. Thomas Mathew, my internal, my internal

    project guideproject guide Prof.Manoj MathewProf.Manoj Mathew and ourand ourfaculty coordinatorfaculty coordinator Prof.Vaishali KulkarniProf.Vaishali Kulkarni for allfor alltheir help and co-operation.their help and co-operation.Above this I would not like to miss thisAbove this I would not like to miss thisprecious opportunity to thank,precious opportunity to thank, prof. Thomasprof. ThomasMathewMathew ,, prof.Sinimole,prof.Sinimole, M.F KumbarM.F Kumbar ,, SherliSherliBijuBiju ,,Mohini OzarkarMohini Ozarkar && Steve HalgeSteve Halge our librarian, myour librarian, my

    friends,friends, Mr.Subandu K. MaityMr.Subandu K. Maity,, Mr. DurgeshMr. DurgeshTannaTanna,,Miss Hiral ShahMiss Hiral Shah,, Mr.Narinder Singh KaboMr.Narinder Singh Kabo,,MissMiss Radhika S. AppaswamyRadhika S. Appaswamy ,, Miss Payal P.Miss Payal P.PatelPatel ,,MissMiss Bhagyalaxmi Subramaniam ,, Mrs. Soma L.Mrs. Soma L.JoshuaJoshua and myand my parentsparents for helping, guidingfor helping, guidingand supporting us in all problems.and supporting us in all problems.

    pg. 5

  • 8/7/2019 winter project main

    6/176

    pg. 6

  • 8/7/2019 winter project main

    7/176

    Table of Contents

    Type chapter title (level 1).............................................................1

    Type chapter title (level 2)............................................................2

    Type chapter title (level 3).........................................................3

    Type chapter title (level 1).............................................................4

    Type chapter title (level 2)............................................................5

    Type chapter title (level 3).........................................................6

    pg. 7

  • 8/7/2019 winter project main

    8/176

    pg. 8

    Executive Summary

    Data mining is a process that uses a variety of dataanalysis tools to discover knowledge, patterns andrelationships in data that may be used to make validpredictions. With the popularity of object-orienteddatabase systems in database applications, it is important

    to study the data mining methods for object-orienteddatabases. The traditional Database Management Systems(DBMSs) have limitations when handling complexinformation and user defined data types which could beaddressed by incorporating Object-oriented programmingconcepts into the existing databases. Classification is awell-established data mining task that has beenextensively studied in the fields of statistics, decisiontheory and machine learning literature. This study focuses

    on the design of an object-oriented database, throughincorporation of object-oriented programming conceptsinto existing relational databases. In the design of thedatabase, the object-oriented programming conceptsnamely inheritance and polymorphism are employed. Thedesign of the object-oriented database is done in such awell equipped manner that the design itself aids inefficient data mining. Our main objective is to reduce theimplementation overhead and the memory space requiredfor storage when compared to the traditional databases.

  • 8/7/2019 winter project main

    9/176

    Purpose of the study

    The purpose of this is to find the effective way of data miningusing Object Oriented Database and to improve CRM using DataMining

    Significance of the study

    This work will help provide additional information for the databaseadministrator who is engaged in the improvement the way of data

    mining from data warehouse and also the effective to handle datamining. This research work is not done in the intention to replaceor duplicate the work that is being done by me but rather itsoutcome that can help to complement the Business Analyst

    pg. 9

  • 8/7/2019 winter project main

    10/176

    Objective of the project

    The general objective of this project is to investigate andrecommend suitable way for data mining. The data mining solutionproposed in this study could help support a data mining process aswell as contribute to build a smooth way of data handing withinorganization. In this work, data mining implementations of othercompanies were investigated though CRM magazine current yearand last year.

    In order to meet the general objective of this project the followingkey activities must be carried out:

    To Study and Understand the Basic concept of database, datawarehouse and data mining.

    To Study and Understand the Object oriented database To design a simple Object oriented database To do effective data mining in the designed Object oriented

    database. To hit upon effective way of memory saving data mining

    using Object oriented database. To effective way of data mining to successes in CRM

    To build profitable Customer Relationships with DataMining

    pg. 10

  • 8/7/2019 winter project main

    11/176

    Limitation of the project

    This project does not focuses on the whole Database design it onlyfocuses on three tables that is Customers Table, Suppliers Table,Employee Table but in real scenario there is not only three tables ithas many numbers of tables in the database

    pg. 11

  • 8/7/2019 winter project main

    12/176

    Need for study

    Data mining roots are traced back along three family lines. Thelongest of these three lines is classical statistics. Without statistics,there would be no data mining, as statistics are the foundation ofmost technologies on which data mining is built. Classical statisticsembrace concepts such as regression analysis, standard distribution,standard deviation, standard variance, discriminated analysis,cluster analysis, and confidence intervals, all of which are used tostudy data and data relationships. These are the very building

    blocks with which more advanced statistical analyses areunderpinned. Certainly, within the heart of today's data miningtools and techniques, classical statistical analysis plays a significantrole.

    Data mining's second longest family line is artificial intelligence, orAI. This discipline, which is built upon heuristics as opposed tostatistics, attempts to apply human-thought-like processing tostatistical problems. Because this approach requires vast computer

    processing power, it was not practical until the early 1980s, whencomputers began to offer useful power at reasonable prices. AIfound a few applications at the very high end scientific/governmentmarkets, but the required supercomputers of the era priced AI outof the reach of virtually everyone else. The notable exceptions werecertain AI concepts which were adopted by some high-endcommercial products, such as query optimization modules forRelational Database Management Systems (RDBMS).

    pg. 12

  • 8/7/2019 winter project main

    13/176

    The third family line of data mining is machine learning, which is

    more accurately described as the union of statistics and AI. WhileAI was not a commercial success, its techniques were largely co-opted by machine learning. Machine learning, able to takeadvantage of the ever-improving price/performance ratios offeredby computers of the 80s and 90s, found more applications becausethe entry price was lower than AI. Machine learning could beconsidered an evolution of AI, because it blends AI heuristics withadvanced statistical analysis. Machine learning attempts to letcomputer programs learn about the data business analyst study,

    such that programs make different decisions based on the qualitiesof the studied data, using statistics for fundamental concepts, andadding more advanced AI heuristics and algorithms to achieve itsgoals.

    Data mining, in many ways, is fundamentally the adaptation ofmachine learning techniques to business applications. Data miningis best described as the union of historical and recent developmentsin statistics, AI, and machine learning. These techniques are thenused together to study data and find previously-hidden trends orpatterns within. Data mining is finding increasing acceptance inscience and business areas which need to analyze large amounts ofdata to discover trends which business analyst could not otherwisefind.

    pg. 13

  • 8/7/2019 winter project main

    14/176

    Methodology

    This is a primary research. In this project I used exploratoryresearch technique. This research technique is used is closelyrelated to tracking and is used in qualitative research projects.Exploratory research provides insights into and comprehension ofan issue or situation. It should draw definitive conclusions onlywith extreme caution. Exploratory research technique is usedbecause a problem has not been clearly defined. The secondarydata is collected through reviewing magazine and articles.

    Internet is used as a source of most of the database relevant to theissues involved in the study.

    pg. 14

  • 8/7/2019 winter project main

    15/176

    Analysis

    The following are the major activities of this project:

    Task I The Literature/ Computer Weekly Magazines/Articles

    Review

    To study the significance of having good Object-OrientedDatabase Design

    Review the Literature/ Computer Monthly Newspapers\

    CRN Magazines/Articles Review Review other relevant data Mining ways of Object-Oriented

    Database

    Task II Problem Analysis

    This is the first and base stage of the project. At this stage,requirement elicitation is conducted. Potential problem areas ofdesigning the database are identified.

    Technological, social, and educational elements are identified andexamined. Alternatives are explored.

    Information and data collected is analyzed. An Object-Oriented Database design criterion is developed. Evaluate an effective way of data mining using object

    oriented database.

    Task III Proposed Effective Way of Data Mining Using

    Object Oriented Database

    Propose an effective way of data mining in object orienteddatabase.

    pg. 15

  • 8/7/2019 winter project main

    16/176

    Introduction to Database

    The Database Management System

    A Database Management System is a collection of software toolsintended for the purpose of efficient storage and retrieval of data ina computer system. Some of the important concepts involved in thedesign and implementation of a Database Management System arediscussed below.

    The Database

    A database is an integrated collection of automated data filesrelated to one another in the support of a common purpose.

    A database is a collection of information that is organized so thatit can easily be accessed, managed, and updated. In one view,databases can be classified according to types of content:bibliographic, full-text, numeric, and images.

    Each file in a database is made up of data elements numbers,dates, amounts, quantities, names, addresses and other identifiableitems of data.

    The smallest component of data in a computer is the bit, a binaryelement with the values 0 and 1. Bits are used to build bytes, whichare used to build data elements. Data files contain records that aremade up of data elements and a database consists of files. Startingfrom the highest level, the hierarchy is as follows:

    1. Database2. File3. Record4. Data element5. Character (byte)6.Bit

    pg. 16

  • 8/7/2019 winter project main

    17/176

    The Data Element

    A data element is a place in a file used to store an item ofinformation that is uniquely identifiable by its purpose andcontents. A data value is the information stored in a data element.The data element has functional relevance to the application beingsupported by the database.

    The Data Element Dictionary

    A data element dictionary is a table of data elements including atleast the names, data types and lengths of every data element in thesubject database.

    The data element dictionary is central to the application of thedatabase management tools. It forms the basic database schema orthe meta-data, which is the description of the database. The DBMSconstantly refers to this Data Element Dictionary for interpretingthe data stored in the database.

    The Data Element Types

    Relevant to the database management system, there are a variety ofdata types that are supported. Examples of common data elementtypes supported are numeric, alphanumeric, character strings, dateand time.

    pg. 17

  • 8/7/2019 winter project main

    18/176

    Files

    A database contains a set of files related to one another by acommon purpose. A file is collection of records. The records arealike in format but each record is unique in content, therefore therecords in a file have the same data elements but different dataelement values.

    A file is a set of records where the records have the same dataelements in the same format.

    The organization of the file provides functional storage of data,related to the purpose of the system that the data base supports.Interfile relationships are based on the functional relationships oftheir purposes.

    pg. 18

  • 8/7/2019 winter project main

    19/176

    Database Schemas

    A schema is the expression of the data base in terms of the files itstores, the data elements in each file, the key data elements usedfor record identification, and the relationships between files.

    The translation of a schema into a data base management softwaresystem usually involves using a language to describe the schema tothe data base management system.

    The Key Data Elements

    The primary key data element in a file is the data element used touniquely describe and locate a desired record. The key can be acombination of more than one data element.

    The definition of the file includes the specification of the dataelement or elements that are the key to the file. A file key logicallypoints to the record that it indexes

    pg. 19

  • 8/7/2019 winter project main

    20/176

    An Interfile Relationship

    In a database, it is possible to relate one file to another in one of thefollowing three ways:

    One to one

    Many to one

    One to many Many to many

    In such interfile relationships, the database management systemmay or may not enforce data integrity called referential integrity.

    Mapping Cardinalities, or cardinality ratios, express the number ofentities to which another entity can be associated via a relationshipset. Mapping cardinalities are most useful in describing binaryrelationship sets, although Entity can contribute to the descriptionof relationship sets that involve more than two entity sets.

    One toone. An entity in A is associated with at most oneentity in B, and an entity in B is associated with at most oneentity in A.

    One to many. An entity in A is associated with any number(zero or more) of entities in B. An entity in B, however, canbe associated with at most one entity in A.

    Many to one. An entity in A is associated with at most oneentity in B. An entity in B, however, can be associated withany number (zero or more) of entities in A.

    Many to many. An entity in A is associated with anynumber (zero or more) of entities in B, and an entity in B isassociated with any number (zero or more) of entities in A.

    pg. 20

  • 8/7/2019 winter project main

    21/176

    The Data Models

    The data in a database may be organized in 3 principal models:

    Hierarchical Data Model: The relationships between thefiles form a hierarchy.

    Network Data Model: This model is similar to hierarchicalmodel except that a file can have multiple parents.

    Relational Data Model: Here, the files have no parents andno children. Files are unrelated. Here the relationships areexplicitly defined by the user and maintained internally bythe database

    The Data Definition Language

    The format of the database and the format of the tables must be in aformat that the computer can translate into the actual physical

    storage characteristics for the data. The Data Definition Language(DDL) is used for such a specification.

    {CREATE, ALTER, DROP}

    The Data Manipulation Language

    The Data Definition Language is used to describe the database tothe DBMS; there is a need for a corresponding language forprograms to use to communicate with the DBMS. Such a language

    is called the Data Manipulation Language (DML). The DDLdescribes the records to the application programs and the DMLprovides an interface to the DBMS. The first used the record formatand the second uses the external function calls.

    {SELECT, INSERT, UPDATE, DELETE}

    pg. 21

  • 8/7/2019 winter project main

    22/176

    The Query Language

    The Query Language is used primarily for the process of retrievalof data stored in a database. This data is retrieved by issuing querycommands to DBMS, which in turn interprets and appropriatelyprocesses them.

    Figure 1: The Database System

    pg. 22

  • 8/7/2019 winter project main

    23/176

    Introduction to Data Warehouse and Data

    Mining

    The Data Warehouse

    A data warehouse is a central repository for all or significantparts of the data that an enterprise's various business systemscollect.The term was coined by W. H. Inmon. IBM sometimes uses the

    term "information warehouse."

    A single, complete and consistent store of data obtained from avariety of different sources made available to end users in a whatcustomer can understand and use in a business context.

    -Barry Devlin

    Typically, a data warehouse is housed on an enterprise mainframe

    server. Data from various online transaction processing (OLTP)applications and other sources is selectively extracted andorganized on the data warehouse database for use by analyticalapplications and user queries. Data warehousing emphasizes thecapture of data from diverse sources for useful analysis and access,but does not generally start from the point-of-view of the end useror knowledge worker who may need access to specialized,sometimes local databases. The latter idea is known as the datamart.

    Applications of data warehouses include data mining, Web Mining,and decision support systems (DSS).

    pg. 23

  • 8/7/2019 winter project main

    24/176

    The Data Mining

    Data mining is sorting through data to identify patterns andestablish relationships.

    Looking for the hidden patterns and trends in data that is notimmediately apparent from summarizing the data

    Data mining parameters includes:

    Association: Looking for patterns where one event isconnected to another event

    Sequence or path analysis :Looking for patterns where oneevent leads to another later event

    Classification: Looking for new patterns (May result in a

    change in the way the data is organized)

    Clustering: Finding and visually documenting groups offacts not previously known

    Forecasting: Discovering patterns in data that can lead toreasonable predictions about the future (This area of datamining is known as predictive analytics.)

    Data mining techniques are used in a many research areas,including mathematics, cybernetics, genetics and marketing. Webmining, a type of data mining used in customer relationshipmanagement (CRM), takes advantage of the huge amount ofinformation gathered by a Web site to look for patterns in userbehavior.

    pg. 24

  • 8/7/2019 winter project main

    25/176

    We are in an age often referred to as the information age. In this

    information age, because we believe that information leads topower and success, and thanks to sophisticated technologies suchas computers, satellites, etc., Organizations have been collectingtremendous amounts of information. Initially, with the advent ofcomputers and means for mass digital storage, Organizationsstarted collecting and storing all sorts of data, counting on thepower of computers to help sort through this amalgam ofinformation. Unfortunately, these massive collections of data storedon disparate structures very rapidly became overwhelming. This

    initial chaos has led to the creation of structured databases andDatabase Management Systems (DBMS).

    The efficient Database Management Systems have been veryimportant assets for management of a large corpus of data andespecially for effective and efficient retrieval of particularinformation from a large collection whenever needed. Theproliferation of Database Management Systems has alsocontributed to recent massive gathering of all sorts of information.Today, Organizations have far more information thanOrganizations can handle: from business transactions and scientificdata, to satellite pictures, text reports and military intelligence.Information retrieval is simply not enough anymore for decision-making.

    Confronted with huge collections of data, Organizations have nowcreated new needs to help us make better managerial choices.These needs are automatic summarization of data, extraction of theessence of information stored, and the discovery of patterns in

    raw data.

    pg. 25

  • 8/7/2019 winter project main

    26/176

    What kind of Information Data Mining is collecting?

    Organization have been collecting a myriad of data, from simplenumerical measurements and text documents, to more complexinformation such as spatial data, multimedia channels, andhypertext documents. Here is a non-exclusive list of a variety ofinformation collected in digital form in databases and in flat files.

    Business Transactions: Every transaction in the businessindustry is (often) memorized for perpetuity. Such

    transactions are usually time related and can be inter-business deals such as purchases, exchanges, banking, stock,etc., or intra-business operations such as management of in-house wares and assets. Large department stores.

    For example, thanks to the widespread use of bar codes, storemillions of transactions daily representing often terabytes of data.Storage space is not the major problem, as the price of hard disks iscontinuously dropping, but the effective use of the data in areasonable time frame for competitive decision making is definitelythe most important problem to solve for businesses that struggle tosurvive in a highly competitive world.

    Scientific Data: Whether in a Swiss nuclear acceleratorlaboratory counting particles, in the Canadian forest studyingreadings from a grizzly bear radio collar, on a South Poleiceberg gathering data about oceanic activity, or in anAmerican university investigating human psychology, oursociety is amassing colossal amounts of scientific data that

    need to be analyzed. Unfortunately, Organizations cancapture and store more new data faster than Organizationscan analyze the old data already accumulated.

    pg. 26

  • 8/7/2019 winter project main

    27/176

    Medical and Personal Data: From government census to

    personnel and customer files, very large collections ofinformation are continuously gathered about individuals andgroups. Governments, companies and organizations such ashospitals, are stockpiling very important quantities ofpersonal data to help them manage human resources, betterunderstand a market, or simply assist clientele. Regardless ofthe privacy issues this type of data often reveals, thisinformation is collected, used and even shared. Whencorrelated with other data this information can shed light on

    customer behaviors and the like.

    Surveillance Video and Pictures: With the amazingcollapse of video camera prices, video cameras are becomingubiquitous. Video tapes from surveillance cameras areusually recycled and thus the content is lost. However, thereis a tendency today to store the tapes and even digitize themfor future use and analysis.

    Satellite sensing: There are a countless number of satellitesaround the globe: some are geo-stationary above a region,and some are orbiting around the Earth, but all are sending anon-stop stream of data to the surface. NASA, whichcontrols a large number of satellites, receives more dataevery second than what all NASA researchers and engineerscan cope with. Many satellite pictures and data are madepublic as soon as satellite sensing are received in the hopesthat other researchers can analyze them.

    Games: Our society is collecting a tremendous amount ofdata and statistics about games, players and athletes. Fromhockey scores, basketball passes and car-racing lapses, toswimming times, boxers pushes and chess positions, all thedata are stored. Commentators and journalists are using thisinformation for reporting, but trainers and athletes wouldwant to exploit this data to improve performance and betterunderstand opponents.

    pg. 27

  • 8/7/2019 winter project main

    28/176

    Digital Media: The proliferation of cheap scanners, desktop

    video cameras and digital cameras is one of the causes of theexplosion in digital media repositories. In addition, manyradio stations, television channels and film studios aredigitizing their audio and video collections to improve themanagement of their multimedia assets.

    CAD and Software Engineering Data: There are amultitude of Computer Assisted Design (CAD) systems

    for architects to design buildings or engineers to conceivesystem components or circuits. These systems are generatinga tremendous amount of data. Moreover, softwareengineering is a source of considerable similar data withcode, function libraries, objects, etc., which need powerfultools for management and maintenance.

    Virtual Worlds: There are many applications making use ofthree-dimensional virtual spaces. These spaces and theobjects Virtual Worlds contain are described with speciallanguages such as VRML. Ideally, these virtual spaces aredescribed in such a way that Virtual Worlds can shareobjects and places. There is a remarkable amount of virtualreality object and space repositories available. Managementof these repositories as well as content-based search andretrieval from these repositories are still research issues,while the size of the collections continues to grow.

    Text Reports and Memos (E-mail Messages): Most of the

    communications within and between companies or researchorganizations or even private people, are based on reportsand memos in textual forms often exchanged by e-mail.These messages are regularly stored in digital form for futureuse and reference creating formidable digital libraries.

    pg. 28

  • 8/7/2019 winter project main

    29/176

    The World Wide Web Repositories: Since the inception of

    the World Wide Web in 1993, documents of all sorts offormats, content and description have been collected andinter-connected with hyperlinks making it the largestrepository of data ever built. Despite its dynamic andunstructured nature, its heterogeneous characteristic, and itsvery often redundancy and inconsistency, the World WideWeb is the most important data collection regularly used forreference because of the broad variety of topics covered andthe infinite contributions of resources and publishers. Many

    believe that the World Wide Web will become thecompilation of human knowledge.

    pg. 29

  • 8/7/2019 winter project main

    30/176

    What are Data Mining and Knowledge Discovery?

    With the enormous amount of data stored in files, databases, andother repositories, it is increasingly important, if not necessary, todevelop powerful means for analysis and perhaps interpretation ofsuch data and for the extraction of interesting knowledge that couldhelp in decision-making.

    Data Mining, also popularly known as Knowledge Discovery inDatabases (KDD), refers to the nontrivial extraction of implicit,

    previously unknown and potentially useful information from datain databases. While data mining and knowledge discovery indatabases (or KDD) are frequently treated as synonyms, datamining is actually part of the knowledge discovery process.

    pg. 30

  • 8/7/2019 winter project main

    31/176

    The following figure 2 shows data mining as a step in an iterative

    knowledge discovery process.

    Figure2: Data Mining is the core of Knowledge Discovery

    Process

    The Knowledge Discovery in Databases process comprises of afew steps leading from raw data collections to some form of new

    knowledge.

    pg. 31

  • 8/7/2019 winter project main

    32/176

    The iterative process consists ofthe following steps:

    Data Cleaning: Also known as data cleansing, it is a phasein which noise data and irrelevant data are removed from thecollection.

    Data Integration: At this stage, multiple data sources, oftenheterogeneous, maybe combined in a common source.

    Data Selection: At this step, the data relevant to the analysisis decided on and retrieved from the data collection.

    Data Transformation: Also known as data consolidation, itis a phase in which the selected data is transformed intoforms appropriate for the mining procedure.

    Data Mining: It is the crucial step in which clevertechniques are applied to extract patterns potentially useful.

    Pattern Evaluation: In this step, strictly interesting patternsrepresenting knowledge are identified based on givenmeasures.

    Knowledge Representation: It is the final phase in whichthe discovered knowledge is visually represented to the user.This essential step uses visualization techniques to help usersunderstand and interpret the data mining results.

    It is common to combine some of these steps together.

    For instance, data cleaning and data integration can be performedtogether as a pre-processing phase to generate a data warehouse.Data selection and data transformation can also be combined wherethe consolidation of the data is the result of the selection, or, as forthe case of data warehouses, the selection is done on transformeddata.

    pg. 32

  • 8/7/2019 winter project main

    33/176

    The KDD is an iterative process. Once the discovered knowledge is

    presented to the user, the evaluation measures can be enhanced, themining can be further refined, new data can be selected or furthertransformed, or new data sources can be integrated, in order to getdifferent, more appropriate results.

    Data mining derives its name from the similarities betweensearching for valuable information in a large database and miningrocks for a vein of valuable ore. Both imply either sifting through alarge amount of material or ingeniously probing the material to

    exactly pinpoint where the values reside. It is, however, amisnomer, since mining for gold in rocks is usually called goldmining and not rock mining, thus by analogy, data miningshould have been called knowledge mining instead.Nevertheless, data mining became the accepted customary term,and very rapidly a trend that even overshadowed more generalterms such as knowledge discovery in databases (KDD) thatdescribe a more complete process. Other similar terms referring todata mining are: data dredging, knowledge extraction and patterndiscovery.

    pg. 33

  • 8/7/2019 winter project main

    34/176

    What kind of Data can be mined?

    In principle, data mining is not specific to one type of media ordata. Data mining should be applicable to any kind of informationrepository. However, algorithms and approaches may differ whenapplied to different types of data. Indeed, the challenges presentedby different types of data vary significantly.

    Data mining is being put into use and studied for databases,including relational databases, object-relational databases and

    object oriented databases, data warehouses, transactional databases,unstructured and semi structured repositories such as the WorldWide Web, advanced databases such as spatial databases,multimedia databases, time-series databases and textual databases,and even flat files. Here are some examples in more detail:

    Flat files: Flat files are actually the most common datasource for data mining algorithms, especially at the researchlevel. Flat files are simple data files in text or binary formatwith a structure known by the data mining algorithm to beapplied. The data in these files can be transactions, time-series data, scientific measurements, etc.

    pg. 34

  • 8/7/2019 winter project main

    35/176

  • 8/7/2019 winter project main

    36/176

    The most commonly used query language for relational database is

    SQL, which allows retrieval and manipulation of the data stored inthe tables, as well as the calculation of aggregate functions such asaverage, sum, min, max and count. For instance, an SQL query toselect the videos grouped by category would be:

    SELECT count (*) FROM Items WHERE type=video GROUP BYcategory.

    Data mining algorithms using relational databases can be more

    versatile than data mining algorithms specifically written for flatfiles, since Relational Databases can take advantage of the structureinherent to relational databases. While data mining can benefitfrom SQL for data selection, transformation and consolidation, itgoes beyond what SQL could provide, such as predicting,comparing, detecting deviations, etc.

    pg. 36

  • 8/7/2019 winter project main

    37/176

    Data Warehouses: A data warehouse as a storehouse is a

    repository of data collected from multiple data sources (oftenheterogeneous) and is intended to be used as a whole underthe same unified schema. A data warehouse gives the optionto analyze data from different sources under the same roof.

    Let us suppose that VideoStore becomes a franchise in New York.Many video stores belonging to VideoStore Company may havedifferent databases and different structures. If the executive of thecompany wants to access the data from all stores for strategic

    decision-making, future direction, marketing, etc., it would be moreappropriate to store all the data in one site with a homogeneousstructure that allows interactive analysis.

    In other words, data from the different stores would be loaded,cleaned, transformed and integrated together. To facilitate decisionmaking and multi-dimensional views, data warehouses are usuallymodeled by a multi-dimensional data structure. Figure 4 shows anexample of a three dimensional subset of a data cube structure usedfor VideoStore data warehouse.

    pg. 37

  • 8/7/2019 winter project main

    38/176

    o Figure 4: A multi-dimensional data cube structure

    commonly used in data for data warehousing

    The figure shows summarized rentals grouped by film categories,then a cross table of summarized rentals by film categories andtime (in quarters). The data cube gives the summarized rentalsalong three dimensions: category, time, and city. A cube containscells that store values of some aggregate measures (in this caserental counts), and special cells that store summations alongdimensions. Each dimension of the data cube contains a hierarchyof values for one attribute.

    pg. 38

  • 8/7/2019 winter project main

    39/176

  • 8/7/2019 winter project main

    40/176

    Transaction Databases: A transaction database is a set of

    records representing transactions, each with a time stamp, anidentifier and a set of items. Associated with the transactionfiles could also be descriptive data for the items.

    For example, in the case of the video store, the rentals table such asshown in Figure 6 represents the transaction database. Each recordis a rental contract with a customer identifier, a date, and the list ofitems rented (i.e. video tapes, games, VCR, etc.).

    Since relational databases do not allow nested tables (i.e. a set asattribute value), transactions are usually stored in flat files or storedin two normalized transaction tables, one for the transactions andone for the transaction items. One typical data mining analysis onsuch data is the so-called market basket analysis or associationrules in which associations between items occurring together or insequence are studied.

    Figure 6: Fragment of a transaction database for

    the rentals at VideoStore

    pg. 40

  • 8/7/2019 winter project main

    41/176

    Multimedia Databases: Multimedia databases include

    video, images, and audio and text media. MultimediaDatabases can be stored on extended object-relational orobject-oriented databases, or simply on a file system.Multimedia is characterized by its high dimensionality,which makes data mining even more challenging. Datamining from multimedia repositories may require computervision, computer graphics, image interpretation, and naturallanguage processing methodologies.

    Spatial Databases: Spatial databases are databases that, inaddition to usual data, store geographical information likemaps, and global or regional positioning. Such spatialdatabases present new challenges to data mining algorithms.

    Figure 7: Visualization of spatial OLAP (from

    GeoMiner system)

    pg. 41

  • 8/7/2019 winter project main

    42/176

    Time-Series Databases: Time-series databases contain time

    related data such stock market data or logged activities.These databases usually have a continuous flow of new datacoming in, which sometimes causes the need for achallenging real time analysis. Data mining in such databasescommonly includes the study of trends and correlationsbetween evolutions of different variables, as well as theprediction of trends and movements of the variables in time.Figure 8 shows some examples of time-series data.

    Figure 8: Examples of Time-Series Data

    (Source: Thompson Investors Group)

    pg. 42

  • 8/7/2019 winter project main

    43/176

    World Wide Web: The World Wide Web is the most

    heterogeneous and dynamic repository available. A verylarge number of authors and publishers are continuouslycontributing to its growth and metamorphosis, and a massivenumber of users are accessing its resources daily. Data in theWorld Wide Web is organized in inter-connected documents.These documents can be text, audio, video, raw data, andeven applications. Conceptually, the World Wide Web iscomprised of three major components: The Content of theWeb, which encompasses documents available; The

    Structure of the Web, which covers the hyperlinks and therelationships between documents; and The Usage of theweb, describing how and when the resources are accessed.A fourth dimension can be added relating the dynamic natureor evolution of the documents. Data mining in the WorldWide Web, or web mining, tries to address all these issuesand is often divided into web content mining, web structuremining and web usage mining.

    pg. 43

  • 8/7/2019 winter project main

    44/176

    What can be discovered?

    The kinds of patterns that can be discovered depend upon the datamining tasks employed. By and large, there are two types of datamining tasks: descriptive data mining tasks that describe thegeneral properties of the existing data, and predictive data miningtasks that attempt to do predictions based on inference on availabledata.

    The data mining functionalities and the variety of knowledge data

    mining discover are briefly presented in the following list:

    Characterization: Data characterization is a summarizationof general features of objects in a target class, and produceswhat is called characteristic rules. The data relevant to auser-specified class are normally retrieved by a databasequery and run through a summarization module to extract theessence of the data at different levels of abstractions.

    For example, one may want to characterize the VideoStorecustomers who regularly rent more than 30 movies a year. Withconcept hierarchies on the attributes describing the target class, theattribute oriented induction method can be used, for example, tocarry out data summarization. Note that with a data cube containingsummarization of data, simple OLAP operations fit the purpose ofdata characterization.

    pg. 44

  • 8/7/2019 winter project main

    45/176

    Discrimination: Data discrimination produces what are

    called discriminate rules and is basically the comparison ofthe general features of objects between two classes referredto as the target class and the contrasting class. For example,one may want to compare the general characteristics of thecustomers who rented more than 30 movies in the last yearwith those whose rental account is lower than 5. Thetechniques used for data discrimination are very similar tothe techniques used for data characterization with theexception that data discrimination results include

    comparative measures.

    pg. 45

  • 8/7/2019 winter project main

    46/176

  • 8/7/2019 winter project main

    47/176

    Classification: Classification analysis is the organization ofdata in given classes. Also known as supervisedclassification, the classification uses given class labels toorder the objects in the data collection. Classificationapproaches normally use a training set where all objects arealready associated with known class labels. Theclassification algorithm learns from the training set andbuilds a model. The model is used to classify new objects.

    For example, after starting a credit policy, the VideoStoremanagers could analyze the customers behaviors vis--vis theircredit, and label accordingly the customers who received creditswith three possible labels safe, risky and very risky. Theclassification analysis would generate a model that could be used toeither accept or reject credit requests in the future.

    pg. 47

  • 8/7/2019 winter project main

    48/176

    Prediction: Prediction has attracted considerable attention

    given the potential implications of successful forecasting in abusiness context. There are two major types of predictions:one can either try to predict some unavailable data values orpending trends, or predict a class label for some data. Thelatter is tied to classification. Once a classification model isbuilt based on a training set, the class label of an object canbe foreseen based on the attribute values of the object andthe attribute values of the classes. Prediction is howevermore often referred to the forecast of missing numerical

    values, or increase/ decrease trends in time related data. Themajor idea is to use a large number of past values to considerprobable future values.

    Clustering: Similar to classification, clustering is theorganization of data in classes. However, unlikeclassification, in clustering, class labels are unknown and itis up to the clustering algorithm to discover acceptableclasses. Clustering is also called unsupervised classification,because the classification is not dictated by given classlabels. There are many clustering approaches all based on theprinciple of maximizing the similarity between objects in asame class (intra-class similarity) and minimizing thesimilarity between objects of different classes (inter-classsimilarity).

    Outlier analysis: Outliers are data elements that cannot begrouped in a given class or cluster. Also known asexceptions or surprises, Outlier analyses are often very

    important to identify. While outliers can be considered noiseand discarded in some applications, Outlier analysis canreveal important knowledge in other domains, and thus canbe very significant and their analysis valuable.

    pg. 48

  • 8/7/2019 winter project main

    49/176

    Evolution and deviation analysis: Evolution and deviationanalysis pertain to the study of time related data that changesin time. Evolution analysis models evolutionary trends indata, which consent to characterizing, comparing, classifyingor clustering of time related data. Deviation analysis, on theother hand, considers differences between measured valuesand expected values, and attempts to find the cause of thedeviations from the anticipated values.

    It is common that users do not have a clear idea of the kind ofpatterns organization can discover or need to discover from the dataat hand. It is therefore important to have a versatile and inclusivedata mining system that allows the discovery of different kinds ofknowledge and at different levels of abstraction. This also makesinteractivity an important attribute of a data mining system.

    pg. 49

  • 8/7/2019 winter project main

    50/176

    Is all that is Discovered Interesting and Useful?

    Data mining allows the discovery of knowledge potentially usefuland unknown. Whether the knowledge discovered is new, useful orinteresting, is very subjective and depends upon the application andthe user. It is certain that data mining can generate, or discover, avery large number of patterns or rules.

    In some cases the number of rules can reach the millions. One caneven think of a meta-mining phase to mine the oversized data

    mining results. To reduce the number of patterns or rulesdiscovered that have a high probability to be non-interesting, onehas to put a measurement on the patterns. However, this raises theproblem of completeness. The user would want to discover all rulesor patterns, but only those that are interesting. The measurement ofhow interesting a discovery is, often called interestingness, can bebased on quantifiable objective elements such as validity of thepatterns when tested on new data with some degree of certainty, oron some subjective depictions such as understandability of thepatterns, novelty of the patterns, or usefulness.

    Discovered patterns can also be found interesting if businessanalyst confirm or validate a hypothesis sought to be confirmed orunexpectedly contradict a common belief. This brings the issue ofdescribing what is interesting to discover, such as meta-rule guideddiscovery that describes forms of rules before the discoveryprocess, and interestingness refinement languages that interactivelyquery the results for interesting patterns after the discovery phase.Typically, measurements for interestingness are based on

    thresholds set by the user. These thresholds define thecompleteness of the patterns discovered.

    Identifying and measuring the interestingness of patterns and rulesdiscovered, or to be discovered is essential for the evaluation of themined knowledge and the KDD process as a whole. While someconcrete measurements exist, assessing the interestingness ofdiscovered knowledge is still an important research issue.

    pg. 50

  • 8/7/2019 winter project main

    51/176

    How do we Categorize Data Mining Systems?

    There are many data mining systems available or being developed.Some are specialized systems dedicated to a given data source orare confined to limited data mining functionalities, other are moreversatile and comprehensive. Data mining systems can becategorized according to various criteria among other classificationare the following:

    Classification according to the type of data source mined:

    This classification categorizes data mining systemsaccording to the type of data handled such as spatial data,multimedia data, time-series data, text data, World WideWeb, etc.

    Classification according to the data model drawn on:

    This classification categorizes data mining systems based onthe data model involved such as relational database, object-oriented database, data warehouse, transactional, etc.

    Classification according to the king of knowledge

    discovered:

    This classification categorizes data mining systems based onthe kind of knowledge discovered or data miningfunctionalities, such as characterization, discrimination,association, classification, clustering, etc. Some systems tend

    to be comprehensive systems offering several data miningfunctionalities together.

    pg. 51

  • 8/7/2019 winter project main

    52/176

  • 8/7/2019 winter project main

    53/176

    What are the Issues in Data Mining?

    Data mining algorithms embody techniques that have sometimesexisted for many years, but have only lately been applied asreliable and scalable tools that time and again outperform olderclassical statistical methods. While data mining is still in itsinfancy, it is becoming a trend and ubiquitous. Before data miningdevelops into a conventional, mature and trusted discipline, manystill pending issues have to be addressed. Some of these issues areaddressed below. Note that these issues are not exclusive and are

    not ordered in any way.

    Security and Social Issues:

    Security is an important issue with any data collection that isshared and/or is intended to be used for strategic decision-making.In addition, when data is collected for customer profiling, userbehavior understanding, correlating personal data with otherinformation, etc., large amounts of sensitive and privateinformation about individuals or companies is gathered and stored.This becomes controversial given the confidential nature of someof this data and the potential illegal access to the information.Moreover, data mining could disclose new implicit knowledgeabout individuals or groups that could be against privacy policies,especially if there is potential dissemination of discoveredinformation. Another issue that arises from this concern is theappropriate use of data mining. Due to the value of data, databasesof all sorts of content are regularly sold, and because of thecompetitive advantage that can be attained from implicit

    knowledge discovered, some important information could bewithheld, while other information could be widely distributed andused without control.

    pg. 53

  • 8/7/2019 winter project main

    54/176

  • 8/7/2019 winter project main

    55/176

    Mining Methodology Issues:

    These issues pertain to the data mining approaches applied andtheir limitations. Topics such as versatility of the miningapproaches, the diversity of data available, the dimensionality ofthe domain, the broad analysis needs (when known), theassessment of the knowledge discovered, the exploitation ofbackground knowledge and metadata, the control and handling ofnoise in data, etc. are all examples that can dictate miningmethodology choices.

    For instance, it is often desirable to have different data miningmethods available since different approaches may performdifferently depending upon the data at hand. Moreover, differentapproaches may suit and solve users needs differently.

    Most algorithms assume the data to be noise-free. This is of coursestrong assumption. Most datasets contain exceptions, invalid orincomplete information, etc., which may complicate, if not obscure,the analysis process and in many cases compromise the accuracy ofthe results. As a consequence, data preprocessing (data cleaningand transformation) becomes vital. It is often seen as lost time, butdata cleaning, as time consuming and frustrating as it may be, isone of the most important phases in the knowledge discoveryprocess. Data mining techniques should be able to handle noise indata or incomplete information.

    More than the size of data, the size of the search space is even moredecisive for data mining techniques. The size of the search space is

    often depending upon the number of dimensions in the domainspace. The search space usually grows exponentially when thenumber of dimensions increases. This is known as the curse ofdimensionality. This curse affects so badly the performance ofsome data mining approaches that it is becoming one of the mosturgent issues to solve.

    pg. 55

  • 8/7/2019 winter project main

    56/176

    Performance Issues:

    Many artificial intelligence and statistical methods exist for dataanalysis and interpretation. However, these methods were often notdesigned for the very large data sets data mining is dealing withtoday. Terabyte sizes are common. This raises the issues ofscalability and efficiency of the data mining methods whenprocessing considerably large data. Algorithms with exponentialand even medium-order polynomial complexity cannot be ofpractical use for data mining. Linear algorithms are usually the

    norm. In same theme, sampling can be used for mining instead ofthe whole dataset. However, concerns such as completeness andchoice of samples may arise. Other topics in the issue ofperformance are incremental updating, and parallel programming.There is no doubt that parallelism can help solve the size problemif the dataset can be subdivided and the results can be merged later.Incremental updating is important for merging results from parallelmining, or updating data mining results when new data becomesavailable without having to re-analyze the complete dataset.

    pg. 56

  • 8/7/2019 winter project main

    57/176

  • 8/7/2019 winter project main

    58/176

  • 8/7/2019 winter project main

    59/176

    VariablesCalculate additional (derived) fields. This is fairly easy.Business Analyst can multiply, subtract, divide, add,numbers. it should have some business meaning

    Find additional information, inside or outside the company.

    Find the best algorithm

    It tempting to state that probably for each problem there is one bestalgorithm. So all data miner have to do is to try a handful of reallydifferent algorithms to find out which one is the best for theproblem. Different data miners will use the same algorithmdifferently, according to their taste, experience, mood, preference

    So find out which algorithm works best for Data Miner and their

    business problem.

    pg. 59

  • 8/7/2019 winter project main

    60/176

    Zoom in on the business targets

    When data miners want to use a data mining model to select thecustomers who are most likely to buy the business outstandingproduct XYZ, it is reasonable to use the business past buyers ofXYZ as the positive targets in the model. Data Miner get a modelwith an excellent lift and use it for a mailing.

    When the mailing campaign is over, data miner now have all the

    data company need to create a new, better, model for productXYZ. The business targets the past buyers of XYZ in response tothe business mailing. With this new model, data miner will notonly take their natural propensity to buy into account, but alsotheir willingness to respond to the customer mailing

    If the databases contain far more observations than the data miningtool likes, the only thing data miner can do is use samples.Calculate the model, and data miner can use it. But data miner can

    push it a bit further. Use the model to score the entire customerbase. And now zoom in on the customers with the best scores.Lets say the top-10%. Use them to calculate a new, second modelwhich will use the far more tiny differences in customerinformation to find the really promising ones.

    Make it simple

    Nevertheless, data miner have to keep business data mining workas simple as possible, because the business who pays the billswants data miner to deliver good models, on time for hiscampaigns.

    pg. 60

  • 8/7/2019 winter project main

    61/176

    Automate as much as possible

    The data miner should not to try out every possible algorithm ineach data mining project. If problem A was best solved withalgorithm X, than probably problem B, which is very similar to A,should equally be tackled with algorithm X. No need to waste timechecking out other algorithms.

    pg. 61

  • 8/7/2019 winter project main

    62/176

    Introduction to Object-Oriented Database

    In the modern computing world, the amount of data generated andstored in databases of organizations is vast and continuing to growat a rapid pace. The data stored in these databases possess valuablehidden knowledge. The discovery of such knowledge can be veryfruitful for taking effective decisions. Thus the need for developingmethods for extracting knowledge from data is quite evident. Datamining, a promising approach to knowledge discovery, is the use ofpattern recognition technologies with statistical and mathematical

    techniques for discovering meaningful new correlations, patternsand trends by analyzing large amounts of data stored inrepositories. Data mining has made its impact on many applicationssuch as marketing, customer relationship management,engineering, medicine, crime analysis, expert prediction, Webmining, and mobile computing, among others. In general, datamining tasks can be classified into two categories: Descriptivemining and Predictive mining.

    Descriptive Mining is the process of extracting vitalcharacteristics of data from databases. Some of descriptive miningtechniques are Clustering, Association Rule Mining and Sequentialmining.

    Predictive Mining is the process of deriving hidden patterns andtrends from data to make predictions. The predictive miningtechniques consist of a series of tasks namely Classification,Regression and Deviation detection.

    One of the important tasks of Data Mining is Data Classificationwhich is the process of finding a valuable set of models that areself-descriptive and distinguishable data classes or concepts, topredict the set of classes with an unknown class label.

    pg. 62

  • 8/7/2019 winter project main

    63/176

  • 8/7/2019 winter project main

    64/176

  • 8/7/2019 winter project main

    65/176

  • 8/7/2019 winter project main

    66/176

    Polymorphism is another important Object oriented programming

    concept. It is a general term which stands for Many forms.Polymorphism in brief can be defined as "One Interface, ManyImplementations". It is a property of being able to assign a differentmeaning or usage to something in different contexts in particular,to allow an entity such as a variable, a function, or an object to takemore than one form. Polymorphism is different from MethodOverloading or Method Overriding. In literature, polymorphismcan be classified into three different kinds namely: pure, static, anddynamic.

    Pure Polymorphism refers to a function which can takeparameters of several data types.

    Static Polymorphism can be stated as functions andoperators overloading.

    Dynamic Polymorphism is achieved by employinginheritance and virtual functions.

    Dynamic binding or runtime binding allows one to substitutepolymorphic objects for each other at run-time. Polymorphism hasa number of advantages. Its chief advantage is that it simplifies thedefinition of clients, as it allows the client to substitute at run-time,an instance of one class for another instance of a class that has thesame Polymorphic Interface.

    pg. 66

  • 8/7/2019 winter project main

    67/176

  • 8/7/2019 winter project main

    68/176

    Object-Oriented Database (OODB)

    The chief advantage of Object-Oriented Database (OODB) is itsability to represent real world concepts as data models in aneffective and presentable manner. Object-Oriented Database(OODB) is optimized to support object-oriented applications,different types of structures including trees, composite objects andcomplex data relationships. The Object-Oriented Database(OODB) system handles complex databases efficiently and itallows the users to define a database, with features for creating,

    altering, and dropping tables and establishing constraints. From theusers perception, Object-Oriented Database (OODB) is just acollection of objects and inter-relationships among objects . Thoseobjects that resemble in properties and behavior are organized intoclasses. Every class is a container of a set of common attributes andmethods shared by similar objects.

    The Attributes or Instance Variables define theProperties of a Class.

    The Method describes the Behavior of the Objectsassociated with the Class.

    A Class/Subclass Hierarchy is used to representsComplex Objects where Attributes of an Object itselfcontains Complex Objects.

    pg. 68

  • 8/7/2019 winter project main

    69/176

  • 8/7/2019 winter project main

    70/176

    New Approach to the Design of Object Oriented

    Database

    In general computer literature, defines three approaches to build anObject-Oriented Database Management Systems (OODBMS)extending an Object-Oriented Programming Language (OOPL),extending a Relational Database Management System (RDBMS),and starting from scratch.

    The First approach develops an Object-Oriented Database

    Management System (ODBMS) by encompassing to an Object-Oriented Programming Language (OOPL) persistent storage toachieve multiple concurrent accesses with transaction support.

    The Second is an extended relational approach; an Object-Oriented Database Management Systems (OODBMS) is built byincorporating an existing Relational Database ManagementSystems (RDBMS) with Object-Oriented features such as classesand inheritances, methods and encapsulations, polymorphism and

    complex objects.

    The Third approach aims to revolutionize the databasetechnology in the sense that an Object-Oriented DatabaseManagement Systems (OODBMS) is designed from the ground up,as represented by UniSQL / UniOracle and OpenOODB (OpenObject-Oriented Database) .

    In my design, I have employed the second approach which extendsthe Relational Databases by utilizing the Object-OrientedProgramming (OOP) concepts.

    pg. 70

  • 8/7/2019 winter project main

    71/176

    The proposed approach makes use of the Object-Oriented

    Programming (OOP) concepts namely, Inheritance andPolymorphism to design an Object-Oriented Database (OODB)and perform classification in it respectively. Normally, database isa collection of tables. Hence when I have consider a database, it isbound to contain a number of tables with common fields. In myapproach, I have grouped together such common set of fields toform a single generalized table. The newly created table resemblesthe base class in the inheritance hierarchy. This ability to representclasses in hierarchy is one of the eminent Object-Oriented

    Programming (OOP) concepts. Next I have employed anotherimportant object-oriented characteristic dynamic polymorphism,where different classes have methods of the same name andstructure, performing different operations based on the CallingObject. The polymorphism is specifically employed to achieveclassification in a simple and effective manner. The use of theseobject-oriented concepts for the design of Object-OrientedDatabase (OODB) Object-Oriented Database ensures that evencomplex queries can be answered more efficiently. Particularly thedata mining task, classification can be achieved in an effectivemanner.Let T denote a set of all tables on a database D and t subset T,where t represents the set of tables in which some fields are incommon. Now I have create a generalized table composing of allthose common fields from the table sett. To portray the efficiencyof my proposed approach, I consider a traditional table. Atraditional example of the database for large business organizationswill have a number of tables but to best illustrate the Object-Oriented Programming (OOP) concepts employed in my approach,

    I have concentrated on three tables namely, Employees, Suppliersand Customers. The tables are represented as Table 1, Table 2,Table 3 respectively

    pg. 71

  • 8/7/2019 winter project main

    72/176

  • 8/7/2019 winter project main

    73/176

  • 8/7/2019 winter project main

    74/176

  • 8/7/2019 winter project main

    75/176

  • 8/7/2019 winter project main

    76/176

  • 8/7/2019 winter project main

    77/176

    From the above class structure, it is understood that every table has

    a set of general or common fields (highlighted ones) and table-specific fields. On considering the Employee table, it has generalfields like Name, Age, Gender etc. and table-specific fields likeTitle, HireDate etc. These general fields occur repeatedly in mosttables. This causes redundancy and thereby increases spacecomplexity. Moreover, if a query is given to retrieve a set ofrecords for the whole organization satisfying a particular rule, theremay be a need to search all the tables separately. So, thisreplication of general fields in the table leads to a poor design

    which affects effective data classification. To perform betterclassification, I have design an Object-Oriented Database (OODB)by incorporating the inheritance concept of Object-OrientedProgramming (OOP).

    pg. 77

  • 8/7/2019 winter project main

    78/176

    Design of the Object-Oriented Database

    First in my proposed approach, I have design an Object-OrientedDatabase (OODB) by utilizing the inheritance concept of Object-Oriented Programming (OOP) by which will eliminate the problemof redundancy. First, I have located all the general or commonfields from the table sett. Then, all these general or commonfields are fetched and stored in a single table and all the relatedtables can inherit it. Thus the Generalized table resembles the baseclass of the Object-Oriented Programming (OOP) paradigm. In my

    approach, I have created a new table called Person, whichcontains all those common fields and the other tables likeEmployees, Customers inherit the Person table without redefiningit.

    Here, I have used two important Mechanisms namelyGeneralization and Composition. Generalization depicts anis-a relation and composition represents a has-a relation. Boththese relationships can be best illustrated as below: The generalizedtable Person contains all the common fields and the tablesEmployees, Suppliers and Customers inheriting the TablePerson is said to have an is-a relationship with the table Personi.e., an Employee is a Person, A Supplier is a Person and ACustomer is a Person. Similarly to exemplify the compositionrelation, the table Person contains an object reference of thePlaces Table as its field. Then the table Person is said to have ahas-a relationship with the table Places i.e., a Person has a placeand similarly, A Place has a Postal Code. Figure 10 represents theinheritance class hierarchy of the proposed (OODB) Object-

    Oriented Database design. In the following pictured design, thesmall triangle () represents is-a relationship and the arrow ()represents has-a relationship.

    pg. 78

  • 8/7/2019 winter project main

    79/176

    pg. 79

  • 8/7/2019 winter project main

    80/176

  • 8/7/2019 winter project main

    81/176

    pg. 81

  • 8/7/2019 winter project main

    82/176

    The generalized table Person is considered as the base class

    Person and the fields are considered as the attributes ofthe base class Person. Therefore, the base class Person,which contains all the common attributes, is inherited by the otherclasses namely Employees, Suppliers and Customers, whichcontain only the specialized attributes.

    Moreover, inheritance allows me to define the generalizedmethods in the base class and specialized methods in the subclasses.

    For example, if there is a need to get the contact numbers of all thepeople associated with the organization, can define amethod getContactNumebrs() in the base class Person and itcan be shared by its subclasses. In addition, the generalizedclass Person exhibits composition relationship with anothertwo classes Places and PostalCodes. The class Person usesinstance variables, which are object references of the classesPlaces and PostalCodes. The tables in the proposed (OODB)design are shown in Tables.

    pg. 82

  • 8/7/2019 winter project main

    83/176

    pg. 83

  • 8/7/2019 winter project main

    84/176

  • 8/7/2019 winter project main

    85/176

  • 8/7/2019 winter project main

    86/176

  • 8/7/2019 winter project main

    87/176

    Table 7: Example of Extended Customers Table

    pg. 87

  • 8/7/2019 winter project main

    88/176

  • 8/7/2019 winter project main

    89/176

    Table 9: Example of Extended PostalCodes Table

    pg. 89

  • 8/7/2019 winter project main

    90/176

    pg. 90

  • 8/7/2019 winter project main

    91/176

    Owing to the incorporation of inheritance concept in the proposed

    design, Database Designer can extend the database by effortlesslyadding new tables, merely by inheriting the common fields fromthe generalized table

    pg. 91

  • 8/7/2019 winter project main

    92/176

    Data Mining in the Designed Object-OrientedDatabase

    Dynamic Polymorphism or Late Binding allows theprogrammer to define methods with the same name in differentclasses and the method to be called is decided at runtime based onthe calling object. This Object-Oriented Programming (OOP)concept and simple SQL\ ORACLE queries can be used to performclassification in the designed Object-Oriented Database (OODB).Here, a single method can do the classification process for all the

    tables. The uniqueness of my concept is that the classificationprocess can be performed by using simple SQL/ ORACLE querywhile the existing classification approaches for Object-OrientedDatabase (OODB) employ complex techniques such as decisiontrees, neural networks, nearest neighbor methods and more.Database Administrator can also access the method, specifically forindividual entities namely Employees, Suppliers and Customers.By integrating the polymorphism concept, the code is simpler towrite and easier to manage. As a result of the designed (OODB),

    the task of classification can be carried out effectively by usingsimple SQL/ORACLE queries. Thus in our approach byincorporating the Object-Oriented Programming (OOP) conceptsfor designing the Object-Oriented Database (OODB), I haveexploited the maximum advantages of Object-OrientedProgramming (OOP) and also the task of classification isperformed more effectively.

    pg. 92

  • 8/7/2019 winter project main

    93/176

    Implementation and Results

    In this section, I have presented the experimental results of myapproach. The proposed approach for the design of Object-OrientedDatabase (OODB) and classification has been designed withORACLE as database. I have considered only three tables forexperimentation. But in general, an organization may have anumber of tables to manage. Specifically, the number of records isenormous in each table. The incorporation of the Object-OrientedProgramming (OOP) concepts to such databases greatly reduced

    the implementation overhead incurred. Moreover, the memoryspace occupied is reduced to a great extent as the size of the tableincreases. These are some of the eminent benefits of the proposedapproach. I have performed a comparative analysis throughreviewing of Computer Reseller News (CRN) Magazines andCOMPUTER Monthly Newspaper then came to a conclusion of thespace utilized before and after generalization of tables and thus Ihave computed the saved memory space. The comparison isperformed with varying number of records in the tables such as

    1000, 2000, 3000, 4000 and 5000 and the results are stated belowin Table10, Table11, Table12, Table13, Table14 respectively.

    pg. 93

  • 8/7/2019 winter project main

    94/176

    pg. 94

  • 8/7/2019 winter project main

    95/176

    pg. 95

    Normalized Un Normalized

    Tables Fields Records Total Memory Fields Total Records Memory

    Records of size of the of the table size of the

    Table table table

    1 Customers 4 1000 4000 40000 15 15000 150000

    2 Employees 5 1000 5000 50000 16 16000 160000

    3 Suppliers 5 1000 5000 50000 16 16000 160000

    4 Persons 8 3000 24000 240000

    5 Places 3 500 1500 15000

    6 Postalcodes 4 250 1000 10000

    Total 40500 405000 47000 470000

  • 8/7/2019 winter project main

    96/176

  • 8/7/2019 winter project main

    97/176

    Normalized

    Un Normalized

    Tables Fields Records Total Records Memory size Fields Total Records Memory

    of Table of the table of the table size of the

    table

    1 Customers 4 2000 8000 80000 15 30000 300000

    2 Employees 5 2000 10000 100000 16 32000 320000

    3 Suppliers 5 2000 10000 100000 16 32000 320000

    4 Persons 8 6000 48000 480000

    5 Places 3 1000 3000 30000

    6 Postal codes 4 500 2000 20000

    Total 81000 810000 94000 940000

    Saved Memory (KB): 126.9531 Table 11: Saved Memory Table {Source: Computer Reseller News (CRN)

    Magazines}

    pg. 97

  • 8/7/2019 winter project main

    98/176

    Tables Fields Records Total Memory size Fields Total Memory

    Records of of the table Records of size of the

    Table the table table

    1 Customers 4 3000 12000 120000 15 45000 450000

    2 Employees 5 3000 15000 150000 16 48000 480000

    3 Suppliers 5 3000 15000 150000 16 48000 480000

    4 Persons 8 9000 72000 720000

    5 Places 3 1500 4500 45000

    6 Postal codes 4 750 3000 30000

    Total 121500 1215000 141000 1410000

    Saved Memory (KB):190.4297 Table 12: Saved Memory Table {Source: Computer Reseller News (CRN)

    Magazines}

    pg. 98

  • 8/7/2019 winter project main

    99/176

    Tables Fields Records Total Memory Fields Total Memory

    Records of size of the Records of size of the

    Table table the table table

    1 Customers 4 4000 16000 160000 15 60000 600000

    2 Employees 5 4000 20000 200000 16 64000 640000

    3 Suppliers 5 4000 20000 200000 16 64000 640000

    4 Persons 8 12000 96000 960000

    5 Places 3 2000 6000 60000

    6 Postal codes 4 1000 4000 40000

    Total 162000 1620000 188000 1880000

    Saved Memory (KB):253.9063 Table 13: Saved Memory Table {Source: Computer Reseller News (CRN)

    Magazines}

    pg. 99

  • 8/7/2019 winter project main

    100/176

  • 8/7/2019 winter project main

    101/176

    pg. 101

  • 8/7/2019 winter project main

    102/176

    pg. 102

  • 8/7/2019 winter project main

    103/176

    The results of comparative analysis that the saved memory space

    increases, as the number of records in each table increases.

    The Graphical Representation of the results is illustrated in Figure11. From the graph, it is clear

    Figure 11: Graph Demonstrating the above Evaluation Results

    Moreover in the proposed approach, I have placed the commonmethods in the generalized class and entity-specific methods in thesubclasses. Because of this design, we have saved a considerablememory space.

    pg. 103

  • 8/7/2019 winter project main

    104/176

  • 8/7/2019 winter project main

    105/176

    Building Profitable Customer Relationships with

    Data Mining

    Organization have to built the customer information and marketingdata warehouse, how do organization can make good use of thedata it contains

    Customer Relationship Management (CRM) helps companiesimprove the profitability of their interactions with customers whileat the same time making the interactions appear friendlier through

    individualization. To succeed with CRM, companies need to matchproducts and campaigns to prospects and customers in other words,to intelligently manage the "Customer Life Cycle. Until recentlymost CRM software has focused on simplifying the organizationand management of customer information. Such software, calledOperational CRM, has focused on creating a customer databasethat presents a consistent picture of the customers relationshipwith the company, and providing that information in specificapplications such as sales force automation and customer service in

    which the company touches the customer. However, the sheervolume of customer information and increasingly complexinteractions with customers has propelled data mining to theforefront of making the Organization customer relationshipsprofitable. Data mining is a process that uses a variety of dataanalysis and modeling techniques to discover patterns andrelationships in data that may be used to make accurate predictions.It can help Data Miner to select the right prospects on whom tofocus, offer the right additional products to Organization existingcustomers, and identify good customers who may be about to leavethe product of the Organization. The result is improved revenuebecause of a greatly improved ability to respond to each individualcontact in the best way, and reduced costs due to properlyallocating the business resources. CRM applications that use datamining are called Analytic CRM.

    pg. 105

  • 8/7/2019 winter project main

    106/176

    This section of the project will describe the various aspects of

    analytic CRM and show how it is used to manage the customer lifecycle more cost-effectively. The case histories of these fictionalcompanies are composites of real-life data mining applications.

    pg. 106

  • 8/7/2019 winter project main

    107/176

    Data Mining in Customer Relationship Management

    The first and simplest analytical step in data mining is to "Describethe Data" For example, summarize its statistical attributes (such asmeans and standard deviations), visually review it using charts andgraphs, and look at the distribution of values of the fields in theorganization data.

    But data description alone cannot provide an action plan. Anorganization must "Build a Predictive Model" based on patterns

    determined from known results, and then test that model on resultsoutside the original sample. A good model should never beconfused with reality (Business man know a road map isnt aperfect representation of the actual road), but it can be a usefulguide to understanding the business.

    Data mining can be used for both classification and regressionproblems.

    In "Classification Problems" Business Analyst predicting whatcategory something will fall into

    For example, whether a person will be a good credit risk or not, orwhich of several offers someone is most likely to accept.

    In Regression Problems" Business Analyst are predicting anumber such as the probability that a person will respond to anoffer.

    pg. 107

  • 8/7/2019 winter project main

    108/176

    In CRM, data mining is frequently used to assign a score to a

    particular customer or prospect indicating the likelihood that theindividual will behave in the way Business Man want. Forexample, a score could measure the propensity to respond to aparticular offer or to switch to a competitors product. It is alsofrequently used to identify a set of characteristics (called a profile)that segments customers into groups with similar behaviors, suchas buying a particular product.

    A special type of classification can recommend items based on

    similar interests held by groups of customers. This is sometimescalled "Collaborative Filtering".

    The data mining technology used for solving Classification,Regression and Collaborative Filtering problems is brieflydescribed in the Appendix at the end of the project.

    pg. 108

  • 8/7/2019 winter project main

    109/176

    Defining CRM

    "Customer Relationship Management" in its broadest sense simplymeans managing all customer interactions. In practice, this requiresusing information about the Business customers and prospects tomore effectively interact with Business customers in all stages ofBusiness relationship with them. I have refer to these stages as thecustomer life cycle.

    The customer life cycle has three stages:

    Acquiring customers

    Increasing the value of the customer

    Retaining good customers

    Data mining can improve Business profitability in each of thesestages through integration with operational CRM systems or asindependent applications.

    pg. 109

  • 8/7/2019 winter project main

    110/176

    Applying Data Mining to CRM

    In order to build good models for the Business CRM system, thereare a number of steps the Business Man must follow.

    The Two Crows data mining process model described below issimilar to other process models such as the CRISP-DM model,differing mostly in the emphasis it places on the different steps.Keep in mind that while the steps appear in a list, the data miningprocess is not linear the CRM implementor will inevitably need to

    loop back to previous steps. For example, what implementer learnsin the explore data step may require implementor to add new datato the data mining database. The initial models implementor buildmay provide insights that lead implementor to create new variables.

    The basic steps of data mining for effective CRM are:

    Define business problem

    Build marketing database

    Explore data

    Prepare data for modeling

    Build model

    Evaluate model

    Deploy model and results

    Define the business problem.

    Each CRM application will have one or more business objectives

    for which Business Analyst will need to build the appropriatemodel. Depending on business specific goal, such as increasingthe response rate or increasing the value of a response, BusinessAnalyst will build a very different model. An effective statement ofthe problem will include a way of measuring the results of BusinessCRM project.

    pg. 110

  • 8/7/2019 winter project main

    111/176

    Build a Marketing Database.

    Steps two through four constitute the core of the data preparation.Together, Big Sams Clothing Company take more time and effortthan all the other steps combined. There may be repeated iterationsof the data preparation and model building steps as business analystlearn something from the model that suggests business analyst tomodify the data. These data preparation steps may take anywherefrom 50% to 90% of the time and effort of the entire data miningprocess!

    Business Analyst will need to build a marketing database becausebusiness operational databases and corporate data warehouse willoften not contain the data Business Man need in the format.Furthermore, business CRM applications may interfere with thespeedy and effective execution of these systems.

    When business analyst build business marketing database DataMiner will need to clean it up, if business want good modelsbusiness analyst need to have clean data. The data business analystneed may reside in multiple databases such as the customerdatabase, product database, and transaction databases. This meansbusiness analyst will need to integrate and consolidate the data intoa single marketing database and reconcile differences in data valuesfrom the various sources. Improperly reconciled data is a majorsource of quality problems. There are often large differences in theway data is defined and used in different databases. Someinconsistencies may be easy to uncover, such as different addressesfor the same customer. Making it more difficult to resolve these

    problems is that Big Sams Clothing Company are often subtle.For example, the same customer may have different names orworse multiple customer identification numbers.

    pg. 111

  • 8/7/2019 winter project main

    112/176

    Explore the data.

    Before business analyst can build good predictive models, BusinessAnalyst must understand the Business data. Start by gathering avariety of numerical summaries (including descriptive statisticssuch as averages, standard deviations and so forth) and looking atthe distribution of the data.

    Business Man may want to produce cross tabulations (pivot tables)for multi-dimensional data. Graphing and visualization tools are a

    vital aid in data preparation, and their importance to effective dataanalysis cannot be overemphasized. Data visualization most oftenprovides the leading to new insights and success. Some of thecommon and very useful graphical displays of data are histogramsor box plots that display distributions of values. Business analystmay also want to look at scatter plots in two or three dimensions ofdifferent pairs of variables. The ability to add a third, overlayvariable greatly increases the usefulness of some types of graphs

    pg. 112

  • 8/7/2019 winter project main

    113/176

    . Prepare data for modeling.

    This is the final data preparation step before building models andthe step where the most art comes in. There are four main parts tothis step:

    First business analyst wants to select the variables on which tobuild the model. Ideally, business analyst would take all thevariables business analyst have, feed them to the data mining tooland let it find those which are the best predictors. In practice, this

    doesnt work very well. One reason is that the time it takes to builda model increases with the number of variables. Another reason isthat blindly including extraneous columns can lead to models withless rather than more predictive power.

    The next step is to construct new predictors derived from the rawdata.

    For example, forecasting credit risk using a debt-to-income ratiorather than just debt and income as predictor variables may yieldmore accurate results that are also easier to understand.

    Next business analyst may decide to select a subset or sample ofthe data on which to build models. If business analysts have a lot ofdata, however, using all Business data may take too long or requirebuying a bigger computer than business analyst would like.Working with a properly selected random sample usually results inno loss of information for most CRM problems. Given a choice ofeither investigating a few models built on all the data or

    investigating more models built on a sample, the latter approachwill usually help business analyst to develop a more accurate androbust model. of the problem.

    Last, business analyst will need to transform variables inaccordance with the requirements of the algorithm business analystchoose for building business model.

    pg. 113

  • 8/7/2019 winter project main

    114/176

    Data mining model building.

    The most important thing to remember about model building is thatit is an iterative process. Business analyst will need to explorealternative models to find the one that is most useful in solving thebusiness pro