winter project main
Transcript of winter project main
-
8/7/2019 winter project main
1/176
A REPORT ON
St. Francis Institute of Management and ResearchSt. Francis Institute of Management and ResearchMount Poinsur, S.V.P Road, Borivali (West)
Mumbai-400103
pg. 1
-
8/7/2019 winter project main
2/176
St. Francis Institute of Management andSt. Francis Institute of Management and
Research.Research.
Mount Poinsur, S.V.P Road, Borivali (West),Mumbai-400103.`
Winter Project (Information technology studies)
Report
Title:
Prepared for Mumbai University in the partial fulfillment
of the requirement for the award of the degree in
MASTER IN MANAGEMENT STUDIES
SUBMITTED BY
Patel Pramod RameshchandraRoll No: 38 Year: 2009-11
Under the Guidance ofProf. Manoj Mathew.
pg. 2
-
8/7/2019 winter project main
3/176
St. Francis Institute ofSt. Francis Institute ofManagement and ResearchManagement and Research
Certificate Of Merit
This is certify that the workentered in this project is thework of an individual
Mr.Patel PramodRameshchandra.Roll No: 38 MMS-II
Has worked for the Semester IVof the year2010-2011 in the college. Date:
pg. 3
-
8/7/2019 winter project main
4/176
-
8/7/2019 winter project main
5/176
AcknowledgmentAcknowledgment
I would like to express my sincere gratitudeI would like to express my sincere gratitudetoward thetoward the MBAMBA department of thedepartment of theST. Francis Institute of Management andST. Francis Institute of Management andResearchResearchfor encouraging me in the development of thisfor encouraging me in the development of thisproject. I would like to thank, ourproject. I would like to thank, ourDirector Dr. Thomas MathewDirector Dr. Thomas Mathew, my internal, my internal
project guideproject guide Prof.Manoj MathewProf.Manoj Mathew and ourand ourfaculty coordinatorfaculty coordinator Prof.Vaishali KulkarniProf.Vaishali Kulkarni for allfor alltheir help and co-operation.their help and co-operation.Above this I would not like to miss thisAbove this I would not like to miss thisprecious opportunity to thank,precious opportunity to thank, prof. Thomasprof. ThomasMathewMathew ,, prof.Sinimole,prof.Sinimole, M.F KumbarM.F Kumbar ,, SherliSherliBijuBiju ,,Mohini OzarkarMohini Ozarkar && Steve HalgeSteve Halge our librarian, myour librarian, my
friends,friends, Mr.Subandu K. MaityMr.Subandu K. Maity,, Mr. DurgeshMr. DurgeshTannaTanna,,Miss Hiral ShahMiss Hiral Shah,, Mr.Narinder Singh KaboMr.Narinder Singh Kabo,,MissMiss Radhika S. AppaswamyRadhika S. Appaswamy ,, Miss Payal P.Miss Payal P.PatelPatel ,,MissMiss Bhagyalaxmi Subramaniam ,, Mrs. Soma L.Mrs. Soma L.JoshuaJoshua and myand my parentsparents for helping, guidingfor helping, guidingand supporting us in all problems.and supporting us in all problems.
pg. 5
-
8/7/2019 winter project main
6/176
pg. 6
-
8/7/2019 winter project main
7/176
Table of Contents
Type chapter title (level 1).............................................................1
Type chapter title (level 2)............................................................2
Type chapter title (level 3).........................................................3
Type chapter title (level 1).............................................................4
Type chapter title (level 2)............................................................5
Type chapter title (level 3).........................................................6
pg. 7
-
8/7/2019 winter project main
8/176
pg. 8
Executive Summary
Data mining is a process that uses a variety of dataanalysis tools to discover knowledge, patterns andrelationships in data that may be used to make validpredictions. With the popularity of object-orienteddatabase systems in database applications, it is important
to study the data mining methods for object-orienteddatabases. The traditional Database Management Systems(DBMSs) have limitations when handling complexinformation and user defined data types which could beaddressed by incorporating Object-oriented programmingconcepts into the existing databases. Classification is awell-established data mining task that has beenextensively studied in the fields of statistics, decisiontheory and machine learning literature. This study focuses
on the design of an object-oriented database, throughincorporation of object-oriented programming conceptsinto existing relational databases. In the design of thedatabase, the object-oriented programming conceptsnamely inheritance and polymorphism are employed. Thedesign of the object-oriented database is done in such awell equipped manner that the design itself aids inefficient data mining. Our main objective is to reduce theimplementation overhead and the memory space requiredfor storage when compared to the traditional databases.
-
8/7/2019 winter project main
9/176
Purpose of the study
The purpose of this is to find the effective way of data miningusing Object Oriented Database and to improve CRM using DataMining
Significance of the study
This work will help provide additional information for the databaseadministrator who is engaged in the improvement the way of data
mining from data warehouse and also the effective to handle datamining. This research work is not done in the intention to replaceor duplicate the work that is being done by me but rather itsoutcome that can help to complement the Business Analyst
pg. 9
-
8/7/2019 winter project main
10/176
Objective of the project
The general objective of this project is to investigate andrecommend suitable way for data mining. The data mining solutionproposed in this study could help support a data mining process aswell as contribute to build a smooth way of data handing withinorganization. In this work, data mining implementations of othercompanies were investigated though CRM magazine current yearand last year.
In order to meet the general objective of this project the followingkey activities must be carried out:
To Study and Understand the Basic concept of database, datawarehouse and data mining.
To Study and Understand the Object oriented database To design a simple Object oriented database To do effective data mining in the designed Object oriented
database. To hit upon effective way of memory saving data mining
using Object oriented database. To effective way of data mining to successes in CRM
To build profitable Customer Relationships with DataMining
pg. 10
-
8/7/2019 winter project main
11/176
Limitation of the project
This project does not focuses on the whole Database design it onlyfocuses on three tables that is Customers Table, Suppliers Table,Employee Table but in real scenario there is not only three tables ithas many numbers of tables in the database
pg. 11
-
8/7/2019 winter project main
12/176
Need for study
Data mining roots are traced back along three family lines. Thelongest of these three lines is classical statistics. Without statistics,there would be no data mining, as statistics are the foundation ofmost technologies on which data mining is built. Classical statisticsembrace concepts such as regression analysis, standard distribution,standard deviation, standard variance, discriminated analysis,cluster analysis, and confidence intervals, all of which are used tostudy data and data relationships. These are the very building
blocks with which more advanced statistical analyses areunderpinned. Certainly, within the heart of today's data miningtools and techniques, classical statistical analysis plays a significantrole.
Data mining's second longest family line is artificial intelligence, orAI. This discipline, which is built upon heuristics as opposed tostatistics, attempts to apply human-thought-like processing tostatistical problems. Because this approach requires vast computer
processing power, it was not practical until the early 1980s, whencomputers began to offer useful power at reasonable prices. AIfound a few applications at the very high end scientific/governmentmarkets, but the required supercomputers of the era priced AI outof the reach of virtually everyone else. The notable exceptions werecertain AI concepts which were adopted by some high-endcommercial products, such as query optimization modules forRelational Database Management Systems (RDBMS).
pg. 12
-
8/7/2019 winter project main
13/176
The third family line of data mining is machine learning, which is
more accurately described as the union of statistics and AI. WhileAI was not a commercial success, its techniques were largely co-opted by machine learning. Machine learning, able to takeadvantage of the ever-improving price/performance ratios offeredby computers of the 80s and 90s, found more applications becausethe entry price was lower than AI. Machine learning could beconsidered an evolution of AI, because it blends AI heuristics withadvanced statistical analysis. Machine learning attempts to letcomputer programs learn about the data business analyst study,
such that programs make different decisions based on the qualitiesof the studied data, using statistics for fundamental concepts, andadding more advanced AI heuristics and algorithms to achieve itsgoals.
Data mining, in many ways, is fundamentally the adaptation ofmachine learning techniques to business applications. Data miningis best described as the union of historical and recent developmentsin statistics, AI, and machine learning. These techniques are thenused together to study data and find previously-hidden trends orpatterns within. Data mining is finding increasing acceptance inscience and business areas which need to analyze large amounts ofdata to discover trends which business analyst could not otherwisefind.
pg. 13
-
8/7/2019 winter project main
14/176
Methodology
This is a primary research. In this project I used exploratoryresearch technique. This research technique is used is closelyrelated to tracking and is used in qualitative research projects.Exploratory research provides insights into and comprehension ofan issue or situation. It should draw definitive conclusions onlywith extreme caution. Exploratory research technique is usedbecause a problem has not been clearly defined. The secondarydata is collected through reviewing magazine and articles.
Internet is used as a source of most of the database relevant to theissues involved in the study.
pg. 14
-
8/7/2019 winter project main
15/176
Analysis
The following are the major activities of this project:
Task I The Literature/ Computer Weekly Magazines/Articles
Review
To study the significance of having good Object-OrientedDatabase Design
Review the Literature/ Computer Monthly Newspapers\
CRN Magazines/Articles Review Review other relevant data Mining ways of Object-Oriented
Database
Task II Problem Analysis
This is the first and base stage of the project. At this stage,requirement elicitation is conducted. Potential problem areas ofdesigning the database are identified.
Technological, social, and educational elements are identified andexamined. Alternatives are explored.
Information and data collected is analyzed. An Object-Oriented Database design criterion is developed. Evaluate an effective way of data mining using object
oriented database.
Task III Proposed Effective Way of Data Mining Using
Object Oriented Database
Propose an effective way of data mining in object orienteddatabase.
pg. 15
-
8/7/2019 winter project main
16/176
Introduction to Database
The Database Management System
A Database Management System is a collection of software toolsintended for the purpose of efficient storage and retrieval of data ina computer system. Some of the important concepts involved in thedesign and implementation of a Database Management System arediscussed below.
The Database
A database is an integrated collection of automated data filesrelated to one another in the support of a common purpose.
A database is a collection of information that is organized so thatit can easily be accessed, managed, and updated. In one view,databases can be classified according to types of content:bibliographic, full-text, numeric, and images.
Each file in a database is made up of data elements numbers,dates, amounts, quantities, names, addresses and other identifiableitems of data.
The smallest component of data in a computer is the bit, a binaryelement with the values 0 and 1. Bits are used to build bytes, whichare used to build data elements. Data files contain records that aremade up of data elements and a database consists of files. Startingfrom the highest level, the hierarchy is as follows:
1. Database2. File3. Record4. Data element5. Character (byte)6.Bit
pg. 16
-
8/7/2019 winter project main
17/176
The Data Element
A data element is a place in a file used to store an item ofinformation that is uniquely identifiable by its purpose andcontents. A data value is the information stored in a data element.The data element has functional relevance to the application beingsupported by the database.
The Data Element Dictionary
A data element dictionary is a table of data elements including atleast the names, data types and lengths of every data element in thesubject database.
The data element dictionary is central to the application of thedatabase management tools. It forms the basic database schema orthe meta-data, which is the description of the database. The DBMSconstantly refers to this Data Element Dictionary for interpretingthe data stored in the database.
The Data Element Types
Relevant to the database management system, there are a variety ofdata types that are supported. Examples of common data elementtypes supported are numeric, alphanumeric, character strings, dateand time.
pg. 17
-
8/7/2019 winter project main
18/176
Files
A database contains a set of files related to one another by acommon purpose. A file is collection of records. The records arealike in format but each record is unique in content, therefore therecords in a file have the same data elements but different dataelement values.
A file is a set of records where the records have the same dataelements in the same format.
The organization of the file provides functional storage of data,related to the purpose of the system that the data base supports.Interfile relationships are based on the functional relationships oftheir purposes.
pg. 18
-
8/7/2019 winter project main
19/176
Database Schemas
A schema is the expression of the data base in terms of the files itstores, the data elements in each file, the key data elements usedfor record identification, and the relationships between files.
The translation of a schema into a data base management softwaresystem usually involves using a language to describe the schema tothe data base management system.
The Key Data Elements
The primary key data element in a file is the data element used touniquely describe and locate a desired record. The key can be acombination of more than one data element.
The definition of the file includes the specification of the dataelement or elements that are the key to the file. A file key logicallypoints to the record that it indexes
pg. 19
-
8/7/2019 winter project main
20/176
An Interfile Relationship
In a database, it is possible to relate one file to another in one of thefollowing three ways:
One to one
Many to one
One to many Many to many
In such interfile relationships, the database management systemmay or may not enforce data integrity called referential integrity.
Mapping Cardinalities, or cardinality ratios, express the number ofentities to which another entity can be associated via a relationshipset. Mapping cardinalities are most useful in describing binaryrelationship sets, although Entity can contribute to the descriptionof relationship sets that involve more than two entity sets.
One toone. An entity in A is associated with at most oneentity in B, and an entity in B is associated with at most oneentity in A.
One to many. An entity in A is associated with any number(zero or more) of entities in B. An entity in B, however, canbe associated with at most one entity in A.
Many to one. An entity in A is associated with at most oneentity in B. An entity in B, however, can be associated withany number (zero or more) of entities in A.
Many to many. An entity in A is associated with anynumber (zero or more) of entities in B, and an entity in B isassociated with any number (zero or more) of entities in A.
pg. 20
-
8/7/2019 winter project main
21/176
The Data Models
The data in a database may be organized in 3 principal models:
Hierarchical Data Model: The relationships between thefiles form a hierarchy.
Network Data Model: This model is similar to hierarchicalmodel except that a file can have multiple parents.
Relational Data Model: Here, the files have no parents andno children. Files are unrelated. Here the relationships areexplicitly defined by the user and maintained internally bythe database
The Data Definition Language
The format of the database and the format of the tables must be in aformat that the computer can translate into the actual physical
storage characteristics for the data. The Data Definition Language(DDL) is used for such a specification.
{CREATE, ALTER, DROP}
The Data Manipulation Language
The Data Definition Language is used to describe the database tothe DBMS; there is a need for a corresponding language forprograms to use to communicate with the DBMS. Such a language
is called the Data Manipulation Language (DML). The DDLdescribes the records to the application programs and the DMLprovides an interface to the DBMS. The first used the record formatand the second uses the external function calls.
{SELECT, INSERT, UPDATE, DELETE}
pg. 21
-
8/7/2019 winter project main
22/176
The Query Language
The Query Language is used primarily for the process of retrievalof data stored in a database. This data is retrieved by issuing querycommands to DBMS, which in turn interprets and appropriatelyprocesses them.
Figure 1: The Database System
pg. 22
-
8/7/2019 winter project main
23/176
Introduction to Data Warehouse and Data
Mining
The Data Warehouse
A data warehouse is a central repository for all or significantparts of the data that an enterprise's various business systemscollect.The term was coined by W. H. Inmon. IBM sometimes uses the
term "information warehouse."
A single, complete and consistent store of data obtained from avariety of different sources made available to end users in a whatcustomer can understand and use in a business context.
-Barry Devlin
Typically, a data warehouse is housed on an enterprise mainframe
server. Data from various online transaction processing (OLTP)applications and other sources is selectively extracted andorganized on the data warehouse database for use by analyticalapplications and user queries. Data warehousing emphasizes thecapture of data from diverse sources for useful analysis and access,but does not generally start from the point-of-view of the end useror knowledge worker who may need access to specialized,sometimes local databases. The latter idea is known as the datamart.
Applications of data warehouses include data mining, Web Mining,and decision support systems (DSS).
pg. 23
-
8/7/2019 winter project main
24/176
The Data Mining
Data mining is sorting through data to identify patterns andestablish relationships.
Looking for the hidden patterns and trends in data that is notimmediately apparent from summarizing the data
Data mining parameters includes:
Association: Looking for patterns where one event isconnected to another event
Sequence or path analysis :Looking for patterns where oneevent leads to another later event
Classification: Looking for new patterns (May result in a
change in the way the data is organized)
Clustering: Finding and visually documenting groups offacts not previously known
Forecasting: Discovering patterns in data that can lead toreasonable predictions about the future (This area of datamining is known as predictive analytics.)
Data mining techniques are used in a many research areas,including mathematics, cybernetics, genetics and marketing. Webmining, a type of data mining used in customer relationshipmanagement (CRM), takes advantage of the huge amount ofinformation gathered by a Web site to look for patterns in userbehavior.
pg. 24
-
8/7/2019 winter project main
25/176
We are in an age often referred to as the information age. In this
information age, because we believe that information leads topower and success, and thanks to sophisticated technologies suchas computers, satellites, etc., Organizations have been collectingtremendous amounts of information. Initially, with the advent ofcomputers and means for mass digital storage, Organizationsstarted collecting and storing all sorts of data, counting on thepower of computers to help sort through this amalgam ofinformation. Unfortunately, these massive collections of data storedon disparate structures very rapidly became overwhelming. This
initial chaos has led to the creation of structured databases andDatabase Management Systems (DBMS).
The efficient Database Management Systems have been veryimportant assets for management of a large corpus of data andespecially for effective and efficient retrieval of particularinformation from a large collection whenever needed. Theproliferation of Database Management Systems has alsocontributed to recent massive gathering of all sorts of information.Today, Organizations have far more information thanOrganizations can handle: from business transactions and scientificdata, to satellite pictures, text reports and military intelligence.Information retrieval is simply not enough anymore for decision-making.
Confronted with huge collections of data, Organizations have nowcreated new needs to help us make better managerial choices.These needs are automatic summarization of data, extraction of theessence of information stored, and the discovery of patterns in
raw data.
pg. 25
-
8/7/2019 winter project main
26/176
What kind of Information Data Mining is collecting?
Organization have been collecting a myriad of data, from simplenumerical measurements and text documents, to more complexinformation such as spatial data, multimedia channels, andhypertext documents. Here is a non-exclusive list of a variety ofinformation collected in digital form in databases and in flat files.
Business Transactions: Every transaction in the businessindustry is (often) memorized for perpetuity. Such
transactions are usually time related and can be inter-business deals such as purchases, exchanges, banking, stock,etc., or intra-business operations such as management of in-house wares and assets. Large department stores.
For example, thanks to the widespread use of bar codes, storemillions of transactions daily representing often terabytes of data.Storage space is not the major problem, as the price of hard disks iscontinuously dropping, but the effective use of the data in areasonable time frame for competitive decision making is definitelythe most important problem to solve for businesses that struggle tosurvive in a highly competitive world.
Scientific Data: Whether in a Swiss nuclear acceleratorlaboratory counting particles, in the Canadian forest studyingreadings from a grizzly bear radio collar, on a South Poleiceberg gathering data about oceanic activity, or in anAmerican university investigating human psychology, oursociety is amassing colossal amounts of scientific data that
need to be analyzed. Unfortunately, Organizations cancapture and store more new data faster than Organizationscan analyze the old data already accumulated.
pg. 26
-
8/7/2019 winter project main
27/176
Medical and Personal Data: From government census to
personnel and customer files, very large collections ofinformation are continuously gathered about individuals andgroups. Governments, companies and organizations such ashospitals, are stockpiling very important quantities ofpersonal data to help them manage human resources, betterunderstand a market, or simply assist clientele. Regardless ofthe privacy issues this type of data often reveals, thisinformation is collected, used and even shared. Whencorrelated with other data this information can shed light on
customer behaviors and the like.
Surveillance Video and Pictures: With the amazingcollapse of video camera prices, video cameras are becomingubiquitous. Video tapes from surveillance cameras areusually recycled and thus the content is lost. However, thereis a tendency today to store the tapes and even digitize themfor future use and analysis.
Satellite sensing: There are a countless number of satellitesaround the globe: some are geo-stationary above a region,and some are orbiting around the Earth, but all are sending anon-stop stream of data to the surface. NASA, whichcontrols a large number of satellites, receives more dataevery second than what all NASA researchers and engineerscan cope with. Many satellite pictures and data are madepublic as soon as satellite sensing are received in the hopesthat other researchers can analyze them.
Games: Our society is collecting a tremendous amount ofdata and statistics about games, players and athletes. Fromhockey scores, basketball passes and car-racing lapses, toswimming times, boxers pushes and chess positions, all thedata are stored. Commentators and journalists are using thisinformation for reporting, but trainers and athletes wouldwant to exploit this data to improve performance and betterunderstand opponents.
pg. 27
-
8/7/2019 winter project main
28/176
Digital Media: The proliferation of cheap scanners, desktop
video cameras and digital cameras is one of the causes of theexplosion in digital media repositories. In addition, manyradio stations, television channels and film studios aredigitizing their audio and video collections to improve themanagement of their multimedia assets.
CAD and Software Engineering Data: There are amultitude of Computer Assisted Design (CAD) systems
for architects to design buildings or engineers to conceivesystem components or circuits. These systems are generatinga tremendous amount of data. Moreover, softwareengineering is a source of considerable similar data withcode, function libraries, objects, etc., which need powerfultools for management and maintenance.
Virtual Worlds: There are many applications making use ofthree-dimensional virtual spaces. These spaces and theobjects Virtual Worlds contain are described with speciallanguages such as VRML. Ideally, these virtual spaces aredescribed in such a way that Virtual Worlds can shareobjects and places. There is a remarkable amount of virtualreality object and space repositories available. Managementof these repositories as well as content-based search andretrieval from these repositories are still research issues,while the size of the collections continues to grow.
Text Reports and Memos (E-mail Messages): Most of the
communications within and between companies or researchorganizations or even private people, are based on reportsand memos in textual forms often exchanged by e-mail.These messages are regularly stored in digital form for futureuse and reference creating formidable digital libraries.
pg. 28
-
8/7/2019 winter project main
29/176
The World Wide Web Repositories: Since the inception of
the World Wide Web in 1993, documents of all sorts offormats, content and description have been collected andinter-connected with hyperlinks making it the largestrepository of data ever built. Despite its dynamic andunstructured nature, its heterogeneous characteristic, and itsvery often redundancy and inconsistency, the World WideWeb is the most important data collection regularly used forreference because of the broad variety of topics covered andthe infinite contributions of resources and publishers. Many
believe that the World Wide Web will become thecompilation of human knowledge.
pg. 29
-
8/7/2019 winter project main
30/176
What are Data Mining and Knowledge Discovery?
With the enormous amount of data stored in files, databases, andother repositories, it is increasingly important, if not necessary, todevelop powerful means for analysis and perhaps interpretation ofsuch data and for the extraction of interesting knowledge that couldhelp in decision-making.
Data Mining, also popularly known as Knowledge Discovery inDatabases (KDD), refers to the nontrivial extraction of implicit,
previously unknown and potentially useful information from datain databases. While data mining and knowledge discovery indatabases (or KDD) are frequently treated as synonyms, datamining is actually part of the knowledge discovery process.
pg. 30
-
8/7/2019 winter project main
31/176
The following figure 2 shows data mining as a step in an iterative
knowledge discovery process.
Figure2: Data Mining is the core of Knowledge Discovery
Process
The Knowledge Discovery in Databases process comprises of afew steps leading from raw data collections to some form of new
knowledge.
pg. 31
-
8/7/2019 winter project main
32/176
The iterative process consists ofthe following steps:
Data Cleaning: Also known as data cleansing, it is a phasein which noise data and irrelevant data are removed from thecollection.
Data Integration: At this stage, multiple data sources, oftenheterogeneous, maybe combined in a common source.
Data Selection: At this step, the data relevant to the analysisis decided on and retrieved from the data collection.
Data Transformation: Also known as data consolidation, itis a phase in which the selected data is transformed intoforms appropriate for the mining procedure.
Data Mining: It is the crucial step in which clevertechniques are applied to extract patterns potentially useful.
Pattern Evaluation: In this step, strictly interesting patternsrepresenting knowledge are identified based on givenmeasures.
Knowledge Representation: It is the final phase in whichthe discovered knowledge is visually represented to the user.This essential step uses visualization techniques to help usersunderstand and interpret the data mining results.
It is common to combine some of these steps together.
For instance, data cleaning and data integration can be performedtogether as a pre-processing phase to generate a data warehouse.Data selection and data transformation can also be combined wherethe consolidation of the data is the result of the selection, or, as forthe case of data warehouses, the selection is done on transformeddata.
pg. 32
-
8/7/2019 winter project main
33/176
The KDD is an iterative process. Once the discovered knowledge is
presented to the user, the evaluation measures can be enhanced, themining can be further refined, new data can be selected or furthertransformed, or new data sources can be integrated, in order to getdifferent, more appropriate results.
Data mining derives its name from the similarities betweensearching for valuable information in a large database and miningrocks for a vein of valuable ore. Both imply either sifting through alarge amount of material or ingeniously probing the material to
exactly pinpoint where the values reside. It is, however, amisnomer, since mining for gold in rocks is usually called goldmining and not rock mining, thus by analogy, data miningshould have been called knowledge mining instead.Nevertheless, data mining became the accepted customary term,and very rapidly a trend that even overshadowed more generalterms such as knowledge discovery in databases (KDD) thatdescribe a more complete process. Other similar terms referring todata mining are: data dredging, knowledge extraction and patterndiscovery.
pg. 33
-
8/7/2019 winter project main
34/176
What kind of Data can be mined?
In principle, data mining is not specific to one type of media ordata. Data mining should be applicable to any kind of informationrepository. However, algorithms and approaches may differ whenapplied to different types of data. Indeed, the challenges presentedby different types of data vary significantly.
Data mining is being put into use and studied for databases,including relational databases, object-relational databases and
object oriented databases, data warehouses, transactional databases,unstructured and semi structured repositories such as the WorldWide Web, advanced databases such as spatial databases,multimedia databases, time-series databases and textual databases,and even flat files. Here are some examples in more detail:
Flat files: Flat files are actually the most common datasource for data mining algorithms, especially at the researchlevel. Flat files are simple data files in text or binary formatwith a structure known by the data mining algorithm to beapplied. The data in these files can be transactions, time-series data, scientific measurements, etc.
pg. 34
-
8/7/2019 winter project main
35/176
-
8/7/2019 winter project main
36/176
The most commonly used query language for relational database is
SQL, which allows retrieval and manipulation of the data stored inthe tables, as well as the calculation of aggregate functions such asaverage, sum, min, max and count. For instance, an SQL query toselect the videos grouped by category would be:
SELECT count (*) FROM Items WHERE type=video GROUP BYcategory.
Data mining algorithms using relational databases can be more
versatile than data mining algorithms specifically written for flatfiles, since Relational Databases can take advantage of the structureinherent to relational databases. While data mining can benefitfrom SQL for data selection, transformation and consolidation, itgoes beyond what SQL could provide, such as predicting,comparing, detecting deviations, etc.
pg. 36
-
8/7/2019 winter project main
37/176
Data Warehouses: A data warehouse as a storehouse is a
repository of data collected from multiple data sources (oftenheterogeneous) and is intended to be used as a whole underthe same unified schema. A data warehouse gives the optionto analyze data from different sources under the same roof.
Let us suppose that VideoStore becomes a franchise in New York.Many video stores belonging to VideoStore Company may havedifferent databases and different structures. If the executive of thecompany wants to access the data from all stores for strategic
decision-making, future direction, marketing, etc., it would be moreappropriate to store all the data in one site with a homogeneousstructure that allows interactive analysis.
In other words, data from the different stores would be loaded,cleaned, transformed and integrated together. To facilitate decisionmaking and multi-dimensional views, data warehouses are usuallymodeled by a multi-dimensional data structure. Figure 4 shows anexample of a three dimensional subset of a data cube structure usedfor VideoStore data warehouse.
pg. 37
-
8/7/2019 winter project main
38/176
o Figure 4: A multi-dimensional data cube structure
commonly used in data for data warehousing
The figure shows summarized rentals grouped by film categories,then a cross table of summarized rentals by film categories andtime (in quarters). The data cube gives the summarized rentalsalong three dimensions: category, time, and city. A cube containscells that store values of some aggregate measures (in this caserental counts), and special cells that store summations alongdimensions. Each dimension of the data cube contains a hierarchyof values for one attribute.
pg. 38
-
8/7/2019 winter project main
39/176
-
8/7/2019 winter project main
40/176
Transaction Databases: A transaction database is a set of
records representing transactions, each with a time stamp, anidentifier and a set of items. Associated with the transactionfiles could also be descriptive data for the items.
For example, in the case of the video store, the rentals table such asshown in Figure 6 represents the transaction database. Each recordis a rental contract with a customer identifier, a date, and the list ofitems rented (i.e. video tapes, games, VCR, etc.).
Since relational databases do not allow nested tables (i.e. a set asattribute value), transactions are usually stored in flat files or storedin two normalized transaction tables, one for the transactions andone for the transaction items. One typical data mining analysis onsuch data is the so-called market basket analysis or associationrules in which associations between items occurring together or insequence are studied.
Figure 6: Fragment of a transaction database for
the rentals at VideoStore
pg. 40
-
8/7/2019 winter project main
41/176
Multimedia Databases: Multimedia databases include
video, images, and audio and text media. MultimediaDatabases can be stored on extended object-relational orobject-oriented databases, or simply on a file system.Multimedia is characterized by its high dimensionality,which makes data mining even more challenging. Datamining from multimedia repositories may require computervision, computer graphics, image interpretation, and naturallanguage processing methodologies.
Spatial Databases: Spatial databases are databases that, inaddition to usual data, store geographical information likemaps, and global or regional positioning. Such spatialdatabases present new challenges to data mining algorithms.
Figure 7: Visualization of spatial OLAP (from
GeoMiner system)
pg. 41
-
8/7/2019 winter project main
42/176
Time-Series Databases: Time-series databases contain time
related data such stock market data or logged activities.These databases usually have a continuous flow of new datacoming in, which sometimes causes the need for achallenging real time analysis. Data mining in such databasescommonly includes the study of trends and correlationsbetween evolutions of different variables, as well as theprediction of trends and movements of the variables in time.Figure 8 shows some examples of time-series data.
Figure 8: Examples of Time-Series Data
(Source: Thompson Investors Group)
pg. 42
-
8/7/2019 winter project main
43/176
World Wide Web: The World Wide Web is the most
heterogeneous and dynamic repository available. A verylarge number of authors and publishers are continuouslycontributing to its growth and metamorphosis, and a massivenumber of users are accessing its resources daily. Data in theWorld Wide Web is organized in inter-connected documents.These documents can be text, audio, video, raw data, andeven applications. Conceptually, the World Wide Web iscomprised of three major components: The Content of theWeb, which encompasses documents available; The
Structure of the Web, which covers the hyperlinks and therelationships between documents; and The Usage of theweb, describing how and when the resources are accessed.A fourth dimension can be added relating the dynamic natureor evolution of the documents. Data mining in the WorldWide Web, or web mining, tries to address all these issuesand is often divided into web content mining, web structuremining and web usage mining.
pg. 43
-
8/7/2019 winter project main
44/176
What can be discovered?
The kinds of patterns that can be discovered depend upon the datamining tasks employed. By and large, there are two types of datamining tasks: descriptive data mining tasks that describe thegeneral properties of the existing data, and predictive data miningtasks that attempt to do predictions based on inference on availabledata.
The data mining functionalities and the variety of knowledge data
mining discover are briefly presented in the following list:
Characterization: Data characterization is a summarizationof general features of objects in a target class, and produceswhat is called characteristic rules. The data relevant to auser-specified class are normally retrieved by a databasequery and run through a summarization module to extract theessence of the data at different levels of abstractions.
For example, one may want to characterize the VideoStorecustomers who regularly rent more than 30 movies a year. Withconcept hierarchies on the attributes describing the target class, theattribute oriented induction method can be used, for example, tocarry out data summarization. Note that with a data cube containingsummarization of data, simple OLAP operations fit the purpose ofdata characterization.
pg. 44
-
8/7/2019 winter project main
45/176
Discrimination: Data discrimination produces what are
called discriminate rules and is basically the comparison ofthe general features of objects between two classes referredto as the target class and the contrasting class. For example,one may want to compare the general characteristics of thecustomers who rented more than 30 movies in the last yearwith those whose rental account is lower than 5. Thetechniques used for data discrimination are very similar tothe techniques used for data characterization with theexception that data discrimination results include
comparative measures.
pg. 45
-
8/7/2019 winter project main
46/176
-
8/7/2019 winter project main
47/176
Classification: Classification analysis is the organization ofdata in given classes. Also known as supervisedclassification, the classification uses given class labels toorder the objects in the data collection. Classificationapproaches normally use a training set where all objects arealready associated with known class labels. Theclassification algorithm learns from the training set andbuilds a model. The model is used to classify new objects.
For example, after starting a credit policy, the VideoStoremanagers could analyze the customers behaviors vis--vis theircredit, and label accordingly the customers who received creditswith three possible labels safe, risky and very risky. Theclassification analysis would generate a model that could be used toeither accept or reject credit requests in the future.
pg. 47
-
8/7/2019 winter project main
48/176
Prediction: Prediction has attracted considerable attention
given the potential implications of successful forecasting in abusiness context. There are two major types of predictions:one can either try to predict some unavailable data values orpending trends, or predict a class label for some data. Thelatter is tied to classification. Once a classification model isbuilt based on a training set, the class label of an object canbe foreseen based on the attribute values of the object andthe attribute values of the classes. Prediction is howevermore often referred to the forecast of missing numerical
values, or increase/ decrease trends in time related data. Themajor idea is to use a large number of past values to considerprobable future values.
Clustering: Similar to classification, clustering is theorganization of data in classes. However, unlikeclassification, in clustering, class labels are unknown and itis up to the clustering algorithm to discover acceptableclasses. Clustering is also called unsupervised classification,because the classification is not dictated by given classlabels. There are many clustering approaches all based on theprinciple of maximizing the similarity between objects in asame class (intra-class similarity) and minimizing thesimilarity between objects of different classes (inter-classsimilarity).
Outlier analysis: Outliers are data elements that cannot begrouped in a given class or cluster. Also known asexceptions or surprises, Outlier analyses are often very
important to identify. While outliers can be considered noiseand discarded in some applications, Outlier analysis canreveal important knowledge in other domains, and thus canbe very significant and their analysis valuable.
pg. 48
-
8/7/2019 winter project main
49/176
Evolution and deviation analysis: Evolution and deviationanalysis pertain to the study of time related data that changesin time. Evolution analysis models evolutionary trends indata, which consent to characterizing, comparing, classifyingor clustering of time related data. Deviation analysis, on theother hand, considers differences between measured valuesand expected values, and attempts to find the cause of thedeviations from the anticipated values.
It is common that users do not have a clear idea of the kind ofpatterns organization can discover or need to discover from the dataat hand. It is therefore important to have a versatile and inclusivedata mining system that allows the discovery of different kinds ofknowledge and at different levels of abstraction. This also makesinteractivity an important attribute of a data mining system.
pg. 49
-
8/7/2019 winter project main
50/176
Is all that is Discovered Interesting and Useful?
Data mining allows the discovery of knowledge potentially usefuland unknown. Whether the knowledge discovered is new, useful orinteresting, is very subjective and depends upon the application andthe user. It is certain that data mining can generate, or discover, avery large number of patterns or rules.
In some cases the number of rules can reach the millions. One caneven think of a meta-mining phase to mine the oversized data
mining results. To reduce the number of patterns or rulesdiscovered that have a high probability to be non-interesting, onehas to put a measurement on the patterns. However, this raises theproblem of completeness. The user would want to discover all rulesor patterns, but only those that are interesting. The measurement ofhow interesting a discovery is, often called interestingness, can bebased on quantifiable objective elements such as validity of thepatterns when tested on new data with some degree of certainty, oron some subjective depictions such as understandability of thepatterns, novelty of the patterns, or usefulness.
Discovered patterns can also be found interesting if businessanalyst confirm or validate a hypothesis sought to be confirmed orunexpectedly contradict a common belief. This brings the issue ofdescribing what is interesting to discover, such as meta-rule guideddiscovery that describes forms of rules before the discoveryprocess, and interestingness refinement languages that interactivelyquery the results for interesting patterns after the discovery phase.Typically, measurements for interestingness are based on
thresholds set by the user. These thresholds define thecompleteness of the patterns discovered.
Identifying and measuring the interestingness of patterns and rulesdiscovered, or to be discovered is essential for the evaluation of themined knowledge and the KDD process as a whole. While someconcrete measurements exist, assessing the interestingness ofdiscovered knowledge is still an important research issue.
pg. 50
-
8/7/2019 winter project main
51/176
How do we Categorize Data Mining Systems?
There are many data mining systems available or being developed.Some are specialized systems dedicated to a given data source orare confined to limited data mining functionalities, other are moreversatile and comprehensive. Data mining systems can becategorized according to various criteria among other classificationare the following:
Classification according to the type of data source mined:
This classification categorizes data mining systemsaccording to the type of data handled such as spatial data,multimedia data, time-series data, text data, World WideWeb, etc.
Classification according to the data model drawn on:
This classification categorizes data mining systems based onthe data model involved such as relational database, object-oriented database, data warehouse, transactional, etc.
Classification according to the king of knowledge
discovered:
This classification categorizes data mining systems based onthe kind of knowledge discovered or data miningfunctionalities, such as characterization, discrimination,association, classification, clustering, etc. Some systems tend
to be comprehensive systems offering several data miningfunctionalities together.
pg. 51
-
8/7/2019 winter project main
52/176
-
8/7/2019 winter project main
53/176
What are the Issues in Data Mining?
Data mining algorithms embody techniques that have sometimesexisted for many years, but have only lately been applied asreliable and scalable tools that time and again outperform olderclassical statistical methods. While data mining is still in itsinfancy, it is becoming a trend and ubiquitous. Before data miningdevelops into a conventional, mature and trusted discipline, manystill pending issues have to be addressed. Some of these issues areaddressed below. Note that these issues are not exclusive and are
not ordered in any way.
Security and Social Issues:
Security is an important issue with any data collection that isshared and/or is intended to be used for strategic decision-making.In addition, when data is collected for customer profiling, userbehavior understanding, correlating personal data with otherinformation, etc., large amounts of sensitive and privateinformation about individuals or companies is gathered and stored.This becomes controversial given the confidential nature of someof this data and the potential illegal access to the information.Moreover, data mining could disclose new implicit knowledgeabout individuals or groups that could be against privacy policies,especially if there is potential dissemination of discoveredinformation. Another issue that arises from this concern is theappropriate use of data mining. Due to the value of data, databasesof all sorts of content are regularly sold, and because of thecompetitive advantage that can be attained from implicit
knowledge discovered, some important information could bewithheld, while other information could be widely distributed andused without control.
pg. 53
-
8/7/2019 winter project main
54/176
-
8/7/2019 winter project main
55/176
Mining Methodology Issues:
These issues pertain to the data mining approaches applied andtheir limitations. Topics such as versatility of the miningapproaches, the diversity of data available, the dimensionality ofthe domain, the broad analysis needs (when known), theassessment of the knowledge discovered, the exploitation ofbackground knowledge and metadata, the control and handling ofnoise in data, etc. are all examples that can dictate miningmethodology choices.
For instance, it is often desirable to have different data miningmethods available since different approaches may performdifferently depending upon the data at hand. Moreover, differentapproaches may suit and solve users needs differently.
Most algorithms assume the data to be noise-free. This is of coursestrong assumption. Most datasets contain exceptions, invalid orincomplete information, etc., which may complicate, if not obscure,the analysis process and in many cases compromise the accuracy ofthe results. As a consequence, data preprocessing (data cleaningand transformation) becomes vital. It is often seen as lost time, butdata cleaning, as time consuming and frustrating as it may be, isone of the most important phases in the knowledge discoveryprocess. Data mining techniques should be able to handle noise indata or incomplete information.
More than the size of data, the size of the search space is even moredecisive for data mining techniques. The size of the search space is
often depending upon the number of dimensions in the domainspace. The search space usually grows exponentially when thenumber of dimensions increases. This is known as the curse ofdimensionality. This curse affects so badly the performance ofsome data mining approaches that it is becoming one of the mosturgent issues to solve.
pg. 55
-
8/7/2019 winter project main
56/176
Performance Issues:
Many artificial intelligence and statistical methods exist for dataanalysis and interpretation. However, these methods were often notdesigned for the very large data sets data mining is dealing withtoday. Terabyte sizes are common. This raises the issues ofscalability and efficiency of the data mining methods whenprocessing considerably large data. Algorithms with exponentialand even medium-order polynomial complexity cannot be ofpractical use for data mining. Linear algorithms are usually the
norm. In same theme, sampling can be used for mining instead ofthe whole dataset. However, concerns such as completeness andchoice of samples may arise. Other topics in the issue ofperformance are incremental updating, and parallel programming.There is no doubt that parallelism can help solve the size problemif the dataset can be subdivided and the results can be merged later.Incremental updating is important for merging results from parallelmining, or updating data mining results when new data becomesavailable without having to re-analyze the complete dataset.
pg. 56
-
8/7/2019 winter project main
57/176
-
8/7/2019 winter project main
58/176
-
8/7/2019 winter project main
59/176
VariablesCalculate additional (derived) fields. This is fairly easy.Business Analyst can multiply, subtract, divide, add,numbers. it should have some business meaning
Find additional information, inside or outside the company.
Find the best algorithm
It tempting to state that probably for each problem there is one bestalgorithm. So all data miner have to do is to try a handful of reallydifferent algorithms to find out which one is the best for theproblem. Different data miners will use the same algorithmdifferently, according to their taste, experience, mood, preference
So find out which algorithm works best for Data Miner and their
business problem.
pg. 59
-
8/7/2019 winter project main
60/176
Zoom in on the business targets
When data miners want to use a data mining model to select thecustomers who are most likely to buy the business outstandingproduct XYZ, it is reasonable to use the business past buyers ofXYZ as the positive targets in the model. Data Miner get a modelwith an excellent lift and use it for a mailing.
When the mailing campaign is over, data miner now have all the
data company need to create a new, better, model for productXYZ. The business targets the past buyers of XYZ in response tothe business mailing. With this new model, data miner will notonly take their natural propensity to buy into account, but alsotheir willingness to respond to the customer mailing
If the databases contain far more observations than the data miningtool likes, the only thing data miner can do is use samples.Calculate the model, and data miner can use it. But data miner can
push it a bit further. Use the model to score the entire customerbase. And now zoom in on the customers with the best scores.Lets say the top-10%. Use them to calculate a new, second modelwhich will use the far more tiny differences in customerinformation to find the really promising ones.
Make it simple
Nevertheless, data miner have to keep business data mining workas simple as possible, because the business who pays the billswants data miner to deliver good models, on time for hiscampaigns.
pg. 60
-
8/7/2019 winter project main
61/176
Automate as much as possible
The data miner should not to try out every possible algorithm ineach data mining project. If problem A was best solved withalgorithm X, than probably problem B, which is very similar to A,should equally be tackled with algorithm X. No need to waste timechecking out other algorithms.
pg. 61
-
8/7/2019 winter project main
62/176
Introduction to Object-Oriented Database
In the modern computing world, the amount of data generated andstored in databases of organizations is vast and continuing to growat a rapid pace. The data stored in these databases possess valuablehidden knowledge. The discovery of such knowledge can be veryfruitful for taking effective decisions. Thus the need for developingmethods for extracting knowledge from data is quite evident. Datamining, a promising approach to knowledge discovery, is the use ofpattern recognition technologies with statistical and mathematical
techniques for discovering meaningful new correlations, patternsand trends by analyzing large amounts of data stored inrepositories. Data mining has made its impact on many applicationssuch as marketing, customer relationship management,engineering, medicine, crime analysis, expert prediction, Webmining, and mobile computing, among others. In general, datamining tasks can be classified into two categories: Descriptivemining and Predictive mining.
Descriptive Mining is the process of extracting vitalcharacteristics of data from databases. Some of descriptive miningtechniques are Clustering, Association Rule Mining and Sequentialmining.
Predictive Mining is the process of deriving hidden patterns andtrends from data to make predictions. The predictive miningtechniques consist of a series of tasks namely Classification,Regression and Deviation detection.
One of the important tasks of Data Mining is Data Classificationwhich is the process of finding a valuable set of models that areself-descriptive and distinguishable data classes or concepts, topredict the set of classes with an unknown class label.
pg. 62
-
8/7/2019 winter project main
63/176
-
8/7/2019 winter project main
64/176
-
8/7/2019 winter project main
65/176
-
8/7/2019 winter project main
66/176
Polymorphism is another important Object oriented programming
concept. It is a general term which stands for Many forms.Polymorphism in brief can be defined as "One Interface, ManyImplementations". It is a property of being able to assign a differentmeaning or usage to something in different contexts in particular,to allow an entity such as a variable, a function, or an object to takemore than one form. Polymorphism is different from MethodOverloading or Method Overriding. In literature, polymorphismcan be classified into three different kinds namely: pure, static, anddynamic.
Pure Polymorphism refers to a function which can takeparameters of several data types.
Static Polymorphism can be stated as functions andoperators overloading.
Dynamic Polymorphism is achieved by employinginheritance and virtual functions.
Dynamic binding or runtime binding allows one to substitutepolymorphic objects for each other at run-time. Polymorphism hasa number of advantages. Its chief advantage is that it simplifies thedefinition of clients, as it allows the client to substitute at run-time,an instance of one class for another instance of a class that has thesame Polymorphic Interface.
pg. 66
-
8/7/2019 winter project main
67/176
-
8/7/2019 winter project main
68/176
Object-Oriented Database (OODB)
The chief advantage of Object-Oriented Database (OODB) is itsability to represent real world concepts as data models in aneffective and presentable manner. Object-Oriented Database(OODB) is optimized to support object-oriented applications,different types of structures including trees, composite objects andcomplex data relationships. The Object-Oriented Database(OODB) system handles complex databases efficiently and itallows the users to define a database, with features for creating,
altering, and dropping tables and establishing constraints. From theusers perception, Object-Oriented Database (OODB) is just acollection of objects and inter-relationships among objects . Thoseobjects that resemble in properties and behavior are organized intoclasses. Every class is a container of a set of common attributes andmethods shared by similar objects.
The Attributes or Instance Variables define theProperties of a Class.
The Method describes the Behavior of the Objectsassociated with the Class.
A Class/Subclass Hierarchy is used to representsComplex Objects where Attributes of an Object itselfcontains Complex Objects.
pg. 68
-
8/7/2019 winter project main
69/176
-
8/7/2019 winter project main
70/176
New Approach to the Design of Object Oriented
Database
In general computer literature, defines three approaches to build anObject-Oriented Database Management Systems (OODBMS)extending an Object-Oriented Programming Language (OOPL),extending a Relational Database Management System (RDBMS),and starting from scratch.
The First approach develops an Object-Oriented Database
Management System (ODBMS) by encompassing to an Object-Oriented Programming Language (OOPL) persistent storage toachieve multiple concurrent accesses with transaction support.
The Second is an extended relational approach; an Object-Oriented Database Management Systems (OODBMS) is built byincorporating an existing Relational Database ManagementSystems (RDBMS) with Object-Oriented features such as classesand inheritances, methods and encapsulations, polymorphism and
complex objects.
The Third approach aims to revolutionize the databasetechnology in the sense that an Object-Oriented DatabaseManagement Systems (OODBMS) is designed from the ground up,as represented by UniSQL / UniOracle and OpenOODB (OpenObject-Oriented Database) .
In my design, I have employed the second approach which extendsthe Relational Databases by utilizing the Object-OrientedProgramming (OOP) concepts.
pg. 70
-
8/7/2019 winter project main
71/176
The proposed approach makes use of the Object-Oriented
Programming (OOP) concepts namely, Inheritance andPolymorphism to design an Object-Oriented Database (OODB)and perform classification in it respectively. Normally, database isa collection of tables. Hence when I have consider a database, it isbound to contain a number of tables with common fields. In myapproach, I have grouped together such common set of fields toform a single generalized table. The newly created table resemblesthe base class in the inheritance hierarchy. This ability to representclasses in hierarchy is one of the eminent Object-Oriented
Programming (OOP) concepts. Next I have employed anotherimportant object-oriented characteristic dynamic polymorphism,where different classes have methods of the same name andstructure, performing different operations based on the CallingObject. The polymorphism is specifically employed to achieveclassification in a simple and effective manner. The use of theseobject-oriented concepts for the design of Object-OrientedDatabase (OODB) Object-Oriented Database ensures that evencomplex queries can be answered more efficiently. Particularly thedata mining task, classification can be achieved in an effectivemanner.Let T denote a set of all tables on a database D and t subset T,where t represents the set of tables in which some fields are incommon. Now I have create a generalized table composing of allthose common fields from the table sett. To portray the efficiencyof my proposed approach, I consider a traditional table. Atraditional example of the database for large business organizationswill have a number of tables but to best illustrate the Object-Oriented Programming (OOP) concepts employed in my approach,
I have concentrated on three tables namely, Employees, Suppliersand Customers. The tables are represented as Table 1, Table 2,Table 3 respectively
pg. 71
-
8/7/2019 winter project main
72/176
-
8/7/2019 winter project main
73/176
-
8/7/2019 winter project main
74/176
-
8/7/2019 winter project main
75/176
-
8/7/2019 winter project main
76/176
-
8/7/2019 winter project main
77/176
From the above class structure, it is understood that every table has
a set of general or common fields (highlighted ones) and table-specific fields. On considering the Employee table, it has generalfields like Name, Age, Gender etc. and table-specific fields likeTitle, HireDate etc. These general fields occur repeatedly in mosttables. This causes redundancy and thereby increases spacecomplexity. Moreover, if a query is given to retrieve a set ofrecords for the whole organization satisfying a particular rule, theremay be a need to search all the tables separately. So, thisreplication of general fields in the table leads to a poor design
which affects effective data classification. To perform betterclassification, I have design an Object-Oriented Database (OODB)by incorporating the inheritance concept of Object-OrientedProgramming (OOP).
pg. 77
-
8/7/2019 winter project main
78/176
Design of the Object-Oriented Database
First in my proposed approach, I have design an Object-OrientedDatabase (OODB) by utilizing the inheritance concept of Object-Oriented Programming (OOP) by which will eliminate the problemof redundancy. First, I have located all the general or commonfields from the table sett. Then, all these general or commonfields are fetched and stored in a single table and all the relatedtables can inherit it. Thus the Generalized table resembles the baseclass of the Object-Oriented Programming (OOP) paradigm. In my
approach, I have created a new table called Person, whichcontains all those common fields and the other tables likeEmployees, Customers inherit the Person table without redefiningit.
Here, I have used two important Mechanisms namelyGeneralization and Composition. Generalization depicts anis-a relation and composition represents a has-a relation. Boththese relationships can be best illustrated as below: The generalizedtable Person contains all the common fields and the tablesEmployees, Suppliers and Customers inheriting the TablePerson is said to have an is-a relationship with the table Personi.e., an Employee is a Person, A Supplier is a Person and ACustomer is a Person. Similarly to exemplify the compositionrelation, the table Person contains an object reference of thePlaces Table as its field. Then the table Person is said to have ahas-a relationship with the table Places i.e., a Person has a placeand similarly, A Place has a Postal Code. Figure 10 represents theinheritance class hierarchy of the proposed (OODB) Object-
Oriented Database design. In the following pictured design, thesmall triangle () represents is-a relationship and the arrow ()represents has-a relationship.
pg. 78
-
8/7/2019 winter project main
79/176
pg. 79
-
8/7/2019 winter project main
80/176
-
8/7/2019 winter project main
81/176
pg. 81
-
8/7/2019 winter project main
82/176
The generalized table Person is considered as the base class
Person and the fields are considered as the attributes ofthe base class Person. Therefore, the base class Person,which contains all the common attributes, is inherited by the otherclasses namely Employees, Suppliers and Customers, whichcontain only the specialized attributes.
Moreover, inheritance allows me to define the generalizedmethods in the base class and specialized methods in the subclasses.
For example, if there is a need to get the contact numbers of all thepeople associated with the organization, can define amethod getContactNumebrs() in the base class Person and itcan be shared by its subclasses. In addition, the generalizedclass Person exhibits composition relationship with anothertwo classes Places and PostalCodes. The class Person usesinstance variables, which are object references of the classesPlaces and PostalCodes. The tables in the proposed (OODB)design are shown in Tables.
pg. 82
-
8/7/2019 winter project main
83/176
pg. 83
-
8/7/2019 winter project main
84/176
-
8/7/2019 winter project main
85/176
-
8/7/2019 winter project main
86/176
-
8/7/2019 winter project main
87/176
Table 7: Example of Extended Customers Table
pg. 87
-
8/7/2019 winter project main
88/176
-
8/7/2019 winter project main
89/176
Table 9: Example of Extended PostalCodes Table
pg. 89
-
8/7/2019 winter project main
90/176
pg. 90
-
8/7/2019 winter project main
91/176
Owing to the incorporation of inheritance concept in the proposed
design, Database Designer can extend the database by effortlesslyadding new tables, merely by inheriting the common fields fromthe generalized table
pg. 91
-
8/7/2019 winter project main
92/176
Data Mining in the Designed Object-OrientedDatabase
Dynamic Polymorphism or Late Binding allows theprogrammer to define methods with the same name in differentclasses and the method to be called is decided at runtime based onthe calling object. This Object-Oriented Programming (OOP)concept and simple SQL\ ORACLE queries can be used to performclassification in the designed Object-Oriented Database (OODB).Here, a single method can do the classification process for all the
tables. The uniqueness of my concept is that the classificationprocess can be performed by using simple SQL/ ORACLE querywhile the existing classification approaches for Object-OrientedDatabase (OODB) employ complex techniques such as decisiontrees, neural networks, nearest neighbor methods and more.Database Administrator can also access the method, specifically forindividual entities namely Employees, Suppliers and Customers.By integrating the polymorphism concept, the code is simpler towrite and easier to manage. As a result of the designed (OODB),
the task of classification can be carried out effectively by usingsimple SQL/ORACLE queries. Thus in our approach byincorporating the Object-Oriented Programming (OOP) conceptsfor designing the Object-Oriented Database (OODB), I haveexploited the maximum advantages of Object-OrientedProgramming (OOP) and also the task of classification isperformed more effectively.
pg. 92
-
8/7/2019 winter project main
93/176
Implementation and Results
In this section, I have presented the experimental results of myapproach. The proposed approach for the design of Object-OrientedDatabase (OODB) and classification has been designed withORACLE as database. I have considered only three tables forexperimentation. But in general, an organization may have anumber of tables to manage. Specifically, the number of records isenormous in each table. The incorporation of the Object-OrientedProgramming (OOP) concepts to such databases greatly reduced
the implementation overhead incurred. Moreover, the memoryspace occupied is reduced to a great extent as the size of the tableincreases. These are some of the eminent benefits of the proposedapproach. I have performed a comparative analysis throughreviewing of Computer Reseller News (CRN) Magazines andCOMPUTER Monthly Newspaper then came to a conclusion of thespace utilized before and after generalization of tables and thus Ihave computed the saved memory space. The comparison isperformed with varying number of records in the tables such as
1000, 2000, 3000, 4000 and 5000 and the results are stated belowin Table10, Table11, Table12, Table13, Table14 respectively.
pg. 93
-
8/7/2019 winter project main
94/176
pg. 94
-
8/7/2019 winter project main
95/176
pg. 95
Normalized Un Normalized
Tables Fields Records Total Memory Fields Total Records Memory
Records of size of the of the table size of the
Table table table
1 Customers 4 1000 4000 40000 15 15000 150000
2 Employees 5 1000 5000 50000 16 16000 160000
3 Suppliers 5 1000 5000 50000 16 16000 160000
4 Persons 8 3000 24000 240000
5 Places 3 500 1500 15000
6 Postalcodes 4 250 1000 10000
Total 40500 405000 47000 470000
-
8/7/2019 winter project main
96/176
-
8/7/2019 winter project main
97/176
Normalized
Un Normalized
Tables Fields Records Total Records Memory size Fields Total Records Memory
of Table of the table of the table size of the
table
1 Customers 4 2000 8000 80000 15 30000 300000
2 Employees 5 2000 10000 100000 16 32000 320000
3 Suppliers 5 2000 10000 100000 16 32000 320000
4 Persons 8 6000 48000 480000
5 Places 3 1000 3000 30000
6 Postal codes 4 500 2000 20000
Total 81000 810000 94000 940000
Saved Memory (KB): 126.9531 Table 11: Saved Memory Table {Source: Computer Reseller News (CRN)
Magazines}
pg. 97
-
8/7/2019 winter project main
98/176
Tables Fields Records Total Memory size Fields Total Memory
Records of of the table Records of size of the
Table the table table
1 Customers 4 3000 12000 120000 15 45000 450000
2 Employees 5 3000 15000 150000 16 48000 480000
3 Suppliers 5 3000 15000 150000 16 48000 480000
4 Persons 8 9000 72000 720000
5 Places 3 1500 4500 45000
6 Postal codes 4 750 3000 30000
Total 121500 1215000 141000 1410000
Saved Memory (KB):190.4297 Table 12: Saved Memory Table {Source: Computer Reseller News (CRN)
Magazines}
pg. 98
-
8/7/2019 winter project main
99/176
Tables Fields Records Total Memory Fields Total Memory
Records of size of the Records of size of the
Table table the table table
1 Customers 4 4000 16000 160000 15 60000 600000
2 Employees 5 4000 20000 200000 16 64000 640000
3 Suppliers 5 4000 20000 200000 16 64000 640000
4 Persons 8 12000 96000 960000
5 Places 3 2000 6000 60000
6 Postal codes 4 1000 4000 40000
Total 162000 1620000 188000 1880000
Saved Memory (KB):253.9063 Table 13: Saved Memory Table {Source: Computer Reseller News (CRN)
Magazines}
pg. 99
-
8/7/2019 winter project main
100/176
-
8/7/2019 winter project main
101/176
pg. 101
-
8/7/2019 winter project main
102/176
pg. 102
-
8/7/2019 winter project main
103/176
The results of comparative analysis that the saved memory space
increases, as the number of records in each table increases.
The Graphical Representation of the results is illustrated in Figure11. From the graph, it is clear
Figure 11: Graph Demonstrating the above Evaluation Results
Moreover in the proposed approach, I have placed the commonmethods in the generalized class and entity-specific methods in thesubclasses. Because of this design, we have saved a considerablememory space.
pg. 103
-
8/7/2019 winter project main
104/176
-
8/7/2019 winter project main
105/176
Building Profitable Customer Relationships with
Data Mining
Organization have to built the customer information and marketingdata warehouse, how do organization can make good use of thedata it contains
Customer Relationship Management (CRM) helps companiesimprove the profitability of their interactions with customers whileat the same time making the interactions appear friendlier through
individualization. To succeed with CRM, companies need to matchproducts and campaigns to prospects and customers in other words,to intelligently manage the "Customer Life Cycle. Until recentlymost CRM software has focused on simplifying the organizationand management of customer information. Such software, calledOperational CRM, has focused on creating a customer databasethat presents a consistent picture of the customers relationshipwith the company, and providing that information in specificapplications such as sales force automation and customer service in
which the company touches the customer. However, the sheervolume of customer information and increasingly complexinteractions with customers has propelled data mining to theforefront of making the Organization customer relationshipsprofitable. Data mining is a process that uses a variety of dataanalysis and modeling techniques to discover patterns andrelationships in data that may be used to make accurate predictions.It can help Data Miner to select the right prospects on whom tofocus, offer the right additional products to Organization existingcustomers, and identify good customers who may be about to leavethe product of the Organization. The result is improved revenuebecause of a greatly improved ability to respond to each individualcontact in the best way, and reduced costs due to properlyallocating the business resources. CRM applications that use datamining are called Analytic CRM.
pg. 105
-
8/7/2019 winter project main
106/176
This section of the project will describe the various aspects of
analytic CRM and show how it is used to manage the customer lifecycle more cost-effectively. The case histories of these fictionalcompanies are composites of real-life data mining applications.
pg. 106
-
8/7/2019 winter project main
107/176
Data Mining in Customer Relationship Management
The first and simplest analytical step in data mining is to "Describethe Data" For example, summarize its statistical attributes (such asmeans and standard deviations), visually review it using charts andgraphs, and look at the distribution of values of the fields in theorganization data.
But data description alone cannot provide an action plan. Anorganization must "Build a Predictive Model" based on patterns
determined from known results, and then test that model on resultsoutside the original sample. A good model should never beconfused with reality (Business man know a road map isnt aperfect representation of the actual road), but it can be a usefulguide to understanding the business.
Data mining can be used for both classification and regressionproblems.
In "Classification Problems" Business Analyst predicting whatcategory something will fall into
For example, whether a person will be a good credit risk or not, orwhich of several offers someone is most likely to accept.
In Regression Problems" Business Analyst are predicting anumber such as the probability that a person will respond to anoffer.
pg. 107
-
8/7/2019 winter project main
108/176
In CRM, data mining is frequently used to assign a score to a
particular customer or prospect indicating the likelihood that theindividual will behave in the way Business Man want. Forexample, a score could measure the propensity to respond to aparticular offer or to switch to a competitors product. It is alsofrequently used to identify a set of characteristics (called a profile)that segments customers into groups with similar behaviors, suchas buying a particular product.
A special type of classification can recommend items based on
similar interests held by groups of customers. This is sometimescalled "Collaborative Filtering".
The data mining technology used for solving Classification,Regression and Collaborative Filtering problems is brieflydescribed in the Appendix at the end of the project.
pg. 108
-
8/7/2019 winter project main
109/176
Defining CRM
"Customer Relationship Management" in its broadest sense simplymeans managing all customer interactions. In practice, this requiresusing information about the Business customers and prospects tomore effectively interact with Business customers in all stages ofBusiness relationship with them. I have refer to these stages as thecustomer life cycle.
The customer life cycle has three stages:
Acquiring customers
Increasing the value of the customer
Retaining good customers
Data mining can improve Business profitability in each of thesestages through integration with operational CRM systems or asindependent applications.
pg. 109
-
8/7/2019 winter project main
110/176
Applying Data Mining to CRM
In order to build good models for the Business CRM system, thereare a number of steps the Business Man must follow.
The Two Crows data mining process model described below issimilar to other process models such as the CRISP-DM model,differing mostly in the emphasis it places on the different steps.Keep in mind that while the steps appear in a list, the data miningprocess is not linear the CRM implementor will inevitably need to
loop back to previous steps. For example, what implementer learnsin the explore data step may require implementor to add new datato the data mining database. The initial models implementor buildmay provide insights that lead implementor to create new variables.
The basic steps of data mining for effective CRM are:
Define business problem
Build marketing database
Explore data
Prepare data for modeling
Build model
Evaluate model
Deploy model and results
Define the business problem.
Each CRM application will have one or more business objectives
for which Business Analyst will need to build the appropriatemodel. Depending on business specific goal, such as increasingthe response rate or increasing the value of a response, BusinessAnalyst will build a very different model. An effective statement ofthe problem will include a way of measuring the results of BusinessCRM project.
pg. 110
-
8/7/2019 winter project main
111/176
Build a Marketing Database.
Steps two through four constitute the core of the data preparation.Together, Big Sams Clothing Company take more time and effortthan all the other steps combined. There may be repeated iterationsof the data preparation and model building steps as business analystlearn something from the model that suggests business analyst tomodify the data. These data preparation steps may take anywherefrom 50% to 90% of the time and effort of the entire data miningprocess!
Business Analyst will need to build a marketing database becausebusiness operational databases and corporate data warehouse willoften not contain the data Business Man need in the format.Furthermore, business CRM applications may interfere with thespeedy and effective execution of these systems.
When business analyst build business marketing database DataMiner will need to clean it up, if business want good modelsbusiness analyst need to have clean data. The data business analystneed may reside in multiple databases such as the customerdatabase, product database, and transaction databases. This meansbusiness analyst will need to integrate and consolidate the data intoa single marketing database and reconcile differences in data valuesfrom the various sources. Improperly reconciled data is a majorsource of quality problems. There are often large differences in theway data is defined and used in different databases. Someinconsistencies may be easy to uncover, such as different addressesfor the same customer. Making it more difficult to resolve these
problems is that Big Sams Clothing Company are often subtle.For example, the same customer may have different names orworse multiple customer identification numbers.
pg. 111
-
8/7/2019 winter project main
112/176
Explore the data.
Before business analyst can build good predictive models, BusinessAnalyst must understand the Business data. Start by gathering avariety of numerical summaries (including descriptive statisticssuch as averages, standard deviations and so forth) and looking atthe distribution of the data.
Business Man may want to produce cross tabulations (pivot tables)for multi-dimensional data. Graphing and visualization tools are a
vital aid in data preparation, and their importance to effective dataanalysis cannot be overemphasized. Data visualization most oftenprovides the leading to new insights and success. Some of thecommon and very useful graphical displays of data are histogramsor box plots that display distributions of values. Business analystmay also want to look at scatter plots in two or three dimensions ofdifferent pairs of variables. The ability to add a third, overlayvariable greatly increases the usefulness of some types of graphs
pg. 112
-
8/7/2019 winter project main
113/176
. Prepare data for modeling.
This is the final data preparation step before building models andthe step where the most art comes in. There are four main parts tothis step:
First business analyst wants to select the variables on which tobuild the model. Ideally, business analyst would take all thevariables business analyst have, feed them to the data mining tooland let it find those which are the best predictors. In practice, this
doesnt work very well. One reason is that the time it takes to builda model increases with the number of variables. Another reason isthat blindly including extraneous columns can lead to models withless rather than more predictive power.
The next step is to construct new predictors derived from the rawdata.
For example, forecasting credit risk using a debt-to-income ratiorather than just debt and income as predictor variables may yieldmore accurate results that are also easier to understand.
Next business analyst may decide to select a subset or sample ofthe data on which to build models. If business analysts have a lot ofdata, however, using all Business data may take too long or requirebuying a bigger computer than business analyst would like.Working with a properly selected random sample usually results inno loss of information for most CRM problems. Given a choice ofeither investigating a few models built on all the data or
investigating more models built on a sample, the latter approachwill usually help business analyst to develop a more accurate androbust model. of the problem.
Last, business analyst will need to transform variables inaccordance with the requirements of the algorithm business analystchoose for building business model.
pg. 113
-
8/7/2019 winter project main
114/176
Data mining model building.
The most important thing to remember about model building is thatit is an iterative process. Business analyst will need to explorealternative models to find the one that is most useful in solving thebusiness pro