Data Mining and Data Visualization
Transcript of Data Mining and Data Visualization
-
8/3/2019 Data Mining and Data Visualization
1/28
Data Mining andData Visualization
Prof. Rushen Chahal
-
8/3/2019 Data Mining and Data Visualization
2/28
A Picture is Worth a
Thousand WordsData mining is the set of activities used to find
new, hidden, or unexpected patterns in data.
These techniques are often called knowledge
data discovery (KDD), and include statisticalanalysis, neural or fuzzy logic, intelligent
agents or data visualization.
The KDD techniques not only discover useful
patterns in the data, but also can be used to
develop predictive models.
-
8/3/2019 Data Mining and Data Visualization
3/28
Verification Versus Discovery
In the past, decision support activities were
primarily based on the concept of verification.
This required a great deal of prior knowledge
on the decision-makers part in order to verifya suspected relationship.
With the advance of technology, the concept
of verification began to turn into discovery.
-
8/3/2019 Data Mining and Data Visualization
4/28
Data Minings Growth in Popularity
One reason is that we keep getting more and
more data all the time and need tools to
understand it.
We also are aware that the human brain hastrouble processing multidimensional data.
A third reason is that machine learning
techniques are becoming more affordable
and more refined at the same time.
-
8/3/2019 Data Mining and Data Visualization
5/28
Making Accurate Predictions with
Data MiningAlthough the literature containsstatements such as data mining willallow us to predict who will buy a
particular product, that is againsthuman nature.
In situations where data mining is usedto predict response to a marketingcampaign, only about 5% of the peopleselected as likely respondents actuallydo respond.
-
8/3/2019 Data Mining and Data Visualization
6/28
Making Accurate Predictions with
Data Mining (cont.)
Although the accuracy of predicting
individual behavior is not so good, it isbetter than it seems, since direct
marketing efforts often have hit rates
of only about 1% without data mining.
-
8/3/2019 Data Mining and Data Visualization
7/28
Online Analytical Processing (OLAP)
1. Multidimensional view
2. Transparent to user
3. Accessible
4. Consistent reporting
5. Client-server
architecture
6. Generic dimensionality
7. Dynamic sparse matrix
handling8. Multiuser support
9. Cross-dimensional ops
10. Intuitive manipulation
11. Flexible reporting
12. Unlimited dimension and
aggregation
Codd developed a set of 12 rules for the
development of multidimensional databases:
-
8/3/2019 Data Mining and Data Visualization
8/28
OLAP as Implemented
To date, it does not appear that any
implementation exists that satisfies all 12
rules.
Some people argue it might not even bepossible to attain all of them.
More recently, the term OLAP has come to
represent the broad category of software
technology that enables multidimensional
analysis of enterprise data.
-
8/3/2019 Data Mining and Data Visualization
9/28
Multidimensional OLAP (MOLAP)
Data can be viewedacross severaldimensions. Here salesare arrayed by region andproduct.
A fourth dimension couldbe added by using severalgraphs -- perhaps atdifferent time points.
Most analyses have manymore dimensions thanthis. MOLAP handlesdata as an n-dimensionalhypercube.
4
3
1
0.3
Product
0.4
0.5
2
0.6
0.7
2
Sales
1
3Region
-
8/3/2019 Data Mining and Data Visualization
10/28
Relational OLAP (ROLAP)
A large relational database server replacesthe multidimensional one.
The database contains both detailed and
summarized data, allowing drill downtechniques to be applied.
SQL interfaces allow vendors to build tools,both portable and scalable.
This does require databases with manyrelational tables which may lead tosubstantial processor overhead on complex
joins.
-
8/3/2019 Data Mining and Data Visualization
11/28
A Typical Relational Schema
-
8/3/2019 Data Mining and Data Visualization
12/28
Data Mining Technologies
Statistics the most mature data mining
technologies, but are often not applicable
because they need clean data. In addition,
many statistical procedures assume linearrelationships, which limits their use.
Neuralnetworks, genetic algorithms, fuzzy
logic these technologies are able to work
with complicated and imprecise data. Theirbroad applicability has made them popular in
the field.
-
8/3/2019 Data Mining and Data Visualization
13/28
Data Mining Technologies (cont.)
Decision trees these technologies are
conceptually simple and have gained in
popularity as better tree growing
software was introduced. Because of
the way they are used, they are perhaps
better called classification trees.
-
8/3/2019 Data Mining and Data Visualization
14/28
The Knowledge Discovery
Search ProcessDefine the business problem and
obtain the data to study it.
Use data mining software to modelthe problem.
Mine the data to search for patterns
of interest.
-
8/3/2019 Data Mining and Data Visualization
15/28
The Knowledge Discovery
Search Process (cont.)Review the mining results and refine
them by respecifying the model.
Once validated, make the modelavailable to other users of the DW.
-
8/3/2019 Data Mining and Data Visualization
16/28
New Applications for Data Mining
As the technology matures, new applications
emerge, especially in two new categories,
text mining and web mining. Some text
mining examples are: Distilling the meaning of a text
Accurate summarization of a text
Explication of the text theme structure Clustering of texts
-
8/3/2019 Data Mining and Data Visualization
17/28
Web mining
Web mining is a special case of text miningwhere the mining occurs over a website.
It enhances the website with intelligent
behavior, such as suggesting related links orrecommending new products.
It allows you to unobtrusively learn theinterests of the visitors and modify their user
profiles in real time.They also allow you to match resources to theinterests of the visitor.
-
8/3/2019 Data Mining and Data Visualization
18/28
Current Limitations and
Challenges to Data Mining
Despite the potential power and value, datamining is still a new field. Some things thatthat thus far have limited advancement are:
Iden
tification
of missin
g in
formation
notall knowledge gets stored in a database
Data noise andmissing values futuresystems need better ways to handle this
Largedatabases andhigh dimensionalityfuture applications need ways to partition
data into more manageable chunks
-
8/3/2019 Data Mining and Data Visualization
19/28
3-6: Data Visualization:
Seeing the Data
-
8/3/2019 Data Mining and Data Visualization
20/28
Visual Presentation
For any kind of high dimensional data set,
displaying predictive relationships is a
challenge.
Shading is used to represent relative degreesof thunderstorm activity, with the darkest
regions the heaviest activity.
-
8/3/2019 Data Mining and Data Visualization
21/28
A Bit of History
An early effort used sequences of two-
dimensional graphs to add depth.
Current virtual reality programs allow the user
to step through a data set. Try going to arealtors website and taking a tour of a house
up for sale.
-
8/3/2019 Data Mining and Data Visualization
22/28
Human Visual Perception and
DataV
isualizationData visualization is so powerful because the
human visual cortex converts objects into
information so quickly.
The next three slides show (1) usage ofglobal private networks, (2) flow through
natural gas pipelines, and (3) a risk analysis
report that permits the user to draw an
interactive yield curve.
All three use height or shading to add
additional dimensions to the figure.
-
8/3/2019 Data Mining and Data Visualization
23/28
Global Private Network Activity
High Activity
Low Activity
-
8/3/2019 Data Mining and Data Visualization
24/28
Natural Gas Pipeline Analysis
Note: Height shows total flow through compressor stations.
-
8/3/2019 Data Mining and Data Visualization
25/28
An Enlivened Risk Analysis Report
-
8/3/2019 Data Mining and Data Visualization
26/28
Geographical Information Systems
A GIS is a special purpose database that
contains a spatial coordinate system. A
comprehensive GIS requires:
1. Data input from maps, aerial photos, etc.2. Data storage, retrieval and query
3. Data transformation and modeling
4. Data reporting (maps, reports and plans)
-
8/3/2019 Data Mining and Data Visualization
27/28
The Special Capabilities of a GIS
In general, a GIS contains two types of data:
Spatialdata: these elements correspond to auniquely-defined location on earth. They
could be in point, line or polygon form.Attributedata: These are the data that will
be portrayed at the geographicreferences established by spatial data.
Example: Data from an opinion poll isdisplayed for multiple regions in the UnitedStates. Clicking on an area allows the userto drill down to the results for smaller areas.
-
8/3/2019 Data Mining and Data Visualization
28/28
Telephone Polling Results
Note: On the live map, clicking on an area allows the user
to drill down and see results for smaller areas.