01 introduction to data mining

60
336331 กกกกกกกกกกกก กกกกกก (Data Mining) สสสสสส กก.กกกกกกกกก กกกกกกกกกกกกกกก กกกกกกกกกกกกกกกกกกกกกกกกกกกกกกกกกกกกกกกกกกก กกกกก 1 : Introduction to Data Mining
  • Upload

    -
  • Category

    Education

  • view

    1.591
  • download

    1

description

 

Transcript of 01 introduction to data mining

  • 1. 336331 (Data Mining) . 1 : Introduction to Data Mining

2. Data Mining (File Processing) (Data Structure) (Sorting) (Indexing) (Searching) 3. Data Mining () .. 1960 .. 1970 .. 1980 (Database Management Systems) Hierarchical Database System, Network Database System, Relational Database System Relational Database System 4. Data Mining () (Data modeling) Entity-Relationship Model (B+Tree Indexing) (SQL: Structure Query Language) (Query Processing) (Query Optimization) (Data Recovery) (Concerrency Control) On-Line Transaction Processing (OLTP) 5. Data Mining () .. 1980 (Data Warehouse) OLAP (Online Analytical Processing) OLAP OLAP (Data Mining) 6. Data Mining () Statistics, Machine Learning, Information Science Visualization NSA Data Mining , CIA Wins Control of Terrorist Data Mining Program, (Business Intelligence) (Bioinformatics) (Hydroinformatics) 7. What is data mining? (Data Mining) (Knowledge Discovery in Database : KDD) (Patterns) (associations) Data Mining Data Mining software software 8. Data mining as a step in the process of knowledge discovery in database : KDD) Data Cleaning Data Integration Data Selection Data Transformation Data Mining Pattern Evaluation Knowledge Presentation 9. Data Collection -Primitive File Processing Database management system - Network and relational database management system - Data Modeling Tools - Query Language Advanced database management system - Advanced data model - Object-oriented database management system - Object relational database management system Decision Support System - Data warehouse - Data mining - XML-based database System, Web Mining 1970s 1960s & earlier 1980s - present 1990s - present 10. 11. We are drowning in data, but starving for knowledge! 12. (Data Explosion) (Data Warehousing) (Machine Learning) (Artificial Intelligence) 13. (Database systems, data warehouses, OLAP) (Machine Learning) (Statistic and data analysis methods) (information science) 14. (mathematical programming) (High performance computing) (Visualization) 15. (nontrivial) (valid) (novel/ previously unknown) (potentially useful) (interesting) (understandable) 16. What kinds of data can be mined? Relational database Transactional database Data warehouses Transaction Data Advanced databases and information repositories Object-orientedand object-relational database Spatial databases 17. Time-series data and temporal data Text databases Multimedia databases www 18. Relational Database Database management system (DBMS) (columns or fields) (Tuple) Cust_ID name address age incom e Credit_inf o C1 Smith 111, Chicago,.. 21 $2700 1 .. Trans_ID Cust_ID Item_I D Date Time Method_pay amount 001 C1 I3 31/05/10 10:00 Visa $20000 customer purchase 19. Relational Database Database management system (DBMS) Relational SQL ... SQL 20. Transactional databases Transaction transaction transaction transaction point- of-sale 21. Transactional databases transactional Market basket analysis (frequent itemsets) Trans_ID Item_ID qty T100 Item3 1 T100 Item8 2 .. .. .. .. .. .. Item_sold 22. Data Warehouses (Heterogeneous data source) (Unified schema) Query and analysis tools Client Data Warehouse Data Source 1 Data Source 2 Data Source 3 Clean Transform Integrate Load Client 23. Data Warehouses Problem of multiple source: (Schema Differences) (Naming Differences) (Data Type Differences) (Value Differences) (Semantic Differences) (Missing Values) 24. Data Warehouses QuickCar 3 3 25. Data Warehouses Query and analysis tools Client Data Warehouse Khon Kaen Chiang Mai Songkla Clean Transform Integrate Load Client Multiple source 26. Data Warehouses Quickcar (Schema Differences) Branch A: Cars(serialNo, model, color, autoTrans, cdPlayer, ) Branch B: QuickCar(serial, model, color), Options(serial, option) (Naming Differences) Branch A: Table name Cars Branch B: Table name QuickCar (Data Type Differences) Branch A: serialNo integer Branch B: serial string 27. Data Warehouses (Value Differences) Branch A: color black Branch B: color BL (confused in BLUE color) (Semantic Differences) Branch A: QuickCar cars Branch B: QuickCar cars and 4x4 W (Missing Values) Branch A: model Civic DX, LX or EX Branch B: model Civic 28. Advanced databases and information repositories Object-Oriented Databases (text) (multimedia data) Object-Relational Databases 29. Spatial Database () 30. Time-Series and Temporal Database 31. Text database (articles) 32. Multimedia database voice mail 33. World Wide Web (distributed) www web pages web access log 34. Problem Understanding -Determine objective -Define success criteria -Asses situation -Determine data mining goals -Produce a project plan Modeling -Select modeling technique -Generate test design -Build a model -Asses the model Data Understanding -Collect initial data -Define success criteria -Describe data -Explore data -Verify data quality Data Preparation -Select data -Clean data -Transform data Evaluation -Evaluate Results -Review process -Determine next steps Deployment -Plan the deployment -monitor and maintain -Final Report 35. 1. 5% 36. 2. 37. 3. 38. 4. 39. 5. 6. 40. Graphical User Interface Pattern Evaluation Data Mining Engine Database or Data Warehouse Server Database warehouse FilteringData Preprocessing: -Data Cleaning -Data Integration Knowledge Base 41. Database & Data Warehouse Database/Data Warehouse server Knowledge base 42. Data Mining Engine (Modules) Pattern Evaluation Module Data Mining Engine Graphical User Interface 43. What kinds of pattern can be mined? DM Strategies Predictive or Supervised Modeling Descriptive or Unsupervised Modeling Classification Prediction Estimation/ Regression Associations Clustering 44. 1. (Predictive/ Supervised Modeling) (Inference) 2. (Descriptive/ Unsupervised Modeling) (Association) (Clustering) 45. (Mining Association Rules) Transactional (correlation) (causality) Market basket analysis x Y (Support) (Confidence) X Y 46. : AllElectronics shop relation database, a data mining system may find association rule: Single-dimensional association rules computer software or contains (T, computer) contains (T, software) [support = 1%, confidence = 50%] T contains computer, ther is a 50% chance that it contains software 1% of all of the transactions contain software 47. : AllElectronics shop relation database, a data mining system may find association rule: Multidimensional association rule Age(X, 20..29) income (X, 20K..29K) buys (X, CD player) [support = 2%, confidence = 60%] 2% support are 20 to 29 year of age with an income of 20K to 29K and have purchased a CD player at AllElectronics shop 60% probability that a customer in this age and income group will purchase a CD player 48. 49. : Classification: Decision Tree Age Rent Period Buy 23 36 20 27 20 50 36 36 22 3 1.5 1.5 2 1 2.5 1 2 2.5 No No No Yes No Yes No Yes no Business Info Rent Property Customer renting property > 2 year? Customer age > 25 year? Rent Property Buy Property YesNo No Yes 50. : Prediction: Neural Network Customer renting property > 2 years ? Customer age > 25 years ? 0.6 0.4 0.5 0.3 0.7 0.4 Class (Rent or buy property 51. Max. the intraclass Min. the intrerclass Class A Class B 52. monthl y Payment (baht) 1 10,000.00 2 15,000.00 3 1,500,000.00 2010 monthl y Payment (baht) 1 25,000.00 2 30,000.00 3 17,000.00 .. 12 23,500.00 2009 Outlier value can be detected -Location -Type of purchase -Purchase frequency 53. 54. (Market Basket Analysis) 55. Which Technologies are used? Statistic Machine Learning Database Systems and Data Warehouses Information Retrieval 56. (Mining Path Traversal) 57. Data Mining Program Oracle Data Warehouse Building SQL Analysis Weka RapidMiner Knime Keel 58. LAB 1 Data Mining Tool Function 59. 1 1. Data Mining 2. Knowledge Discovery in Databases 3. 4. 5. Data Warehouse 6. 7. 8. 9. 10. Data Mining