Unidad_1 Conceptos de KDD

download Unidad_1 Conceptos de KDD

of 19

description

myu bueno

Transcript of Unidad_1 Conceptos de KDD

  • 1Descubrimiento de Conocimiento de Datos

    Dr. Francisco Javier Luna Rosas

    Profesor Investigador en Inteligencia Artificial y Sistemas Distribuidos.

    [email protected]

  • 2Unidad 1 Conceptos.

    1. Introduccin.1.1 Conceptos bsicos

  • 31. Introduccin (1/16).

    Simply stated, data mining refers to extractingor mining knowledge from large amounts ofdata. The term is actually a misnomer.Remember that the mining of gold from rocksor sand is referred to as gold mining ratherthan rock or sand mining. Thus, data miningshould have been more appropriately namedknowledge mining from data, which isunfortunately somewhat long.

  • 41. Introduccin (2/16).

    Knowledge mining, a shorter term, may notreflect the emphasis on mining from largeamounts of data. Nevertheless, mining is avivid term characterizing the process that findsa small set of precious nuggets from a greatdeal of raw material.

  • 51. Introduccin (3/16).

    Thus, such a misnomer that carries bothdata and mining became a popular choice.Many other terms carry a similar or slightlydifferent meaning to data mining, such asknowledge mining from data, knowledgeextraction, data/pattern analysis, dataarchaeology, and data dredging. Many peopletreat data mining as a synonymfor anotherpopularly used term, Knowledge DiscoveryfromData, or KDD.

  • 61. Introduccin (4/16).

    Knowledge discovery as a process is depictedin Figure 1.4 and consists of an iterativesequence of the following steps:

    1. Data cleaning (to remove noise andinconsistent data).

    2. Data integration (where multiple datasources may be combined).

  • 71. Antecedentes (1/12).

  • 81. Introduccin (6/16).

    3. Data selection (where data relevant to theanalysis task are retrieved from the database).

    4. Data transformation (where data aretransformed or consolidated into formsappropriate for mining by performing summaryor aggregation operations, for instance).

  • 91. Introduccin (7/16).

    5. Data mining (an essential process whereintelligent methods are applied in order toextract data patterns)

    6. Pattern evaluation (to identify the trulyinteresting patterns representing knowledgebased on some interestingness measures;Section 1.5)

  • 10

    1. Introduccin (8/16).

    7. Knowledge presentation (where visualizationand knowledge representation techniques areused to present the mined knowledge to theuser).

  • 11

    1. Introduccin (9/16).

    Steps 1 to 4 are different forms of datapreprocessing, where the data are prepared formining. The data mining step may interact with theuser or a knowledge base. The interesting patternsare presented to the user and may be stored as newknowledge in the knowledge base. Note thataccording to this view, data mining is only one step inthe entire process, albeit an essential one because it

    uncovers hidden patterns for evaluation..

  • 12

    1. Introduccin (10/16).

    We agree that data mining is a step in the knowledgediscovery process. However, in industry, in media,and in the database research milieu, the term datamining is becoming more popular than the longerterm of knowledge discovery from data. Therefore, inthis course, we choose to use the term data mining.We adopt a broad view of data mining functionality:data mining is the process of discovering interestingknowledge from large amounts of data stored indatabases, data warehouses, or other informationrepositories.

  • 13

    1. Introduccin (11/16).

    Based on this view, the architecture of a typical data miningsystem may have the following major components (Figure1.5):Database, data warehouse, WorldWideWeb, or other

    information repository: This is one or a set of databases,data warehouses, spreadsheets, or other kinds of informationrepositories. Data cleaning and data integration techniquesmay be performed on the data.Database or data warehouse server: The database ordata warehouse server is responsible for fetching the relevantdata, based on the users data mining request.

  • 14

    1. Introduccin (12/16).

  • 15

    1. Introduccin (13/16).

    Knowledge base: This is the domain knowledge that is usedto guide the search or evaluate the interestingness of resultingpatterns. Such knowledge can include concept hierarchies,used to organize attributes or attribute values into differentlevels of abstraction. Knowledge such as user beliefs, whichcan be used to assess a patterns interestingness based on itsunexpectedness, may also be included. Other examples ofdomain knowledge are additional interestingness constraintsor thresholds, and metadata (e.g., describing data frommultiple heterogeneous sources).

  • 16

    1. Introduccin (14/16).

    Data mining engine: This is essential to the data miningsystem and ideally consists of a set of functional modules fortasks such as characterization, association and correlationanalysis, classification, prediction, cluster analysis, outlieranalysis, and evolution analysis.

  • 17

    1. Introduccin (15/16).

    Pattern evaluation module: This component typicallyemploys interestingness measures (Section 1.5) and interactswith the data mining modules so as to focus the searchtoward interesting patterns. It may use interestingnessthresholds to filter out discovered patterns. Alternatively, thepattern evaluation module may be integrated with the miningmodule, depending on the implementation of the data miningmethod used. For efficient data mining, it is highlyrecommended to push the evaluation of patterninterestingness as deep as possible into the mining process soas to confine the search to only the interesting patterns.

  • 18

    1. Introduccin (16/16).

    User interface: This module communicates between usersand the data mining system, allowing the user to interact withthe system by specifying a data mining query or task,providing information to help focus the search, and performingexploratory data mining based on the intermediate datamining results. In addition, this component allows the user tobrowse database and data warehouse schemas or datastructures, evaluate mined patterns, and visualize the patternsin different forms.

  • 19

    Bibliografia.

    [Han 2006] Data Mining: Concepts and Techniques, Second Edition, JiaweiHan, University of Illinois at Urbana-Champaign. Elsevier 2006.