Agenda 02/21/2013

25
Agenda 02/21/2013 Discuss exercise Answer questions in task #1 Put up your sample databases for tasks #2 and #3 Define ETL in more depth by the activities performed. Discuss the “controversy” in ETL activities

description

Agenda 02/21/2013. Discuss exercise Answer questions in task #1 Put up your sample databases for tasks #2 and #3 Define ETL in more depth by the activities performed. Discuss the “controversy” in ETL activities. Discussed in prior classes. Lots of data. - PowerPoint PPT Presentation

Transcript of Agenda 02/21/2013

Page 1: Agenda  02/21/2013

Agenda 02/21/2013

Discuss exerciseAnswer questions in task #1Put up your sample databases for tasks #2 and #3

Define ETL in more depth by the activities performed.Discuss the “controversy” in ETL activities

Page 2: Agenda  02/21/2013

Discussed in prior classes...Lots of data.

Traditional transaction processing systemsNon-traditional transaction processing

Call center; Click-stream; Loyalty card; Warranty cards/product registration information

External data from government and commercial entities.

Lots of poor quality data for lots of reasons that can be traced back to lots of people.

Page 3: Agenda  02/21/2013

Master Data Management: What does it mean and why is it so difficult to manage

master data?

Page 4: Agenda  02/21/2013

Populating the data warehouseExtract

Take data from source systems. May require middleware to gather all necessary data.

TransformationPut data into consistent format and content.Validate data – check for accuracy, consistency using pre-defined and agreed-upon business rules.Convert data as necessary.

LoadUse a batch (bulk) update operation that keeps track of what is loaded, where, when and how. Keep a detailed load log to audit updates to the data warehouse.

Page 5: Agenda  02/21/2013

Data CleansingSource systems contain “dirty data” that must be cleansedETL software contains rudimentary to very sophisticated data cleansing capabilitiesIndustry-specific data cleansing software is often used. Important for performing name and address correctionLeading data cleansing vendors include general hardware/software vendors such as IBM, Oracle, SAP, Microsoft and specialty vendors Information Builders (DataMigrator), Harte-Hanks (Trillium), CloverETL, Talend, and BusinessObjects (Centric)

Page 6: Agenda  02/21/2013

Steps in data cleansing· Parsing· Correcting· Standardizing· Matching· Consolidating

Page 7: Agenda  02/21/2013

ParsingParsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files.Examples include parsing the first, middle, and last name; street number and street name; and city and state.

Page 8: Agenda  02/21/2013

Input Data from Source FileBeth Christine Parker, SLS MGRRegional Port AuthorityFederal Building12800 Lake CalumetHedgewisch, IL

Parsed Data in Target FileFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: Lake CalumetCity: HedgewischState: IL

Parsing

Page 9: Agenda  02/21/2013

Correcting

Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.

Page 10: Agenda  02/21/2013

Correcting

Corrected DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: South Butler DriveCity: ChicagoState: ILZip: 60633Zip+Four: 2398

Parsed DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: Lake CalumetCity: HedgewischState: IL

Page 11: Agenda  02/21/2013

Standardizing

Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.

Page 12: Agenda  02/21/2013

StandardizingCorrected DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: South Butler DriveCity: ChicagoState: ILZip: 60633Zip+Four: 2398

Corrected DataPre-name: Ms.First Name: Beth1st Name Match Standards: Elizabeth, Bethany, BethelMiddle Name: ChristineLast Name: ParkerTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr.City: ChicagoState: ILZip: 60633Zip+Four: 2398

Page 13: Agenda  02/21/2013

Matching

Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.

Page 14: Agenda  02/21/2013

Matching

Corrected Data (Data Source #1)Pre-name: Ms.First Name: Beth1st Name Match Standards: Elizabeth, Bethany, BethelMiddle Name: ChristineLast Name: ParkerTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr.City: ChicagoState: ILZip: 60633Zip+Four: 2398

Corrected Data (Data Source #2)Pre-name: Ms.First Name: Elizabeth1st Name Match Standards: Beth, Bethany, BethelMiddle Name: ChristineLast Name: Parker-LewisTitle: Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr., Suite 2City: ChicagoState: ILZip: 60633Zip+Four: 2398Phone: 708-555-1234Fax: 708-555-5678

Page 15: Agenda  02/21/2013

Consolidating

Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.

Page 16: Agenda  02/21/2013

Consolidating

Corrected Data (Data Source #1)

Corrected Data (Data Source #2)

Consolidated DataName: Ms. Beth (Elizabeth) Christine Parker-LewisTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingAddress: 12800 S. Butler Dr., Suite 2 Chicago, IL 60633-2398Phone: 708-555-1234Fax: 708-555-5678

Page 17: Agenda  02/21/2013

Source system view – 3 clients

Policy No.ME309451-2

Account#1238891

TransactionB498/97

Page 18: Agenda  02/21/2013

The reality – ONE client

Account#1238891

Policy No.ME309451-2

TransactionB498/97

Page 19: Agenda  02/21/2013

Consolidating whole groups

William Parker

BethLewis

Karen Parker-Lewis

William Parker-Lewis, Jr.

Page 20: Agenda  02/21/2013

ETL ProductsSQL Server 2012 Integration Services from MicrosoftPower Mart/Power Center from InformaticaWarehouse Builder from OracleTeradata Warehouse Builder from TeradataDataMigrator from Information BuildersSAS System from SAS InstituteConnectivity Solutions from OpenTextAb Initio

Page 21: Agenda  02/21/2013

What about unstructured data?What is unstructured data?What percentage of data in organizations is considered to be “unstructured”?ExamplesWhy store it in a data warehouse?Does it do any good in large text fields?Special ETL for unstructured data

Page 22: Agenda  02/21/2013

Unstructured Data ExampleNotes about post-service of a product:

The hub bent when the bicycle hit a large pothole.The plane takes off sluggishly during high-altitude departures.The product won’t allow entry of a 1098-T when the person is declared as a dependent.

“Text analytics” are used to transform the data.

Page 23: Agenda  02/21/2013

Text analyticsParses text and extracts facts (complaints, problems, issues) about key entities (customers, products, locations).Uses natural language processes (NLP).

NLP converts human language into more formal representations that are easier for a computer program to manipulate.Combination of computational linguistics and artificial intelligence.

Page 24: Agenda  02/21/2013

Goal of ETLStructured and unstructured data stored in a relational database.Data is complete, accurate, consistent, and in conformance with the business rules of the organization.

Page 25: Agenda  02/21/2013

Controversy in ETLIs it necessary?Has the advent of big data changed our need for ETL?ETL vs. ELTDoes the use of Hadoop eliminate the need for ETL software???