Innovations in Data and Information Mining

1

Innovation in Data and Information Mining

MBA Technology ConferenceMarch 28, 2007

Linda C. Simmons, IBM Global Business Services

Innovations in Data and Information Mining

2


Unparalleled -- the largest private research institution in the world

Annual budget of almost $5B

Eight labs across the world on all continents

Over 3,000 researchers

5 Nobel Prize winners, 4 US National Medals of Technology, 3 National Medals of Science, 19 memberships in the National Academy of Sciences and more than 47 members of the National Academy of Engineering

Skills in mathematics, computer science, physics, operations research and many more

Over 30,000 US patents since 1993

Who is IBM Research?

3


• Developing effective tools and techniques for enabling a wide variety of Business Intelligence applications and solutions.

• Techniques for extracting actionable insights from structured (data) and unstructured (text) information.

• Enabling analytics for data and text within large-scale data and computing infrastructure environments.

• Work with clients to drive our research agenda for developing novel data mining solutions• To have data mining impact business and industry problem-solving in new and unique ways.

• Basic Research• Cost-Sensitive Learning, Active Learning, Reinforcement Learning,

Regularization Methods.• Systems Research: Developing highly scalable and fully automated predictive modeling capabilities

• Data-parallel architectures for leveraging database systems• Compute-parallel architectures for leveraging grid computing

• Solutions and Services• Customer Insights• Business Forecasting• Risk Management• Etc.

Data Mining Research Goals

Current Activities

4


Customer Interaction

Software Development

Theory Advancement

Database ManagementKnowledge Discovery and Data MiningKnowledge ManagementNatural Language ProcessingInformation Retrieval

Retail ManufacturingBanks and InsuranceTravel and TransportGovernmentIBM

Service and maintenance manufacturing, procurement, distributionproduct design, forecasting, pricing, and fulfillment

Parallel DatabasesOLAP AnalyticsParallel Data MiningUnstructured Information ManagementText Analytics and Mining

Academic CommunityProfessional Societies

IndustriesGBS - ODIS

IBM Businesses

Software GroupOpen Source

Multi-faceted Approach to our Data Mining Research Agenda

5


Supply chain solutions – Optimize, plan, model and analyze supply chain and transportation processes.

Advanced call center automation – Design and help deploy natural language voice-recognition and voice-mining solutions

Advanced networking services – Apply cutting-edge models, algorithms, software and expertise to help design, monitor and optimize enterprise networks and networked applications, e.g. storage area networks and IP telephony.

Business optimization and analytics – Optimize, plan, model, analyze and transform businesses to on demand models.

Collaboration – Realize the value of collaboration through a skilled assessment of the current environment for collaboration, methodologies that document end-user requirements for collaboration, strategic design for visualizing future collaborative states, and tools that support human communication.

Security and privacy – Assess, design and implement enhanced security processes and tools.

Emerging Innovation from Research

6


Emerging Innovation in Research (2)

e-business systems and architecture – Design and help deploy applications, middleware and Web content.

Grid and autonomic solutions – Apply cutting-edge models, software, designs and expertise to help quickly and accurately evaluate, design, pilot and optimize grid and autonomic capability in client distributed-computing systems.

Information mining and management – Gain business insight from structured and unstructured data, text, voice, video and more.

Mobile enablement – Apply new wireless and pervasive technology to improve security,reliability and integration.

Product lifecycle management – Improve product development processes through better tools, methodologies and collaboration.

Technology-based learning – Deploy prototype learning technology that can help improve learning effectiveness, increase accountability and boost productivity.

7


Client Big City Coach, a high-end car service company has a few hundred cars and drivers (more drivers than cars), which may service 1000 rides/day in several big cities nationwide.

Challenge This ground transportation leader wanted to increase vehicle and driver utilization, push customer service to new levels, and lower operating costs. The mathematical optimization concept came from discussions with our Research Math team.

Solution Developed a Fleet Optimization System (FOS) which gathers off-line and real-time information from a variety of internal and external sources and produces a near-optimal staffing plan. FOS made possible real-time adjustment of schedules and resource allocation.

Benefits - Increased vehicle utilization thru better visibility of scheduling info

- Increase efficiency – less downtime for drivers, more effective use of partner resources

- Improved customer service and satisfaction due to real time reallocation of cars/drivers

- Better resource management, esp. during peak traffic times, bad weather, & delays

Continual Optimization

8


Text Analytics for a Financial CommunicatorClient A leading financial communications powerhouse which prides itself on

providing an unequaled mix of electronic trading, data, analytics, calculation engines, and straight-through processing.

Challenge The company was interested in validating its hypotheses around text analytics, which enable computers to read documents and derive value from the output. The intent was to use text analytics to automate the data collection and analysis process.

Solution Strategy and Change Consulting, powered by computational linguists, performed in-depth analyses around the new technologies and solutions that the firm had been evaluating.

Benefits The firm now has validated and enhanced new product plays that it can leverage; in addition, it is realizing staff efficiencies that enables it to do more with the same number of people. Other benefits include data quality and time-to-market improvements. Overall, it can now better compete in the marketplace.

9


Client Famous Group – A subsidiary of A Big Finance Company

Challenge Automatic discovery of all credible and actionable risk groups in auto insurance policyholders to improve premium pricing, underwriting rules, and new business development.

Solution A data warehouse was put together that stored four years of 300 historical actors on 2 million policyholders, claims, and insured assets (autos). A new predictive modeling technology was developed that was optimized for discovering homogenous risk groups from this data. The generated models were represented as if-then rules.

Benefits Of all the rules that were generated, 43 were statistically significant and not known before. Marketing benefits analysis of 6 of these 43 discoveries suggested a $2 Million profit enhancement over a 2 million policyholder base.

Underwriting Profitability Analysis

10


Client A Big UK Grocery

Challenge Cross-Sell / Up-Sell services to consumers with handheld PDAs for anytime / anywhere shopping

Solution A solution was developed in which recommendations are generated by matching products to customers based on the expected appeal of the product and the previous spending of the customer. A combination of associations mining in the product domain and clustering in the customer domain is used for developing customer-specific recommendations.

Benefits In a pilot program with several hundred customers, a 1.8% boost in revenue was observed as a result of purchases made directly from the list of recommended products.

Customer Insight : Personalization of Product Recommendations

11


Client A Fifth Avenue Retailer

Challenge Optimize cross-channel customer messaging to maximize customer lifetime value

Solution A reinforcement learning based methodology was developed to model enterprise-customer. The developed methodology discovers customer responses on one channel as a result of a contact on another channel. The technology is highly scalable so it could address the large volumes of data that are typically available in a cross-channel scenario.

Benefits The system was benchmarked against the retailer’s current methodology for customer relationship management in the direct mail and store channels. Initial results suggest a 7-8% increase in store revenues.

Customer Insight: Lifetime Value Management

12


Passenger-Based Airline No-show Prediction Passenger-Based Airline No-show PredictionClient Air Elsewhere

Challenge Using detailed information on each passenger, predict the number of passengers who will not show for a flight. Accurate no-show forecasts are an essential input to airline revenue-management systems.

Solution Two different predictive models were built using passenger-based features extracted from over 1M passenger records. The first model used a segmented Naïve Bayes approach (ProbE) to estimate each passenger’s probability of not showing. The second model predicted the no-show fraction directly using a novel aggregationmethod for an ensemble of probabilistic models.

Benefits Various evaluation metrics demonstrated that the passenger-based models are more accurate than conventional history-based statistical models. A simple revenue model suggested that use of these models could produce between 0.4% and 3.2% revenue gain over the conventional model.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction of booked PNRs (sorted by no-show probability)

Frac

tion

of P

NR

no-

show

s

Passenger-Level [ProbE] Passenger-Level [APMR] Passenger-Level [C4.5] Historical Model [Statistical]Random

13


Call Center Text TAKMI Analysis

Analysts

Customers

The BusinessInformation about customer’s experiences with products or services

Better Products and Services; Increased Customer Sat.

The BusinessWhat customers are asking about; what they need to know

Better self-service; lower costs

Actionable Information

Notes taken by CSR’s

Call Center Text Mining

Vast amounts of textual dataInternal ReportsPatentsCustomers’ messagesetc.

Knowledge AcquisitionHidden regularities/factsTrends in contentsFeatures of specific topicsRelationship with other knowledge

IBM Research leads in Speech Recognition, Natural Language Understanding, Dialog Management, Language Generation and Speech Synthesis. Our approach combines the use of advanced statistical and machine learning

techniques with sophisticated grammars, digital dictionaries and encyclopedias

14


Interior node

Leaf node

Rec < 6m

Spend < $150

Rec < 3m

#delinq < 2

#kids < 2

Ret < $20

Tree structure is obtained by a recursive procedure using the best univariate splits at each stage

The leaf nodes define a non-overlapping, exhaustive partition of the input space

Final model is a collection of segments with their associated segment model in each leaf node

Splitting condition is based on minimizing the negative log-likelihood using search algorithms

Final tree is determined by a stopping condition based on test set or cross-validation error

Out-of-memory row-scan based procedure

Data-partitioned parallelism

We have new approaches to Segmentation-based predictive modeling

15


Automated methods for embedding in solutionsIntegrating structured and unstructured dataAbsorbing new ideas from learning theory and computational statistics for addressing typical issues with business data

Missing values, Data sparseness, High DimensionalitySupport Vector Machines, Predictive Rule Induction, Regularization Techniques

Streaming data miningOnline and incremental mining of streaming data

Outlier DetectionDetecting anomalies and abnormalities in data

Mine historical data to train patterns/models that can predict future behavior

BehaviorsResponse to Direct MailProduct Quality (Defects)Declining ActivityCredit RiskDelinquencyLikelihood to buy specific productsProfitabilityetc.

Score with models to reflect likelihood to exhibit the modeled behaviorAct to optimize business objectives based on these scores.

Traditional Predictive Mining Process

Current Predictive Mining Research

16


Security and Privacy Initiatives

• Secure Hardware Embedded Analytics– Leveraging cryptographic secure processing

technology

• Sovereign Information Integration– Need-to-know information sharing

• Privacy Preserving Data Mining– Assumes no trusted third party.

Security and Privacy Initiatives: Financial Services

17


• Secure processor →Ultimate data security.

• Memory-light data mining →Sophisticated analytics can

run inside processor.

• Memory-light DB2 → Secure data federation and

query processing capabilities across multiple data sources.

Encrypteddata transfer

Data are only decrypted

inside processor

….Enterprise 1

Database

Secure processor

Enable data analysis

inside secure processor

Enterprise N

Database

Memory-light data mining

Memory-light DB2

Secure Federated MiningArchitecture

18


Intra-bank Service Center ScenariosAnti-Money Laundering

Credit Risk RatingCRM

….

Intra-Bank Data Centralizer

Encrypteddata

Encrypteddata

SecureFederated

Mining

LOB NLOB 1

• Guarantees confidentiality1. Analyzing data from different

LOBs together to know customers.

2. Legislations limiting data sharing among LOBs.

• Guarantees that data will only be used for specialized purposes.

– Customers are more likely to allow banks to share their data among LOBs with this condition.

• Data federation allows multiple LOBs to share data without having central data warehouse.

19


EPAL –Enterprise Privacy Architecture Language

Implementing Privacy Management Using EPA

The Enterprise Privacy Authorization Language (EPAL) is a formal language to specify fine-grained enterprise privacy policies. It concentrates on the core privacy authorization while abstracting from all deployment details such as data model or user-authentication.

AuditManager

Log Data

Privacy Management

ServerE – P3P Policy Consent

Obligations Queue

Privacy Management Submission Monitor Legacy

Applications

Privacy Management Enforcement MonitorsWeb Data

Legacy Data

CPOPrivacy

ManagementConsole

CustomerEnterprise Employee

http://www.zurich.ibm.com/security/enterprise-privacy/epal

► EPAL specs published (07/2003)► Java ref implementation of EPAL & XACML

■ On alphaWorks: http://www.alphaworks.ibm.com/tech/dpm

► P3P ↔ EPAL mapping► WS Privacy specs and bindings: ongoing

EPAL –Enterprise Privacy Architecture Language

http://www.zurich.ibm.com/security/enterprise-privacy/epal

http://www.alphaworks.ibm.com/tech/dpm

20


PrivacyPolicy

DataCollection

Queries

PrivacyMetadataCreator

Store

PrivacyConstraintValidator

DataAccuracyAnalyzer

AuditInfo

AuditInfo

AuditTrail

QueryIntrusionDetector

AttributeAccessControl

PrivacyMetadata

Other

DataRetentionManager

RecordAccessControl

EncryptionSupport

DataCollectionAnalyzer

# Name Age Phone1 Adams 10 111-11113 - - 333-33334 Daniels 40 -

05 0

1 0 01 5 02 0 02 5 03 0 0

0 .0 1 0 . 1 0 .2 0 . 5 1

A p p l i c a tio n S e le c ti v i ty

Que

ry E

xecu

tion

Tim

e

(sec

onds

)

O rig in a l Q u e rie sR e w ri t t e n Q u e rie s

Table Size: 10 million, no index

• Vision: Database systems that take responsibility for the privacy and ownership of data they manage, while not impeding the flow of information.

• Architectural principles derived from principles behind current legislations.

Hippocratic Database

21


Thank you!

Contact Information: Linda C SimmonsIBM Global Business [email protected] 904.491.0410Mobile 904.610.3723

mailto:[email protected]

Innovations in Data and Information Mining

Documents

Transcript of Innovations in Data and Information Mining