EDUCATION Information Classification: The Cornerstone to ...Information Classification: The...

32
EDUCATION Information Classification: The Cornerstone to Information Management Sheila Childs, EMC

Transcript of EDUCATION Information Classification: The Cornerstone to ...Information Classification: The...

Page 1: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information ManagementSheila Childs, EMC

Page 2: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 2

SNIA Legal Notice

• The material contained in this tutorial is copyrighted by the SNIA.

• Member companies and individuals may use this material in presentations and literature under the following conditions:– Any slide or slides used must be reproduced without

modification– The SNIA must be acknowledged as source of any

material used in the body of any document containing material from these presentations.

• This presentation is a project of the SNIA Education Committee.

Page 3: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 3

Abstract

Information Classification: The Cornerstone to Information ManagementAnyone who is trying to get a handle on information growth, compliance-related risk mitigation and information management costs realizes that without an understanding of the information under management, these objectives are difficult to achieve. Fundamental to any ILM strategy is the practice of information classification

Information classification requires that I.T. administrators work with Line-of-Business and knowledge workers to gain an understanding of the data to be managed. This sessions will explore the different types of classification methodologies and file system metadata-based classification and content/context-based classification will be discussed. Manual versus automated classification procedures will be discussed, along with the pros and cons that each approach has in its implementation. We will discuss the difference between indexing and classification, and discuss where each approach makes sense.

The session will include information on various standards under development and focus on the work of the SNIA DMF ILM initiative. It will culminate with a view of the benefits to be derived through classification of information – better risk mitigation, lowered management costs and a better understanding of corporate information.

Page 4: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 4

Why Classification? Why Now?

• Growth of Information hasn’t slowed down…• Sarbanes-Oxley initiatives are evolving into

broader enterprise risk management activities– Numerous regulations and corporate requirements

switch focus from capital spend to operating expense (management)

• Storage landscape is changing– Is the DVD player more important or the

movies you watch?• Storage people are reacting to an

unfamiliar and new set of guidelines

Page 5: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 5

Classification Drivers

Storage TCO– External disk storage purchase projected to grow at 52% annually– Capacity is #1 storage issued driven by email, unstructured data– Significant transition to disk-based archival storage – Digital archive capacity will increase nearly tenfold between 2005 and 2010

* Sources: Merrill Lynch 2007-08 storage forecast & views from CIOs, Enterprise Strategy Groups 2006 Digital Archive study

2005 2010

Total Digital Archived Capacity,WW

EmailDatabaseUnstructured 54% CAGR54% CAGR

79% CAGR79% CAGR68% CAGR68% CAGR

Page 6: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 6

Classification Drivers

Risk management– Compliance:

• Payment Card Industry Data Security Standard (PCI), Health Insurance Portability and Accountability Act (HIPAA), new Federal Rules of Civil Procedure (FRCP)

• EU Directive on Privacy and Electronic Communications (2002/58/EC)

– Information Security• Protecting Personally

Identifiable Information (PII)

• Companies will spend as much as $80B on compliance by 2009

• Compliant records data growing 60% per year generating more than 2PB of new storage capacity requirements in 2007

• Single fastest growing application segment of the storage industry

* Source: Fred Moore, Horison, Storage Spectrum 2006

Page 7: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 7

Classification Drivers

Improved productivity• The average knowledge worker spends six hours per

week searching for information – 50% of all searches fail to locate desired information – 15% of the average knowledge worker’s time is spent recreating

existing information

• Need – Better organization of information – Accurate search– Consistent management of information– Shortened “time-to-information”

Source: IDC

Page 8: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 8

Classification for Security and Risk Mitigation

•To date, over 54 million identities have been stolen

•An estimated 19,000 more identities are stolen each day.

•Companies on average are spending over 1,500 hours per incident at a cost of $40,000 to $90,000 per victim

Top 10 Customer Data-Loss Incidents Prior to Jan 2006*

Company / Organization Number of affected customers

Date of initial disclosure

CardSystems 40 million June 17, 2005

Citigroup 3.9 million June 6, 2005

DSW Shoe Warehouse 1.4 million March 8, 2005

Bank of America 1.2 million Feb 25, 2005

Wachovia, Bank of America, PNC Financial Services Group, Commerce Bancorp

676,000 April 28, 2005

Time Warner 600,000 May 2, 2005

Georgia Department of Motor Vehicles

465,000 April 2005

LexisNexis 310,000 July 19,2005

University of Southern California 270,000 July 19, 2005

Marriott international 206,000 Dec 28, 2005

* Privacy Rights Clearinghouse, InformationWeek, Fred Moore, Storage Sepctrum

Page 9: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 9

Classification for Litigation Support

• eDiscovery and records management coming together– Driven by huge costs and risks– Changes to the Federal Rules of Civil

Procedure• Electronically Stored Information (ESI) is subject to production (the

way it is managed from cradle to grave will impact costs and risks of eDiscovery)

• There will be an early “meet and confer”• Word “preserving” appears in the rules for the first time• There is a need to understand the “sources” of ESI

– Average eDiscovery costs can run into the millions of dollars per event

Page 10: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 10

Some Relevant Terms

• Data– Data is what I.T. manages: files, volumes, bits and bytes

• Information– Information is data with context– Data Lifecycle supports the Information Lifecycle

• Record– Recorded information, regardless of medium or characteristics,

made or received by an organization that is evidence of its operations, and has value requiring its retention for a specificperiod of time (ARMA)

• See the SNIA Technical dictionary for additional definitions:

http://www.snia.org/education/dictionary

Page 11: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 11

Information Lifecycle Management and Classification

• Information Lifecycle Management– The policies, processes, practices and tools used to align the

business value of information with the most appropriate and cost effective IT infrastructure from the time information is created through it’s final disposition

– Information is aligned with business processes through management polices and service levels associated with applications, metadata, information and data

• Information Classification– The process of identifying and categorizing information

associated with a business process, in order to produce requirements for the management of this information, within a defined scope

Page 12: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 12

Who Needs to Classify

• Traditionally – Records Information Managers• Now – IT and numerous other groups are involved

– Line of Business (LOB information stakeholders)• Application performance, availability, recoverability…• Staff response time, asset reporting…• Cost

– Corporate information stakeholders:• Security officer: Secret, confidential, proprietary…• Records Managers: corporate system of record• Compliance officers: authorization, retention• Litigation support: eDiscovery Check out

SNIA Tutorial:ILM and Tiers of Storage

Check outSNIA Tutorial:

ILM and Tiers of Storage

Page 13: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 13

Classification is Challenging

• Need is to classify “ESI” – Electronically Stored Information– Stakeholders are numerous

• RM is a component of a bigger information management strategy• Legal provides Litigation support requirements• IT provides ILM TCO requirements

– Huge amounts of data – do we need to classify it all?– Current gaps between records managers, IT, application owners– How much risk are you willing to bear?

• Example: Major bank implementing a top-down information management strategy run by an ILM senior architect– “board room stuff”

Cross-disciplinary Information Management

Executive Committee

Page 14: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 14

The Classification Process: Application Classification

Focused on Business Applications

Drivers for Application Classification– Disaster recovery and business continuity– Server consolidation– Application performance

Application Classification is fairly “simple”– Establishes a ranking of applications– All information associated with the

application is treated the same– Works best when applications are

segmented by server

Application Classification is often “good enough”

Page 15: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 15

The Classification Process: File Attribute Classification

• FILE ATTRIBUTE classification is largely based on file attributes and access patterns

• What is file named?• What is the file type?• Who owns the data?• Where is it located?• When was it created?

Page 16: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 16

File Attribute Classification

• File attributes offers limited input, therefore limited recognition– Still useful but class of solutions limited– Generally useful in optimizing HSM or archiving strategies– Tends not to meet complex ILM needs (security, retention, etc.)

• Pros & Cons– Fast, lightweight, not invasive– May not address changing business value over time

Page 17: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 17

Data Movement Across Tiers

Example: File Attribute Classification

Primary Storage

Secondary Disk-based Storage

Secondary orTertiary

Non-disk storage

Page 18: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 18

The Classification Process: Content Classification• Categories and taxonomies have been in use since ancient Greece• Classification based on CONTENT makes use of indexes, lexicons and

taxonomies

• What keywords?• How is this data related to

other data?• How should data be

retained/disposed of for compliance or otherwise used by the business?

Page 19: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 19

Content Classification: Some Additional Definitions• Taxonomy

– A hierarchical structure used for categorizing a body of information or knowledge, allowing an understanding of how that body of knowledge can be broken down into parts, and how its various parts relate to each other. Taxonomies are used to organize information in systems, therefore helping users to find it

• Related terms: ontology, categories, evidence structures• Lexicon

– The vocabulary of a language, an individual speaker or group of speakers, or a subject• Example: A dictionary of over 200,000 medical, pharmaceutica, biomedical and

healthcare acronyms and abreviations is a medical lexicon• Related terms: thesaurus, vocabulary

• Indexing– The act of preparing data for search and/or classification, based on content and/or metadata– Index when keyword searches only are sufficient– Index when looking to find information quickly within a particular document or file

• Search– The act of looking for something in a set of data

Page 20: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 20

Why Content Classification?

• How does I.T. determine what type of services these files require?

• Are they important?

Page 21: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 21

Manual Analysis of Files

• File names are not intuitive

• Content is difficult to decipher manually

Page 22: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 22

Manual Analysis of Files

• Content analysis requires expertise

• Cost becomes a burden, leading to increased risk

Page 23: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 23

Automated Content Classification

Automated Classification speeds time-to-information effectivenessAutomated Content Classification make sense

– When multiple classification options results in confusion– When there is an overwhelming volume of items to classify– When some documents require time-consuming review by subject matter experts– When there are a large number of non-business documents– When you don’t want to have idiosyncratic results

“The highest quality and accuracy occurs when records management is as non-intrusive as possible to the desktop end users and does not interfere with the normal work routines of professional staff in the enterprise”**

** Timothy J. Sprehe and Charles R. McClure, “Lifting the Burden.”Information Management Journal, Vol.39 Issue 4 (Jul/Aug 2005), 475

Page 24: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 24

Content Classification Algorithms

Rules-based content classification algorithms

– Conceptual and Semantic Analysis

– Keywords – Term frequency– Pattern matching– Stemming – Compound terms– Latent semantic analysis

(synonymy and polysemy)

• All content-based classification is based on “natural language”• Two general types:

“Learning”-based content classification algorithms

– Neural Networks– Probabilistic modeling– Bayesian Inference– Shannon’s Information

Theory

Page 25: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 25

Example of Rules-based Content Classification

• Personally identifiable/identifying information (PII)– Any piece of information which can potentially be used to uniquely identify,

contact, locate, stalk or steal the identity of a single person.• Payment Card Industry (PCI) Data Security Standard

– Protects PII: Forbids retailers from storing credit and debit card data on point-of-sale systems

– All retailers must ensure that their POS systems are purged of such information, which includes magnetic stripe, PIN and card verification value data

• Classification becomes the cornerstone for identification of PII – Need to go beyond formal security classifications that have existed in the U.S. for

years– Example: Major bank

• Formal security policies include identifying categories of information: “bank-confidential”, “bank customer confidential”, “other”

• Patterns include customer account numbers according to PCI standards

Classification for Security and Privacy: Pattern Matching

Page 26: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 26

Example of Rules-based Content Classification

A search for “Washington” delivers 333,000,000 results

Classification for targeted search: keyword

Page 27: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 27

Example of Rules-based Content Classification

Further refining my category rules to: “Washington state sightseeing” gets me what I need

VERY SPECIAL PLACESDistinctive Washington Lodging, Tours & Unique Attractions

Classification for targeted search: keyword

Page 28: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 28

Classification: What is “Good Enough”?Challenges of classification – various types• Some human intervention always required to review results of classification

– Automated tools improve efficiency • Documents with little text – how are these classified?

– Power point slides, email, etc. – Varying document types– Metadata classification might be better in this case

• Lack of consistency in naming , structure, format– Metadata classification may be best

Factors affecting accuracy• Document consistency / naming consistency• The strength of the taxonomy (content)• Applicability of classification algorithms to specific content

What is a reasonable cost per document?What is the cost of a document that is incorrectly classified?

Does value to the organization outweigh the cost?

Page 29: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 29

Classification: Getting Started

• Obtain buy-in from Execs • Establish cross-disciplinary information management

team• Start small, identify business drivers for classification

(risk, cost, information access?)• Audit current state, determine desired outcome and

assess gaps• Begin building policies and procedures to deliver

classification strategy– Develop taxonomies and associated rules, if required– Evaluate and Select classification tools

Page 30: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 30

Summary

Classification - Immediate Benefits– Better understand and organize your information– Mitigate risk associated with unmanaged information– Better deployment and alignment of your I.T. resources

(storage/server consolidation, “smart” purchases, etc.)– Better compliance readiness and eDiscovery

Classification - Longer term benefits– Service Level Management improves I.T. service

delivery– Information management automation– Cost reduction

Page 31: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 31

Continue Your SNIA Education Experience At SNW

• Attend Hands-On Labs in:Data Classification

Key to Service Level ManagementData Security and Protection

Data Assurance Solutions to Meet Corporate Requirements

IP StorageiSCSI, Your IP SAN

Storage ManagementManage Storage or Be Managed By It

Storage VirtualizationIncreasing Productivity

Zero to SAN• Fibre Channel Connectivity in No Time

Sessions begin Monday afternoon, April 16 and continue through Wednesday, April 18. All sessions in Emma/Maggie/Annie, 3rd

Floor of the Hyatt Manchester.Registration at the SNW Registration area

Page 32: EDUCATION Information Classification: The Cornerstone to ...Information Classification: The Cornerstone to Information Management ... – The act of preparing data for search and/or

EDUCATION

Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 32

Q&A / Feedback• Please send any questions or comments on this presentation to

SNIA: [email protected]

Many thanks to the following individuals for their contributions to this tutorial.

SNIA Education Committee

Edgar St.Pierre Bob Rodgers Rob PeglarJeff Porter

Check outSNIA Tutorial:

ILM and TiersOf Storage

Check outSNIA Tutorial:

ILM and TiersOf Storage