Best Practices for managing unstructured Data

Best Practices for Managing Unstructured Data

of 13 Sponsored by


Contents

How to Manage Semi-Structured Data

Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration

As critical business information is increasingly found in unstructured and semi-structured data such as documents, images and emails as well as tweets and other social media data, enterprises must re-evaluate existing data-handling strategies. This E-Guide explores several best practices and approaches for managing and integrating these new types of data. Learn about the differences between using enterprise search versus text analysis for processing unstructured data.

How to Manage Semi-Structured Data By: Mark Whitehom, Contributor

Data can be classified as structured, semi-structured or unstructured – but

what bearing do these classifications have on a company’s data-handling

strategy? The short answer is that it is becoming more important in our

rapidly changing IT world to be aware of different data forms and how (or if)

you need to manage them.

Structured data is data that has been split into small, discrete units. Each

piece of data concerns one thing (to use a good Anglo-Saxon catchall word),

for example, the last name of a customer. Structured data is typically stored

in tables. Continuing with our example, one column of data would list the last

names of all customers, and each row would pertain to one customer. These

tables, in turn, are typically stored in a relational database.

In very many cases, we find that data in the real world is not structured quite

as neatly as this. But we impose the database structure upon it for the simple

reason that doing so makes the data easy to retrieve and query. In practice,

this works well for managing most business data; stock control, finance,

human resources and other corporate systems all submit fairly readily to an

imposed data structure.

The problem is that some data is not amenable to rigorous structuring – and

such data is becoming more and more prevalent. A great deal of data

of 13 Sponsored by


Contents



relevant to the enterprise is turning up in documents, images and emails as

well as tweets and other social media data. All of these can be described as

semi-structured data.

The term unstructured data is also bandied around but is not, in my opinion,

a viable classification. Virtually all data has some kind of structure – only

random noise is truly unstructured, and it contains very little commercial

value.

Our options for managing semi-structured data are

a) Ignore it (probably fatal in a competitive climate)

b) Force it into structured relational form

c) Adopt a different storage mechanism

Let’s consider those options one by one.

Ignore it. This is a good one to rule out: So much data is being created and

collected in semi-structured forms that most enterprises cannot afford to

disregard the outpouring of it. Doing so is viable only if there is no compelling

business advantage in being able to track and analyse such data.

Stay relational. Relational database engines have been significantly

modified over the years to handle what are characterised by the database

manufacturers as ―complex data types.‖ XML is one example: It is considered

by many to be an excellent way of holding classic semi-structured data. Most

common document formats are, or can be, rendered into XML, and almost all

relational engines now have an XML data type, which means that documents

often can be stored in a relational database. But the additional complexity of

handling semi-structured data means there will inevitably be a trade-off, and

in general that will equate to slower retrieval times. However, it does make it

very easy to find all tweets that refer to your product, all emails that mention

―politician‖ and so on.

of 13 Sponsored by


Contents



Other examples of complex data types are those that can handle spatial and

image data.

Adopt a different approach. There is increasing interest in adopting

alternative data management and storage mechanisms. Imagine you store

patient X-rays as images. We store data so we can retrieve it later and also

so we can query it, but running a query against an X-ray image is a

somewhat bizarre concept because the X-ray is simply a collection of pixels.

What often happens in practice is that this and other semi-structured data

comes with some attached metadata and can also undergo some form of

analysis in order to generate further metadata. (In a nutshell, metadata is

data about data). In the case of an email, the attached metadata might

include length, sender, recipient, time/date and so on. Automatic semantic

analysis of the email could be performed and that might yield metadata about

the tone of the email (as in, angry, conciliatory, praising, etc.), its

grammatical construction (correct, lax, etc.) and so on.

Metadata is typically highly structured and is therefore highly susceptible to

analysis. So you could then store the emails and the metadata in a relational

database and query the metadata to find, not just those emails that mention

your product, but more specifically those that are well-written and also

positive about the product.

And now think again about X-rays, which are classic semi-structured data.

While you wouldn’t query against a raw X-ray image, you can query against

its metadata. The attached metadata might include patient ID, doctor ID,

extensive information about how and when the X-ray was taken and so on.

Automatic analysis of the image might yield metadata like diagnosis,

prognosis and so on. In this solution the semi-structured data might be stored

simply as image files in the file system and the structured metadata would be

stored in a relational database and linked to the image. A query could then

pull out all the X-rays for doctor ID 1234 that involved broken limbs and

display the images.

of 13 Sponsored by


Contents



The bottom line is that semi-structured data is here to stay, and it offers the

potential for business advantage to any company that handles and analyses

it well.

About the author:

Dr. Mark Whitehorn specializes in the areas of data analysis, data modeling

and business intelligence (BI). Based in the UK, Whitehorn works as a

consultant for a number of national and international companies and is a

mentor wit Solid Quality Mentors. In addition, he is a well-recognized

commentator on the computer world, publishing articles, white papers and

books. Whitehorn is also a senior lecturer in the School of Computing at the

University of Dundee, where he teaches the masters course in BI. His

academic interests include the application of BI to scientific research.

Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration By: Krish Krishnan

A great debate is raging in the industry, and it is being fanned by the

adoption of "big data." The simple question is: Do we create better search

techniques or do we go all the way to text analysis for integrating

unstructured data? A simple answer is to say ―yes‖ to both the questions, but

there are hidden layers of complexity in the answer, which this article will

attempt to explain.

Search vs. Analysis

At a fundamental level, both search and analysis engines operate on text

data. Here is where the similarity ends. With search, you typically look for

patterns and present the findings to the user in short order. There is no

further transformation to the text. Analysis deals with the discovery of the

pattern (akin to search); but, more importantly, transformations are applied to

the text to create a meaningful outcome. Analysis assumes that text must be

integrated and transformed before it can be analyzed. This advanced

treatment of text in terms of analysis is where complexities arise, and the

field – though decades rich in terms of algorithms, research and

of 13 Sponsored by


Contents



development, and published theses – continues to be nascent and niche.

The fundamental characteristic of text is termed best in one adjective ―erose‖

(do not confuse with ―verbose‖). The Latin word ―erose‖ means ―irregularly

notched, toothed, or indented‖(from dictionary.com), and is used more in

botany to describe leaves of a plant. The underlying reason for this

attribution is text is long, complex and unpredictable. It is a combination of

words and phrases to form contextual statements, which may contain

repeatable patterns (this repeatability can also differ based on context within

a single document or text). When discussing ―unstructured‖ data, we use this

lack of repeatability and the associated ambiguity to distinguish text data

analysis and outcomes, as opposed to structured data where there is great

repeatability of data, a structured and formatted storage architecture, which

lends itself well to integration and analytics.

Applying Search for Unstructured Integration

With the available search infrastructure and algorithms, one can make the

argument that in order to integrate any ―unstructured data,‖ why not just

extend search outputs? Why do we need to create a text analysis platform

separately? There have been attempts at doing that, but including integration

and transformation as part of search is not a good approach.

Search engines or enterprise appliances will become lethargic and

slow upon including integration and transformation to the normal

workload. For example, let us assume that 10,000 searches have to

be done for a contract database on a content management platform

for every user query. Every search transaction will create operational

structures and return quick hits on a set of patterns as its output.

Adding analysis type of transformation introduces great inefficiencies

of operation to this exercise. The critical reason here is analysis

requires creating clarity and context around the unstructured

information, and both of these operations are highly complex and

require processing. The additional operation will cause immense

slowdown of search.

of 13 Sponsored by


Contents



Search engines do a lot of pattern matching, metadata (taxonomy

and ontology) based indexing and large-scale distributed data

processing. Metadata and patterns are definitely nimble and agile

techniques for transforming the minimal data required for search

processing, but the same will not scale to support the complex

nature of unstructured data analysis.

Searches are designed to process patterns for every user query and

are inconsistent by design. No two users will search for the same

pattern at a given time. Thus, the same algorithms are replayed over

and over, for multiple types of data patterns, which are short life

cycle and efficient despite of processing inconsistencies.

While these are the key reasons where applying search to analyze

unstructured data is not the best option, these are not the only

reasons. Analysis of text requires a lot of additional processing including

spelling correction, alternate spellings, synonyms, user defined rules and

much more deep processing.

Text Analysis

Let’s look at how analysis will be different from search:

Text analysis advances the integration of unstructured data beyond just

light indexing and pattern matching of search.

Analysis consists of multiple transformation steps, each of which needs

to be run once per set of patterns, metadata terms or context.

Analysis creates multiple iterations of metadata output as opposed to

simple result sets of entire pages, which create a powerful set of indexes

within the text and its context.

Analysis always processes data in a consist manner as opposed to

search.

of 13 Sponsored by


Contents



For example, here is a popular example found in Wikipedia under Natural

Language Processing

The sentence "I never said she stole my money" demonstrates the

importance stress can play in a sentence, and thus the inherent difficulty a

natural language processor can have in parsing it.

"I never said she stole my money" – Someone else said it, but I

didn't.

"I never said she stole my money" – I simply didn't ever say it.

"I never said she stole my money" – I might have implied it in some

way, but I never explicitly said it.

"I never said she stole my money" – I said someone took it; I didn't

say it was she.

"I never said she stole my money" – I just said she probably

borrowed it.

"I never said she stole my money" – I said she stole someone else's

money.

"I never said she stole my money" – I said she stole something, but

not my money

Depending on which word the speaker stresses, you can see how this

sentence could have several different meanings.

If you search for this pattern, you will get all the statements, and you have to

search for the extended meaning and interpret the same. If you process this

through a text analysis platform, you can create a context-oriented result set

that will provide you not only the result, but also the associated context,

which is far more useful.

of 13 Sponsored by


Contents



The need for transforming data before it becomes useful for analytics and

reporting is not a new thought. We have always designed the data

warehouse to process data in this fashion, and call it ETL. Extending this

analysis to text creates a powerful concept: textual ETL.

This need for transformation and integration of text has some interesting

challenges. One challenge is the size of the data to be transformed. Let us

assume that you intend to take the Internet as your data set. Is it possible to

transform and analyze all the text found on the Internet? In a nutshell, it is not

practical or feasible. In such a situation, you primarily rely on search and can

use a subset of data from the result set for deeper analysis.

But there are other data sets such as enterprise data that are large in

volume, complex in formats and have multiple contexts, yet lend themselves

to rigors of text analysis and processing. A simple example is the contracts

existing across the different business divisions such as purchasing, supply

chain, inventory management, logistics, transportation and human resources.

Each of these contracts has a different purpose, and there may be many

contracts of a type that can provide insights beyond just start and end dates.

Insights include legal terms and conditions with applied context, liabilities and

obligations and much more. After analysis, such text will create a powerful

and rich metadata output with context that can be simply integrated into a

decision-support system ecosystem.

Other challenges include the variety of formats, the volumes of text, the

ambiguous nature of the data itself and lack of formal documentation, to

name a few. But once the challenges are addressed, the output from such an

analysis is powerful to create a huge visualization platform for looking into

text and unstructured data within the enterprise. This is where you can

leverage the data that has been stored on content management platforms for

years for useful output of trends and behaviors.

The major differences between a result set produced by a search and text

analytics system are as follows

:

of 13 Sponsored by


Contents



Search

Search is oriented to process informational needs of a single user

query

The search result set is proprietary to that user and cannot be

shared

Result set is temporary (under normal circumstances)

Transformation rules are repeated with every query and are

minimalistic

Result set cannot be integrated with a DBMS

Search processing cannot scale for large and complex operations –

context-based search has always added significant overhead

Text Analysis

Can be defined by users for processing with business rules, like an

ETL tool

Produces a result set that is a key-value column pair often stored in

an RDBMS

Result set can be used for further analytical processing

Result sets can be stored as snapshots for repeated processing

Transformation of data and associated context is repeatable in

multiple passes of processing cycles

Text of different languages for global organizations can be stored in

the same result database based on metadata integrations and rules

http://searchdatamanagement.techtarget.com/definition/extract-transform-load

of 13 Sponsored by


Contents



Text analysis can scale easily based on the infrastructure capabilities

Based on the discussion here, you can discern that search is good for finding

things on an ad hoc basis in a large set of data. Analysis is good for creating

a platform that can be used repeatedly against a large but finite amount of

textual data as related to a corporation.

In order to perform text analysis and deep text mining, you need to process

the text rather than extend a search engine or appliance. A robust text

analysis system will provide for the following:

Categorization

Classification

Spelling correction

Synonyms, antonyms and homonyms

Integration with taxonomies

Metadata

Business rules integration

Reprocessing capabilities

Document fracturing and processing

Each of these steps allows you to process large text and create the result

database for processing. This database can be used with search to create

guided search and navigation, and can be extended to machine learning

using a search and analysis combination platform.

The major advantage of text analysis is the ability to track changes as they

occurred or occur within the text environment in a similar manner to tracking

of 13 Sponsored by


Contents



changes in a dimension. This is the most powerful output that makes

analysis such a better proposition than search and is called document mid-

point reprocessing. You can extend this concept to emails, Excel

spreadsheets and other document types very easily.

In conclusion, search and text analysis both serve different purposes for

processing unstructured data and can be effectively leveraged. Search can

be used for early stage data discovery, and text analysis can be used for the

detailed analysis and downstream analytical processing. But remember this:

Do not substitute search as the alternative to traditional text analytics.

of 13 Sponsored by


Contents



Free resources for technology professionals TechTarget publishes targeted technology media that address your need for

information and resources for researching products, developing strategy and

making cost-effective purchase decisions. Our network of technology-specific

Web sites gives you access to industry experts, independent content and

analysis and the Web’s largest library of vendor-provided white papers,

webcasts, podcasts, videos, virtual trade shows, research reports and more

—drawing on the rich R&D resources of technology providers to address

market trends, challenges and solutions. Our live events and virtual seminars

give you access to vendor neutral, expert commentary and advice on the

issues and challenges you face daily. Our social community IT Knowledge

Exchange allows you to share real world information in real time with peers

and experts.

What makes TechTarget unique? TechTarget is squarely focused on the enterprise IT space. Our team of

editors and network of industry experts provide the richest, most relevant

content to IT professionals and management. We leverage the immediacy of

the Web, the networking and face-to-face opportunities of events and virtual

events, and the ability to interact with peers—all to create compelling and

actionable information for enterprise IT professionals across all industries

and markets.

Related TechTarget Websites

Best Practices for managing unstructured Data

Documents

Transcript of Best Practices for managing unstructured Data