Best Practices for managing unstructured Data
-
Upload
andre-goyette -
Category
Documents
-
view
247 -
download
2
description
Transcript of Best Practices for managing unstructured Data
Best Practices for Managing Unstructured Data
Page 2 of 13 Sponsored by
Best Practices for Managing Unstructured Data
Contents
How to Manage Semi-Structured Data
Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration
As critical business information is increasingly found in unstructured and semi-structured data such as documents, images and emails as well as tweets and other social media data, enterprises must re-evaluate existing data-handling strategies. This E-Guide explores several best practices and approaches for managing and integrating these new types of data. Learn about the differences between using enterprise search versus text analysis for processing unstructured data.
How to Manage Semi-Structured Data By: Mark Whitehom, Contributor
Data can be classified as structured, semi-structured or unstructured – but
what bearing do these classifications have on a company’s data-handling
strategy? The short answer is that it is becoming more important in our
rapidly changing IT world to be aware of different data forms and how (or if)
you need to manage them.
Structured data is data that has been split into small, discrete units. Each
piece of data concerns one thing (to use a good Anglo-Saxon catchall word),
for example, the last name of a customer. Structured data is typically stored
in tables. Continuing with our example, one column of data would list the last
names of all customers, and each row would pertain to one customer. These
tables, in turn, are typically stored in a relational database.
In very many cases, we find that data in the real world is not structured quite
as neatly as this. But we impose the database structure upon it for the simple
reason that doing so makes the data easy to retrieve and query. In practice,
this works well for managing most business data; stock control, finance,
human resources and other corporate systems all submit fairly readily to an
imposed data structure.
The problem is that some data is not amenable to rigorous structuring – and
such data is becoming more and more prevalent. A great deal of data
Page 3 of 13 Sponsored by
Best Practices for Managing Unstructured Data
Contents
How to Manage Semi-Structured Data
Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration
relevant to the enterprise is turning up in documents, images and emails as
well as tweets and other social media data. All of these can be described as
semi-structured data.
The term unstructured data is also bandied around but is not, in my opinion,
a viable classification. Virtually all data has some kind of structure – only
random noise is truly unstructured, and it contains very little commercial
value.
Our options for managing semi-structured data are
a) Ignore it (probably fatal in a competitive climate)
b) Force it into structured relational form
c) Adopt a different storage mechanism
Let’s consider those options one by one.
Ignore it. This is a good one to rule out: So much data is being created and
collected in semi-structured forms that most enterprises cannot afford to
disregard the outpouring of it. Doing so is viable only if there is no compelling
business advantage in being able to track and analyse such data.
Stay relational. Relational database engines have been significantly
modified over the years to handle what are characterised by the database
manufacturers as ―complex data types.‖ XML is one example: It is considered
by many to be an excellent way of holding classic semi-structured data. Most
common document formats are, or can be, rendered into XML, and almost all
relational engines now have an XML data type, which means that documents
often can be stored in a relational database. But the additional complexity of
handling semi-structured data means there will inevitably be a trade-off, and
in general that will equate to slower retrieval times. However, it does make it
very easy to find all tweets that refer to your product, all emails that mention
―politician‖ and so on.
Page 4 of 13 Sponsored by
Best Practices for Managing Unstructured Data
Contents
How to Manage Semi-Structured Data
Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration
Other examples of complex data types are those that can handle spatial and
image data.
Adopt a different approach. There is increasing interest in adopting
alternative data management and storage mechanisms. Imagine you store
patient X-rays as images. We store data so we can retrieve it later and also
so we can query it, but running a query against an X-ray image is a
somewhat bizarre concept because the X-ray is simply a collection of pixels.
What often happens in practice is that this and other semi-structured data
comes with some attached metadata and can also undergo some form of
analysis in order to generate further metadata. (In a nutshell, metadata is
data about data). In the case of an email, the attached metadata might
include length, sender, recipient, time/date and so on. Automatic semantic
analysis of the email could be performed and that might yield metadata about
the tone of the email (as in, angry, conciliatory, praising, etc.), its
grammatical construction (correct, lax, etc.) and so on.
Metadata is typically highly structured and is therefore highly susceptible to
analysis. So you could then store the emails and the metadata in a relational
database and query the metadata to find, not just those emails that mention
your product, but more specifically those that are well-written and also
positive about the product.
And now think again about X-rays, which are classic semi-structured data.
While you wouldn’t query against a raw X-ray image, you can query against
its metadata. The attached metadata might include patient ID, doctor ID,
extensive information about how and when the X-ray was taken and so on.
Automatic analysis of the image might yield metadata like diagnosis,
prognosis and so on. In this solution the semi-structured data might be stored
simply as image files in the file system and the structured metadata would be
stored in a relational database and linked to the image. A query could then
pull out all the X-rays for doctor ID 1234 that involved broken limbs and
display the images.
Page 5 of 13 Sponsored by
Best Practices for Managing Unstructured Data
Contents
How to Manage Semi-Structured Data
Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration
The bottom line is that semi-structured data is here to stay, and it offers the
potential for business advantage to any company that handles and analyses
it well.
About the author:
Dr. Mark Whitehorn specializes in the areas of data analysis, data modeling
and business intelligence (BI). Based in the UK, Whitehorn works as a
consultant for a number of national and international companies and is a
mentor wit Solid Quality Mentors. In addition, he is a well-recognized
commentator on the computer world, publishing articles, white papers and
books. Whitehorn is also a senior lecturer in the School of Computing at the
University of Dundee, where he teaches the masters course in BI. His
academic interests include the application of BI to scientific research.
Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration By: Krish Krishnan
A great debate is raging in the industry, and it is being fanned by the
adoption of "big data." The simple question is: Do we create better search
techniques or do we go all the way to text analysis for integrating
unstructured data? A simple answer is to say ―yes‖ to both the questions, but
there are hidden layers of complexity in the answer, which this article will
attempt to explain.
Search vs. Analysis
At a fundamental level, both search and analysis engines operate on text
data. Here is where the similarity ends. With search, you typically look for
patterns and present the findings to the user in short order. There is no
further transformation to the text. Analysis deals with the discovery of the
pattern (akin to search); but, more importantly, transformations are applied to
the text to create a meaningful outcome. Analysis assumes that text must be
integrated and transformed before it can be analyzed. This advanced
treatment of text in terms of analysis is where complexities arise, and the
field – though decades rich in terms of algorithms, research and
Page 6 of 13 Sponsored by
Best Practices for Managing Unstructured Data
Contents
How to Manage Semi-Structured Data
Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration
development, and published theses – continues to be nascent and niche.
The fundamental characteristic of text is termed best in one adjective ―erose‖
(do not confuse with ―verbose‖). The Latin word ―erose‖ means ―irregularly
notched, toothed, or indented‖(from dictionary.com), and is used more in
botany to describe leaves of a plant. The underlying reason for this
attribution is text is long, complex and unpredictable. It is a combination of
words and phrases to form contextual statements, which may contain
repeatable patterns (this repeatability can also differ based on context within
a single document or text). When discussing ―unstructured‖ data, we use this
lack of repeatability and the associated ambiguity to distinguish text data
analysis and outcomes, as opposed to structured data where there is great
repeatability of data, a structured and formatted storage architecture, which
lends itself well to integration and analytics.
Applying Search for Unstructured Integration
With the available search infrastructure and algorithms, one can make the
argument that in order to integrate any ―unstructured data,‖ why not just
extend search outputs? Why do we need to create a text analysis platform
separately? There have been attempts at doing that, but including integration
and transformation as part of search is not a good approach.
Search engines or enterprise appliances will become lethargic and
slow upon including integration and transformation to the normal
workload. For example, let us assume that 10,000 searches have to
be done for a contract database on a content management platform
for every user query. Every search transaction will create operational
structures and return quick hits on a set of patterns as its output.
Adding analysis type of transformation introduces great inefficiencies
of operation to this exercise. The critical reason here is analysis
requires creating clarity and context around the unstructured
information, and both of these operations are highly complex and
require processing. The additional operation will cause immense
slowdown of search.
Page 7 of 13 Sponsored by
Best Practices for Managing Unstructured Data
Contents
How to Manage Semi-Structured Data
Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration
Search engines do a lot of pattern matching, metadata (taxonomy
and ontology) based indexing and large-scale distributed data
processing. Metadata and patterns are definitely nimble and agile
techniques for transforming the minimal data required for search
processing, but the same will not scale to support the complex
nature of unstructured data analysis.
Searches are designed to process patterns for every user query and
are inconsistent by design. No two users will search for the same
pattern at a given time. Thus, the same algorithms are replayed over
and over, for multiple types of data patterns, which are short life
cycle and efficient despite of processing inconsistencies.
While these are the key reasons where applying search to analyze
unstructured data is not the best option, these are not the only
reasons. Analysis of text requires a lot of additional processing including
spelling correction, alternate spellings, synonyms, user defined rules and
much more deep processing.
Text Analysis
Let’s look at how analysis will be different from search:
Text analysis advances the integration of unstructured data beyond just
light indexing and pattern matching of search.
Analysis consists of multiple transformation steps, each of which needs
to be run once per set of patterns, metadata terms or context.
Analysis creates multiple iterations of metadata output as opposed to
simple result sets of entire pages, which create a powerful set of indexes
within the text and its context.
Analysis always processes data in a consist manner as opposed to
search.
Page 8 of 13 Sponsored by
Best Practices for Managing Unstructured Data
Contents
How to Manage Semi-Structured Data
Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration
For example, here is a popular example found in Wikipedia under Natural
Language Processing
The sentence "I never said she stole my money" demonstrates the
importance stress can play in a sentence, and thus the inherent difficulty a
natural language processor can have in parsing it.
"I never said she stole my money" – Someone else said it, but I
didn't.
"I never said she stole my money" – I simply didn't ever say it.
"I never said she stole my money" – I might have implied it in some
way, but I never explicitly said it.
"I never said she stole my money" – I said someone took it; I didn't
say it was she.
"I never said she stole my money" – I just said she probably
borrowed it.
"I never said she stole my money" – I said she stole someone else's
money.
"I never said she stole my money" – I said she stole something, but
not my money
Depending on which word the speaker stresses, you can see how this
sentence could have several different meanings.
If you search for this pattern, you will get all the statements, and you have to
search for the extended meaning and interpret the same. If you process this
through a text analysis platform, you can create a context-oriented result set
that will provide you not only the result, but also the associated context,
which is far more useful.
Page 9 of 13 Sponsored by
Best Practices for Managing Unstructured Data
Contents
How to Manage Semi-Structured Data
Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration
The need for transforming data before it becomes useful for analytics and
reporting is not a new thought. We have always designed the data
warehouse to process data in this fashion, and call it ETL. Extending this
analysis to text creates a powerful concept: textual ETL.
This need for transformation and integration of text has some interesting
challenges. One challenge is the size of the data to be transformed. Let us
assume that you intend to take the Internet as your data set. Is it possible to
transform and analyze all the text found on the Internet? In a nutshell, it is not
practical or feasible. In such a situation, you primarily rely on search and can
use a subset of data from the result set for deeper analysis.
But there are other data sets such as enterprise data that are large in
volume, complex in formats and have multiple contexts, yet lend themselves
to rigors of text analysis and processing. A simple example is the contracts
existing across the different business divisions such as purchasing, supply
chain, inventory management, logistics, transportation and human resources.
Each of these contracts has a different purpose, and there may be many
contracts of a type that can provide insights beyond just start and end dates.
Insights include legal terms and conditions with applied context, liabilities and
obligations and much more. After analysis, such text will create a powerful
and rich metadata output with context that can be simply integrated into a
decision-support system ecosystem.
Other challenges include the variety of formats, the volumes of text, the
ambiguous nature of the data itself and lack of formal documentation, to
name a few. But once the challenges are addressed, the output from such an
analysis is powerful to create a huge visualization platform for looking into
text and unstructured data within the enterprise. This is where you can
leverage the data that has been stored on content management platforms for
years for useful output of trends and behaviors.
The major differences between a result set produced by a search and text
analytics system are as follows
:
Page 10 of 13 Sponsored by
Best Practices for Managing Unstructured Data
Contents
How to Manage Semi-Structured Data
Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration
Search
Search is oriented to process informational needs of a single user
query
The search result set is proprietary to that user and cannot be
shared
Result set is temporary (under normal circumstances)
Transformation rules are repeated with every query and are
minimalistic
Result set cannot be integrated with a DBMS
Search processing cannot scale for large and complex operations –
context-based search has always added significant overhead
Text Analysis
Can be defined by users for processing with business rules, like an
ETL tool
Produces a result set that is a key-value column pair often stored in
an RDBMS
Result set can be used for further analytical processing
Result sets can be stored as snapshots for repeated processing
Transformation of data and associated context is repeatable in
multiple passes of processing cycles
Text of different languages for global organizations can be stored in
the same result database based on metadata integrations and rules
Page 11 of 13 Sponsored by
Best Practices for Managing Unstructured Data
Contents
How to Manage Semi-Structured Data
Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration
Text analysis can scale easily based on the infrastructure capabilities
Based on the discussion here, you can discern that search is good for finding
things on an ad hoc basis in a large set of data. Analysis is good for creating
a platform that can be used repeatedly against a large but finite amount of
textual data as related to a corporation.
In order to perform text analysis and deep text mining, you need to process
the text rather than extend a search engine or appliance. A robust text
analysis system will provide for the following:
Categorization
Classification
Spelling correction
Synonyms, antonyms and homonyms
Integration with taxonomies
Metadata
Business rules integration
Reprocessing capabilities
Document fracturing and processing
Each of these steps allows you to process large text and create the result
database for processing. This database can be used with search to create
guided search and navigation, and can be extended to machine learning
using a search and analysis combination platform.
The major advantage of text analysis is the ability to track changes as they
occurred or occur within the text environment in a similar manner to tracking
Page 12 of 13 Sponsored by
Best Practices for Managing Unstructured Data
Contents
How to Manage Semi-Structured Data
Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration
changes in a dimension. This is the most powerful output that makes
analysis such a better proposition than search and is called document mid-
point reprocessing. You can extend this concept to emails, Excel
spreadsheets and other document types very easily.
In conclusion, search and text analysis both serve different purposes for
processing unstructured data and can be effectively leveraged. Search can
be used for early stage data discovery, and text analysis can be used for the
detailed analysis and downstream analytical processing. But remember this:
Do not substitute search as the alternative to traditional text analytics.
Page 13 of 13 Sponsored by
Best Practices for Managing Unstructured Data
Contents
How to Manage Semi-Structured Data
Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration
Free resources for technology professionals TechTarget publishes targeted technology media that address your need for
information and resources for researching products, developing strategy and
making cost-effective purchase decisions. Our network of technology-specific
Web sites gives you access to industry experts, independent content and
analysis and the Web’s largest library of vendor-provided white papers,
webcasts, podcasts, videos, virtual trade shows, research reports and more
—drawing on the rich R&D resources of technology providers to address
market trends, challenges and solutions. Our live events and virtual seminars
give you access to vendor neutral, expert commentary and advice on the
issues and challenges you face daily. Our social community IT Knowledge
Exchange allows you to share real world information in real time with peers
and experts.
What makes TechTarget unique? TechTarget is squarely focused on the enterprise IT space. Our team of
editors and network of industry experts provide the richest, most relevant
content to IT professionals and management. We leverage the immediacy of
the Web, the networking and face-to-face opportunities of events and virtual
events, and the ability to interact with peers—all to create compelling and
actionable information for enterprise IT professionals across all industries
and markets.
Related TechTarget Websites