As You Seek – How Search Enables Big Data Analytics
-
Upload
inside-analysis -
Category
Technology
-
view
938 -
download
0
Transcript of As You Seek – How Search Enables Big Data Analytics
Twitter Tag: #briefr
The Briefing Room
! Reveal the essential characteristics of enterprise software, good and bad
! Provide a forum for detailed analysis of today’s innovative technologies
! Give vendors a chance to explain their product to savvy analysts
! Allow audience members to pose serious questions... and get answers!
Mission
Twitter Tag: #briefr
The Briefing Room
JUNE: Database
July: CLOUD
August: HIGH PERFORMANCE ANALYTICS
September: ANALYTICS
Twitter Tag: #briefr
The Briefing Room
Analyst: Robin Bloor
Robin Bloor is Chief Analyst at The Bloor Group
Twitter Tag: #briefr
The Briefing Room
! MarkLogic is an enterprise-class NoSQL database company
! Key features of its database include ACID transactions, horizontal scaling, real-time indexing, high availability, disaster recovery, and government-grade security
! Its platform provides full-text query and search capabilities, application services and big data analytics
MarkLogic
Twitter Tag: #briefr
The Briefing Room
David Gorbet
David Gorbet is Vice President of Engineering for MarkLogic, where he also runs the Support organization. Gorbet brings two decades of experience delivering some of the highest-volume applications and enterprise software in the world. Prior to MarkLogic, Gorbet helped pioneer Microsoft’s business online services strategy by founding and leading the SharePoint Online team. Gorbet holds a Bachelor of Applied Science degree in Systems Design Engineering with an additional major in Psychology from the University of Waterloo, and an MBA from the University of Washington Foster School of Business.
Slide 2 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
WE ARE THE NEW GENERATION
DATABASE
Any Structure Era “For all your data!” • Schema-agnostic • Massive scale • Query and search • Analytics • Application services • Faster time-to-results
Relational Era “For all your structured data!” • Normalized, tabular
model • Application-
independent query • User control
Hierarchical Era For your application data! • Application- and
hardware-specific
Slide 3 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
Real Value From Big Data
Make The World More Secure
Provide Access To Valuable Information
Create New Revenue Streams
Gain Insights to Increase Market Share
Reduce Bottom Line Expense
Slide 4 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
The MarkLogic Advantage
Only Enterprise NoSQL Database
ACID compliant
Big data search
High availability
Replication
Point in-time recovery
Government-grade security
Real-time your Hadoop
Proven customer success
Slide 5 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
How Does It Work?
Schema-agnostic design
Real-time indexing and query
Event processing and alerting
Scale-out shared-nothing cluster topology
Analytics and Visualization
High availability and disaster recovery
Slide 6 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
Hierarchical Data Model
MarkLogic Server is a document-centric database
Supports any-structured data via hierarchical data model
Document
Title Author
Section
Section Section Section Section
First Last
Metadata
Trade Cashflows
Party Identifier
Net Payment
Payment Date
Party Reference
Payer Party
tradeID
Payment Amount
Receiver Party
Slide 7 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
MarkLogic is Schema Agnostic
JSON and XML are self-describing <article>
<title> MarkLogic Server:… </title>
<author>
<first-name> John </first-name>
<last-name> Doe </last-name>
</author>
<abstract>
. . . . <company> MarkLogic </company> . . . .
</abstract>
<body>
<section>
<section> . . . . </section>
</section>
<section> …index… </section>
</body>
<copyright> Copyright © … </copyright>
</article>
Slide 8 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
MarkLogic is Schema Agnostic
JSON and XML are self-describing <article>
<title>
MarkLogic Server:…
<author>
<first-name>
John
<last-name>
Doe
<abstract>
. . . .
<company>
MarkLogic
. . . .
<body>
<section>
<section>
. . . .
<section> …index…
<copyright>
Copyright © …
Slide 9 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
“brown” 123, 125, 129, 152, 344, 491, …
“mice” 123, 125, 126, 129, 130, 152, …
“brown mice” 125, 152, 516, 522, 765, 890, …
STEM “mouse” 123, 125, 126, 129, 130, 152, …
STEM “brown mouse” 125, 152, 516, 522, 765, 890, …
<article> …
<article>/<abstract> …
<section>/<paragraph> …
<animal>mouse</animal> …
<year>1950</year> …
Collection:Draft …
Role:Editor + Action:Read …
… …
… …
… …
Universal Index
Term Term List
MarkLogic indexes…
Words
Phrases
Stemming
Structure
Values
Collections
Security Permissions
Document References
125, 516, 890, …
Which draft articles contain the phrase brown mice?
Slide 10 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
“brown” 123, 125, 129, 152, 344, 491, …
“mice” 123, 125, 126, 129, 130, 152, …
“brown mice” 125, 152, 516, 522, 765, 890, …
STEM “mouse” 123, 125, 126, 129, 130, 152, …
STEM “brown mouse” 125, 152, 516, 522, 765, 890, …
<article> …
<article>/<abstract> …
<section>/<paragraph> …
<animal>mouse</animal> …
<year>1950</year> …
Collection:Draft …
Role:Editor + Action:Read …
… …
… …
… …
Scalar Queries
Term Term List Document References
125, 516, 890, …
Which draft articles that contain the phrase brown mice were written before 2010?
Slide 11 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
Range Indexes
Value ID
2002 3
2003 10
2004 5
2004 11
2007 4
2007 17
2009 1
2011 8
… …
… …
… …
ID Value
1 2009
3 2002
4 2007
5 2004
8 2011
10 2003
11 2004
17 2007
… …
… …
… …
Map document IDs to
values, and vice-versa in
a compact in-memory
representation
Slide 12 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
Geospatial Index: A 2-Dimensional Range Index
Fully composable with all other indexes!
Built-in support for:
Point
Box
Circle
Polygon
Complex Polygon
Polygon Intersection
Polygon Containment
Slide 13 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
Reverse Indexes (Alerting)
1. Load serialized queries as query documents
2. For a given data document, find all queries that match
Can provide real-time alerts during loads
With no significant performance impact!
Can let documents store values as "ranges"
Documents about cities self-defining their geo boundaries
Person documents defining birthdays as ranges, sequences
Can power classifiers and "matchmaker" queries
Slide 14 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
Value ID
2002 3
2003 10
2004 5
2004 11
2007 4
2007 17
2009 1
2011 8
… …
… …
… …
ID Value
1 2009
3 2002
4 2007
5 2004
8 2011
10 2003
11 2004
17 2007
… …
… …
… …
Range Indexes
Map document IDs to
values, and vice-versa in
a compact in-memory
representation
Range Indexes work like a built-in in-memory column store
Slide 17 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
In-database Analytic Functions
Leverage ready-made analytic built-ins for commonly-used numeric applications
Variance
Covariance
Correlation
Standard deviation
Linear model
Median
Mode
Percentile
Rank
Percent-rank
Benefits
Faster analytics-based application development
Supports more users & more data
Eliminates costs associated with writing custom code
Slide 18 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
User-defined Functions
class InfluenceRank : public AggregateUDF
{
public:
struct Value {
double sum, sum_sq, count;
Value() : sum(0), sum_sq(0), count(0) {}
} value;
public:
AggregateUDF* clone() const { return new InfluenceRank (*this); }
void close() { delete this; }
void start(Sequence&, Reporter&) {}
void finish(OutputSequence& os, Reporter& reporter);
void map(TupleIterator& values, Reporter& reporter);
void reduce(const AggregateUDF* _o, Reporter& reporter);
void encode(Encoder& e, Reporter& reporter);
void decode(Decoder& d, Reporter& reporter);
};
Slide 19 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
• • •
• • •
In-database MapReduce
start encode
decode reduce finish
decode map reduce encode
Slide 20 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
SQL and BI Tools
ODBC
SQL
Range Indexes
Slide 22 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
HA/DR Features of MarkLogic
Needs expansion: • How local-disk/shared-disk
failover works • How Flexrep works • How DBRep works
Slide 23 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
MarkLogic 6
Flexible Indexes
Full Text Search
Schema-Agnostic
Scalable Analytic
Functions
Hadoop Distribution
Alerting & Event
Processing
Geospatial Query
In-database
MapReduce
Visualization Widgets
Transactions Role-based
Security
Automated Failover
Replication Journal Archiving
Point-in-time
Recovery
Database Rollback
Backup/ Restore
Distributed Transactions
Super-clusters
Powerful Everything you need to deliver business value
Trusted Enterprise-ready for mission-critical apps
REST & Java APIs
JSON Storage
Application Builder
Information Studio
Hadoop Connector
Content Pump
BI Integration
SQL Support
Monitoring &
Management
OS Support
Accessible Leverage existing tools, knowledge, skills
Slide 25 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
What is Semantics Technology?
Slide 26 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
Elasticity
New tools to characterize and monitor the resource requirements of your applications and loads.
Dynamic provisioning system that can add or subtract resources on-the-fly to match the loads.
Distributed & virtualized environments including VMWare, Amazon AWS and Hadoop are supported to scale-out.
Make the cloud a first-class citizen: Use Hadoop HDFS or Amazon S3 for backup
Aligning infrastructure + demand, continually
Slide 27 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
Tiered storage
ML
SSD
local
HDFS
amzn s3
Benefits
Keep data on tiers appropriate to
access needs = lower costs
Detach and reattach storage when
needed. Fewer compute nodes
required = lower costs
Leverage Hadoop HDFS investment
Choose infrastructure based on
value of data stored.
100% online with different tiers
at different SLAs/topologies
On-line/near-line mix utilizing
mount on-demand and
dynamic node spin-up.
Tiered Storage New Constructs
• Range partitions by Date/Scalar
manage group of forests by
range (“Q1” or “1990-1995”)
• Super Databases federate
queries across multiple
databases
Slide 28 Copyright © 2013 MarkLogic® Corporation. All rights reserved.
Tiered Storage
96 504 1,044
592 2,066 2,080
Total Size (TB)
Total Cost ($000)
Operational
$25
Effective Unit Cost ($/GB)
$4
Compliance
$1.50
Analytic
The Bloor Group
Database Innovation Database used to be a “zero-innovation market.”
Now it is the opposite.
Traditional (relational) database is now seen
(rightly) as inadequate in many respects
Big Data is, mainly, new data posing new
problems
New products are emerging and some older products are
being given a make-over (and gaining popularity)
Hadoop has changed perceptions and
thinking about database
The Bloor Group
NoSQL Confusion
As the graph indicates NoSQL is a very
confusing descriptor.
WHAT CAN A GIVEN DATABASE ACTUALLY
DO?
The important question is
The Bloor Group
The Joys and Sorrows of SQL
SQL: Very good for set manipulation Works for OLTP and many query environments
Not good for nested data structures (documents, web pages, etc.) Not good for ordered data sets Not good for data graphs (networks of values)
The Bloor Group
! In my view we have reached a situation where there will be multiple “data engines.” Is that MarkLogic’s view?
! Specifically, are there data structures or database contexts for which MarkLogic is inappropriate?
! What new features or capabilities are on the MarkLogic roadmap?
! In your view, is the “age of the data warehouse” over?
The Bloor Group
! Which sectors/businesses are currently in MarkLogic’s “sweet spot”?
! Data analytics involves much more than having analytical functions in the database. It is more than 50% data prep (merging, cleansing, joining, transformation, etc.). How does MarkLogic accommodate that?
! What is MarkLogic’s attitude to the cloud? Specifically, where would it recommend cloud deployment?
Twitter Tag: #briefr
The Briefing Room
July: CLOUD
August: HIGH PERFORMANCE ANALYTICS
September: ANALYTICS
Upcoming Topics
www.insideanalysis.com