A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and...
-
Upload
lucidimagination -
Category
Documents
-
view
219 -
download
0
Transcript of A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and...
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
1/45
A Study of I/O and VirtualizationPerformance with a Search Engine
based on an XML database and
LuceneEd Buech, EMC
[email protected], May 25, 2011
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
2/45
Agenda
My Background Documentum xPlore Context and History Overview of Documentum xPlore Tips and Observations on IO and Host
Virtualization
3
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
3/45
My Background Ed Buech Information Intelligence Group within EMC EMC Distinguished Engineer & xPlore Architect Areas of expertise
Content Management (especially performance &scalability)
Database (SQL and XML) and Full text search Previous experience: Sybase and Bell Labs
Part of the EMC Documentum xPloredevelopment team
Pleasanton (CA), Grenoble (France), Shanghai,and Rotterdam (Netherlands)
4
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
4/45
Documentum search 101
Documentum Content Server provides an object/relational data model and query language
Object metadata called attributes (sample: title, subject,author)
Sub-types can be created with customer defined attributes Documentum Query Language (DQL) Example:
SELECT object_name FROM foo
WHERE subject = bar AND customer_id = ID1234
DQL also support full text extensions Example:
SELECT object_name FROM foo
SEARCH DOCUMENT CONTAINShello world
WHERE subject = bar AND customer_id = ID1234
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
5/45
Introducing Documentum xPlore
Provides IntegratedSearch for Documentum
but is built as astandalone search
engine to replace FASTInstream
Built over EMC xDB,Lucene, and leading
content extraction and
linguistic analysis
software
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
6/45
Documentum Search
History-at-a-glance
almost 15 years of Structured/Unstructured integrated search
Verity Integration 1996 2005
Basic full text search throughDQL
Basic attribute search1 day 1 hour latencyEmbedded implementation
FAST Integration 2005 2011Combined structured /unstructured search
2 5 min latencyScore ordered results
xPlore Integration 2010 - ??? Replaces FAST in DCTM Integrated security Deep facet computation HA/DR improvements Latency: typically seconds
Improved Administration
Virtualization Support
1996 20102005
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
7/45
Enhancing Documentum Deployments
with Search
Without Full Text in a Documentum deployment a DQL query will bedirected to the RDBMS
DQL is translated into SQL However, relational querying has many limitations.
ContentServer
DCTM clientDQL SQL
RDBMS
search
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
8/45
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
9/45
Some Basic Design Concepts
behind Documentum xPlore
Inverted Indexes are not optimized for all use-cases B+-tree indexes can be far more efficient for
simple, low-latency/highly dynamic scenarios
De-normalization cant efficiently solve allproblems
Update propagation problem can be deadly Joins are a necessary part of most applications
Applications need fine control over not onlysearch criteria, but also result sets
10
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
10/45
Design concepts (cont)
Applications need fluid, changing metadataschemas that can be efficiently queried
Adding metadata through joins with side-tablescan be inefficient to query
Users want the power of Information Retrievalon their structured queries Data Management, HA, DR shouldnt be an
after-thought
When possible, operate within standards Lucene is not a database. Most Lucene
applications deploy with databases.
11
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
11/45
Lessons Learned
Structured Queryuse-cases
UnstructuredQuery use-cases
Fit touse-case
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
12/45
Indexes, DB, and IR
Structured Queryuse-cases
UnstructuredQuery use-cases
Relational DBtechnology
Fit touse-case
Scoring,
Relevance,
Entities
Hierarchical data
representations
(XML)
Full Text
searches
Constantly
changing
schemas
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
13/45
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
14/45
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
15/45
Documentum xPlore
Bringbest-of-breedXMLDatabasewithpowerfulApacheLuceneFulltextEngine ProvidesstructuredandunstructuredsearchleveragingXMLandXQuerystandards
DesignedwithEnterprisereadiness,scalabilityandingesCon AdvancedDataManagementfuncConalitynecessaryforlargescalesystems
IndustryleadinglinguisCctechnologyandcomprehensiveformatfilters MetricsandAnalyCcs
xDB Transaction, Index& Page Management
xDB Query Processing&
Optimization
xDB API
xPlore APISearch
Services
Node & DataManagement
Services
IndexingServices
AdminServices
ContentProcessing
Services
Analytics
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
16/45
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
17/45
Scope of index
covers all xml files in
all sub-libraries
A
B C
Libraries / Collections & Indexes
A
B
C
= xDB segment
= xDBLibrary /xPlore collection
= xDBIndex
= xDBxml file (dftxml, trackingxml, status, metrics, audit)
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
18/45
Lucene Integration
Transactional Non-committed index updates in separate
(typically in memory) lucene indexes
Recently committed (but dirty) indexes backed byxDB log
Query to index leverages Lucene multi-searcherwith filter to apply update/delete blacklisting
Lucene indexes managed to fit into xDBsARIES-based recovery mechanism
No changes to Lucene Goal: no obstacles to be as current as possible
19
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
19/45
Lucene Integration (cont)
Both value and full text queries supported XML elements mapped to lucene fields Tokenized and value-based fields available
Composite key queries supported Lucene much more flexible than traditional B-
tree composite indexes
ACL and Facet information stored in Lucenefield array
Documentums security ACL security modelhighly complex and potentially dynamic
Enables secure facet computation20
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
20/45
xPlore has lucene search engine
capabilities plus.
XQuery provides powerful query & datamanipulation language
A typical search engine cant even express a join Creation of arbitrary structure for result set Ability to call to language-based functions or java-
based methods
Ability to use B-tree based indexes when needed xDB optimizer decides this
Transactional update and recovery of data/index Hierarchical data modeling capability
Ti d Ob ti
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
21/45
Tips and Observations on
IO and Host Virtualization
Virtualization offers huge savings for companies
through consolidation and automation
Both Disk and Host virtualization available However, there are pitfalls to avoid
One-size-fits-all Consolidation contention Availability of resources
22
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
22/45
Tip #1: Dont assume that
one-size-fits all
Most IT shops will create VM or SANtemplates that have a fixed resource
consumption
Reduces admin costs Example: Two CPU VM with 2 GB of memory Deviations from this must be made in a special
request
Recommendations: Size correctly, dont accept insufficient resources Test pre-production environments
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
23/45
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
24/45
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
25/45
EMC Symmetrix:Nondisruptive MobilityVirtual LUN VP Mobility
Fast, efficient mobilityMaintains replication and
quality of service during
relocationsSupports up to thousands of
concurrent VPLUNmigrations
Recommendation: work withstorage technicians to
ensure backend storage has
sufficient I/O
Virtual Pools
Flash400 GB
RAID 5
Tier 2
Fibre Channel600 GB 15K
RAID 1
SATA2 TBRAID 6
VLUN
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
26/45
Tip #2: Consolidation Contention
Virtualization provides benefit from consolidation Consolidation provides resources to the active
Your resources can be consumed by other VMs,other apps
Physical resources can be over-stretched Recommendations:
Track actual capacity vs. planned Vmware: track number of times your VM is denied CPU SANs: track % I/O utilization vs. number of I/Os
For Vmware leverage guaranteed minimumresource allocations and/or allocate to non-
overloaded HW
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
27/45
Some Vmware statistics
Readymetric Generated by Vcenter and represents the
number of cycles (across all CPUs) in which VMwas deniedCPU
Generated in milliseconds and real-timesample happens at best every 20 secs
For interactive apps: As a percentage of offeredcapacity > 10% is considered worrisome
Pages-in, Pages-out Can indicate over subscription of memory
28
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
28/45
Sample %Ready for a production VM with xPloredeployment for an entire week
29
0%
2%
4%
6%
8%
10%
12%
14%
16%
official area thatIndicates pain
In this case Avgresp time
doubled andmax resp time
grew by 5x
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
29/45
S S btl ti ith
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
30/45
Some Subtleties with
Interactive CPU denial
The Ready metric represents denial upondemand
Interactive workloads can be bursty If no demand, then Readycounter will be low
Poor user response encourages less usage Like walking on a broken leg Causing less Readysamples
31
20 sec interval
Denialspike
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
31/45
Sharing I/O capacity
If Multiple VMs (or servers) are sharing thesame underlying physical volumes and thecapacity is not managed properly then the available I/O capacity of the volume could
be less than the theoretical capacity
This can be seen if the OS tools show that thedisk is very busy (high utilization) while thenumber of I/Os is lower than expected
Volume forLucene
application
Volume forother
application
Both volumes spread over the same set of drivesand effectively sharing the I/O capacity
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
32/45
Recommendations on diagnosing
disk I/O related issues
On Linux/UNIX Have IT group install SAR and IOSTAT
Also install a disk I/O testing tool (like Bonnie) Compare Bonnie output with SAR & IOSTAT
data High disk Utilization at much lower achieved rates could
indicate contention from other applications
Also, High SAR I/O wait time might be anindication of slow disks
On Windows Leverage the Windows Performance Monitor Objects: Processor, Physical Disk, Memory
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
33/45
Sample output from the Bonnie tool
Bonnie is an open source disk I/O driver tool for Linux that can be useful forpretesting Linux disk environments prior to an xPlore/Lucene install.
bonnie -s 1024 -y -u -o_direct -v 10 -p 10This will increase the size of the file to 2 Gb.Examine the output. Focus on the random I/O area:
---Sequential Output (sync)----- ---Sequential Input-- --Rnd Seek--CharUnlk- -DIOBlock- -DRewrite- -CharUnlk- -DIOBlock- --04k (10)-
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU Mach2 10*2024 73928 97 104142 5.3 26246 2.9 8872 22.5 43794 1.9 735.7 15.2
-s 1024 means that 2 GB files will be created
-o_direct means that direct I/O (by-passing buffer cache)
will be done
-v 10 means that 10 different 2GB files will be created.
-p 10 means that 10 different threads will query those files
This output meansthat the random read
test saw 735 random I/
Os per sec at 15%
CPU busy
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
34/45
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
35/45
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
36/45
IO / caching test use-case
Unselective Term search 100 sample queries Avg( hits per term) = 4,300+, max ~ 60,000 Searching over 100s of DCTM object attributes + content
Medium result window Avg( results returned per query) = 350 (max: 800)
Stored Fields Utilized Some security & facet info
Goal: Pre-cache portions of the index to improve response time in
scenarios
Reboot, buffer cache contention, & vm memory contention
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
37/45
Some xPlore Structures for Search
Dictionary of termsPosting list (doc-ids for term)
Stored fields (facets and node-ids)
Security indexes(b-tree based)
xDB XMLstore
(containstext for
summary)
1st doc N-thdoc
Facet decompression map
Frequency and position structures ignored for simplicity
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
38/45
IO model for search in xPlore
Search Term:term1 term2
Dictionary Posting list (doc-ids for term)
Stored fields
Xdb node-id
plus facet /security info
Security lookup(b-tree based)
xDB XMLstore
(containstext for
summary)
Resultset
Facet decompression map
S ti f i l i
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
39/45
Separation ofcovering values instored fields and summary
Facet
Calc
FinalFacetcalc values
overthousands of
results
Res-1 - sum
Res-2 - sumRes-3 - sum
:
:Res-350-sum
Xdb docswith text for
summary
Small numberfor result
window
Smallstructure
Potentiallythousands of
results
Stored fields(Random access)
Potentiallythousands
of hits
Security
lookup
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
40/45
xPlore Memory Pool areas
at-a-glance
xPlore Instance (fixed size)
memory
xDB
Buffer
Cache
LuceneCaches
&
workingmemory
xPlorecaches
Other vm
working
mem
Operating
System
File Buffer
cache
(dynamically
sized)
Native code
content
extraction &
linguistic
processing
memory
Lucene data resides primarily in
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
41/45
Lucene data resides primarily in
OS buffer cache
42
xPlore Instance (fixed size)
memory
xDBBuffer
Cache
Lucene
Caches
&working
memory
xPlore
caches
Other vm
workingmem
Operating
System
File Buffer
cache
(dynamically
sized)
Native code
content
extraction &
linguisticprocessing
memory
Dictionary of terms
Posting list (doc-ids for term)
Stored fields (facets and node-ids)
1st doc N-th
doc
xDB XML
store
(contains
text for
summary)
N-th
doc
Potential for manythings to sweep
lucene from thatcache
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
42/45
Test Env
32 GB memory Direct attached storage (no SAN) 1.4 million documents Lucene index size = 10 GB Size of internal parts of Lucene CFS file
Stored fields (fdt, fdx): 230 MB (2% of index) Term Dictionary (tis,tii): 537 MB (5% of index) Positions (prx): 8.78 GB (80% of index) Frequencies (frq) : 1.4 GB (13 % of index)
Text in xDB stored compressed separately43
S f
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
43/45
Some results of the query suite
Test Avg Respto
consumeall results
(sec)
MB pre-cached
I/O perresult
Total MBloaded into
memory(cached + test)
Nothing cached 1.89 0 0.89 77
Stored fields cached 0.95 241 0.38 272
Term dict cached 1.73 537 0.79 604
Positions cached 1.58 8,789 0.74 8,800
Frequencies cached 1.65 1,406 0.63 1,436
Entire index cached 0.59 10,970 < 0.05 10,970
44
Linux buffer cache cleared completely before each run Resp as seen by final user in Documentum Facets not computed in this example. Just a result set returned. With Facets
response time difference more pronounced.
Mileage will vary depending on a series of factors that include query complexity,compositions of the index, and number of results consumed
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
44/45
Other Notes
Caching 2% of index yields a response timethat is only 60% greater than if the entire indexwas cached.
Caching cost only 9 secs on a mirrored drive pair Caching cost 6800 large sequential I/Os vs.
potentially 58,000 random I/Os
Mileage will vary, factors include Phrase search
Wildcard search Multi-term search
SANs can grow I/O capacity as searchcomplexity increases
45
-
8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
45/45
Contact
Ed Buech [email protected] http://community.emc.com/people/Ed_Bueche/blog http://community.emc.com/docs/DOC-8945