ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!
-
Upload
petter-skodvin-hvammen -
Category
Software
-
view
511 -
download
2
description
Transcript of ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!
So you think you can crawl?Stretching the Boundaries of SharePoint 2013!
Petter Skodvin-Hvammen
AD-Gruppen, Norway
Who am I?
Petter Skodvin-Hvammen
Oseberg ship - Discovered 1904 in Tønsberg, Norway. Buried by Vikings in 834 AD
• Solutions Architect• SharePoint Consultant• Search Enthusiast• Community Lead
@pettersh - [email protected]
www.adgruppen.no
Enterprise Search
Index thousands of
sources
Automate index
management
Infrastructure sizing
Challenges and Solutions
Not Included: code/scripts, user experience, relevancy, governance www.sharepointeurope.com
Enterprise Search using SharePoint Server 2013
• 30,000 users• 85 locations in 30 countries• 15,000 daily searches• 100,000,000 documents(?)• 60 core systems, 2,000 applications
The Mission…
What do we index?
100,000,000documents
3,000 fileshares
500 servers
Where is the data?• Datacenters• Time zones• Bandwidth
www.sharepointeurope.com
* http://blogs.technet.com/b/shanecothran/archive/2010/07/16/maxtokensize-and-kerberos-token-bloat.aspx
How can we get it?• Limit bandwidth usage for specific server
locations
• Limit crawler impact within local business hours
• Grant read access to crawler per file share
• Avoid token bloat issues with more than 1,015* groups per account
How do we operate it?• File shares are created, changed, and
deleted every day using a custom self service solution
• File shares are moved between servers every day by automation rules
• Manage indexing and crawling of each file shares with minimum manual effort
www.sharepointeurope.com
What can SharePoint do?• Max 50 content sources per service application
– Max 500 with October 2013 CU installed
• Max 100 start addresses per content source– Max 500 with October 2013 CU installed
• Max 20 concurrent crawls per service application– Limitation has been removed
http://technet.microsoft.com/en-us/library/cc262787(v=office.15).aspx#Search
It’s complicated• More data than we have space
for• It’s located all over the place• Everything changes all of the
time• There are limitations in
SharePoint• Someone’s gotta maintain this• It has to be secure and relevant
www.sharepointeurope.com
What did we do?
• Created logical groups of file shares• Used symbolic linking
www.sharepointeurope.com
fewer content sources
\\file01\share01
\\file02\share03
\\file03\share03
\\file00\share\sym01
\\file00\share\sym02
\\file00\share\sym03
\\file00\share
Start address
What did we do?
• Grouped file shares based on region
• One content source per region• Incremental crawls every night
www.sharepointeurope.com
crawling based on
time zones
What did we do?
• Created DNS alias per impact rule in etc/hosts on crawl servers
www.sharepointeurope.com
reducedcrawler impact
What did we do?
• Granted file share access to the account included in least groups
• Monitored group memberships• Grouped file shares by crawl
account• Crawl rules matched folder
structure
managed pool of crawl
accounts
file://.*/spcrwl01/.*
file://.*/spcrwl02/.*
Include
Include
SP\spcrwl01
SP\spcrwl02
www.sharepointeurope.com
The bigger picture• Folder structure:• Start addresses:
<content source>/<crawler impact>/<crawl account>/<symbolic link>
file://<crawler impact>/<content source>/<crawler impact>
Source
Start addresses Folder Crawl rule Impact rule
Europe
file://default/europe/default
europe/default/spcrwl01
file://.*/spcrwl01/.* Default
europe/default/spcrwl02
file://.*/spcrwl02/.* Default
file://wait-60/europe/wait-60
europe/wait-60/spcrwl01
file://.*/spcrwl01/.* Wait-60
europe/wait-60/spcrwl02
file://.*/spcrwl02/.* Wait-60
Asia file://default/asia/default
asia/default/spcrwl01 file://.*/spcrwl01/.* Default
asia/default/spcrwl02 file://.*/spcrwl02/.* Default
file://wait-60/asia/wait-60
asia/wait-60/spcrwl01 file://.*/spcrwl01/.* Wait-60
asia/wait-60/spcrwl02 file://.*/spcrwl02/.* Wait-60
How did we manage this?
www.sharepointeurope.com
self service portal for enabling indexing of
file shares
custom web service integration in self service portal
custom solution for granting access to
crawl accounts
custom timer job to get list of file shares to crawl from self service portal
AUTOMATION
custom timer job for creating and removing symbolic links
custom lists for mappingserver to content source, schedule
and impact, shares to crawl accounts and metadata, UNC to symlink
content enrichment service for replacing symlinks in paths with actual file paths
www.sharepointeurope.com
Title:European SharePoint Conference
Owner: Petter Skodvin-Hvammen
Business Area:
Consulting
Classification:
Internal
Type: Project
UNC Path: Assigned automatically
Crawl Account:
Assigned automatically
CancelSave
Example: Self Service Portal Example: Custom Lists
Title: European SharePoint Conference
Owner: Petter Skodvin-Hvammen
Business Area:
Consulting
Classification:
Internal
Type: Project
UNC Path: \\file01\share01
Crawl Account:
SP\spcrawl01
Symlink:\\default\europe\default\spcrwl01\e5dc12a41d
Location:europe (server file01 is located in Oslo DC)
Bandwidth: 5Mbps
Index-0
Query
WFE
Doc Proc
Crawling
Central Admin
Enrichment
Query
WFE
Index-2
Index-1
Index-3
Index-0
Index-2
Index-1
Index-3
Doc Proc
Doc Proc
Doc Proc
Doc Proc
Doc Proc
Doc Proc
Doc Proc
Crawling
Analytics
AdminAdmin
Enrichment
Enrichment
Enrichment
Enrichment
Enrichment
Enrichment
Enrichment
Analytics
Doc Proc
Enrichment
Doc Proc
Enrichment
40Million
Documents
10Queries /Second
SQL Server SQL Server
• Admin DB• Analytics DB• Crawl DB• Link DB
• Other SP DBs
Caching Caching
Capacity testing
Purpose• Crawling of symbolic
links• Scaling of virtual
machines• Sizing of disk space• Verify Microsoft’s
advises
Approach• 4 server farm with 2
partitions• 8 vCPU, 16 GB RAM, 850
GB• Crawl 10 file shares (3.7M
files)• Replay top 300 queries• Apache JMeter
www.sharepointeurope.com
Capacity testing – findings• Crawl rate declined 1% per million items indexed• Query latency increased exponentially from 12 million
items indexed per partition• Database latency was insignificant during crawling• Successfully crawled file shares via symbolic directory
links• Disk space usage was significant lower than expected
– Reduced data volume from 850 GB to 450 GB– 40+ servers => huge cost savings
www.sharepointeurope.com
Infrastructure – VM sizingDedicated ESX Cluster• 14 x VM for SharePoint
2013– 4 physical machines– 4 x 32 = 128 CPUs– 4 x 56 = 1024 GB memory
• HA max utiliization = ¾– 3 x 32 = 96 CPUs– 3 x 56 = 768 GB memory
• CPU and Memory can be over-commited
• CPU over-commited 1,34 (1,78 if one physical host fail)
• VM’s must wait for physical CPU Wait time for 8 cpu = 2 x 4 cpu
• Mitigation: a) Reduce allocated virtual CPU, or b) Increase physical CPU
• Memory factor 0,44 (0,59)• Reserved and locked
memory prevents HA failover
www.sharepointeurope.com
Infrastructure – VM tuning
www.sharepointeurope.com
DC Role vCPU Peak AverageCalculate
dRecomme
ndedChange
A Web, Query, Admin 8 187,55 37,03 2 4 -4B Web, Query, Admin 8 621,88 92,69 8 8 0
ACrawl, Analytics, Content, CEWS, Central Admin
8 724,35 210,59 8 8 0
BCrawl, Analytics, Content, CEWS, Symbolic Links
8 724,56 198,44 8 8 0
A Index 0, Content, CEWS 8 486,18 62,55 6 6 -2B Index 0, Content, CEWS 8 520,63 63,98 6 6 -2A Index 1, Content, CEWS 8 547,08 69,3 6 6 -2B Index 1, Content, CEWS 8 546,44 91,74 6 6 -2A Index 2, Content, CEWS 8 491,38 65,6 6 6 -2B Index 2, Content, CEWS 8 532,01 77,83 6 6 -2A Index 3, Content, CEWS 8 540,45 78,72 6 6 -2B Index 3, Content, CEWS 8 621,88 92,69 8 8 0A Distributed Cache 4 91,71 5,99 2 2 -2B Distributed Cache* (added later) - - - - - -
100 78 80 -20
Peak and average CPU usage is calculated over 30 days
Summary
1. Indexing thousands of content sources2. Automation for rapid changing index
requirements3. Sizing the infrastructure for performance
and HA
www.sharepointeurope.com
Questions?
[email protected] http://linkedin.com/in/petterskodvin @pettersh