ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

24
So you think you can crawl? Stretching the Boundaries of SharePoint 2013! Petter Skodvin-Hvammen AD-Gruppen, Norway

description

Presentation from the European SharePoint Conference 2014 in Barcelona. How did we build a solution for indexing 3000 file shares using self service solutions and automated crawl management.

Transcript of ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

Page 1: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

So you think you can crawl?Stretching the Boundaries of SharePoint 2013!

Petter Skodvin-Hvammen

AD-Gruppen, Norway

Page 2: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

Who am I?

Petter Skodvin-Hvammen

Oseberg ship - Discovered 1904 in Tønsberg, Norway. Buried by Vikings in 834 AD

• Solutions Architect• SharePoint Consultant• Search Enthusiast• Community Lead

@pettersh - [email protected]

www.adgruppen.no

Page 3: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

Enterprise Search

Index thousands of

sources

Automate index

management

Infrastructure sizing

Challenges and Solutions

Not Included: code/scripts, user experience, relevancy, governance www.sharepointeurope.com

Page 4: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

Enterprise Search using SharePoint Server 2013

• 30,000 users• 85 locations in 30 countries• 15,000 daily searches• 100,000,000 documents(?)• 60 core systems, 2,000 applications

The Mission…

Page 5: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

What do we index?

100,000,000documents

3,000 fileshares

500 servers

Page 6: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

Where is the data?• Datacenters• Time zones• Bandwidth

www.sharepointeurope.com

Page 7: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

* http://blogs.technet.com/b/shanecothran/archive/2010/07/16/maxtokensize-and-kerberos-token-bloat.aspx

How can we get it?• Limit bandwidth usage for specific server

locations

• Limit crawler impact within local business hours

• Grant read access to crawler per file share

• Avoid token bloat issues with more than 1,015* groups per account

Page 8: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

How do we operate it?• File shares are created, changed, and

deleted every day using a custom self service solution

• File shares are moved between servers every day by automation rules

• Manage indexing and crawling of each file shares with minimum manual effort

www.sharepointeurope.com

Page 9: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

What can SharePoint do?• Max 50 content sources per service application

– Max 500 with October 2013 CU installed

• Max 100 start addresses per content source– Max 500 with October 2013 CU installed

• Max 20 concurrent crawls per service application– Limitation has been removed

http://technet.microsoft.com/en-us/library/cc262787(v=office.15).aspx#Search

Page 10: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

It’s complicated• More data than we have space

for• It’s located all over the place• Everything changes all of the

time• There are limitations in

SharePoint• Someone’s gotta maintain this• It has to be secure and relevant

www.sharepointeurope.com

Page 11: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

What did we do?

• Created logical groups of file shares• Used symbolic linking

www.sharepointeurope.com

fewer content sources

\\file01\share01

\\file02\share03

\\file03\share03

\\file00\share\sym01

\\file00\share\sym02

\\file00\share\sym03

\\file00\share

Start address

Page 12: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

What did we do?

• Grouped file shares based on region

• One content source per region• Incremental crawls every night

www.sharepointeurope.com

crawling based on

time zones

Page 13: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

What did we do?

• Created DNS alias per impact rule in etc/hosts on crawl servers

www.sharepointeurope.com

reducedcrawler impact

Page 14: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

What did we do?

• Granted file share access to the account included in least groups

• Monitored group memberships• Grouped file shares by crawl

account• Crawl rules matched folder

structure

managed pool of crawl

accounts

file://.*/spcrwl01/.*

file://.*/spcrwl02/.*

Include

Include

SP\spcrwl01

SP\spcrwl02

www.sharepointeurope.com

Page 15: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

The bigger picture• Folder structure:• Start addresses:

<content source>/<crawler impact>/<crawl account>/<symbolic link>

file://<crawler impact>/<content source>/<crawler impact>

Source

Start addresses Folder Crawl rule Impact rule

Europe

file://default/europe/default

europe/default/spcrwl01

file://.*/spcrwl01/.* Default

europe/default/spcrwl02

file://.*/spcrwl02/.* Default

file://wait-60/europe/wait-60

europe/wait-60/spcrwl01

file://.*/spcrwl01/.* Wait-60

europe/wait-60/spcrwl02

file://.*/spcrwl02/.* Wait-60

Asia file://default/asia/default

asia/default/spcrwl01 file://.*/spcrwl01/.* Default

asia/default/spcrwl02 file://.*/spcrwl02/.* Default

file://wait-60/asia/wait-60

asia/wait-60/spcrwl01 file://.*/spcrwl01/.* Wait-60

asia/wait-60/spcrwl02 file://.*/spcrwl02/.* Wait-60

Page 16: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

How did we manage this?

www.sharepointeurope.com

self service portal for enabling indexing of

file shares

custom web service integration in self service portal

custom solution for granting access to

crawl accounts

custom timer job to get list of file shares to crawl from self service portal

AUTOMATION

custom timer job for creating and removing symbolic links

custom lists for mappingserver to content source, schedule

and impact, shares to crawl accounts and metadata, UNC to symlink

content enrichment service for replacing symlinks in paths with actual file paths

Page 17: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

www.sharepointeurope.com

Title:European SharePoint Conference

Owner: Petter Skodvin-Hvammen

Business Area:

Consulting

Classification:

Internal

Type: Project

UNC Path: Assigned automatically

Crawl Account:

Assigned automatically

CancelSave

Example: Self Service Portal Example: Custom Lists

Title: European SharePoint Conference

Owner: Petter Skodvin-Hvammen

Business Area:

Consulting

Classification:

Internal

Type: Project

UNC Path: \\file01\share01

Crawl Account:

SP\spcrawl01

Symlink:\\default\europe\default\spcrwl01\e5dc12a41d

Location:europe (server file01 is located in Oslo DC)

Bandwidth: 5Mbps

Page 18: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

Index-0

Query

WFE

Doc Proc

Crawling

Central Admin

Enrichment

Query

WFE

Index-2

Index-1

Index-3

Index-0

Index-2

Index-1

Index-3

Doc Proc

Doc Proc

Doc Proc

Doc Proc

Doc Proc

Doc Proc

Doc Proc

Crawling

Analytics

AdminAdmin

Enrichment

Enrichment

Enrichment

Enrichment

Enrichment

Enrichment

Enrichment

Analytics

Doc Proc

Enrichment

Doc Proc

Enrichment

40Million

Documents

10Queries /Second

SQL Server SQL Server

• Admin DB• Analytics DB• Crawl DB• Link DB

• Other SP DBs

Caching Caching

Page 19: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

Capacity testing

Purpose• Crawling of symbolic

links• Scaling of virtual

machines• Sizing of disk space• Verify Microsoft’s

advises

Approach• 4 server farm with 2

partitions• 8 vCPU, 16 GB RAM, 850

GB• Crawl 10 file shares (3.7M

files)• Replay top 300 queries• Apache JMeter

www.sharepointeurope.com

Page 20: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

Capacity testing – findings• Crawl rate declined 1% per million items indexed• Query latency increased exponentially from 12 million

items indexed per partition• Database latency was insignificant during crawling• Successfully crawled file shares via symbolic directory

links• Disk space usage was significant lower than expected

– Reduced data volume from 850 GB to 450 GB– 40+ servers => huge cost savings

www.sharepointeurope.com

Page 21: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

Infrastructure – VM sizingDedicated ESX Cluster• 14 x VM for SharePoint

2013– 4 physical machines– 4 x 32 = 128 CPUs– 4 x 56 = 1024 GB memory

• HA max utiliization = ¾– 3 x 32 = 96 CPUs– 3 x 56 = 768 GB memory

• CPU and Memory can be over-commited

• CPU over-commited 1,34 (1,78 if one physical host fail)

• VM’s must wait for physical CPU Wait time for 8 cpu = 2 x 4 cpu

• Mitigation: a) Reduce allocated virtual CPU, or b) Increase physical CPU

• Memory factor 0,44 (0,59)• Reserved and locked

memory prevents HA failover

www.sharepointeurope.com

Page 22: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

Infrastructure – VM tuning

www.sharepointeurope.com

DC Role vCPU Peak AverageCalculate

dRecomme

ndedChange

A Web, Query, Admin 8 187,55 37,03 2 4 -4B Web, Query, Admin 8 621,88 92,69 8 8 0

ACrawl, Analytics, Content, CEWS, Central Admin

8 724,35 210,59 8 8 0

BCrawl, Analytics, Content, CEWS, Symbolic Links

8 724,56 198,44 8 8 0

A Index 0, Content, CEWS 8 486,18 62,55 6 6 -2B Index 0, Content, CEWS 8 520,63 63,98 6 6 -2A Index 1, Content, CEWS 8 547,08 69,3 6 6 -2B Index 1, Content, CEWS 8 546,44 91,74 6 6 -2A Index 2, Content, CEWS 8 491,38 65,6 6 6 -2B Index 2, Content, CEWS 8 532,01 77,83 6 6 -2A Index 3, Content, CEWS 8 540,45 78,72 6 6 -2B Index 3, Content, CEWS 8 621,88 92,69 8 8 0A Distributed Cache 4 91,71 5,99 2 2 -2B Distributed Cache* (added later) - - - - - -

100 78 80 -20

Peak and average CPU usage is calculated over 30 days

Page 23: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

Summary

1. Indexing thousands of content sources2. Automation for rapid changing index

requirements3. Sizing the infrastructure for performance

and HA

www.sharepointeurope.com

Page 24: ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

Questions?

[email protected] http://linkedin.com/in/petterskodvin @pettersh