ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

So you think you can crawl?Stretching the Boundaries of SharePoint 2013!

Petter Skodvin-Hvammen

AD-Gruppen, Norway

Who am I?

Petter Skodvin-Hvammen

Oseberg ship - Discovered 1904 in Tønsberg, Norway. Buried by Vikings in 834 AD

• Solutions Architect• SharePoint Consultant• Search Enthusiast• Community Lead

@pettersh - [email protected]

www.adgruppen.no

http://www.adgruppen.no/

Enterprise Search

Index thousands of

sources

Automate index

management

Infrastructure sizing

Challenges and Solutions

Not Included: code/scripts, user experience, relevancy, governance www.sharepointeurope.com

Enterprise Search using SharePoint Server 2013

• 30,000 users• 85 locations in 30 countries• 15,000 daily searches• 100,000,000 documents(?)• 60 core systems, 2,000 applications

The Mission…

What do we index?

100,000,000documents

3,000 fileshares

500 servers

Where is the data?• Datacenters• Time zones• Bandwidth

www.sharepointeurope.com

* http://blogs.technet.com/b/shanecothran/archive/2010/07/16/maxtokensize-and-kerberos-token-bloat.aspx

How can we get it?• Limit bandwidth usage for specific server

locations

• Limit crawler impact within local business hours

• Grant read access to crawler per file share

• Avoid token bloat issues with more than 1,015* groups per account

http://blogs.technet.com/b/shanecothran/archive/2010/07/16/maxtokensize-and-kerberos-token-bloat.aspx



How do we operate it?• File shares are created, changed, and

deleted every day using a custom self service solution

• File shares are moved between servers every day by automation rules

• Manage indexing and crawling of each file shares with minimum manual effort


What can SharePoint do?• Max 50 content sources per service application

– Max 500 with October 2013 CU installed

• Max 100 start addresses per content source– Max 500 with October 2013 CU installed

• Max 20 concurrent crawls per service application– Limitation has been removed

http://technet.microsoft.com/en-us/library/cc262787(v=office.15).aspx#Search



It’s complicated• More data than we have space

for• It’s located all over the place• Everything changes all of the

time• There are limitations in

SharePoint• Someone’s gotta maintain this• It has to be secure and relevant


What did we do?

• Created logical groups of file shares• Used symbolic linking


fewer content sources

\\file01\share01

\\file02\share03

\\file03\share03

\\file00\share\sym01



\\file00\share

Start address

What did we do?

• Grouped file shares based on region

• One content source per region• Incremental crawls every night


crawling based on

time zones

What did we do?

• Created DNS alias per impact rule in etc/hosts on crawl servers


reducedcrawler impact

What did we do?

• Granted file share access to the account included in least groups

• Monitored group memberships• Grouped file shares by crawl

account• Crawl rules matched folder

structure

managed pool of crawl

accounts

file://.*/spcrwl01/.*

file://.*/spcrwl02/.*

Include

Include

SP\spcrwl01

SP\spcrwl02


The bigger picture• Folder structure:• Start addresses:

<content source>/<crawler impact>/<crawl account>/<symbolic link>

file://<crawler impact>/<content source>/<crawler impact>

Source

Start addresses Folder Crawl rule Impact rule

Europe

file://default/europe/default

europe/default/spcrwl01

file://.*/spcrwl01/.* Default

europe/default/spcrwl02

file://.*/spcrwl02/.* Default

file://wait-60/europe/wait-60

europe/wait-60/spcrwl01

file://.*/spcrwl01/.* Wait-60

europe/wait-60/spcrwl02

file://.*/spcrwl02/.* Wait-60

Asia file://default/asia/default

asia/default/spcrwl01 file://.*/spcrwl01/.* Default

asia/default/spcrwl02 file://.*/spcrwl02/.* Default

file://wait-60/asia/wait-60

asia/wait-60/spcrwl01 file://.*/spcrwl01/.* Wait-60

asia/wait-60/spcrwl02 file://.*/spcrwl02/.* Wait-60

How did we manage this?


self service portal for enabling indexing of

file shares

custom web service integration in self service portal

custom solution for granting access to

crawl accounts

custom timer job to get list of file shares to crawl from self service portal

AUTOMATION

custom timer job for creating and removing symbolic links

custom lists for mappingserver to content source, schedule

and impact, shares to crawl accounts and metadata, UNC to symlink

content enrichment service for replacing symlinks in paths with actual file paths


Title:European SharePoint Conference

Owner: Petter Skodvin-Hvammen

Business Area:

Consulting

Classification:

Internal

Type: Project

UNC Path: Assigned automatically

Crawl Account:

Assigned automatically

CancelSave

Example: Self Service Portal Example: Custom Lists

Title: European SharePoint Conference

Owner: Petter Skodvin-Hvammen

Business Area:

Consulting

Classification:

Internal

Type: Project

UNC Path: \\file01\share01

Crawl Account:

SP\spcrawl01

Symlink:\\default\europe\default\spcrwl01\e5dc12a41d

Location:europe (server file01 is located in Oslo DC)

Bandwidth: 5Mbps

Index-0

Query

WFE

Doc Proc

Crawling

Central Admin

Enrichment

Query

WFE

Index-2

Index-1

Index-3

Index-0

Index-2

Index-1

Index-3

Doc Proc

Doc Proc

Doc Proc

Doc Proc

Doc Proc

Doc Proc

Doc Proc

Crawling

Analytics

AdminAdmin

Enrichment

Enrichment

Enrichment

Enrichment

Enrichment

Enrichment

Enrichment

Analytics

Doc Proc

Enrichment

Doc Proc

Enrichment

40Million

Documents

10Queries /Second

SQL Server SQL Server

• Admin DB• Analytics DB• Crawl DB• Link DB

• Other SP DBs

Caching Caching

Capacity testing

Purpose• Crawling of symbolic

links• Scaling of virtual

machines• Sizing of disk space• Verify Microsoft’s

advises

Approach• 4 server farm with 2

partitions• 8 vCPU, 16 GB RAM, 850

GB• Crawl 10 file shares (3.7M

files)• Replay top 300 queries• Apache JMeter


Capacity testing – findings• Crawl rate declined 1% per million items indexed• Query latency increased exponentially from 12 million

items indexed per partition• Database latency was insignificant during crawling• Successfully crawled file shares via symbolic directory

links• Disk space usage was significant lower than expected

– Reduced data volume from 850 GB to 450 GB– 40+ servers => huge cost savings


Infrastructure – VM sizingDedicated ESX Cluster• 14 x VM for SharePoint

2013– 4 physical machines– 4 x 32 = 128 CPUs– 4 x 56 = 1024 GB memory

• HA max utiliization = ¾– 3 x 32 = 96 CPUs– 3 x 56 = 768 GB memory

• CPU and Memory can be over-commited

• CPU over-commited 1,34 (1,78 if one physical host fail)

• VM’s must wait for physical CPU Wait time for 8 cpu = 2 x 4 cpu

• Mitigation: a) Reduce allocated virtual CPU, or b) Increase physical CPU

• Memory factor 0,44 (0,59)• Reserved and locked

memory prevents HA failover


Infrastructure – VM tuning


DC Role vCPU Peak AverageCalculate

dRecomme

ndedChange

A Web, Query, Admin 8 187,55 37,03 2 4 -4B Web, Query, Admin 8 621,88 92,69 8 8 0

ACrawl, Analytics, Content, CEWS, Central Admin

8 724,35 210,59 8 8 0

BCrawl, Analytics, Content, CEWS, Symbolic Links

8 724,56 198,44 8 8 0

A Index 0, Content, CEWS 8 486,18 62,55 6 6 -2B Index 0, Content, CEWS 8 520,63 63,98 6 6 -2A Index 1, Content, CEWS 8 547,08 69,3 6 6 -2B Index 1, Content, CEWS 8 546,44 91,74 6 6 -2A Index 2, Content, CEWS 8 491,38 65,6 6 6 -2B Index 2, Content, CEWS 8 532,01 77,83 6 6 -2A Index 3, Content, CEWS 8 540,45 78,72 6 6 -2B Index 3, Content, CEWS 8 621,88 92,69 8 8 0A Distributed Cache 4 91,71 5,99 2 2 -2B Distributed Cache* (added later) - - - - - -

100 78 80 -20

Peak and average CPU usage is calculated over 30 days

Summary

1. Indexing thousands of content sources2. Automation for rapid changing index

requirements3. Sizing the infrastructure for performance

and HA


Questions?

[email protected] http://linkedin.com/in/petterskodvin @pettersh

ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

Software

Transcript of ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!