CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Andreas Wagner – CERN IT/OIS Eduardo...

38
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/ OIS Andreas Wagner – CERN IT/OIS Eduardo Alvarez – CERN IT/OIS Sergio Fernandez – CERN IT/OIS

Transcript of CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Andreas Wagner – CERN IT/OIS Eduardo...

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS

Andreas Wagner – CERN IT/OIS

Eduardo Alvarez – CERN IT/OIS

Sergio Fernandez – CERN IT/OIS

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Summary

• Introduction to search• Inside CERN Search• New Search Solution

– Concepts, collections, pipelines, stages, architecture

– Search features• Demo• Conclusions and future work

Presentation Title - 2

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS What is Search?

• Search is the art of balancing three factors:– Recall

• How many matching documents were returned?

– Precision• Of returned documents, how many match the query?

– Relevancy• How well does a document match the query?

Presentation Title - 3

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Enterprise Search

• Wide range of document sources:

CERN Search - 4

• Web Pages• File systems• Databases• Directories (People and Places)• Document repositories (CDS,

EDMS, Indico, …)• Structured CMS Data

• Sharepoint, Drupal, Twiki

• Variety of meta data• Different Access Protection Schemes• Different retrieval methods and frequencies

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Enterprise Search

• Components of Enterprise Search:– Search Engine / Search Technology– Integration within existing infrastructure (authentication,

authorization)– Document retrieval

• Not only Web pages• Database/XML data (CDS, Indico, Phone data)

– Protected documents• Access for document data• In addition information about ACLs needed

– Ranking of documents

– Enterprise Search is not only a question about the search technology used!

CERN Search - 5

collaboration with data owners

collaboration with data owners

collaboration with data owners

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS What about Google

• What makes Google Web search so good

– Huge Web space analysis capabilities,

– Huge usage data used for “voting” the results most popular results are promoted

– Substantial resources to tune and correct results;- usage data analysis- taking into account popular events - hand edited results for popular single key word searches

– Personalize filter of results• Based on : Location, Preferences, search historial, …

• Above is valid for all public web search engines, Yahoo, Bing

• At the same time Web Search is not Enterprise search!

CERN Search - 6

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Summary

• Introduction to search• Inside CERN Search• New Search Solution

– Concepts, collections, pipelines, stages, architecture

– Search features• Demo• Conclusions and future work

CERN Search - 7

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Search at CERN?

• Why Search Service? If…– Every systems usually has its own search

system• Probably one of the best place for this

service• Quite a lot different content sources• High rate of new content• Solutions are not always optimal• Centralize the search of content

Presentation Title - 8

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS CERN Search

• A Central Search solution to provide• for users

– Single entry point for searching information on several content sources at the same time

• for service providers– Search backend service

» TWiki, Drupal, Sharepoint, JACOW, Groups

• Start of project in February 2006:• Based on commercial product from FAST

• (Microsoft subsidiary and market leader)

• CERN Search in production since 2007

• Present resources 1 PJAS & some fraction of a staff

CERN Search - 9

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OISCERN Search

Last Progress

• 2009 Migration to FAST ESP 5.3• 2010 Reorganization of the Indexed Web

Space (Improved relevancy)• 2010 – Twiki protected pages indexed

– Service used as default Twiki search

• 1Q 2011 – Indico Protected Docs + Material• 1Q 2011 – Index of the Sharepoint content• 3Q 2011 – Migration to FAST Search Server

2010 for Sharepoint

Presentation Title - 10

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Overview of Indexed documents

CERN Search - 11

Documents indexed by CERN Search

2011 2010 2009 2008

CERN Websites 1561670 1456510 1787805 829542

CDS 1116921 1048360  1040694 936018

TWiki Pages 68055 60796 --- ---

Indico 1531908 311208 255365 432339

JACOW 169204 144388 --- ---

Phonebook 26426 29819 25629 23982

Sharepoint(eSpace/Groups Archive)

6721347 --- --- ---

Central SSO Websites

--- --- --- ---

Total 11 million 3 million 3 million 2 million

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Summary

• Introduction to search• Inside CERN Search• New Search Solution

– Concepts, collections, pipelines, stages, architecture

– Search features• Demo• Conclusions and future work

CERN Search - 12

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Concepts I

• Document• Pipeline• Processing Stage

Presentation Title - 13

• Collection• Crawler (Files,

Web)

Collection A

Collection B

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS

CERN Search - 15

Indexing Protected Content I

• To allow indexing protected content we need to • Retrieve the document

• Search engine needs access to document

• Obtaining document ACLs• To be able to decide who is allowed to find a document

• Often not trivial since most systems answer the question:

“Has a given user the right to access a given document?” and not

“Tell me who has access to a given document?”

This is due to often complex permission models including inheritance, fine granularity of permissions and changing permission during document lifecycle …

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS

CERN Search - 16

Indexing Protected Content II

• Document Processing • Resolve ACLs to SIDs • Sent to Indexer with document

• FSA (FAST Security Authorization) Component• Active Directory integration, i.e. based

on CERN accounts and e-groups

Search Index

CERN Search

DocumentRepository

Document Processing

Active DirectoryUsers & Groups

Doc + ACLACL

Document

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS

CERN Search - 17

Authentication / Authorisation

CERN Search

Active DirectoryUsers & Groups

Search Index

Search F

ront End

Query & Identity

Group Membership

Authentication (SSO) & Search

• Query Processing • Authentication by Front-End • FSA creates filter with expanded user

credentials and groups

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OISFAST Search for SharepointCluster Architecture

Presentation Title - 18

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Index Profile

• Final representation of each document• Set of attributes to index (Managed Prop)

– Title– Author– Last modified date– ACLs

• Define properties queryables, refiners, sort• Define FullTextIndex Properties• Define mappings to FullTextIndex• Flexible

Presentation Title - 19

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Result Ranking – Rank Profiles

CERN Search - 20

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Ranking Issues at CERN

• Flat Web space – Lack of metadata (Copy-Paste, not well meta html tags,...)– Isolated sites (not many inter-links, only CERN main page)

• Good experience with well structured content– Indico, CDS

• How to improve ranking? – Manual Tuning of results, promote, demote– Modify rank profile– Custom processing stage for static rank points

• Not easy, – Manpower intensive– Better understand of data indexed– Not magic solution, balance rank profile for different

collections

CERN Search - 21

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Changes on FAST ESP products

• Before only one product– FAST ESP 5.3 (Standalone product)

• Now, several possibilities– FAST

• FAST Search Server 2010 for Internal Applications (FSIA)

• FAST Search Server 2010 for Internet Sites (FSIS)

– Microsoft + FAST• FAST Search Server 2010 for Sharepoint (FS4SP)

– Same core– Configuration and OTB pipeline adapted for Sharepoint– Reduced set of tools, others migrated to Sharepoint or

Powershell cmdlets

Presentation Title - 22

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OISFAST Search for SharepointArquitecture Overview

Presentation Title - 23

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OISFAST Search for SharepointTopology

Presentation Title - 24

Sharepoint CrawlerSharepoint Sites

Web SitesFile Shares

Exchange public foldersLotus Notes

FAST Enterprise Crawler

Search Centre

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Server Architecture

• Two systems (Production + Dev)• Using Sharepoint Central Service• Production

– 1 admin node– 1 crawler + pre-processing node– 4 nodes index cluster

• Both roles Indexer and Search• 2 rows

– Backup– Query performance

• 2 columns – Easy handle more than 30 million documents

– High reliability on critical components• Content Distributors, QueryServers, Document Processors

Presentation Title - 25

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OISFast Search for SharepointNew features (I)

• New Query Suggestions model– Based on dictionary and common user queries

• Best Bets & Visual Best Bets• Custom search experience (per user/role)• New management system (microsoft style)

– SCOM, Powershell,…

• Sharepoint integration• Phonetic and nickname search• Thumbnails and previews in results

Presentation Title - 26

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OISFast Search for SharepointNew features (II)

• Entity extraction• Office Web Apps integration• Relevance improvements with social

behaviour– Click-through relevancy

• Enhanced Results Refinement– Deep results refinement– Based on any managed properties– Similar results

• Federation Search

Presentation Title - 27

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Migration Process

• Migrate Pipelines• Adapt Retrieval and Pre-processing scripts• Port Custom processing stages• Migrate feed process to use Sharepoint

Crawlers (Files Shares)• Customize Search Centre to offer same

functionality than old system• Create general helpers tools

– Manage index profile– Manage keywords, best bets,…

Presentation Title - 28

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Examples

• Best Bets & Visual Best Bets

Presentation Title - 29

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Examples

• Visual Refiners

Presentation Title - 30

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Examples

• Federation search examples (google, bing, twitter)

Presentation Title - 31

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Search Driven Application

Presentation Title - 32

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Summary

• Introduction to search• Inside CERN Search• New Search Solution

– Concepts, collections, pipelines, stages, architecture

– Search features• Demo• Conclusions and future work

CERN Search - 33

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Summary

• Introduction to search• Inside CERN Search• New Search Solution

– Concepts, collections, pipelines, stages, architecture

– Search features• Demo• Conclusions and future work

CERN Search - 34

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Conclusions

• Succesfully migrated all the content from old system– Experience in the same technology

• Reduced tools and help for other content than Sharepoint

• But,– New interesting features, Sharepoint integration– Complete Search Centre

• More community behind• High cohesion between Sharepoint and

Search Services

Presentation Title - 35

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS Next Steps

• Integration with Drupal– Customized pre-processing, processing, index

and query

• Index SSO Centrally Manage Sites– Own SSO Crawling, Get ACLs, processing

• Continue evolving the new system– Take advantage all FS4SP features

• Office WebApps, Visual Refiners, phonetic search,...

– Together with content providers improve• Relevancy, Best Bets, ...

Presentation Title - 36

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS• CERN Search:

http://cern.ch/search

• and also via:– CERN Intranet & Public Pages– TWiki– IT, HR,

PH Websites– JACOW

CERN Search @

CERN Search - 37

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS

CERN Search - 38

Questions ?