Jose Castro Leon CERN – IT/OIS CERN Agile Infrastructure Infrastructure as a Service.
CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Andreas Wagner – CERN IT/OIS Eduardo...
-
Upload
emmeline-marshall -
Category
Documents
-
view
228 -
download
0
Transcript of CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Andreas Wagner – CERN IT/OIS Eduardo...
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS
Andreas Wagner – CERN IT/OIS
Eduardo Alvarez – CERN IT/OIS
Sergio Fernandez – CERN IT/OIS
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Summary
• Introduction to search• Inside CERN Search• New Search Solution
– Concepts, collections, pipelines, stages, architecture
– Search features• Demo• Conclusions and future work
Presentation Title - 2
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS What is Search?
• Search is the art of balancing three factors:– Recall
• How many matching documents were returned?
– Precision• Of returned documents, how many match the query?
– Relevancy• How well does a document match the query?
Presentation Title - 3
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Enterprise Search
• Wide range of document sources:
CERN Search - 4
• Web Pages• File systems• Databases• Directories (People and Places)• Document repositories (CDS,
EDMS, Indico, …)• Structured CMS Data
• Sharepoint, Drupal, Twiki
• Variety of meta data• Different Access Protection Schemes• Different retrieval methods and frequencies
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Enterprise Search
• Components of Enterprise Search:– Search Engine / Search Technology– Integration within existing infrastructure (authentication,
authorization)– Document retrieval
• Not only Web pages• Database/XML data (CDS, Indico, Phone data)
– Protected documents• Access for document data• In addition information about ACLs needed
– Ranking of documents
– Enterprise Search is not only a question about the search technology used!
CERN Search - 5
collaboration with data owners
collaboration with data owners
collaboration with data owners
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS What about Google
• What makes Google Web search so good
– Huge Web space analysis capabilities,
– Huge usage data used for “voting” the results most popular results are promoted
– Substantial resources to tune and correct results;- usage data analysis- taking into account popular events - hand edited results for popular single key word searches
– Personalize filter of results• Based on : Location, Preferences, search historial, …
• Above is valid for all public web search engines, Yahoo, Bing
• At the same time Web Search is not Enterprise search!
CERN Search - 6
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Summary
• Introduction to search• Inside CERN Search• New Search Solution
– Concepts, collections, pipelines, stages, architecture
– Search features• Demo• Conclusions and future work
CERN Search - 7
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Search at CERN?
• Why Search Service? If…– Every systems usually has its own search
system• Probably one of the best place for this
service• Quite a lot different content sources• High rate of new content• Solutions are not always optimal• Centralize the search of content
Presentation Title - 8
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS CERN Search
• A Central Search solution to provide• for users
– Single entry point for searching information on several content sources at the same time
• for service providers– Search backend service
» TWiki, Drupal, Sharepoint, JACOW, Groups
• Start of project in February 2006:• Based on commercial product from FAST
• (Microsoft subsidiary and market leader)
• CERN Search in production since 2007
• Present resources 1 PJAS & some fraction of a staff
CERN Search - 9
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OISCERN Search
Last Progress
• 2009 Migration to FAST ESP 5.3• 2010 Reorganization of the Indexed Web
Space (Improved relevancy)• 2010 – Twiki protected pages indexed
– Service used as default Twiki search
• 1Q 2011 – Indico Protected Docs + Material• 1Q 2011 – Index of the Sharepoint content• 3Q 2011 – Migration to FAST Search Server
2010 for Sharepoint
Presentation Title - 10
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Overview of Indexed documents
CERN Search - 11
Documents indexed by CERN Search
2011 2010 2009 2008
CERN Websites 1561670 1456510 1787805 829542
CDS 1116921 1048360 1040694 936018
TWiki Pages 68055 60796 --- ---
Indico 1531908 311208 255365 432339
JACOW 169204 144388 --- ---
Phonebook 26426 29819 25629 23982
Sharepoint(eSpace/Groups Archive)
6721347 --- --- ---
Central SSO Websites
--- --- --- ---
Total 11 million 3 million 3 million 2 million
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Summary
• Introduction to search• Inside CERN Search• New Search Solution
– Concepts, collections, pipelines, stages, architecture
– Search features• Demo• Conclusions and future work
CERN Search - 12
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Concepts I
• Document• Pipeline• Processing Stage
Presentation Title - 13
• Collection• Crawler (Files,
Web)
Collection A
Collection B
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS
CERN Search - 14
Concepts II
Co
nte
nt A
PI
Qu
ery
AP
IF
ilte
r AP
I
Connectors(Push&Pull)
Document retrieval Document indexingDocument processing
Document Content Flow
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS
CERN Search - 15
Indexing Protected Content I
• To allow indexing protected content we need to • Retrieve the document
• Search engine needs access to document
• Obtaining document ACLs• To be able to decide who is allowed to find a document
• Often not trivial since most systems answer the question:
“Has a given user the right to access a given document?” and not
“Tell me who has access to a given document?”
This is due to often complex permission models including inheritance, fine granularity of permissions and changing permission during document lifecycle …
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS
CERN Search - 16
Indexing Protected Content II
• Document Processing • Resolve ACLs to SIDs • Sent to Indexer with document
• FSA (FAST Security Authorization) Component• Active Directory integration, i.e. based
on CERN accounts and e-groups
Search Index
CERN Search
DocumentRepository
Document Processing
Active DirectoryUsers & Groups
Doc + ACLACL
Document
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS
CERN Search - 17
Authentication / Authorisation
CERN Search
Active DirectoryUsers & Groups
Search Index
Search F
ront End
Query & Identity
Group Membership
Authentication (SSO) & Search
• Query Processing • Authentication by Front-End • FSA creates filter with expanded user
credentials and groups
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OISFAST Search for SharepointCluster Architecture
Presentation Title - 18
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Index Profile
• Final representation of each document• Set of attributes to index (Managed Prop)
– Title– Author– Last modified date– ACLs
• Define properties queryables, refiners, sort• Define FullTextIndex Properties• Define mappings to FullTextIndex• Flexible
Presentation Title - 19
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Result Ranking – Rank Profiles
CERN Search - 20
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Ranking Issues at CERN
• Flat Web space – Lack of metadata (Copy-Paste, not well meta html tags,...)– Isolated sites (not many inter-links, only CERN main page)
• Good experience with well structured content– Indico, CDS
• How to improve ranking? – Manual Tuning of results, promote, demote– Modify rank profile– Custom processing stage for static rank points
• Not easy, – Manpower intensive– Better understand of data indexed– Not magic solution, balance rank profile for different
collections
CERN Search - 21
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Changes on FAST ESP products
• Before only one product– FAST ESP 5.3 (Standalone product)
• Now, several possibilities– FAST
• FAST Search Server 2010 for Internal Applications (FSIA)
• FAST Search Server 2010 for Internet Sites (FSIS)
– Microsoft + FAST• FAST Search Server 2010 for Sharepoint (FS4SP)
– Same core– Configuration and OTB pipeline adapted for Sharepoint– Reduced set of tools, others migrated to Sharepoint or
Powershell cmdlets
Presentation Title - 22
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OISFAST Search for SharepointArquitecture Overview
Presentation Title - 23
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OISFAST Search for SharepointTopology
Presentation Title - 24
Sharepoint CrawlerSharepoint Sites
Web SitesFile Shares
Exchange public foldersLotus Notes
FAST Enterprise Crawler
Search Centre
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Server Architecture
• Two systems (Production + Dev)• Using Sharepoint Central Service• Production
– 1 admin node– 1 crawler + pre-processing node– 4 nodes index cluster
• Both roles Indexer and Search• 2 rows
– Backup– Query performance
• 2 columns – Easy handle more than 30 million documents
– High reliability on critical components• Content Distributors, QueryServers, Document Processors
Presentation Title - 25
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OISFast Search for SharepointNew features (I)
• New Query Suggestions model– Based on dictionary and common user queries
• Best Bets & Visual Best Bets• Custom search experience (per user/role)• New management system (microsoft style)
– SCOM, Powershell,…
• Sharepoint integration• Phonetic and nickname search• Thumbnails and previews in results
Presentation Title - 26
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OISFast Search for SharepointNew features (II)
• Entity extraction• Office Web Apps integration• Relevance improvements with social
behaviour– Click-through relevancy
• Enhanced Results Refinement– Deep results refinement– Based on any managed properties– Similar results
• Federation Search
Presentation Title - 27
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Migration Process
• Migrate Pipelines• Adapt Retrieval and Pre-processing scripts• Port Custom processing stages• Migrate feed process to use Sharepoint
Crawlers (Files Shares)• Customize Search Centre to offer same
functionality than old system• Create general helpers tools
– Manage index profile– Manage keywords, best bets,…
Presentation Title - 28
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Examples
• Best Bets & Visual Best Bets
Presentation Title - 29
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Examples
• Visual Refiners
Presentation Title - 30
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Examples
• Federation search examples (google, bing, twitter)
Presentation Title - 31
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Search Driven Application
Presentation Title - 32
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Summary
• Introduction to search• Inside CERN Search• New Search Solution
– Concepts, collections, pipelines, stages, architecture
– Search features• Demo• Conclusions and future work
CERN Search - 33
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Summary
• Introduction to search• Inside CERN Search• New Search Solution
– Concepts, collections, pipelines, stages, architecture
– Search features• Demo• Conclusions and future work
CERN Search - 34
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Conclusions
• Succesfully migrated all the content from old system– Experience in the same technology
• Reduced tools and help for other content than Sharepoint
• But,– New interesting features, Sharepoint integration– Complete Search Centre
• More community behind• High cohesion between Sharepoint and
Search Services
Presentation Title - 35
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS Next Steps
• Integration with Drupal– Customized pre-processing, processing, index
and query
• Index SSO Centrally Manage Sites– Own SSO Crawling, Get ACLs, processing
• Continue evolving the new system– Take advantage all FS4SP features
• Office WebApps, Visual Refiners, phonetic search,...
– Together with content providers improve• Relevancy, Best Bets, ...
Presentation Title - 36
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS• CERN Search:
http://cern.ch/search
• and also via:– CERN Intranet & Public Pages– TWiki– IT, HR,
PH Websites– JACOW
CERN Search @
CERN Search - 37