Hippo get together presentation solr integration
-
Upload
hippo -
Category
Technology
-
view
108 -
download
0
description
Transcript of Hippo get together presentation solr integration
Solr integration
April 20, 2012Ard Schrijvers • [email protected] / [email protected]
1. Working at Hippo since 20012. Email: [email protected]
[email protected] 3. Worked primarily on:
1. HST 2. Hippo Repository / Jackrabbit3. Lucene 4. Cocoon 5. Slide
4. Apache committer of Jackrabbit and Cocoon
About me:Ard Schrijvers
Outline
1. The current search (HST / repo) architecture
Outline
1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches
Outline
1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives
Outline
1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue
Outline
1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo
Outline
1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up
Outline
1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up 7. Questions
Current search architecture
Current search architecture
SoAn HSTQuery
is translated to anXPath query
Which is delegated to the repository that returns aJCR NodeIterator
which the HST binds back toHippoBean's
Current search architecture
That sounds doable and not to complex
is it?
Current search architecture
Well, it is .......
Current search architecture
Well, it is ....... very complex
Current search architecture
Reasons:
1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.4
Current search architecture
Reasons:
1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.4
2. The first JSR-170 spec imposed some very harsh constraints : A save must result in directly updated search results
Current search architecture
Reasons:
1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.4
2. The first JSR-170 spec imposed some very harsh constraints : A save must result in directly updated search results
3. Support for XPath / SQL was needed. However, Lucene likes flattened data, JCR with XPath / SQL is all about hierarchical data
Current search architecture
Reasons:
1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.4
2. The first JSR-170 spec imposed some very harsh constraints : A save must result in directly updated search results
3. Support for XPath / SQL was needed. However, Lucene likes flattened data, JCR with XPath / SQL is all about hierarchical data
4. JCR Nodes != Documents
Outline
1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A short HOWTO as developer 6. A very fast demo7. Wrap up 8. Questions
Current problems / shortcomings / mismatches
1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)
Current problems / shortcomings / mismatches
1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion
Current problems / shortcomings / mismatches
1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion
3. Very hard and very limited to customize
Current problems / shortcomings / mismatches
1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion
3. Very hard and very limited to customize4. A single index for an entire workspace
Current problems / shortcomings / mismatches
1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion
3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price
of CPU, Memory and complexity
Current problems / shortcomings / mismatches
1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion
3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price
of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no 'derived'
field indexes
Current problems / shortcomings / mismatches
1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion
3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price
of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no 'derived'
field indexes7. To index external sources, the sources need to be stored in
the repository
Current problems / shortcomings / mismatches
1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion
3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price
of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no 'derived'
field indexes7. To index external sources, the sources need to be stored in
the repository8. Range queries (and others) easily blow up
Current problems / shortcomings / mismatches
1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion
3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price
of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no 'derived'
field indexes7. To index external sources, the sources need to be stored in
the repository8. Range queries (and others) easily blow up9. Getting the number of hits is complex
Current problems / shortcomings / mismatches
Extra problem
JCR Nodes !=
Documents
For example : A news document contains a link to an author document : Through the author name, the news document should be found
Outline
1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up 7. Questions
Objectives
1. Fix all the 9+ problems / shortcomings/ mismatches from previous slides
2. Easy to use and customize3. Satisfied customers4. Satisfied partners5. Scalable searches : CPU, memory and large document
numbers6. Document oriented 7. Integration with HST ContentBeans (HippoBeans)8. Index external sources 9. Control the SIZE of the index yourself
10. Don't invent but integrate ( with out-of-the-box features supported by a large community)
Objective: Fix all the 9 problems / shortcomings/ mismatches from
previous slides
Objective: Fix all the 9 problems / shortcomings/ mismatches from
previous slidesEasy:
Solr integration to rescue
Objective: Easy to use and customize
Objective: Easy to use and customize
YOU will be in the driver seat
Objective: Easy to use and customize
Objective: Easy to use and customize
Objective: Easy to use and customize
No more complete dependence on what the sometimes not so smAR&D Hippo team thought was good for YOU
Objective : Easy to use and customize
Objective: Easy to use and customize
You decide 'from where', 'what', 'how' and 'when' to index
Objective: Easy to use and customize
You decide 'from where', 'what', 'how' and 'when' to index 1. from where: which sources (jcr, webpages, database,
noSQL store, nuxeo, alfresco, anything)
Objective: Easy to use and customize
You decide 'from where', 'what', 'how' and 'when' to index 1. from where: which sources (jcr, webpages, database,
noSQL store, nuxeo, alfresco, anything)2. what : which parts of a document (not jcr node) or external
source
Objective: Easy to use and customize
You decide 'from where', 'what', 'how' and 'when' to index 1. from where: which sources (jcr, webpages, database,
noSQL store, nuxeo, alfresco, anything)2. what : which parts of a document (not jcr node) or external
source3. how :
1. which analyzer, 2. index on document level, property level or both3. store the text
Objective: Easy to use and customize
You decide 'from where', 'what', 'how' and 'when' to index 1. from where: which sources (jcr, webpages, database,
noSQL store, nuxeo, alfresco, anything)2. what : which parts of a document (not jcr node) or external
source3. how :
1. which analyzer, 2. index on document level, property level or both3. store the text
4. when : when do you want to index
Objective: Easy to use and customize
But of course, out-of-the-box support and toolingready to be used by YOU
Objective: Easy to use and customize
But of course, out-of-the-box support and toolingready to be used by YOU
1. Default hippo repository indexer & observer
Objective: Easy to use and customize
But of course, out-of-the-box support and toolingready to be used by YOU
1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing
Objective: Easy to use and customize
But of course, out-of-the-box support and toolingready to be used by YOU
1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBean's
Objective: Easy to use and customize
But of course, out-of-the-box support and toolingready to be used by YOU
1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBean's4. Deployment support
Objective: Easy to use and customize
But of course, out-of-the-box support and toolingready to be used by YOU
1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBean's4. Deployment support 5. Clustering support
Objective: Satisfied customers
Objective: Satisfied customers
HOW?
Objective: Satisfied customers
EASY
Objective: Satisfied customers
Most likely they just will be satisfied
Objective: Satisfied customers
If they are not satisfied enough you can:
1. Easily customize it (aka tune it until 'je een ons weegt')2. Hire anyone with Solr experience : All our partners have
Solr experience
Objective: Satisfied customers
Still not satisfied?
Let them pay too much for a Google Search appliance, Autonomy or any of the other 'useless to pay for software'
Objective: Satisfied partners
Objective: Satisfied partners
Although on thin ice here, I strongly believe in this because:
Objective: Satisfied partners
1. Our partners frequently have good knowledge about Solr
Objective: Satisfied partners
1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations
Objective: Satisfied partners
1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge
Objective: Satisfied partners
1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations
Objective: Satisfied partners
1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations5. Our partners will earn more on Hippo and have happier
developers
Objective: Satisfied partners
1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations5. Our partners will earn more on Hippo and have happier
developers6. Hippo will earn more through HES: Which will satisfy
partners again, because Hippo can spend more on AR&D ==> more features
Objective: Scalable searches
Objective: Scalable searches
1. Using Solr to do the searches
Objective: Scalable searches
1. Using Solr to do the searches 2. Not the complex JCR hierarchical searches
Objective: Scalable searches
1. Using Solr to do the searches 2. Not the complex JCR hierarchical searches3. Document oriented instead of JCR Nodes ( #docs <<
#nodes)
Objective: Document oriented
Objective: Document oriented
What do we want to search for?
Objective: Document oriented
Exactly,
Documents!!
Objective: Document oriented
A Document ==
A HippoBean !=
JCR Node
Objective: Document oriented
So let's index
Objective: Document oriented
So let's index
HippoBeans(ContentBeans)
Objective: Integration with ContentBeans (HippoBeans)
Objective: Integration with ContentBeans (HippoBeans)
As a developer ....
how am I going to index my beans?
Objective: Integration with ContentBeans (HippoBeans)
I know how to write HippoBeans, that all I ever did in my life
Objective: Integration with ContentBeans (HippoBeans)
How do you expect me to index my beans?
Objective: Integration with ContentBeans (HippoBeans)
Annotate your getters with
@IndexField or
@IndexField(name="foo")
And account for them in Solr schema.xml <field name="title" type="text_general" indexed="true" stored="true" /> <field name="summary" type="text_general" indexed="true" stored="true"/>
Objective: Integration with ContentBeans (HippoBeans)
An example: @Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField(name="samenvatting") public String getSummary() { return getProperty("demosite:summary") ; }}
Objective: Integration with ContentBeans (HippoBeans)
Another example: @Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField public String getSummary() { return getProperty("demosite:summary") ; } @IndexField public String getAuthor() { return getLinkedBean("demosite:author", Author.class).getAuthor(); }}
Objective: Integration with ContentBeans (HippoBeans)
Another example: @Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField public String getSummary() { return getProperty("demosite:summary") ; } @ReIndexOnChange @IndexField public Author getAuthor() { return getLinkedBean("demosite:author", Author.class); }}
Objective: Integration with ContentBeans (HippoBeans)
Another example: Setters@Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { private String title; private String summary; @IndexField public String getTitle() { return title == null ? getProperty("demosite:title"): title ; } public void setTitle(String title) { this.title = title; } @IndexField public String getSummary() { return summary == null ? getProperty("demosite:summary"): summary ; } public void setSummary(String summary) { this.summary = summary; }}
Bonus : What can we achieve with the Setters?
Objective: Integration with ContentBeans (HippoBeans)
That's all you need to do And the HST binds some extra indexing fields like 1. The path2. The canonicalUUID3. The name4. The localized name5. The depth 6. The class hierarchy (including interfaces)
Objective: Index external sources
Objective: Index external sources
You can
1. Push them directly to Solr
Objective: Index external sources
You can
1. Push them directly to Solr2. Push them to a HST JAX-RS resource that binds to a
ContentBean and commits to Solr
Objective: Index external sources
You can
1. Push them directly to Solr2. Push them to a HST JAX-RS resource that binds to a
ContentBean and commits to Solr3. Crawl from the HST and bind to ContentBeans and commit
them to Solr
Objective: Index external sources
A ContentBean does *not* need a JCR Node!
ContentBean interface:
public interface ContentBean { @IndexField(name="id") String getPath(); void setPath(String path);}
Objective: Index external sources
An example : GoGreenProductBean in Testsuite
public class GoGreenProductBean implements ContentBean { private String path;
private String title;
private String summary;
private String description;
public String getPath() {return path;}
public void setPath(final String path) {this.path = path;}
@IndexField public String getTitle() {return title;}
public void setTitle(String title) {this.title = title;}
@IndexField
public String getSummary() {return summary ;}
public void setSummary(String summary) {this.summary = summary;}
@IndexField
public String getDescription() {return description;}
public void setDescription(String description) {this.description = description;}}
Objective: Index external sources
And add the GoGreenProductBean to Solr
{
List<GoGreenProductBean> gogreenBeans = new ArrayList<GoGreenProductBean>(); // FILL THE gogreenBeans LIST
// NOW ADD TO INDEX HippoSolrManager solrManager = HstServices.getComponentManager().getComponent( HippoSolrManager.class.getName(), SOLR_MODULE_NAME); try { solrManager.getSolrServer().addBeans(gogreenBeans); UpdateResponse commit = solrManager.getSolrServer().commit(); } catch (IOException e) { e.printStackTrace(); } catch (SolrServerException e) { e.printStackTrace(); }}
Objective: Control the SIZE of the index yourself
Objective: Control the SIZE of the index yourself
JCR / Jackrabbit / Hippo-Repository has a generic
one-fits-all-index (or one-fits-none-index)
Which grows very large easily, and can hardly be customized
Objective: Control the SIZE of the index yourself
However, search is
domain specific
Thus,
Just index what is needed for the customer
Objective: Don't invent but integrate
Objective: Don't invent but integrate
Use Solr
Use Solrj client
Expose the Solrj SolrQuery
Objective: Don't invent but integrate
For example:HippoSolrManager solrManager = ...String query = ...HippoQuery hippoQuery = solrManager.createQuery(query); hippoQuery.setLimit(pageSize); hippoQuery.setOffset((page - 1) * pageSize); // hippoQuery.getSolrQuery() is the SolrQuery object // include scoring
hippoQuery.getSolrQuery().setIncludeScore(true);hippoQuery.getSolrQuery().setHighlight(true); hippoQuery.getSolrQuery().setHighlightFragsize(200); hippoQuery.getSolrQuery().addHighlightField("title"); hippoQuery.getSolrQuery().addHighlightField("summary"); hippoQuery.getSolrQuery().addHighlightField("htmlContent"); HippoQueryResult result = hippoQuery.execute(true);
Objective: Don't invent but integrate
For example:HippoSolrManager solrManager = ...String query = ...HippoQuery hippoQuery = solrManager.createQuery(query); hippoQuery.setLimit(pageSize); hippoQuery.setOffset((page - 1) * pageSize); // hippoQuery.getSolrQuery() is the SolrQuery object // include scoring
hippoQuery.getSolrQuery().setIncludeScore(true);hippoQuery.getSolrQuery().setHighlight(true); hippoQuery.getSolrQuery().setHighlightFragsize(200); hippoQuery.getSolrQuery().addHighlightField("title"); hippoQuery.getSolrQuery().addHighlightField("summary"); hippoQuery.getSolrQuery().addHighlightField("htmlContent"); HippoQueryResult result = hippoQuery.execute(true);
Outline
1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up 7. Questions
Solr integration to rescue
No further comments :-)
Outline
1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up 7. Questions
A very fast demo
setup ~75.000 long wikipedia docs in repository
............... doing the demo .................
That was : a very fast demo
Outline
1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up 7. Questions
Wrap up
I think that with the Solr integration
Wrap up
I think that with the Solr integration 1. Developers will be happier
Wrap up
I think that with the Solr integration 1. Developers will be happier2. Customers will be happier
Wrap up
I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier
Wrap up
I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier4. Hippo will be happier
Wrap up
I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier4. Hippo will be happier
And finally, last and least
Wrap up
I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier4. Hippo will be happier5. Infra will be happier because the servers stop sweating
Outline
1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up 7. Questions
Questions?
Check out the example at :http://svn.onehippo.org/repos/hippo/hippo-cms7/testsuite/trunk