Hippo get together presentation solr integration

Solr integration

April 20, 2012Ard Schrijvers • [email protected] / [email protected]

1. Working at Hippo since 20012. Email: [email protected]

[email protected] 3. Worked primarily on:

1. HST 2. Hippo Repository / Jackrabbit3. Lucene 4. Cocoon 5. Slide

4. Apache committer of Jackrabbit and Cocoon

About me:Ard Schrijvers

Outline

1. The current search (HST / repo) architecture

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up 7. Questions

Current search architecture


SoAn HSTQuery

is translated to anXPath query

Which is delegated to the repository that returns aJCR NodeIterator

which the HST binds back toHippoBean's


That sounds doable and not to complex

is it?


Well, it is .......


Well, it is ....... very complex


Reasons:

1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.4


Reasons:


2. The first JSR-170 spec imposed some very harsh constraints : A save must result in directly updated search results


Reasons:



3. Support for XPath / SQL was needed. However, Lucene likes flattened data, JCR with XPath / SQL is all about hierarchical data


Reasons:



3. Support for XPath / SQL was needed. However, Lucene likes flattened data, JCR with XPath / SQL is all about hierarchical data

4. JCR Nodes != Documents

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A short HOWTO as developer 6. A very fast demo7. Wrap up 8. Questions

Current problems / shortcomings / mismatches

1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)



2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion




3. Very hard and very limited to customize




3. Very hard and very limited to customize4. A single index for an entire workspace




3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price

of CPU, Memory and complexity





of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no 'derived'

field indexes






field indexes7. To index external sources, the sources need to be stored in

the repository







the repository8. Range queries (and others) easily blow up







the repository8. Range queries (and others) easily blow up9. Getting the number of hits is complex


Extra problem

JCR Nodes !=

Documents

For example : A news document contains a link to an author document : Through the author name, the news document should be found

Outline


Objectives

1. Fix all the 9+ problems / shortcomings/ mismatches from previous slides

2. Easy to use and customize3. Satisfied customers4. Satisfied partners5. Scalable searches : CPU, memory and large document

numbers6. Document oriented 7. Integration with HST ContentBeans (HippoBeans)8. Index external sources 9. Control the SIZE of the index yourself

10. Don't invent but integrate ( with out-of-the-box features supported by a large community)

Objective: Fix all the 9 problems / shortcomings/ mismatches from

previous slides

Objective: Fix all the 9 problems / shortcomings/ mismatches from

previous slidesEasy:

Solr integration to rescue

Objective: Easy to use and customize


YOU will be in the driver seat


No more complete dependence on what the sometimes not so smAR&D Hippo team thought was good for YOU

Objective : Easy to use and customize


You decide 'from where', 'what', 'how' and 'when' to index


You decide 'from where', 'what', 'how' and 'when' to index 1. from where: which sources (jcr, webpages, database,

noSQL store, nuxeo, alfresco, anything)



noSQL store, nuxeo, alfresco, anything)2. what : which parts of a document (not jcr node) or external

source




source3. how :

1. which analyzer, 2. index on document level, property level or both3. store the text




source3. how :

1. which analyzer, 2. index on document level, property level or both3. store the text

4. when : when do you want to index


But of course, out-of-the-box support and toolingready to be used by YOU



1. Default hippo repository indexer & observer



1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing



1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBean's



1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBean's4. Deployment support



1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBean's4. Deployment support 5. Clustering support

Objective: Satisfied customers


HOW?


EASY


Most likely they just will be satisfied


If they are not satisfied enough you can:

1. Easily customize it (aka tune it until 'je een ons weegt')2. Hire anyone with Solr experience : All our partners have

Solr experience


Still not satisfied?

Let them pay too much for a Google Search appliance, Autonomy or any of the other 'useless to pay for software'

Objective: Satisfied partners


Although on thin ice here, I strongly believe in this because:


1. Our partners frequently have good knowledge about Solr


1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations


1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge


1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations


1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations5. Our partners will earn more on Hippo and have happier

developers


1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations5. Our partners will earn more on Hippo and have happier

developers6. Hippo will earn more through HES: Which will satisfy

partners again, because Hippo can spend more on AR&D ==> more features

Objective: Scalable searches


1. Using Solr to do the searches


1. Using Solr to do the searches 2. Not the complex JCR hierarchical searches


1. Using Solr to do the searches 2. Not the complex JCR hierarchical searches3. Document oriented instead of JCR Nodes ( #docs <<

#nodes)

Objective: Document oriented


What do we want to search for?


Exactly,

Documents!!


A Document ==

A HippoBean !=

JCR Node


So let's index


So let's index

HippoBeans(ContentBeans)

Objective: Integration with ContentBeans (HippoBeans)


As a developer ....

how am I going to index my beans?


I know how to write HippoBeans, that all I ever did in my life


How do you expect me to index my beans?


Annotate your getters with

@IndexField or

@IndexField(name="foo")

And account for them in Solr schema.xml <field name="title" type="text_general" indexed="true" stored="true" /> <field name="summary" type="text_general" indexed="true" stored="true"/>


An example: @Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField(name="samenvatting") public String getSummary() { return getProperty("demosite:summary") ; }}


Another example: @Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField public String getSummary() { return getProperty("demosite:summary") ; } @IndexField public String getAuthor() { return getLinkedBean("demosite:author", Author.class).getAuthor(); }}


Another example: @Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField public String getSummary() { return getProperty("demosite:summary") ; } @ReIndexOnChange @IndexField public Author getAuthor() { return getLinkedBean("demosite:author", Author.class); }}


Another example: Setters@Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { private String title; private String summary; @IndexField public String getTitle() { return title == null ? getProperty("demosite:title"): title ; } public void setTitle(String title) { this.title = title; } @IndexField public String getSummary() { return summary == null ? getProperty("demosite:summary"): summary ; } public void setSummary(String summary) { this.summary = summary; }}

Bonus : What can we achieve with the Setters?


That's all you need to do And the HST binds some extra indexing fields like 1. The path2. The canonicalUUID3. The name4. The localized name5. The depth 6. The class hierarchy (including interfaces)

Objective: Index external sources


You can

1. Push them directly to Solr


You can

1. Push them directly to Solr2. Push them to a HST JAX-RS resource that binds to a

ContentBean and commits to Solr


You can

1. Push them directly to Solr2. Push them to a HST JAX-RS resource that binds to a

ContentBean and commits to Solr3. Crawl from the HST and bind to ContentBeans and commit

them to Solr


A ContentBean does *not* need a JCR Node!

ContentBean interface:

public interface ContentBean { @IndexField(name="id") String getPath(); void setPath(String path);}


An example : GoGreenProductBean in Testsuite

public class GoGreenProductBean implements ContentBean { private String path;

private String title;

private String summary;

private String description;

public String getPath() {return path;}

public void setPath(final String path) {this.path = path;}

@IndexField public String getTitle() {return title;}

public void setTitle(String title) {this.title = title;}

@IndexField

public String getSummary() {return summary ;}

public void setSummary(String summary) {this.summary = summary;}

@IndexField

public String getDescription() {return description;}

public void setDescription(String description) {this.description = description;}}


And add the GoGreenProductBean to Solr

{

List<GoGreenProductBean> gogreenBeans = new ArrayList<GoGreenProductBean>(); // FILL THE gogreenBeans LIST

// NOW ADD TO INDEX HippoSolrManager solrManager = HstServices.getComponentManager().getComponent( HippoSolrManager.class.getName(), SOLR_MODULE_NAME); try { solrManager.getSolrServer().addBeans(gogreenBeans); UpdateResponse commit = solrManager.getSolrServer().commit(); } catch (IOException e) { e.printStackTrace(); } catch (SolrServerException e) { e.printStackTrace(); }}

Objective: Control the SIZE of the index yourself


JCR / Jackrabbit / Hippo-Repository has a generic

one-fits-all-index (or one-fits-none-index)

Which grows very large easily, and can hardly be customized


However, search is

domain specific

Thus,

Just index what is needed for the customer

Objective: Don't invent but integrate


Use Solr

Use Solrj client

Expose the Solrj SolrQuery


For example:HippoSolrManager solrManager = ...String query = ...HippoQuery hippoQuery = solrManager.createQuery(query); hippoQuery.setLimit(pageSize); hippoQuery.setOffset((page - 1) * pageSize); // hippoQuery.getSolrQuery() is the SolrQuery object // include scoring

hippoQuery.getSolrQuery().setIncludeScore(true);hippoQuery.getSolrQuery().setHighlight(true); hippoQuery.getSolrQuery().setHighlightFragsize(200); hippoQuery.getSolrQuery().addHighlightField("title"); hippoQuery.getSolrQuery().addHighlightField("summary"); hippoQuery.getSolrQuery().addHighlightField("htmlContent"); HippoQueryResult result = hippoQuery.execute(true);

Outline


Solr integration to rescue

No further comments :-)

Outline


A very fast demo

setup ~75.000 long wikipedia docs in repository

............... doing the demo .................

That was : a very fast demo

Outline


Wrap up

I think that with the Solr integration

Wrap up

I think that with the Solr integration 1. Developers will be happier

Wrap up

I think that with the Solr integration 1. Developers will be happier2. Customers will be happier

Wrap up

I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier

Wrap up

I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier4. Hippo will be happier

Wrap up

I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier4. Hippo will be happier

And finally, last and least

Wrap up

I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier4. Hippo will be happier5. Infra will be happier because the servers stop sweating

Outline


Questions?

Check out the example at :http://svn.onehippo.org/repos/hippo/hippo-cms7/testsuite/trunk

Hippo get together presentation solr integration

Technology

Transcript of Hippo get together presentation solr integration