Needle in an enterprise haystack
-
Upload
andrew-mleczko -
Category
Technology
-
view
4.227 -
download
2
Transcript of Needle in an enterprise haystack
search engine integrationsNeedle in an enterprise haystack
1
so why do you need an external search engine?
3
why do you need an external search engine...
• Plone's portal_catalog is slow with big sites (large number of indexed objects)
• You want to reduce Plone memory consumption (by removing heavy indexes like SearchableText)
• You want to query Plone's content from external applications
• You want to use advanced search features
4
there are several solutions
that you can use
5
Plone external indexing and searching
• Out-of-the-box:
• collective.gsa (Google Search Appliance)
• collective.solr (Apache Solr)
• Custom integrations:
• Solr
• Tsearch2
http://www.flickr.com/photos/jenny-pics/3527749814
6
Solr?
http://www.flickr.com/photos/st3f4n/2767217547
7
a search enginebased on Lucene
http://www.flickr.com/photos/st3f4n/2767217547
8
Lucene?
http://www.flickr.com/photos/st3f4n/2767217547
9
Full-text search library 100% in java
http://www.flickr.com/photos/st3f4n/2767217547
10
Solr XML/HTTP, JSON interface,Open Source
http://www.flickr.com/photos/st3f4n/2767217547
11
collective.solr python API and Plone integration
http://www.flickr.com/photos/st3f4n/2767217547
12
Document formatsolr
collective.solr
13
Document format
<add><doc>
! <field name=”id”>123</field>
! <field name=”title”>The Trap</field>
! <field name=”author”>Agatha Christie</field>
! <field name=”genre”>thriller</field>
</doc></add>
solrcollective.solr
13
Document format
<add><doc>
! <field name=”id”>123</field>
! <field name=”title”>The Trap</field>
! <field name=”author”>Agatha Christie</field>
! <field name=”genre”>thriller</field>
</doc></add>
>>> conn = SolrConnection(host='127.0.0.1', ...)
>>> book = {'title': 'The Trap',
...!! ! 'author': 'Agatha Christie',
...!! ! 'genre' : 'thriller'}
>>> conn.add(**book)
solrcollective.solr
13
Response format
14
Response format
<response><result numFound=”2” start=”0”>
<doc><str name=”title”>Coma</str>
<str name=”author”>Robin Cook</str></doc>
<doc><str name=”title”>The Trap</str>
! <str name=”author”>Agatha Christie</str></doc>
</result></response>
solr
14
Response format
>>> query = {'genre': 'thriller'}
>>> response = conn.search(q=query)
>>> results = SolrResponse(response).response
>>> results.numFound
2
>>> results[0].title
'Coma'
>>> results[0].author
'Robin Cook'
collective.solr
14
Who use solr/lucene?
15
Who use solr/lucene?Who use Solr/Lucene?Who use Solr/Lucene?
15
"Biblioteca Virtuale Italiana di Testi in Formato Alternativo"
16
sources
Architecture
Z39.50
web site
Books
CSV
retriever
retriever
retriever
populator solr
search
populator ...
17
Retrievers
• they are normalizing sources to unique format
• source can be anything from CSV to public site
18
Public sites
• makes a query
• grabs HTML results
• using configurable xpath parser transform HTML results into python format
19
Normalize it!
• Title
• Description
• Authors
• Publisher
• Format
• ISBN
• ISSN
• Data
every Book needs to have minimal metadata:
20
Populators
Today:
• only one solr populator
In the future:
• populate other sites,
• populate RDBMS
• ...
21
Conclusions
• multiple retrivers – multiple populators
• we have used only collective.solr SolrConnection API
• 120.000 books indexed so far in solr - querying and indexing is extremly fast
22
tsearch2 ?
http://www.flickr.com/photos/st3f4n/2767217547
23
tsearch2 ?search engine fully integrated in PostgreSQL 8.3.x
http://www.flickr.com/photos/st3f4n/2767217547
24
tsearch2 main features
• Flexible and rich linguistic support (dictionaries, stop words), thesaurus
• Full UTF-8 support
• Sophisticated ranking functions with support of proximity and structure information (rank, rank_cd)
• Rich query language with query rewriting support
• Headline support (text fragments with highlighted search terms)
• It is mature (5 years of development)
25
first steps with tsearch2
1. PostgreSQL >= 8.4(but 8.3 will work as well)
2. COLUMNALTER TABLE content ADD COLUMN search_vector tsvector;
3. INDEXCREATE INDEX search_index ON content USING gin(search_vector);
26
first steps with tsearch2
4. TRIGGER
CREATE FUNCTION fullsearch_trigger() RETURNS trigger AS $$
begin
new.search_vector :=
setweight(to_tsvector('pg_catalog.english',
coalesce(new.subject,'')), 'A') ||
setweight(to_tsvector('pg_catalog.english',
coalesce(new.title,'')), 'B') ||
setweight(to_tsvector('pg_catalog.english',
coalesce(new.description,'')), 'C');
return new;
end
$$ LANGUAGE plpgsql;
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON content FOR EACH ROW EXECUTE PROCEDURE fullsearch_trigger();
27
tsearch2how to serialize Plone content to SQL?
http://www.flickr.com/photos/st3f4n/2767217547
28
ore.contentmirror„it focuses and supports out of the box, content deployment to a relational database”
http://www.flickr.com/photos/st3f4n/2767217547
29
how to add tsearch2 to ore.contentmirror ddl?
http://www.flickr.com/photos/st3f4n/2767217547
30
How to add tsearch2to ore.contentmirror ddl?
>>> from ore.contentmirror.schema import content
>>> def setup_search(event, schema_item, bind):
...!! bind.execute("alter table content add
...!! ! ! ! ! column search_vector tsvector")
>>> content.append_ddl_listener('after-create',... setup_search)
31
Geco - community portal for Italian youth
32
Geco
• Started in 2009 for Emilia-Romagna
• Multiple content types, including video, polls, articles and more
33
Geco
• 95 editors (Emilia-Romagna)
• 100.000 documents (Emilia-Romagna)
• This year: 2 other regions joins
• Future: all 20 regions joins the project
• Every region has it's own server deployment
34
Objectives
✓ fast and efficient search engine that can integrate multiple different Plone sites
✓ search results should be ordered by rank
✓ content should be serialized in SQL so it can be reused by other applications (ratings, comments)
35
rt.tsearch2
• integrates tsearch2 in PostgreSQL
• extend sqlalchemy query with rank sorting
36
rt.tsearch2
• integrates tsearch2 in PostgreSQL
• extend sqlalchemy query with rank sorting
>>> rank = '{0,0.05,0.05,0.9}'>>> term = 'Ferrara'>>> query = query.order_by(desc("ts_rank('%s', Content.search_vector,! to_tsquery('%s'))" % (rank, term)))
36
Conclusionshttp://www.flickr.com/photos/vramak/3499502280
37
Conclusions
✓ Integrating external search engine in Plone is easy!
✓ You can find a solution that suites your needs!
38
QuestionsAndrew MleczkoRedTurtle Technology [email protected]
39
Thank you.
40