Needle in an enterprise haystack

46
search engine integrations Needle in an enterprise haystack 1

Transcript of Needle in an enterprise haystack

Page 1: Needle in an enterprise haystack

search engine integrationsNeedle in an enterprise haystack

1

Page 2: Needle in an enterprise haystack

Who am I?

Andrew MleczkoPlone IntegratorRedturtle Technology (Ferrara/Italy)[email protected]

2

Page 3: Needle in an enterprise haystack

so why do you need an external search engine?

3

Page 4: Needle in an enterprise haystack

why do you need an external search engine...

• Plone's portal_catalog is slow with big sites (large number of indexed objects)

• You want to reduce Plone memory consumption (by removing heavy indexes like SearchableText)

• You want to query Plone's content from external applications

• You want to use advanced search features

4

Page 5: Needle in an enterprise haystack

there are several solutions

that you can use

5

Page 6: Needle in an enterprise haystack

Plone external indexing and searching

• Out-of-the-box:

• collective.gsa (Google Search Appliance)

• collective.solr (Apache Solr)

• Custom integrations:

• Solr

• Tsearch2

http://www.flickr.com/photos/jenny-pics/3527749814

6

Page 7: Needle in an enterprise haystack

Solr?

http://www.flickr.com/photos/st3f4n/2767217547

7

Page 8: Needle in an enterprise haystack

a search enginebased on Lucene

http://www.flickr.com/photos/st3f4n/2767217547

8

Page 9: Needle in an enterprise haystack

Lucene?

http://www.flickr.com/photos/st3f4n/2767217547

9

Page 10: Needle in an enterprise haystack

Full-text search library 100% in java

http://www.flickr.com/photos/st3f4n/2767217547

10

Page 11: Needle in an enterprise haystack

Solr XML/HTTP, JSON interface,Open Source

http://www.flickr.com/photos/st3f4n/2767217547

11

Page 12: Needle in an enterprise haystack

collective.solr python API and Plone integration

http://www.flickr.com/photos/st3f4n/2767217547

12

Page 13: Needle in an enterprise haystack

Document formatsolr

collective.solr

13

Page 14: Needle in an enterprise haystack

Document format

<add><doc>

! <field name=”id”>123</field>

! <field name=”title”>The Trap</field>

! <field name=”author”>Agatha Christie</field>

! <field name=”genre”>thriller</field>

</doc></add>

solrcollective.solr

13

Page 15: Needle in an enterprise haystack

Document format

<add><doc>

! <field name=”id”>123</field>

! <field name=”title”>The Trap</field>

! <field name=”author”>Agatha Christie</field>

! <field name=”genre”>thriller</field>

</doc></add>

>>> conn = SolrConnection(host='127.0.0.1', ...)

>>> book = {'title': 'The Trap',

...!! ! 'author': 'Agatha Christie',

...!! ! 'genre' : 'thriller'}

>>> conn.add(**book)

solrcollective.solr

13

Page 16: Needle in an enterprise haystack

Response format

14

Page 17: Needle in an enterprise haystack

Response format

<response><result numFound=”2” start=”0”>

<doc><str name=”title”>Coma</str>

<str name=”author”>Robin Cook</str></doc>

<doc><str name=”title”>The Trap</str>

! <str name=”author”>Agatha Christie</str></doc>

</result></response>

solr

14

Page 18: Needle in an enterprise haystack

Response format

>>> query = {'genre': 'thriller'}

>>> response = conn.search(q=query)

>>> results = SolrResponse(response).response

>>> results.numFound

2

>>> results[0].title

'Coma'

>>> results[0].author

'Robin Cook'

collective.solr

14

Page 19: Needle in an enterprise haystack

Who use solr/lucene?

15

Page 20: Needle in an enterprise haystack

Who use solr/lucene?Who use Solr/Lucene?Who use Solr/Lucene?

15

Page 21: Needle in an enterprise haystack

"Biblioteca Virtuale Italiana di Testi in Formato Alternativo"

16

Page 22: Needle in an enterprise haystack

sources

Architecture

Z39.50

web site

Books

CSV

retriever

retriever

retriever

populator solr

search

populator ...

17

Page 23: Needle in an enterprise haystack

Retrievers

• they are normalizing sources to unique format

• source can be anything from CSV to public site

18

Page 24: Needle in an enterprise haystack

Public sites

• makes a query

• grabs HTML results

• using configurable xpath parser transform HTML results into python format

19

Page 25: Needle in an enterprise haystack

Normalize it!

• Title

• Description

• Authors

• Publisher

• Format

• ISBN

• ISSN

• Data

every Book needs to have minimal metadata:

20

Page 26: Needle in an enterprise haystack

Populators

Today:

• only one solr populator

In the future:

• populate other sites,

• populate RDBMS

• ...

21

Page 27: Needle in an enterprise haystack

Conclusions

• multiple retrivers – multiple populators

• we have used only collective.solr SolrConnection API

• 120.000 books indexed so far in solr - querying and indexing is extremly fast

22

Page 28: Needle in an enterprise haystack

tsearch2 ?

http://www.flickr.com/photos/st3f4n/2767217547

23

Page 29: Needle in an enterprise haystack

tsearch2 ?search engine fully integrated in PostgreSQL 8.3.x

http://www.flickr.com/photos/st3f4n/2767217547

24

Page 30: Needle in an enterprise haystack

tsearch2 main features

• Flexible and rich linguistic support (dictionaries, stop words), thesaurus

• Full UTF-8 support

• Sophisticated ranking functions with support of proximity and structure information (rank, rank_cd)

• Rich query language with query rewriting support

• Headline support (text fragments with highlighted search terms)

• It is mature (5 years of development)

25

Page 31: Needle in an enterprise haystack

first steps with tsearch2

1. PostgreSQL >= 8.4(but 8.3 will work as well)

2. COLUMNALTER TABLE content ADD COLUMN search_vector tsvector;

3. INDEXCREATE INDEX search_index ON content USING gin(search_vector);

26

Page 32: Needle in an enterprise haystack

first steps with tsearch2

4. TRIGGER

CREATE FUNCTION fullsearch_trigger() RETURNS trigger AS $$

begin

new.search_vector :=

setweight(to_tsvector('pg_catalog.english',

coalesce(new.subject,'')), 'A') ||

setweight(to_tsvector('pg_catalog.english',

coalesce(new.title,'')), 'B') ||

setweight(to_tsvector('pg_catalog.english',

coalesce(new.description,'')), 'C');

return new;

end

$$ LANGUAGE plpgsql;

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON content FOR EACH ROW EXECUTE PROCEDURE fullsearch_trigger();

27

Page 33: Needle in an enterprise haystack

tsearch2how to serialize Plone content to SQL?

http://www.flickr.com/photos/st3f4n/2767217547

28

Page 34: Needle in an enterprise haystack

ore.contentmirror„it focuses and supports out of the box, content deployment to a relational database”

http://www.flickr.com/photos/st3f4n/2767217547

29

Page 35: Needle in an enterprise haystack

how to add tsearch2 to ore.contentmirror ddl?

http://www.flickr.com/photos/st3f4n/2767217547

30

Page 36: Needle in an enterprise haystack

How to add tsearch2to ore.contentmirror ddl?

>>> from ore.contentmirror.schema import content

>>> def setup_search(event, schema_item, bind):

...!! bind.execute("alter table content add

...!! ! ! ! ! column search_vector tsvector")

>>> content.append_ddl_listener('after-create',... setup_search)

31

Page 37: Needle in an enterprise haystack

Geco - community portal for Italian youth

32

Page 38: Needle in an enterprise haystack

Geco

• Started in 2009 for Emilia-Romagna

• Multiple content types, including video, polls, articles and more

33

Page 39: Needle in an enterprise haystack

Geco

• 95 editors (Emilia-Romagna)

• 100.000 documents (Emilia-Romagna)

• This year: 2 other regions joins

• Future: all 20 regions joins the project

• Every region has it's own server deployment

34

Page 40: Needle in an enterprise haystack

Objectives

✓ fast and efficient search engine that can integrate multiple different Plone sites

✓ search results should be ordered by rank

✓ content should be serialized in SQL so it can be reused by other applications (ratings, comments)

35

Page 41: Needle in an enterprise haystack

rt.tsearch2

• integrates tsearch2 in PostgreSQL

• extend sqlalchemy query with rank sorting

36

Page 42: Needle in an enterprise haystack

rt.tsearch2

• integrates tsearch2 in PostgreSQL

• extend sqlalchemy query with rank sorting

>>> rank = '{0,0.05,0.05,0.9}'>>> term = 'Ferrara'>>> query = query.order_by(desc("ts_rank('%s', Content.search_vector,! to_tsquery('%s'))" % (rank, term)))

36

Page 43: Needle in an enterprise haystack

Conclusionshttp://www.flickr.com/photos/vramak/3499502280

37

Page 44: Needle in an enterprise haystack

Conclusions

✓ Integrating external search engine in Plone is easy!

✓ You can find a solution that suites your needs!

38

Page 45: Needle in an enterprise haystack

QuestionsAndrew MleczkoRedTurtle Technology [email protected]

39

Page 46: Needle in an enterprise haystack

Thank you.

40