Best practices in museum search

39
"Best Practices" in Museum Search .. in my (researched) opinion Nate Solas #mcn2011 #search @homebrewer [email protected] http://bit.ly/mcn2011sear ch

description

Slide from my workshop at MCN2011

Transcript of Best practices in museum search

Page 1: Best practices in museum search

"Best Practices" inMuseum Search

.. in my (researched) opinion

Nate Solas#mcn2011 #[email protected]@walkerart.orghttp://bit.ly/mcn2011search

Page 2: Best practices in museum search

Search is hard...

... shouldn't we just leave this to Google?

Page 3: Best practices in museum search

"Leave it to Google" IS a best practice!

For them, it's a solved problem. They have absolutely solved searching for content on websites, especially a finite domain like a museum website.

http://www.powerhousemuseum.com/search/index.php?cx=018242116655519399236%3A4srvv8yns7w&q=blue&sa=&cof=FORID%3A11&siteurl=www.powerhousemuseum.com%2Fvisit%2F

http://www.tate.org.uk/search/default.jsp?q=bluehttp://www.brooklynmuseum.org/http://www.amnh.org/http://si.edu/ (GSA)

Page 4: Best practices in museum search

<title>We can do more to help</title>

<article>

Mark the content! Google indexes ALL the words, so all of our nav, advertising, footer... If we don't indicate what's the "content", it's all fair game (sort of. They're actually smarter than that.)

<sidebar>Meta tags (OG), RDFa, valid HTML5 markup, etc.</sidebar>

</article>

Page 5: Best practices in museum search

Internal search: yes

• We (should) know the most about our content, so we know:o how to suggest thingso how to interpret queries in context (run the search)o how to present things to make sense

It's no longer just a 'web page'!

•  We (should) have the content as discrete pieces of metadata: title, date, body, author, etc.o We can therefore index just the content, none of the other

chrome on the page.o Facets: we can use this metadata to drill down.

Page 6: Best practices in museum search

Phases of search:

... let's just look at three parts:

the query,results,&dead ends

Page 7: Best practices in museum search

Search box, top right. Done.(Powerhouse Museum has it bottom left, but they're in Australia so this makes sense. ;)

• If there's text in the box ("search"), clear it when they click in!

• Autocomplete / suggest isn't really common (yet), but seems very useful where it shows up.o Three strategies I see:

– Suggest page ("live search taxonomy") (http://www.imamuseum.org/)– Suggest tag/title (http://www.vam.ac.uk/)– Suggest phrase from full corpus (http://beta.walkerart.org/ (beta))

The Query

Page 8: Best practices in museum search

Suggest / Autocom

Full text autocomplete is sort of the holy grail, IMO, but we can't be as smart as Google.

IMA does "live search" (auto-suggest) instead of autocomplete, very useful but it doesn't help me spell Lichtenstein.

The real point is to eliminate dead ends.

Suggest / Autocomplete

Page 9: Best practices in museum search

Results

Questions your result page should answer immediately:

1. What are these things?– Why did they match (and why in that order?)– Was I understood / can I try again easily?

Finally:• What's next?

o try some results oro narrow (refine) search oro broaden search

Page 10: Best practices in museum search

WHAT are these things?

Mixed results ("All")

MOMA gets it:• http://www.moma.org/search?quer

y=blueo Full breadcrumb, excerpt, title,

media if they have it.

This is confusing at first:• http://www.metmuseum.org/search-re

sults?ft=blue

Separate results

V&A splits into sections• http://www.vam.ac.uk/contentapi/search/?q=blue&searc

h-submit=Goo .. but some of the "articles" aren't

articles.

MFA sections and staggers• http://www.mfa.org/search/mfa/blue

Careful. This sort of assumes people know what they're looking for.

Page 11: Best practices in museum search

...um

Page 12: Best practices in museum search

Why did they match (in this order)?

• Highlight the match, if possible

• Sort by relevanceo (But see section on "boosting"...)

• If you're splitting up content, it's hard to explain.o ...best result could be at the bottom of the page

... so ... don't. Let user do this.

Page 13: Best practices in museum search

Was I understood / Can I try again?

MFA site: http://www.mfa.org/search/mfa/blue• Without the URL hint, can you even tell what was searched for?

o And what if you want to add a single word? (WAC site is guilty of this. Blame the designer. ;-)

A few "not like this" examples:

• "blue phase"o http://www.vam.ac.uk/contentapi/search/?q=%22blue+phase%22&search-submit=Goo http://www.imamuseum.org/search/ima/%22blue%20phase%22

(People are going to use quotes!)

Page 14: Best practices in museum search

Was I really understood?

We know what you want: "Hours"• http://www.britishmuseum.org/search_results.aspx?searchText=hours• http://www.moma.org/search?query=hours&page=1• http://beta.walkerart.org/search/?q=hours

"We have a special “live search” taxonomy for explicitly boosting content pages we know people are searching for. E.g. “jobs” on our employment page; “love” is our Love sculpture, not the hundreds of other works, “wedding” is for facility rentals, not our hundred wedding dresses in the collection."

-- Charlie Moad, IMA

Do me a favor:• http://beta.walkerart.org/search/?q=articel• http://www.vam.ac.uk/contentapi/search/?suggest=article&q=articel

o (again, a bit confusing but right)

Page 15: Best practices in museum search

Narrow results with facets

Awesome:• si.edu collections

o  http://collections.si.edu/search/results.jsp?q=blue

Good:• IMA

o http://www.imamuseum.org/search/ima/blue• WAC (I'm biased)

o http://beta.walkerart.org/magazine/type/articles/genre/film

Less awesome:• British Museum

o http://www.britishmuseum.org/search_results.aspx?searchText=blue&searchPrevious=blue&itemsPerPage=10

Page 16: Best practices in museum search

Broaden results

• Similar searches / More Like Thiso  http://beta.walkerart.org/search/?q=absent+landlordo  

http://www.powerhousemuseum.com/collection/database/search_tags.php?tag=blue

o http://www.vam.ac.uk/contentapi/search/?q=%22blue+phase%22&search-submit=Go Sort of weird, though.

• More Like Thiso  We're trying it on detail pages:

 http://beta.walkerart.org/calendar/2011/merce-cunningham-dance-company

Page 17: Best practices in museum search

Dead ends / spell check

"Did you mean?"• http://beta.walkerart.org/search/?q=absent+landlord• http://www.vam.ac.uk/contentapi/search/?q=blu&search-sub

mit=Go

This is really just spellcheck. But it's apparently really hard, since nobody's doing it.

Page 18: Best practices in museum search

Final thoughts

Can we just spider our own pages like Google?• Sure. Lots of tools to do this, and it looks like that's how MOMA does it.

o However... http://www.moma.org/search?query=%22ad+reinhardt%22+%22sum+of+days%22&page=1

o http://www.moma.org/search?query=blu&page=1 (look at the mp4!)

Boosting• what kind of boosting makes sense?

o weight towards recent content push down past events, maybe

o "we know what you want" look at logs to see what people are searching for

Page 19: Best practices in museum search

So... "best practices"

• Unified search across all contento full-text search with stemming, phrases, etc.

• Coherent, user-centric divisions of content for faceting• Prevent dead ends

o show #s for facetso autocomplete query

• Help the usero "Did you mean?"

Or just give it to them, don't ask

Page 20: Best practices in museum search

Let's build that!

"Solr is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable."-- http://en.wikipedia.org/wiki/Solr

Page 21: Best practices in museum search

There's a tool for you...

http://wiki.apache.org/solr/IntegratingSolr

• ColdFusion - ColdFusion 9 now includes Apache Solr• Django - Haystack• Drupal - A Drupal module that integrates Apache Solr in Drupal.• eZ Find - eZ Find, a solid solr integration to the open source CMS eZ Publish• Forrest/Cocoon - SolrForrest• Foswiki - A Foswiki plugin that integrates Apache Solr in Foswiki.• Plone - collective.solr• SVN - reposearch• TYPO3• Various Library Catalog Applications - Solr4Lib• Woltlab Community Framework - A WCF package working with the burning board, the blog and all other WCF

components.• WordPress - solr-for-wordpress A WordPress plugin that replaces the default WordPress search with Solr.• ZooKeeperIntegration• OpenCms - opencms-solr

Page 22: Best practices in museum search

Hurry, hurry!

1. introducing Solr2. build fulltext search & introduce dismax3. facets4. build autocomplete5. did you mean?

Page 23: Best practices in museum search

Installation, fast test

user:~solr$ lssolr-nightly.zipuser:~solr$ unzip -q solr-nightly.zipuser:~solr$ cd solr-nightly/example/user:~/solr/example$ java -jar start.jar

That's it! You can actually do local development against that sort of setup and it works fine.

Page 24: Best practices in museum search

Installation, f'realz (Ubuntu)

apt-get install build-essential jetty \    libjetty-extra openjdk-6-jdkcp dist/apache-solr-3.4.0.war \    /usr/share/jetty/webapps/solr.warcp -r example/solr /usr/share/jetty/

edit /usr/share/jetty/solr/conf/schema.xml and solrconfig.xml

edit /etc/default/jetty: turn off no-start, make it bind to all ips, and set the java opts:JAVA_OPTIONS="-Dsolr.solr.home=/usr/share/jetty/solr -Dsolr.data.dir=/usr/share/jetty/solr/data $JAVA_OPTIONS"

/etc/init.d/jetty start

Page 25: Best practices in museum search

For today:

http://172.16.0.67/

Page 26: Best practices in museum search

Explore the fieldtypes: core0

Get the sample text onyour clipboard.In core0, click Admin,then Analysis

Field Names:id (string)text_wstext_generaltext_enphonetictext_general_revalphaonlysort

Page 27: Best practices in museum search

core1: fulltext search engine

Click search on core1. Try it out.(dataset is Walker Art Center events)

Click "edit" on core1. Discuss.

Page 28: Best practices in museum search

core1: dismax query parser

DisMax is an abbreviation Disjunction Max, and is a popular query mode with Solr.

Disjunction refers to the fact that your search is executed across multiple fields, e.g. title, body and keywords, with different relevance weights

Max means that if your word "foo" matches both title and body, the max score of these two (probably title match) is added to the score, not the sum of the two as a simple OR query would do. This gives more control over your ranking.

Page 29: Best practices in museum search

core1: dismax in practice

The DisMaxQParserPlugin is designed to process simple user entered phrases (without heavy syntax) and search for the individual words across several fields using different weighting (boosts) based on the significance of each field.

In English: it does a really good job helping you figure out what the user meant to look for.

Page 30: Best practices in museum search

Try some quotes

chuck close

vs.

chuck "close"

Debug: what's going on?

Page 31: Best practices in museum search

core2: facets

99% chance your Solr library will abstract this for you, but it's good to know what's under the hood.

... we won't do it today, but you can facet by queries, not just field names.

So you can do things like this in one call:• Give me all events matching the query• Show how many by type (like we're doing)• Show how many are happening today• Show how many are happening "this weekend"• ... etc.• http://beta.walkerart.org/calendar/type/free-events

Page 32: Best practices in museum search

core3: Autocomplete (a)

Read this later:http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

This is a very popular and decent solution. It really only works the way he suggests, though, by seeding with popular queries (since it starts at character 0). If you have this data, go for it, but our top queries actually aren't very interesting: "jobs", "staff", "hours", etc.

We want something that can complete any phrase that occurs in our corpus (a), ideally in the middle of the phrase (b).

Page 33: Best practices in museum search

Key technologies

ShingleFilterFactoryMake tokens out of phrases.

TermsComponent"return terms and document frequency of those terms"

Post-processing for stopwordsIndex them in phrases, but remove from suggestions in certain scenarios

Page 34: Best practices in museum search

ShingleFilterFactory

    <fieldType name="shingle_text" class="solr.TextField" positionIncrementGap="100">        <analyzer type="index">            <charFilter class="solr.HTMLStripCharFilterFactory" />            <tokenizer class="solr.StandardTokenizerFactory"/>            <filter class="solr.LowerCaseFilterFactory"/>            <filter class="solr.ASCIIFoldingFilterFactory"/>            <!--<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />-->            <filter class="solr.ShingleFilterFactory" maxShingleSize="5" />        </analyzer>        <analyzer type="query">            <charFilter class="solr.HTMLStripCharFilterFactory" />            <tokenizer class="solr.StandardTokenizerFactory"/>            <filter class="solr.LowerCaseFilterFactory"/>            <filter class="solr.ASCIIFoldingFilterFactory"/>        </analyzer>    </fieldType>

Page 35: Best practices in museum search

TermsComponent

<!-- in solrconfig.xml -->        <arr name="last-components">            <str>terms</str>        </arr>

# strict "starts with"/select?terms=true&terms.fl=auto_text&terms.prefix=term

OR

# attempt at "infix" (sloooow on big corpus)/select?terms=true&terms.fl=auto_text&terms.rege=(^|.* +)term.*

Page 36: Best practices in museum search

core4: Autocomplete (b)

Infix. Big challenges, decent hacks.

Smaller shingles.

Less words (only title & subtitle).

Still... kinda slow in our beta site. Probably have to move to prefix. :(

Page 37: Best practices in museum search

core5: spellcheck

Similar to the setup for autocomplete

Just remember to call a url with spellcheck.build=true to get things started.

For better results, use spellcheck.q and escape spaces. This makes it a phrase instead of spellchecking individual words and correcting them to deadends.

select?q=chuc+closee&spellcheck.q=chuc\+closee

Page 38: Best practices in museum search

Search is hard...

Our content team (and I know the MET too with their new site) constantly struggle to understand why certain results come up over others. They always ask us to make tweaks which inevitably hurt other results. It’s a constant battle for perfection and I have to do a lot of educating.

·         Retail results come up over artworks because they actually write good descriptions! We even set our boost on retail to 0.5.·         Why does “after van Gogh” show up before the real “van Gogh”?·         Why does last year’s event show up before this year’s?

While there are answers to all these, it’s inevitably a slippery slope. My final answer is to usually use the live search taxonomy. It is in place to tell the search engine what users are looking for specific to your institution. People just need to understand that it is a content task just as much as creating a page.

-- Charlie Moad, IMA

Page 39: Best practices in museum search

If we're bored

ASCII / UTF8http://beta.walkerart.org/search/?q=jerome+belhttp://beta.walkerart.org/search/?q=J%C3%A9r%C3%B4me+Bel<!-- remove diacritics BEFORE stemming to match cases without diacritics --><filter class="solr.ASCIIFoldingFilterFactory"/>

boost in general, elevate.xml

bq=(instances:{20110927 TO *})^1000 OR (display_type:Walker\ Shop)^20 OR (display_type:Events)^1

http://wiki.apache.org/solr/QueryElevationComponent - "sponsored search"Index non-data resources (pdf, docs, etc.): Apache Tika