Search Analytics Business Value & NoSQL Backend

Post on 27-Jan-2015

108 views 1 download

Tags:

description

 

Transcript of Search Analytics Business Value & NoSQL Backend

Search Analytics

Business Value&

NoSQL Backend

Otis Gospodnetić – Sematext International@otisg ◦ @sematext ◦ sematext.com

sematext.com/search-analytics

Copyright 2011 Sematext Int'l. All rights reserved.2

About Otis Gospodnetić

• ASF Member: Lucene, Solr, Nutch, Mahout

• Author: Lucene in Action 1 & 2

• Entrepreneur: Sematext, Simpy

Copyright 2011 Sematext Int'l. All rights reserved.3

Sematext Metrics● 100% organic: no GMO, no VC● 4 years old● < 10 people● 7 countries● 3 timezones● 2 continents● > 100 customers

Copyright 2011 Sematext Int'l. All rights reserved.4

About Sematext

Products & ServicesConsulting, Development, Tech Support:

● Search (Lucene, Solr, ElasticSearch...)● Big Data (Hadoop, HBase, Voldemort...)● Web Crawling (Nutch, Droids)● Machine Learning (Mahout)

Copyright 2011 Sematext Int'l. All rights reserved.5

Agenda

● What is Search Analytics and why it matters● Example reports and their value● What we built, why, and how

Copyright 2011 Sematext Int'l. All rights reserved.6

Communication● twitter.com/sematext● twitter.com/otisg● hash tags: #stsa or #stanalytics● http://sematext.com/search-analytics/index.html● Raise your hand!● otis@sematext.com

Copyright 2011 Sematext Int'l. All rights reserved.7

The Compass

Search logs are your MapSearch Analytics is your Compass

Copyright 2011 Sematext Int'l. All rights reserved.8

High Level Why

searchusers

searchproviders

searchexperience

Copyright 2011 Sematext Int'l. All rights reserved.9

High Level Why

searchproviders

searchexperience

This search sucks!It takes 17 tries to find anything here!

F!?@#$%^&?!?

searchusers

Cool, the latest search tweaks made our site really sticky!

Awesome!

Copyright 2011 Sematext Int'l. All rights reserved.10

Don't Be Like This Dude

Copyright 2011 Sematext Int'l. All rights reserved.11

Got Clue?

Search Analytics

Performance Monitoring

Quality Assurance

Tuning UI

Copyright 2011 Sematext Int'l. All rights reserved.12

More Concrete Why● Measure and monitor everything. Introspection.● Supports (re)design, navigation choices● Helps with content acquisition & enhancement● Improve search experience● Mula

Copyright 2011 Sematext Int'l. All rights reserved.13

The Moment of Truth

Question for the audience #1

What do you use for Search Analytics?

a) Home grown stuffb) Google Analyticsc) Omnitured) Webtrendse) Otherf ) Nothing

Copyright 2011 Sematext Int'l. All rights reserved.14

Search Analytics Outline● Collect: queries & clicks & interactions & ...● Analyze: actions / xactions / conversions● Output: reports – over time● Output++: feedback loop

● The means, not the goal● Ongoing, not one-off

remember this

Copyright 2011 Sematext Int'l. All rights reserved.15

Search vs. Web Analytics● User intent and information needs vs. inferring● Hand in hand● Ideally you can relate data from both or even

unify it

Copyright 2011 Sematext Int'l. All rights reserved.16

Example Core Reports● Rate & Volume, Latency (mean, avg, 90%)● Click Through Rate, Mean Reciprocal Rank● Top Queries by count, clicks, 0 hits...● Query Trending● Top Seen Docs, Top Clicked Docs (msft)● Page & Click Depth● Facet & Sort Usage● ...

Copyright 2011 Sematext Int'l. All rights reserved.17

More Reports in More Detail● See Search Analytics What? Why?

How?

http://blog.sematext.com/tag/analytics/

Copyright 2011 Sematext Int'l. All rights reserved.18

Part Dos

Switching gears... Juno digs NoSQL

Copyright 2011 Sematext Int'l. All rights reserved.19

What We've Built● Search Analytics SaaS

● Numerous reports (e.g. query volume, rate, latency, term frequencies / comparisons, hit buckets, search origins, etc.)

● Trending over time● Comparisons of time periods● Top N reports● Filter, slice and dice

Copyright 2011 Sematext Int'l. All rights reserved.20

Who Needs a Compass?● We need it

● search-hadoop.com & search-lucene.com

● Our customers need it!

● You?

Copyright 2011 Sematext Int'l. All rights reserved.21

Sematext Search Analytics

Copyright 2011 Sematext Int'l. All rights reserved.22

Big Dreams● SaaS● Multitenant● Large Scale – Massive Data● Cloud

Copyright 2011 Sematext Int'l. All rights reserved.23

Storage Choices● RDBMS: MySQL, PostgreSQL● HDFS● Hive● HBase● Cassandra

Copyright 2011 Sematext Int'l. All rights reserved.24

SaaS vs. In-HouseQuestion for the audience #2

SaaS vs in-house Search Analytics?

a) SaaSb) in-house

Copyright 2011 Sematext Int'l. All rights reserved.25

Sematext Search Analytics

Copyright 2011 Sematext Int'l. All rights reserved.26

Sematext Search Analytics

Copyright 2011 Sematext Int'l. All rights reserved.27

Sematext Search Analytics

Copyright 2011 Sematext Int'l. All rights reserved.28

Sematext Search Analytics

Copyright 2011 Sematext Int'l. All rights reserved.29

Data Flow● See Search Analytics with Flume and HBase

http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/

Copyright 2011 Sematext Int'l. All rights reserved.30

Data Collection● See Search Analytics with Flume and HBase

http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/

Copyright 2011 Sematext Int'l. All rights reserved.31

Core Tech● JavaScript Beacons● Metric Capture Web App aka Receiver● Flume Agents, Collectors, Sinks● HBase● MapReduce Aggregations● Search Analytics Reporting Web App

Copyright 2011 Sematext Int'l. All rights reserved.32

What is Flume● Distributed data/log collection service● Scalable, configurable, extensible● Centrally manageable, open source

● Agents get data from app, Collectors save it● Abstractions: Source → Decorator(s) → Sink

Copyright 2011 Sematext Int'l. All rights reserved.33

What is HBase● Scalable, reliable, distributed, column-oriented DB● On top of HDFS● MapReducable

Copyright 2011 Sematext Int'l. All rights reserved.34

Data Flow, Detailed

Copyright 2011 Sematext Int'l. All rights reserved.35

Why Flume● Reliable delivery

● e.g. queue msgs locally if destination unreachable● Easy, centralized management via Web UI or

console● Good community, good progress, now @ASF● But: more complex, more moving parts● On Flume: slideshare.net/cloudera/inside-flume● Alternatives: Kafka, Scribe...

Copyright 2011 Sematext Int'l. All rights reserved.36

Why HBase● Scalable raw & aggregate data storage● MapReduce data input● Fast scans for time ranges, fast key lookups● Easy storage and compute power expansion● Good looking roadmap, community, progress

Copyright 2011 Sematext Int'l. All rights reserved.37

Open Sourcing● 2 open-source projects:

github.com/sematext/HBaseWDgithub.com/sematext/HBaseHUT

● See sematext.com/open-source/index.html

● Patches for Flume and HBaseblog.sematext.com/tag/flume/

Copyright 2011 Sematext Int'l. All rights reserved.38

Challenges● Data size. Solutions:

● Compression (4-5x smaller with lzo)● Data pruning (variable levels)

● Query string distribution: very long-tail● Lots of data to process, update, aggregate

● Young tools: Flume, HBase● Poor IO on EC2● Hadoop distributions

Copyright 2011 Sematext Int'l. All rights reserved.39

Output++● AutoComplete - $MM improvement● Better DYM Spellchecker● Related Searches● Recommendations● Relevance Feedback● ...

Copyright 2011 Sematext Int'l. All rights reserved.40

Closing the Loop

searchusers

searchproviders

searchexperience

Copyright 2011 Sematext Int'l. All rights reserved.41

Resource

http://rosenfeldmedia.com/books/searchanalytics/

Search Analytics for Your SiteLouis Rosenfeld

Copyright 2011 Sematext Int'l. All rights reserved.42

We're Hiring

Dig Search?Dig Analytics?Dig Big Data?Dig Performance?Dig working with and in open-source?We're hiring world-wide!http://sematext.com/about/jobs.html

Copyright 2011 Sematext Int'l. All rights reserved.43

sematext.com blog.sematext.com @sematext @otisg otis@sematext.com

Want SA? Grab me or go to: sematext.com/search-analytics

Hash tags: #stsa or #stanalytics

Contact