Solr -

48
Solr Solr Hao Chen 2012.04 Hao Chen 2012.04

description

General

Transcript of Solr -

Page 1: Solr -

SolrSolrHao Chen 2012.04Hao Chen 2012.04

Page 2: Solr -

What is Solr?What is Solr?

SolrSolr is the popular open source is the popular open source enterprise search platform from the enterprise search platform from the Apache Lucene project. Apache Lucene project.

Solr powers the search and Solr powers the search and navigation features of many of the navigation features of many of the world's largest internet sites. world's largest internet sites.

Page 3: Solr -

LuceneLucene

Apache Lucene is a high-performance, fuApache Lucene is a high-performance, full-featured text search engine library wrill-featured text search engine library written entirely in Java. It is a technology sutten entirely in Java. It is a technology suitable for nearly any application that reqitable for nearly any application that requires full-text search, especially cross-pluires full-text search, especially cross-platform. atform.

Page 4: Solr -

Lucene Vs SolrLucene Vs Solr Lucene is a search library built in Java. Solr is Lucene is a search library built in Java. Solr is

a web application built on top of Lucene. a web application built on top of Lucene. Certainly Solr = Lucene + Added features. OfteCertainly Solr = Lucene + Added features. Ofte

n there would a question, when to choose Solr n there would a question, when to choose Solr and when to choose Lucene.and when to choose Lucene.

To get more control use Lucene. For faster devTo get more control use Lucene. For faster development, easy to learn, choose Solr. elopment, easy to learn, choose Solr.

http://www.findbestopensource.com/article-detail/lucene-vs-solr

Page 5: Solr -

Why do we need Solr?Why do we need Solr?

Full-text SearchFull-text Search– MySQL “like %keyword%”MySQL “like %keyword%”

Too slow! And weak!

Page 6: Solr -

Major Features of Solr Major Features of Solr

Advanced Full-Text Search CapabilitiesAdvanced Full-Text Search Capabilities Optimized for High Volume Web TrafficOptimized for High Volume Web Traffic Standards Based Open Interfaces - XML,JSON and Standards Based Open Interfaces - XML,JSON and

HTTPHTTP Comprehensive HTML Administration InterfacesComprehensive HTML Administration Interfaces Server statistics exposed over JMX for monitoringServer statistics exposed over JMX for monitoring Scalability - Efficient Replication to other Solr Scalability - Efficient Replication to other Solr

Search ServersSearch Servers Flexible and Adaptable with XML configurationFlexible and Adaptable with XML configuration Extensible Plugin ArchitectureExtensible Plugin Architecture

http://lucene.apache.org/solr/

Page 7: Solr -

Typical Application Architecture Typical Application Architecture

Web ServerDatabase (MySQL)

http request

Cache (memcached, Redis, etc.)

Solr / Lucene

DIH

All the components could be distributed, to make the architecture scalable.

Page 8: Solr -

Lucene/Solr ArchitectureLucene/Solr Architecture

8

Apache Lucene

/select /spell XML CSVXML Binary JSON

Data Import Handler

(SQL/RSS)

Extracting Request

Handler (PDF/WORD)

CachingFaceting

Query Parsing

Apache Tika

binary/admin

High-lighting

Schema

Index Replication

Request Handlers Update HandlersResponse Writers

QuerySearch Components

Spelling

Faceting

Highlighting Signature

Logging

Update Processors

Indexing

Config

Debug

Statistics

More like this

Distributed Search

Clustering

Filtering Search

Core SearchIndexReader/Searcher

IndexingIndexWriterText Analysis

Analysis

Page 9: Solr -

Demo – Demo – A live website powered by Solr

I’ll be showing you more later!

Page 10: Solr -

Demo – Demo – The backend of the website

Page 11: Solr -

Demo - Demo - Standard directory layout

Page 12: Solr -

DemoDemo - - Multiple cores

Page 13: Solr -

Demo – Demo – Run Solr!

java -jar start.jar Production enviroment: Production enviroment:

– java -Xms200m -Xmx1400m -jar start.jar >>/home/web_logs/solr/soljava -Xms200m -Xmx1400m -jar start.jar >>/home/web_logs/solr/solr$date.log 2>&1 &r$date.log 2>&1 &

– tailf /home/web_logs/solr/solr20120423.logtailf /home/web_logs/solr/solr20120423.log2012-04-07 14:10:50.516::INFO: Started SocketConnector @ 0.0.0.0:89832012-04-07 14:10:50.516::INFO: Started SocketConnector @ 0.0.0.0:8983

Page 14: Solr -

Demo – Demo – Web Admin Interfacehttp://localhost:8983/solr/admin

Page 15: Solr -

Demo – Demo – Web Admin Interfacehttp://localhost:8983/solr/admin

• SCHEMA: This downloads the schema configuration file (XML) directly to the browser.• CONFIG: It is similar to the SCHEMA choice, but this is the main configuration file for Solr.• ANALYSIS: It is used for diagnosing potential query/indexing problems having to do with the text analysis. This is a somewhat advanced screen and will be discussed later.•SCHEMA BROWSER: This is a neat view of the schema reflecting various heuristics of the actual data in the index. We'll return here later.•STATISTICS: Here you will find stats such as timing and cache hit ratios. In Chapter 9, we will visit this screen to evaluate Solr's performance.

Page 16: Solr -

Demo – Demo – Web Admin Interfacehttp://localhost:8983/solr/admin

• INFO: This lists static versioning information about internal components to Solr. Frankly, it's not very useful.

• DISTRIBUTION: It contains Distributed/Replicated status information, only applicable for such configurations.

• PING: Ignore this, although it can be used for a health-check in distributed mode.

• LOGGING: This allows you to adjust the logging levels for different parts of Solr at runtime. For Jetty as we're running it, this output goes to the console and nowhere else.

Page 17: Solr -

QueryQueryIndexingIndexing

Page 18: Solr -

QueryQuery INFO: [core1] webapp=/solr path=/admin/ping params={} status=INFO: [core1] webapp=/solr path=/admin/ping params={} status=

0 QTime=2 0 QTime=2 Apr 23, 2012 5:42:46 PM org.apache.solr.core.SolrCore executeApr 23, 2012 5:42:46 PM org.apache.solr.core.SolrCore execute INFO: [core1] webapp=/solr path=/select params={wt=json&rowsINFO: [core1] webapp=/solr path=/select params={wt=json&rows

=100&json.nl=map&start=0&q=searchKeyword:ipad2} hits=48 sta=100&json.nl=map&start=0&q=searchKeyword:ipad2} hits=48 sta

tus=0 QTime=0tus=0 QTime=0

Page 19: Solr -

QueryQuery INFO: [] webapp=/solr path=/select params={wt=jsonINFO: [] webapp=/solr path=/select params={wt=json

&rows=20&json.nl=map&start=0&&rows=20&json.nl=map&start=0&sort=volume+descsort=volume+desc&&q=CId:50011744+AND+price:[100+TO+*]} hits=1547 staq=CId:50011744+AND+price:[100+TO+*]} hits=1547 status=0 QTime=41tus=0 QTime=41

q=CId:50011744+AND+price:[100+TO+*] q=CId:50011744+AND+price:[100+TO+*] sort=volume+descsort=volume+desc start=0start=0 rows=20rows=20

hits=1547 status=0 QTime=41hits=1547 status=0 QTime=41

Page 20: Solr -

QueryQuery q - q - 查询字符串,必需查询字符串,必需 fl - fl - 指定返回那些字段内容,用逗号或空格分隔多个。指定返回那些字段内容,用逗号或空格分隔多个。 start - start - 返回第一条记录在完整找到结果中的偏移位置,返回第一条记录在完整找到结果中的偏移位置, 00 开始,一般分页用。开始,一般分页用。 rows - rows - 指定返回结果最多有多少条记录,配合指定返回结果最多有多少条记录,配合 startstart 来实现分页。来实现分页。 sort - sort - 排序,格式:排序,格式: sort=<field name>+<desc|asc>[,<field name>+<desc|sort=<field name>+<desc|asc>[,<field name>+<desc|

asc>]… asc>]… 。示例:(。示例:( inStock desc, price ascinStock desc, price asc )表示先 “)表示先 “ inStock” inStock” 降序降序 , , 再 “再 “ price” price” 升序,默认是相关性降序。升序,默认是相关性降序。

wt - (writer type)wt - (writer type) 指定输出格式,可以有 指定输出格式,可以有 xml, json, php, phps, xml, json, php, phps, 后面 后面 solr 1.solr 1.33 增加的,要用通知我们,因为默认没有打开。增加的,要用通知我们,因为默认没有打开。

fq - fq - (( filter queryfilter query )过滤查询,作用:在)过滤查询,作用:在 qq 查询符合结果中同时是查询符合结果中同时是 fqfq 查询查询符合的,例如:符合的,例如: q=mm&fq=date_time:[20081001 TO 20091031]q=mm&fq=date_time:[20081001 TO 20091031] ,找关键,找关键字字 mmmm ,并且,并且 date_timedate_time 是是 2008100120081001 到到 2009103120091031 之间的。之间的。

More: http://wiki.apache.org/solr/CommonQueryParameters

Page 21: Solr -

Demo – Demo – PHP Solr Client

Page 22: Solr -

Query - DemoQuery - Demo

Page 23: Solr -

Indexing DataIndexing Data

Page 24: Solr -

Indexing Data - Indexing Data - Communicating with Solr

– Direct HTTP or a convenient client API– Data streamed remotely or from Solr's filesyste

m

Page 25: Solr -

Indexing Data - Indexing Data - Data formats/sources

– Solr-XML:

– Solr-binary: This is only supported by the SolrJ client API.

– CSV: CSV is a character separated value format (often a comma).

– Rich documents like PDF, XLS, DOC, PPT

– Solr's DIH DataImportHandler contrib add-on is a powerful capability that can communicate with both databases and XML sources (for example: web services). It supports configurable relational and schema mapping options and supports custom transformation additions if needed. The DIH uniquely supports delta updates if the source data has modification dates.

Page 26: Solr -

Lucene/Solr IndexingLucene/Solr Indexing

XML Update Handler

CSV Update Handler

/update /update/csv

XML Update with custom

processor chain

/update/xml

Extracting RequestHandler

(PDF, Word, …)

/update/extract

Lucene Index

Data ImportHandler

Database pullRSS pullSimple

transformsSQL DB

RSS feed

<doc> <title>

Remove Duplicatesprocessor

Loggingprocessor

Indexprocessor

Custom Transformprocessor

PDF

HTTP POSTHTTP POST

pull

pull

Update Processor Chain (per handler)

Lucene

Text Index Analyzers

Page 27: Solr -

schema.xmlschema.xml

Indexing Data - Indexing Data - Schema

Page 28: Solr -

AdvancedAdvanced

Chinese Word Segmentation (Chinese Word Segmentation ( 中文分中文分词词 ))

DIH (Data Import Handler)DIH (Data Import Handler) ShardingSharding ReplicationReplication Performance TuningPerformance Tuning

Page 29: Solr -

Chinese Word Segmentation (Chinese Word Segmentation ( 中文分词中文分词 ))

Page 30: Solr -

Chinese Word Segmentation (Chinese Word Segmentation ( 中文分词中文分词 ))

Page 31: Solr -

Chinese Word Segmentation (Chinese Word Segmentation ( 中文分词中文分词 ))IKAnalyzer3.2.8.jar

Page 32: Solr -

Chinese Word Segmentation (Chinese Word Segmentation ( 中文分词中文分词 ))

相关原理请参阅《 解密搜索引擎技术实解密搜索引擎技术实战》战》

Page 33: Solr -

DIH (Data Import Handler)DIH (Data Import Handler)

MySQL

jdbc/DIH

Solr

• full-import

• delta-import

Most applications store data in relational databases or XML files and searching over such data is a common use-case.

The DataImportHandler is a Solr contrib that provides a configuration driven way to import this data into Solr in both "full builds" and using incremental delta imports.

Page 34: Solr -

DIH (Data Import Handler)DIH (Data Import Handler)

1. Imports data from databases through JDBC (Java Database Connectivity)

2. Imports XML data from a URL (HTTP GET) or a file

3. Can combine data from different tables or sources in various ways

4. Extraction/Transformation of the data

5. Import of updated (delta) data from a database, assuming a last-updated date

6. A diagnostic/development web page

7. Extensible to support alternative data sources and transformation steps

Page 35: Solr -

DIH (Data Import Handler)DIH (Data Import Handler)

• curl http://localhost:8983/solr/dataimport to verify the configuration.

• curl http://localhost:8983/solr/dataimport?command=full-import

• curl http://localhost:8983/solr/dataimport?command=delta-import

Page 36: Solr -

DIH (Data Import Handler) - Full Import Example DIH (Data Import Handler) - Full Import Example 完全索引完全索引

data-config.xml

Page 37: Solr -

DIH (Data Import Handler) - Delta Import Example DIH (Data Import Handler) - Delta Import Example 增量索引增量索引

data-config.xml

Page 38: Solr -

DIH (Data Import Handler) - DemoDIH (Data Import Handler) - Demo

2 millions rows imported in about 20 minutes.

Linux aaa 2.6.18-243.el5 #1 SMP Mon Feb 7 18:47:27 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

cpu cores : 1

MemTotal: 2058400 kB

Page 39: Solr -

ShardingSharding

Sharding is the process of breaking a single logical index in a horizontal fashion across records versus breaking it up vertically by entities.

S1 S2 S3 S4

Page 40: Solr -

Sharding-IndexingSharding-IndexingSHARDS = ['http://server1:8983/solr/', 'http://server2:8983/solr/']

unique_id = document[:id]if unique_id.hash % SHARDS.size == local_thread_id # index to shardend

Page 41: Solr -

Sharding-QuerySharding-Query

The ability to search across shards is built into the query request handlers. You do not need to do any special configuration to activate it.

Page 42: Solr -

ReplicationReplication

Master

Slaves

Page 43: Solr -

Combining replication and sharding

M1 M2 M3Sharding Masters

S1 S2 S3 S1 S2 S3

Slave Pool 1 Slave Pool 2

Queries sent to pools of slave shards

Replication

Page 44: Solr -

Combining replication and sharding

http://wiki.apache.org/solr/SolrCloud http://zookeeper.apache.org/doc/r3.3.2/zookeeperOver.html

Page 45: Solr -

Performance TuningPerformance Tuning

JVMJVM http cachehttp cache Solr CacheSolr Cache Better schemaBetter schema Better indexing strategyBetter indexing strategy

Page 46: Solr -

Solr CachingSolr Caching

Caching is a key part of what makes Solr fast and scalable

There are a number of different caches configured in solrconfig.xml:– filterCache– queryResultCache– documentCache

Page 47: Solr -

More InfoMore Info

《《 Solr 1.4 Enterprise Search ServerSolr 1.4 Enterprise Search Server 》》 http://wiki.apache.org/solr/ http://wiki.apache.org/solr/ http://http://solr.plsolr.pl/en//en/ 《解密搜索引擎技术实战》《解密搜索引擎技术实战》

Page 48: Solr -

Thank you!Thank you!