Handy Installation Tool "Anuenue" for Solr Cluster

Solr Cluster installation tool "Anuenue" and

"Did You Mean?" for Japanese

Takahiko Ito mixi, Inc.

mixi? £ One of the largest social

networking service in Japan.

£ Many services to promote communication among users. ¢ Blog, news, game

platform etc ¢ Most of the services

come with search £ 15M monthly active users

Our current (urgent) project … Replace in-house search engines into a up-to-date search platform!

We have ¢  selected Apache Solr as the search platform! ¢  created a simple OSS package (Anuenue) which

wraps Solr Project URL: http://code.google.com/p/anuenue-wrapper/

Reason why we make Anuenue Deployment / daily operations of Solr search cluster is a bit difficult for ordinary engineers.

¢ We need to edit the configuration files for all the Solr instances respectively

¢ Commands for whole clusters are not provided •  We need to write client commands by ourselves •  Hadoop provides utility commands for clusters E.g., start-all.sh (start processes), fsck (check all

discs), balancer (rebalance the data blocks)

What does Anuenue provide? £ Handy configuration of search clusters £ Commands for clusters

¢ Simple commands (post, delete, update, commit etc) ¢ Start and stop commands for processes in cluster.

£ Japanese support ¢ Implementation of Japanese Did-You-Mean facilities ¢ Japanese tokenizer (Sen and Kuromoji)

Today’s Topics £ Anuenue

¢ Handy configuration of search clusters ¢ Commands for search clusters

£ Did-You-Mean facilities for Japanese queries

¢ Common problem in Did-You-Mean implementation ¢ Mining a Japanese Did-You-Mean dictionary from

query log data

Cluster configuration with Anuenue £  Cluster setup is done with a special configuration file £  Anuenue assigns more than one roles to instances.

¢  Roles are the functions in a cluster ¢  Anuenue supports three roles (Master, Slave,

Merger)

Role: master £  Index input data. NOTE: Anuenue provides a command to distribute the input data into master instances (build Solr shard indexes) .

Input Data

Master-1 Master-2 Master-3

Build shard indexes

Role: slave

Has three functions ¢ Copy (replicate) index

from master ¢ Accept queries from

mergers and then search it own index

¢ Return the results to merger instance

Input Data

Slave-1 Slave-2

Merger-1

Submit queries

Replicate index

Master-1 Master-2

Index input data

Role: merger £  Forwards queries from

clients to slaves. ¢  Note: clients need not

to know the slave instances (merger adds ‘shard’ parameter with slave instances)

£  Merge the results from all the slave instances and returned the merged results.

Slave-1 Slave-2

Merger

Forwards queries

Client-1 Client-2

Submit queries

Example: Anuenue cluster

The cluster consists of five machines

¢ Each has one Anuenue instance

Instances ¢ Merger: aa ¢ Master: bb, cc ¢ Slave: dd, ee

Input Data

Forward queries

Index input data

Client-1 Client-2

Replicate index

How to assign roles to instance?

Edit cluster configuration file, anuenue-nodes.xml. •  Add three elements (mergers, slaves and masters) •  In each element, add more than one instance

information (machine name and port number).

Configuration example Case: there is one merger instance in machine, aa (port 7000) <mergers> <merger> <host>aa</host> <port>7000</port>

</merger> </mergers>

Specify the index to replicate <masters> <master iname=“master1”> <host>aaaa</host> <port>8983</port> </master> </masters> <slaves>

<replicate>master1</replicate> </slave>

</slaves>

Add name of master instance by iname attribute

Specify the master instance to copy the index adding replicate element

Example: simple cluster settings

Input Data

Forward queries

Index input data

Client-1 Client-2 <mergers> <merger> <host>aa</host> <port>8983</port> </merger> </mergers> <masters> <master iname=“master1”> <host>bb</host> <port>8983</port> </master> </masters> <slaves> <slave> <host>cc</host> <port>8983</port> <replicate>master1</replicate> </slave> </slaves>

Replicate index

Cluster setup with Anuenue £ Flexible and support various types of search cluster.

£ For example…

Assign multiple roles

Input Data

instance

Client1 Client2

Index input data

Submit queries

Large clusters to handle huge data with high QPS

Input Data

Client1

Slave1

Client2

Merger1

Slave3 Slave2 Slave4

Master1 Master2

Slave5 Slave6

Master3 Master4 Master5 Master6

Merger2 Merger3

Client3 ClientN …

After setting up cluster We can make use of commands for clusters.

Anuenue provides ¢  start / stop commands ¢  commands to manipulate the index

Start and stop clusters Users can start / stop clusters by a command (anuenue-distdaemon.sh). Usage: $sh bin/anuenue-distdaemon.sh [start|stop]

Simple commands for clusters

Anuenue also provides basic commands (‘post’, ‘delete’, ‘commit’, ‘optimize’ and ‘update’) for search cluster　

¢ The commands are implemented in multi-thread

E.g., $sh bin/anuenue-distcommands.sh post -arg inputDir

Today’s Topics £ Anuenue

¢ Handy cluster configuration of search clusters ¢ Commands for search clusters

£ Did-You-Mean facilities for Japanese queries

¢ Common problem in Did-You-Mean implementation ¢ Mining a Japanese Did-You-Mean dictionary from

query log data

What is Did-You-Mean service? £ Suggest correct spelling when users submit queries with

mistakes £  Increase the usability of search service

Example: Did-You-Mean service

(English: Ugly Betty)

Common implementation

Many search engines (including Solr) apply distance measures such as Edit Distance [Levenshtein, 1965]

Edit Distance: measure of distance between two sequences. Simply speaking, when two sequences have more common characters, the distance is smaller.

E.g., like 1 likes (small distance) like 1 foobar (large distance)

Common procedure: Did-You-Mean When a user submits a query, 1.  Did-You-Mean service computes edit distance between

input query and words in index. 2.  If there is a word whose distance is small,

è  Did-You-Mean handler suggests

E.g., when a user submit a query, “pthon”, Did-You-Mean service suggests a word in the index with small distance “python”.

Problem: Japanese queries

Simple application of edit distance does not work for Japanese è Misspelled queries are sometimes totally different from

the correct one (large distance). E.g., ¢ 墨ともふどうさん (correct: 住友不動産) ¢ 米事案セット (correct: ベイジアンセット)

è These cases are derived from Japanese input method.

Typing in Japanese query

We input Japanese (query) words with two steps. 1.  Type the reading of the Japanese word in Latin

alphabet. 2.  Select a desired word from the list of candidates

This step cause a spelling mistake, too large distance to correct spelling

Example: Typing in Japanese queries

Assume a user wants to submit a query: オバマ (Obama) 1.  Type in the reading in Latin alphabet.

reading: obama 2.  Select correct spelling.

Possible candidates: オバマ (correct), おばま, 小浜 etc.

Japanese Did-You-Mean dictionary

£  Because of the large distance problem, simple distance measures (edit distance) do not work.

£  To handle this problem, Anuenue supports a special dictionary for Japanese Did-You-Mean service.

Dictionary for Japanese Did-You-Mean service

Dictionary has two columns 1. Query with mistakes 2. Correct queries

Query with mistakes

Correct Query

墨ともふどうさん住友不動産

歌だ光る宇多田ヒカル

米事案セットベイジアンセット

Implementing Did-You-Mean service with the dictionary

When users submit the query with mistakes in dictionary, è  Did-You-Mean service

suggests the correct query

NOTE: Anuenue provides handlers for the dictionary format.

Query with mistakes

Correct Query

墨ともふどうさん住友不動産

歌だ光る宇多田ヒカル

米事案セットベイジアンセット

Problem… How we can create the dictionary? è We can make use of a query log mining tool Oluolu.

Oluolu £ Creates a spelling correction dictionary from query log £ Extracts pairs of queries (query with spelling mistakes,

query with correct spelling) ¢ Support the Japanese spelling mistakes (from version

0.2) £ runs on the Hadoop framework

Project URL: http://code.google.com/p/oluolu/

Input to Oluolu: query log Three columns

1.  User Id 2.  Query string 3.  Time of query

submission

User Id Query Time

438904 Pthon 2009-11-21 11:16:12

34443 Java 2009-11-21 12:16:13

438904 Python 2009-11-21 12:16:20

8975 Java Tomcat

2009-11-21 12:16:25

Procedure: creating Japanese Did-You-Mean dictionary with Oluolu

Oluolu extracts the elements of Japanese Did-You-Mean dictionary with 2 steps.

1.  Extract all the query pairs in the same session 2.  Validate the query pairs

Step1: extract query pairs £ Oluolu extracts pairs of

queries in the same session. E.g., Oluolu extracts pair (Pthon and Python).

£ Queries in the same session: a set of queries submit by the same user within small time range.

£ Extracted pairs can be misspelled query and correct query.

User ID Query Time

438904 Pthon 2009-11-21 12:16:12

34443 Java 2009-11-21 12:16:13

438904 Python 2009-11-21 12:16:20

8975 Tomcat 2009-11-21 12:16:25

Step 2: validate candidate pairs £ Oluolu validates all the query pairs extracted step 1. £  In validation phase (step 2), Oluolu makes use of query

readings.

Reading of Japanese words £ Japanese words can be convert into the readings in Latin

Alphabets. ¢ こんにちは (reading: konnichiha) ¢ 伊藤 (reading: itou)

FACT: even when Japanese query with spelling mistakes can be totally different from correct query,

è  the readings are the same or the distance is small!

Validate candidate pair with reading Given a query pairs, Oluolu validates the queries with 2 steps

1. Convert the queries into readings with Latin Alphabets 2. Compute edit distance with the two readings

è  When the distance is small, the two queries are extracted as a element of Did-You-Mean dictionary.

Example: step 2 Given a pair of queries: (墨ともふどうさん, 住友不動産)

1.  Convert them into readings è  readings are the same, “sumitomofudousan”.

3.  Compute the distance with the readings è  Distance is zero è  Extracted as a element of Did-You-Mean dictionary

Creating Japanese Did-You-Mean dictionary with Oluolu £  Installation requirements

¢ Java 1.6.0 or greater ¢ Hadoop 0.20.0 or greater ¢ Oluolu 0.2.0 or greater

£ Copy the input query log into HDFS £ Run spellcheck task of oluolu $ bin/oluolu spellcheck -input testInput.txt -output output -inputLanguage ja

Preliminary experiments £ Experimental settings

¢ Input data: log file from a mixi service (community search).

•  5 GB data

£ Extracted dictionary ¢  number of elements is over 100.000 ¢  succeeded to extract the query pairs with large edit

distance. •  (議Ν, ギニュー) •  (不動有利, 不動裕理)

Current status £ Finished functional tests and stress tests. £ Now replacing an in-house search engine in a small

search service with Anuenue. £  In next phase, we will apply Anuenue to the search

service with large data and high QPS.

Future work £  Integrate SolrCloud and Zookeeper

¢ Support failover, and rebalance the index

£ Kuromoji, a new OSS Japanese tokenizer

Summary £  Introduction of Anuenue £ Described a Did-You-Mean facility for Japanese query

Thank you for your attention!

Handy Installation Tool "Anuenue" for Solr Cluster

Documents

Transcript of Handy Installation Tool "Anuenue" for Solr Cluster

Solr pattern

The%NoSQL%Database% - home.apache.orgpeople.apache.org/~yonik/presentations/solr4_nosql... · EarliestHA% Solr%Conﬁguraons% Load%Balancer% Appservers% Solr%Searchers% Solr%Master%

Solr Recipes

Solr Flair

Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Solr Flair: Search User Interfaces Powered by Apache Solr

Solr Fusion a Solr Proxy

Solr JDBC - Lucene/Solr Revolution 2016

Drupal solr

Schemaless Solr and the Solr Schema REST API

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

Optimizing SOLR to Improve Searchinfo2.magento.com/rs/magentosoftware/images/SOLR... · Agenda ! Overview of SOLR ! Basic Solr Troubleshooting – Common SOLR Troubleshooting and

Understanding the Solr security framework - Lucene Solr Revolution 2015

Kaleo O Kalani Ame Ka Anuenue - Energiebewegung

aNueNue Instructional Book Vol.01

Scaling Solr with Solr Cloud

Apache Solr + ajax solr

Meetup solr

Oak / Solr integration Tommaso Teofili - pro!vision · Solr replicated architecture Solr%@10.1.1.20% C1 C2 Solr%@10.1.1.21% C1 C2 Solr%@10.1.1.22% C1 C2 RRLoad%balancer% adaptTo()