Elasticsearch in hatena bookmark

27
Elasticsearch in Hatena Bookmark Shunsuke KOZAWA

Transcript of Elasticsearch in hatena bookmark

Page 1: Elasticsearch in hatena bookmark

Elasticsearchin Hatena Bookmark

Shunsuke KOZAWA

Page 2: Elasticsearch in hatena bookmark

About Me

● Shunsuke KOZAWA○ Hatena id: skozawa○ Twitter: @5kozawa

● 2007 - 2012○ Research: Natural Language Processing○ Ph.D. in Information Science

● 2012 -○ Hatena Inc.

■ Hatena Bookmark■ Ad-tech

Page 3: Elasticsearch in hatena bookmark

Hatena Bookmark

Social Bookmark Service

Page 4: Elasticsearch in hatena bookmark

Search Engine History in Hatena Bookmark

2005 - 2007MySQL Like

2008 - 2012Sedue (by Preferred Infrastructure)

2012 - 2014/06Solr

2014/06 -Elasticsearchref. http://bookmark.hatenastaff.com/entry/2014/06/27/180000

Page 5: Elasticsearch in hatena bookmark

System Architecture

Page 6: Elasticsearch in hatena bookmark

Mapping (partial) of Hatena Bookmark

{ “entry”: {

“properties”: {

“url”: { “type”: “string” },

“title”: { “type”: “string” },

“content”: { “type”: “string” },

“count”: { “type”: “integer” },

“created”: { “type”: “date” },

“bookmark”: {

}

}

} }

“bookmark”: {

“type”: “nested”,

“properties”: {

“user”: { “type”: “string” },

“tag”: { “type”: “string” }.

“comment”: { “type”: “string” },

“created”: { “type”: “date” }

}

}

Page 7: Elasticsearch in hatena bookmark

Features powered by Elasticsearch

● Entry Search○ Tag Search○ Title Search○ Content Search○ URL Search

● Related Entry● Issue● Topic● Bookmark Counter

Page 8: Elasticsearch in hatena bookmark

Tag/Title Search

Page 9: Elasticsearch in hatena bookmark

Tag/Title Search

Search by “Elasticsearch”

Page 10: Elasticsearch in hatena bookmark

Tag/Title Search

Sorting

Filter by the number of bookmark

Filter by timestamp

Page 11: Elasticsearch in hatena bookmark

Tag/Title Search

{

“sort”: { “created”: “desc” },

“query”: {

“bool”: { “must”: [

{ “match_phrase”: { “title”: “elasticsearch” } }

] },

“filtered”: { “filter”: { “bool”: { “must”: [

{ “range”: { “count”: { “gte”: 3 } } },

{ “range”: { “created”: {

“from”: “2015-05-01T00:00:00”,

“to”: “2015-07-15T00:00:00”

} } }

] } } }

}

}

Page 12: Elasticsearch in hatena bookmark

Content Search

Page 13: Elasticsearch in hatena bookmark

Concept Search

● Simple Content Search○ High recall, but low precision○ Precision is important in Hatena Bookmark

● Concept Search○ Query Expansion

■ Use search results retrieved by tag search■ Expand queries with TF-IDF and IDF, RIDF

● Term Vector API○ Retrieve using expanded queries

■ eg. 「京都」 -> 「祇園、寺、神社、桜、京、...」

ref. はてなブックマークの全文検索の精度改善https://speakerdeck.com/takuyaa/hatenabutukumakuquan-wen-jian-suo-falsejing-du-gai-shan

Page 14: Elasticsearch in hatena bookmark

URL Search

http://b.hatena.ne.jp/entrylist?url=http%3A%2F%2Fwww.elastic.co%2F

http://www.elastic.co/

Page 15: Elasticsearch in hatena bookmark

URL Search

http://b.hatena.ne.jp/entrylist?url=http%3A%2F%2Fwww.elastic.co%2F

{ “query”: {

“filtered”: { “filter”: {

“bool”: { “should”: [

{ “prefix”: {

“url”: “http://www.elastic.co/”

} }

] }

} }

} }

http://www.elastic.co/

Page 16: Elasticsearch in hatena bookmark

URL Subdomain Search

hatenablog.com

*.hatenablog.com

Page 17: Elasticsearch in hatena bookmark

Related Entry

ref. はてなブックマークに基づく関連記事レコメンドの開発http://www.slideshare.net/shunsukekozawa5/hatena-engineer-seminar-5

Page 18: Elasticsearch in hatena bookmark

Issue

Made by editors in Hatena

Entries in special features

Page 19: Elasticsearch in hatena bookmark

Issue

Hard to create Query DSL for non engineers

Made by editors in Hatena

Entries in special features

Page 20: Elasticsearch in hatena bookmark

Edit page for Issue

Page 21: Elasticsearch in hatena bookmark

Edit page for Issue

Friendly for non engineers

Page 22: Elasticsearch in hatena bookmark

Edit page for Issue

Friendly for non engineers

{

“query”: {

“bool”: {

“must”: [

{ “range”: { “count”: { “gte”: 5 } } }

],

“should”: [ (tags, keywords, urls) ],

“must_not”: [ (tags, keywods, urls) ],

“minimum_should_match”: 1

}

},

“sort”: { “created”: “desc” }

}

translate

Page 23: Elasticsearch in hatena bookmark

Topic

Estimate topics from entries in Hatena Bookmark

Page 24: Elasticsearch in hatena bookmark

Topic Page

Entries related with the topic

Page 25: Elasticsearch in hatena bookmark

Topic by Elasticsearch

● Acquire topic keywords○ Two-layered Significant Terms Aggregation

● Acquire entries related with the topic○ Function Score Query○ Retrieve using topic keywords and their scores

官邸、首相、ドローン、落下、カメラ

● 首相官邸にドローン落下 けが人はなし :日本経済新聞

● 首相官邸の屋上にドローン落下、微量の放射線を検出| Reuters

ref. はてなブックマークのトピックページの作り方http://codezine.jp/article/detail/8767

Page 26: Elasticsearch in hatena bookmark

Bookmark Counter

● Count the number of bookmarks in a web site○ Count by Sum Aggregation○ eg. http://d.hatena.ne.jp/

{

“query”: {

{ “prefix”: { “url”: “http://d.hatena.ne.jp/” } }

},

“aggs”: { “total_count”: {

“sum” : { “field”: “count” },

} }

}

Page 27: Elasticsearch in hatena bookmark

Conclusion

● Elasticsearch in Hatena Bookmark

● Features powered by Elasticsearch○ Tag / Title / Content / URL Search○ Related entry○ Issue○ Topic○ Bookmark Counter