Elasticsearch in hatena bookmark
-
Upload
shunsuke-kozawa -
Category
Technology
-
view
2.849 -
download
1
Transcript of Elasticsearch in hatena bookmark
Elasticsearchin Hatena Bookmark
Shunsuke KOZAWA
About Me
● Shunsuke KOZAWA○ Hatena id: skozawa○ Twitter: @5kozawa
● 2007 - 2012○ Research: Natural Language Processing○ Ph.D. in Information Science
● 2012 -○ Hatena Inc.
■ Hatena Bookmark■ Ad-tech
Hatena Bookmark
Social Bookmark Service
Search Engine History in Hatena Bookmark
2005 - 2007MySQL Like
2008 - 2012Sedue (by Preferred Infrastructure)
2012 - 2014/06Solr
2014/06 -Elasticsearchref. http://bookmark.hatenastaff.com/entry/2014/06/27/180000
System Architecture
Mapping (partial) of Hatena Bookmark
{ “entry”: {
“properties”: {
“url”: { “type”: “string” },
“title”: { “type”: “string” },
“content”: { “type”: “string” },
“count”: { “type”: “integer” },
“created”: { “type”: “date” },
“bookmark”: {
…
}
}
} }
“bookmark”: {
“type”: “nested”,
“properties”: {
“user”: { “type”: “string” },
“tag”: { “type”: “string” }.
“comment”: { “type”: “string” },
“created”: { “type”: “date” }
}
}
Features powered by Elasticsearch
● Entry Search○ Tag Search○ Title Search○ Content Search○ URL Search
● Related Entry● Issue● Topic● Bookmark Counter
Tag/Title Search
Tag/Title Search
Search by “Elasticsearch”
Tag/Title Search
Sorting
Filter by the number of bookmark
Filter by timestamp
Tag/Title Search
{
“sort”: { “created”: “desc” },
“query”: {
“bool”: { “must”: [
{ “match_phrase”: { “title”: “elasticsearch” } }
] },
“filtered”: { “filter”: { “bool”: { “must”: [
{ “range”: { “count”: { “gte”: 3 } } },
{ “range”: { “created”: {
“from”: “2015-05-01T00:00:00”,
“to”: “2015-07-15T00:00:00”
} } }
] } } }
}
}
Content Search
Concept Search
● Simple Content Search○ High recall, but low precision○ Precision is important in Hatena Bookmark
● Concept Search○ Query Expansion
■ Use search results retrieved by tag search■ Expand queries with TF-IDF and IDF, RIDF
● Term Vector API○ Retrieve using expanded queries
■ eg. 「京都」 -> 「祇園、寺、神社、桜、京、...」
ref. はてなブックマークの全文検索の精度改善https://speakerdeck.com/takuyaa/hatenabutukumakuquan-wen-jian-suo-falsejing-du-gai-shan
URL Search
http://b.hatena.ne.jp/entrylist?url=http%3A%2F%2Fwww.elastic.co%2F
http://www.elastic.co/
URL Search
http://b.hatena.ne.jp/entrylist?url=http%3A%2F%2Fwww.elastic.co%2F
{ “query”: {
“filtered”: { “filter”: {
“bool”: { “should”: [
{ “prefix”: {
“url”: “http://www.elastic.co/”
} }
] }
} }
} }
http://www.elastic.co/
URL Subdomain Search
hatenablog.com
*.hatenablog.com
Related Entry
ref. はてなブックマークに基づく関連記事レコメンドの開発http://www.slideshare.net/shunsukekozawa5/hatena-engineer-seminar-5
Issue
Made by editors in Hatena
Entries in special features
Issue
Hard to create Query DSL for non engineers
Made by editors in Hatena
Entries in special features
Edit page for Issue
Edit page for Issue
Friendly for non engineers
Edit page for Issue
Friendly for non engineers
{
“query”: {
“bool”: {
“must”: [
{ “range”: { “count”: { “gte”: 5 } } }
],
“should”: [ (tags, keywords, urls) ],
“must_not”: [ (tags, keywods, urls) ],
“minimum_should_match”: 1
}
},
“sort”: { “created”: “desc” }
}
translate
Topic
Estimate topics from entries in Hatena Bookmark
Topic Page
Entries related with the topic
Topic by Elasticsearch
● Acquire topic keywords○ Two-layered Significant Terms Aggregation
● Acquire entries related with the topic○ Function Score Query○ Retrieve using topic keywords and their scores
官邸、首相、ドローン、落下、カメラ
● 首相官邸にドローン落下 けが人はなし :日本経済新聞
● 首相官邸の屋上にドローン落下、微量の放射線を検出| Reuters
ref. はてなブックマークのトピックページの作り方http://codezine.jp/article/detail/8767
Bookmark Counter
● Count the number of bookmarks in a web site○ Count by Sum Aggregation○ eg. http://d.hatena.ne.jp/
{
“query”: {
{ “prefix”: { “url”: “http://d.hatena.ne.jp/” } }
},
“aggs”: { “total_count”: {
“sum” : { “field”: “count” },
} }
}
Conclusion
● Elasticsearch in Hatena Bookmark
● Features powered by Elasticsearch○ Tag / Title / Content / URL Search○ Related entry○ Issue○ Topic○ Bookmark Counter