Ruby meetup evolution of bof search
-
Upload
miha-mencin -
Category
Software
-
view
254 -
download
1
Transcript of Ruby meetup evolution of bof search
BoF?
- www.businessoffashion.com- London based fashion media startup, - one of the biggest DLabs products
Infrastructure
Wordpress, Symfony2, MySql, Mongo, Redis, Elasticsearch, Ruby, HHVM, Go, AWS….
You name it....
The Challenge
● Make search work in such a way, that relevant results will be found. It should be a bit fuzzy, but not too much.
● It should degrade results from certain categories (daily digest), but not on the expense of accuracy.
● It should degrade old articles, but not always.● Items that are shared more, should be higher - but not
sure how high.
The challenge - example search queries:
Tom Adeyoola - is mentioned once in body of and old article -> should be 1st result.‘Tweets and tribes’ - title of an old article ->expected to be 1st result.Dolce Gabbana: 1st should be article from 2011, 2nd should be article from 2014 (News and analysis)
Elasticsearch toolsetSearch types:
● Fuzzy● Match● Phrase
They can be all used together using ‘Bool’ query.
Boosting options
● Boost on field (title, content, category…)
● Boost on search type
● Boost mode (multiply, sum, avg, first, max, min)
Scoring Function
● Script score"script_score" : {
"script" : "_score * category == 1 ? 0.2 : 1"}
● Decay functions ( e.g. every 100 days score should decrease by factor 0.2)
● Score mode (multiply, sum, avg, max, min)
Decay?
"DECAY_FUNCTION": {
"FIELD_NAME": {
"origin": "TODAY",
"scale": "10d",
"offset": "5d",
"decay" : 0.5
}
}
T
Sco
re
1y 2y
Decay function can be Gauss, linear or exponential
How do parameters look?boost_mode: 'avg',
score_mode: 'sum',
split_query: false,
min_score: 0.5,
use_fuzzy: true,
use_query_string: true,
use_match: true,
use_pharse: true,
use_decay_on_time: true,
use_views_weights: true,
use_shares_weights: true,
use_downboost_for_categories: true,
time_decay_scale: '30d',
time_decay_decay: 0.001,
view_weight_divisor: 100000,
shares_weight_divisor: 100000,
boost_factor_fuzzy: 1,
boost_factor_match: 8,
boost_factor_phrase: 8,
boost_title: 8,
boost_summary: 4,
boost_keywords: 2,
boost_content: 8,
boost_slug: 1,
boost_author: 1,
downboost_cat_2687: 0.5,
downboost_cat_4: 0.25,
downboost_cat_77: 0.125
I ended up fixing cases by trial and errorAnd often when 1 case was fixed, another one was broken.
Straight search Decay On + Scoring
Tom Adeyoola 1. Tom Adeyoola2. Tom Ford3. Tom Ford………..
1. Tom Ford2. Tom ford….77. Tom Adeyoola :(
Dolce Gabbana 1. Some ‘irrelevant’ result2. Other ‘irrelevant’ result
1. ‘Relevant’ result2. ‘Relevant’ result
But… I’m a programer, shouldn’t computer be doing this boring tasks of trial and error for me?
Damn right! let’s just try all possible combinations of parameters. There are roughly 1.28E+36 of them. It could take years.So what then?
Evolution FTW.Darwin's theory of evolution was a concept
of such stunning simplicity, but it gave rise,
naturally, to all of the infinite and baffling
complexity of life. The awe it inspired in me
made the awe that people talk about in
respect of religious experience seem,
frankly, silly beside it. I'd take the awe of
understanding over the awe of ignorance
any day.
Douglas Adams
Evolutionary algorithm
Well, genetic algorithm.Genome:
Create random population of
search configurations
Run fitness function for
each subject
Choose the best subjects
Create new populations from bests.
boost_mode: 'avg',
score_mode: 'sum',
split_query: false,
min_score: 0.5,
use_fuzzy: true,
use_query_string: true,
use_match: true,
use_pharse: true,
use_decay_on_time: true,
use_views_weights: true,
use_shares_weights: true,
use_downboost_for_categories: true,
time_decay_scale: '30d',
time_decay_decay: 0.001,
view_weight_divisor: 100000,
shares_weight_divisor: 100000,
boost_factor_fuzzy: 1,
boost_factor_match: 8,
boost_factor_phrase: 8,
boost_title: 8,
boost_summary: 4,
boost_keywords: 2,
boost_content: 8,
boost_slug: 1,
boost_author: 1,
downboost_cat_2687: 0.5,
downboost_cat_4: 0.25,
downboost_cat_77: 0.125
(Vocabulary)
Genome: a set of rules that determines how each subject will behave -> set of search parameters
Population: A set of all subjects -> set of all search objects
(Vocabulary)
Subject: a single member of population -> single search object, created with a genomeElstSearch.new(huge_settings_hash)
Fintess function: Function that determines how good a subject is.
Create random
population
Run fitness function for
each subject
Choose the best
subjects
Create new population
from bests.
boost_mode: [sum, avg, max, min ...].sample,
score_mode: [sum, avg, max, min ...].sample,
split_query: [true, false].sample,
min_score: rand,
….
….
Create random
population
Run fitness function for
each subject
Choose the best
subjects
Create new population
from bests.
Fitness function scores each subject. It shows how much individual subject is appropriate for further usage.
def fitness_tweets_and_tribes
rs = search('tweets and tribes')
place = rs.string_place(:title, 'tweets and tribes')
if place
increase_score(1 / place.to_f)
end
end
def fintess_chris_morton
rs = search('chris morton')
place1 = rs.string_place(:title, 'One Cart to Rule
Them All')
place2 = rs.string_place(:title, "Net Native")
increase_score(1 / place1.to_f)
increase_score(1 / place2.to_f)
increase_score(1) if place1 < place2
end
end
Create random
population
Run fitness function for
each subject
Choose the best
subjects
Create new population
from bests.
Sort by score, take top 10.
Create random
population
Run fitness function for
each subject
Choose the best
subjects
Create new population
from bests.
dna1 = [‘avg’, ‘min’, true, true, false, 14 …]
dna2 = [‘max’, ‘max’, false, true, true, 18 …]
-------------------------------------------------Take from random parent-------
child= [‘avg’, ‘max’, true, true, false, 18 …..]
Mating two DNAs
We create 100 children from 10 bests subject of current generation.
Create random
population
Run fitness function for
each subject
Choose the best
subjects
Create new population
from bests.
The problem I: inbreds
After a while, top subjects all gives same score. The score doesn’t increase even after 1000s of generations.
Solution: mutations
Create random
population
Run fitness function for
each subject
Choose the best
subjects
Create new population
from bests.
dna1 = [‘avg’, ‘min’, true, true, false, 14 …]
dna2 = [‘max’, ‘max’, false, true, true, 18 …]
---Take from random parent + mutate
child =[‘avg’, ‘avg’, true, true, false, 4 …]
Solution: increase mutation rate
● If too much generations has passed from last best score increase mutation rate
● If that does not help, push last best genome into breeding pairs
● If that does not help, create new random generation.
Result boost_mode: 'sum',
score_mode: 'sum',
split_query: false,
min_score: 0.27697059014144987,
use_fuzzy: true,
use_query_string: false,
use_match: false,
use_pharse: true,
use_decay_on_time: true,
use_views_weights: true,
use_shares_weights: false,
use_downboost_for_categories: true,
time_decay_scale: '2491d',
time_decay_decay: 0.3306393289954982,
view_weight_divisor: 504537,
shares_weight_divisor: 703657,
boost_factor_fuzzy: 85,
boost_factor_match: 0,
boost_factor_phrase: 68,
boost_title: 86,
boost_summary: 30,
boost_keywords: 32,
boost_content: 75,
boost_slug: 45,
boost_author: 27,
downboost_cat_2687: 0.73255136431042,
downboost_cat_4: 0.1384377696262037,
downboost_cat_1: 0.7109314266874576,