Predictive Parallelization: Taming Tail Latencies in Web Search

Predictive Parallelization:Taming Tail Latencies in

Web Search

Myeongjae Jeon, Saehoon Kim, Seung-won Hwang, Yuxiong He, Sameh Elnikety,

Alan L. Cox, Scott RixnerMicrosoft Research, POSTECH, Rice University

Performance of Web Search

1) Query response time– Answer quickly to users (e.g., in 300 ms)

2) Response quality (relevance)– Provide highly relevant web pages– Improve with resources and time consumed

Focus: Improving response timewithout compromising quality

Background: Query Processing Stages

2nd phase ranking

Snippet generator

Doc. index search

Response

For example:300 ms

latency SLA

QueryFocus: Stage 1

100s – 1000s of good matching docs

10s of the best matching docs

Few sentences for each doc

Speeding up index search (stage 1) without compromising result quality– Improve user experience– Larger index serving– Sophisticated 2nd phase

2nd phase ranking

Snippet generator

Doc. index search

Response

For example:300 ms

latency SLA

All web pages

How Index Search Works• Partition all web pages across

index servers (massively parallel)

• Distribute query processing (embarrassingly parallel)

• Aggregate top-k relevant pages

Partition Partition Partition Partition Partition Partition

Indexserver

Aggregator

Top-k pages

Top-kpages

PagesQuery

Problem:A slow server makes the entire cluster slow

Observation

• Query processing on every server. Response time is determined by the slowest one.

• We need to reduce its tail latencies

Latency

Aggregator

Indexservers

Aggregator

Indexservers

Fast response Slow response

Examples

• Terminate long query in the middle of processing→ Fast response, but quality drop

Long query(outlier)

Parallelism for Tail Reduction

Opportunity• Available idle cores• CPU-intensive workloads

Challenge• Tails are few• Tails are very long

Breakdown LatencyNetwork 4.26 ms

Queueing 0.15 ms

I/O 4.70 ms

CPU 194.95 ms

Latency breakdown for the 99%tile.

Percentile Latency Scale50%tile 7.83 ms x1

75%tile 12.51 ms x1.6

95%tile 57.15 ms x7.3

99%tile 204.06 ms x26.1

Latency distribution

Predictive Parallelism for Tail Reduction

• Short queries– Many– Almost no speedup

• Long queries– Few– Good speedup

1 2 3 4 5 60

0123456

5.2 4.5

< 30 ms

Parallelism Degree

1 2 3 4 5 60

0123456

> 80 ms

Parallelism Degree

Predictive Parallelization Workflow

query Execution time

predictor

Predict (sequential) execution time of the query with high accuracy

Index server

Predictive Parallelization Workflow

predictor

Resourcemanager

Index server

Using predicted time, selectively parallelize long queries

Predictive Parallelization

• Focus of Today’s Talk1. Predictor: of long query through machine learning2. Parallelization: of long query with high efficiency

Brief Overview of Predictor

Accuracy CostHigh recall for

guaranteeing 99%tile reduction

Low prediction overhead and misprediction cost

In our workload, 4% queries with

> 80 ms

At least 3% must be identified (75% recall)

Existing approaches:Lower accuracy and higher cost

Prediction overhead of 0.75ms or less and high precision

Accuracy: Predicting Early Termination

• Only some limited portion contributes to top-k relevant results

• Such portion depends on keyword (or score distribution more exactly)

Inverted index for “SIGIR”

Processing Not evaluated

Doc 1 Doc 2 Doc 3 ……. Doc N-2 Doc N-1 Doc N

Docs sorted by static rankHighest LowestWeb

documents

……. …….

• Term Features [Macdonald et al., SIGIR 12]

– IDF, NumPostings– Score (Arithmetic, Geometric, Harmonic means, max,

var, gradient)• Query features– NumTerms (before and after rewriting)– Relaxed– Language

Space of Features

New Features: Query

• Rich clues from queries in modern search engines

<Fields related to query execution plan>rank=BM25Fenablefresh=1 partialmatch=1language=en location=us ….

<Fields related to search keywords>SIGIR (Queensland or QLD)

• Term Features [Macdonald et al., SIGIR 12]

– IDF, NumPostings– Score (Arithmetic, Geometric, Harmonic means, max,

var, gradient)• Query features– NumTerms (before and after rewriting)– Relaxed– Language

Space of Features

Space of FeaturesCategory FeatureTerm feature(14)

AMeanScoreGMeanScoreHMeanScoreMaxScoreEMaxScoreVarScoreNumPostingsGAvgMaximaMaxNumPostingsIn5%MaxNumThresProKIDF

Query feature(6)

EnglishNumAugTermComplexityRelaxCountNumBeforeNumAfter

• All features cached to ensure responsiveness (avoiding disk access)

• Term features require 4.47GB memory footprint (for 100M terms)

Feature Analysis and Selection

• Accuracy gain from boosted regression tree, suggesting cheaper subset

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150.600000000000001

0.650000000000001

0.700000000000001

0.750000000000001

0.800000000000001

0.850000000000001

All featuresSorted features

# features (sorted by importance)

Prediction Performance

• Query features are important• Using cheap features is advantageous– IDF from keyword features + query features– Much smaller overhead (90+% less)– Similarly high accuracy as using all features

80 ms Thresh. Precision(|A∩P|/|P|)

Recall(|A∩P|/|A|) Cost

Keyword features 0.76 0.64 HighAll features 0.89 0.84 High

Cheap features 0.86 0.80 Low

A = actual long queriesP = predicted long queries

• Classification vs. Regression– Comparable accuracy– Flexibility– Algorithms

• Linear regression• Gaussian process regression• Boosted regression tree

Algorithms

Accuracy of Algorithms

• Summary– 80% long queries (> 80 ms) identified– 0.6% short queries mispredicted– 0.55 ms for prediction time with low memory overhead

• Key idea– Parallelize only long queries

• Use a threshold on predicted execution time

• Evaluation– Compare Predictive to other baselines

• Sequential• Fixed• Adaptive

Predictive Parallelism

99%tile Response Time

• Outperforms “Parallelize all”

Sequential Degree=3

Predictive Adaptive

Query Arrival Rate (QPS)

50% throughput increase

Related Work

• Search query parallelism– Fixed parallelization [Frachtenberg, WWWJ 09]– Adaptive parallelization using system load only [Raman et al., PLDI 11] High overhead due to parallelizing all queries

• Execution time prediction– Keyword-specific features only [Macdonald et al., SIGIR 12]→ Lower accuracy and high memory overhead for our target problem

Your query to Bing is now parallelized if predicted as long.

Thank You!

predictor

Resourcemanager

Predictive Parallelization: Taming Tail Latencies in Web Search

Documents

Transcript of Predictive Parallelization: Taming Tail Latencies in Web Search

Taming THC

Measuring electronic latencies in MINOS with Auxiliary Detector

Efficient Parallelization of a Dynamic Unstructured ... · Efficient Parallelization of a Dynamic Unstructured Application on the ... the parallelization of a dynamic unstructured

Parallelization at a Glance

Production latencies of morphologically simple and … · Production latencies of morphologically simple and complex verbs in ... (bell ! ell), additions (tack ! tackt), as ... time

T. Latencies of behavioral response to interception of ...

Łukasz Kokoszkiewicz. EnviroGRIDS project overview SWAT explanation SWAT gridification assumptions Parallelization results SWAT model parallelization.

Automatic Parallelization

PARALLELIZATION AND PERFORMANCE OPTIMIZATION OF ...domingo/teaching/ciic8996/PARALLELIZATION … · PARALLELIZATION AND PERFORMANCE OPTIMIZATION OF BIOINFORMATICS AND BIOMEDICAL APPLICATIONS

Grid parallelization and tests

Parallel Monte-Carlo Tree Search - Maastricht University · Parallel Monte-Carlo Tree Search 63 Fig.2. (a) Leaf parallelization (b) Root parallelization (c) Tree parallelization with

Measuring Latencies of IEEE 11073 Compliant Service ...

Trend Towards Parallelization

Smith waterman algorithm parallelization

Turbodecodingalgorithm parallelization

Shared Memory Parallelization

Test parallelization using Jenkins

SENSITIVITY COMPARISON TO LOOP LATENCIES BETWEEN …

Reducing File System Tail Latencies with Chopper

Aho-Corasick algorithm parallelization