Concatenation,EmbeddingandSharding: DoHTTP ...

13
Concatenation, Embedding and Sharding: Do HTTP/1 Performance Best Practices Make Sense in HTTP/2? Robin Marx, Peter Quax, Axel Faes and Wim Lamotte UHasselt-tUL-imec, EDM, Hasselt, Belgium {first.last}@uhasselt.be Keywords: Web Performance, Best Practices, HTTP, Server Push, SpeedIndex, Networking, Measurements. Abstract: Web page performance is becoming increasingly important for end users but also more difficult to provide by web developers, in part because of the limitations of the legacy HTTP/1 protocol. The new HTTP/2 protocol was designed with performance in mind, but existing work comparing its improvements to HTTP/1 often shows contradictory results. It is unclear for developers how to profit from HTTP/2 and whether current HTTP/1 best practices such as resource concatenation, resource embedding, and hostname sharding should still be used. In this work we introduce the Speeder framework, which uses established tools and software to easily and reproducibly test various setup permutations. We compare and discuss results over many parameters (e.g., network conditions, browsers, metrics), both from synthetic and realistic test cases. We find that in most non-extreme cases HTTP/2 is on a par with HTTP/1 and that most HTTP/1 best practices are applicable to HTTP/2. We show that situations in which HTTP/2 currently underperforms are mainly caused by inefficiencies in implementations, not due to shortcomings in the protocol itself. 1 INTRODUCTION For a long time, version 1 of the HyperText Transfer Pro- tocol (HTTP/1, h1) has been the only way to request and transfer websites from servers to clients. Under the pres- sure of ever more complex websites and the need for bet- ter website load performance, h1 and its practical usage have evolved significantly over the last two decades. Most of the changes help to combat the fundamental concept of h1 to only request a single resource per underlying TCP connection at the same time (where slow resources can delay others, called “head-of-line (HOL) blocking”), by better (re-)use of these connections or by making use of parallel connections. The protocol specification now in- cludes techniques such as request pipelining and persistent connections. In practice, further workarounds and tricks have become commonplace to help speed up h1: modern browsers will open several parallel connections (up to six per hostname and 30 in total in most browser implementa- tions) and web developers apply techniques like resource concatenation, resource embedding and hostname shard- ing as best practices (Grigorik, 2013). However, these heavily parallelized setups induce large additional over- heads while not providing extensive performance benefits in the face of ever more complex websites. A redesign of the protocol was needed and starting with SPDY, Google led a movement back towards using just a single TCP connection, this time allowing multiple resources to be multiplexed at the same time, alleviating the HOL blocking of h1 and making better use of the TCP protocol. This effort culminated in the standard- ization of HTTP/2 (h2) (Belshe et al., 2015). Advanced prioritization schemes are provided to help make optimal use of this multiplexing and the novel h2 Push mecha- nism allows servers to send resources to a client without receiving a request for them first. In theory, h2 should solve most of the problems of h1 and as such make its best practices and optimiza- tions obsolete or replaceable (Grigorik, 2013). However, in practice the gains from h2 are limited by other fac- tors and implementation details. Firstly, h2 eliminates application-layer HOL blocking but it could be much more sensitive to transport-layer HOL blocking (induced by TCP’s guarantee of in-order delivery combined with re- transmits when packet loss is present) due to the individ- ual underlying connection (Goel et al., 2016). Secondly, the multiplexing is highly dependent on correct resource prioritization and might introduce its own implementa- tion overhead as chunks need to be aggregated. Finally, complex inter-dependencies between resources and late resource discovery might also lessen the gains from h2 (Netravali et al., 2016). That h2 is not a simple drop-in replacement with consistently better performance than h1 is also clear from previous studies, which often find cases where h2 is significantly slower than h1. Many of these studies also show contradictory results, poten-

Transcript of Concatenation,EmbeddingandSharding: DoHTTP ...

Page 1: Concatenation,EmbeddingandSharding: DoHTTP ...

Concatenation, Embedding and Sharding:Do HTTP/1 Performance Best Practices Make Sense in HTTP/2?

Robin Marx, Peter Quax, Axel Faes and Wim LamotteUHasselt-tUL-imec, EDM, Hasselt, Belgium

{first.last}@uhasselt.be

Keywords: Web Performance, Best Practices, HTTP, Server Push, SpeedIndex, Networking, Measurements.

Abstract: Web page performance is becoming increasingly important for end users but also more difficult to provide by webdevelopers, in part because of the limitations of the legacy HTTP/1 protocol. The new HTTP/2 protocol was designedwith performance in mind, but existing work comparing its improvements to HTTP/1 often shows contradictoryresults. It is unclear for developers how to profit from HTTP/2 and whether current HTTP/1 best practices such asresource concatenation, resource embedding, and hostname sharding should still be used. In this work we introducethe Speeder framework, which uses established tools and software to easily and reproducibly test various setuppermutations. We compare and discuss results over many parameters (e.g., network conditions, browsers, metrics),both from synthetic and realistic test cases. We find that in most non-extreme cases HTTP/2 is on a par with HTTP/1and that most HTTP/1 best practices are applicable to HTTP/2. We show that situations in which HTTP/2 currentlyunderperforms are mainly caused by inefficiencies in implementations, not due to shortcomings in the protocol itself.

1 INTRODUCTION

For a long time, version 1 of the HyperText Transfer Pro-tocol (HTTP/1, h1) has been the only way to request andtransfer websites from servers to clients. Under the pres-sure of ever more complex websites and the need for bet-ter website load performance, h1 and its practical usagehave evolved significantly over the last two decades. Mostof the changes help to combat the fundamental concept ofh1 to only request a single resource per underlying TCPconnection at the same time (where slow resources candelay others, called “head-of-line (HOL) blocking”), bybetter (re-)use of these connections or by making use ofparallel connections. The protocol specification now in-cludes techniques such as request pipelining and persistentconnections. In practice, further workarounds and trickshave become commonplace to help speed up h1: modernbrowsers will open several parallel connections (up to sixper hostname and 30 in total in most browser implementa-tions) and web developers apply techniques like resourceconcatenation, resource embedding and hostname shard-ing as best practices (Grigorik, 2013). However, theseheavily parallelized setups induce large additional over-heads while not providing extensive performance benefitsin the face of ever more complex websites.

A redesign of the protocol was needed and startingwith SPDY, Google led a movement back towards usingjust a single TCP connection, this time allowing multiple

resources to be multiplexed at the same time, alleviatingthe HOL blocking of h1 and making better use of theTCP protocol. This effort culminated in the standard-ization of HTTP/2 (h2) (Belshe et al., 2015). Advancedprioritization schemes are provided to help make optimaluse of this multiplexing and the novel h2 Push mecha-nism allows servers to send resources to a client withoutreceiving a request for them first.

In theory, h2 should solve most of the problems ofh1 and as such make its best practices and optimiza-tions obsolete or replaceable (Grigorik, 2013). However,in practice the gains from h2 are limited by other fac-tors and implementation details. Firstly, h2 eliminatesapplication-layer HOL blocking but it could be muchmore sensitive to transport-layer HOL blocking (inducedby TCP’s guarantee of in-order delivery combined with re-transmits when packet loss is present) due to the individ-ual underlying connection (Goel et al., 2016). Secondly,the multiplexing is highly dependent on correct resourceprioritization and might introduce its own implementa-tion overhead as chunks need to be aggregated. Finally,complex inter-dependencies between resources and lateresource discovery might also lessen the gains from h2(Netravali et al., 2016). That h2 is not a simple drop-inreplacement with consistently better performance thanh1 is also clear from previous studies, which often findcases where h2 is significantly slower than h1. Manyof these studies also show contradictory results, poten-

Page 2: Concatenation,EmbeddingandSharding: DoHTTP ...

tially because they use different experimental setups. Wediscuss related work in Section 2.

This leaves web developers and hosting companieswith an unclear path forwards: they cannot just enable h2because it could deteriorate performance, but they also donot know which concrete changes their websites need inorder to make optimal use of the new protocol.

In this paper, we look at the most common h1 opti-mizations: resource concatenation, resource embeddingand hostname sharding. We evaluate their performanceover three versions of the HTTP protocol: the secureHTTPS/2 (h2s) and HTTPS/1.1 (h1s) and also the un-encrypted HTTP/1.1 (h1c), because many websites stilluse this “cleartext” version. We do not include h2c, asmodern browsers choose to only support h2s for secu-rity reasons. Note additionally that switching from h1cto a secure setup (either h1s or h2s) could have its ownperformance impact as TLS connections typically requireadditional network round-trips to setup. The next sec-tions use h2 to refer to h2s and h1 to refer to bothh1s and h1c at the same time.Our main contributions are as follows:

• We introduce the Speeder framework for web per-formance measurement, which combines a largenumber of off-the-shelf software packages to providevarious test setup permutations, leading to a broad ba-sis for comparison and interpretation of experimentalresults.

• We show that resource concatenation and host-name sharding are still best practices when usingh2 and that HTTP/2 Push is a good replacement forresource embedding.

• We compare h2 to both h1s and h1c to find thath2 rarely significantly improves performanceover h1 (both secure and cleartext) and canseverely slow down sites currently on h1c. How-ever, in most cases bad network conditions do notseem to impact h2 much more than they impact h1.

• We use the SpeedIndex metric (Meenan, 2012) toshow that h2 is often later than h1 to start ren-dering the web page and discover various browserquirks. Additionally, we show that implementationdetails can have a significant impact on observed per-formance.

We will first look at hand-crafted experiments onsynthetic data (Section 4). These test cases are intendedto make the underlying behavior of the protocols and theirimplementations clearer, and so are often not entirelyrealistic or involve extreme circumstances. Secondly, welook at more realistic data based on existing websites(Section 5). We expect that, compared to the experimentson synthetic pages, these test cases will show similar butmore nuanced results and trends.

2 RELATED WORK

Various authors have published works comparing the per-formance of h2 and its predecessor SPDY to h1. Somealso discuss h1 best practices and h2 Push. We high-light some of their findings and methods.

In “how speedy is SPDY?” (Wang et al., 2014) theauthors employ isolated test cases to better assess theimpact of various parameters (latency, bandwidth, lossrate, initial TCP window, number of objects and objectsizes). They observe that “object size and packet lossrate are the most important factors in predicting SPDYperformance”, in other words SPDY hurts when packetloss is high (mainly due to the single underlying TCPconnection) but helps for many small objects. They alsofind it “helps for many large objects when the networkis fast”. For real pages, they find that SPDY helps up to80% of pages under low bandwidths, but only 55% ofpages under high bandwidth.

“Towards a SPDY’ier Mobile Web?” (Erman et al.,2013) performs an analysis of SPDY over a variety ofreal networks and finds that underlying cellular protocolscan have a profound impact on its performance. For 3G,SPDY performed on a par with h1, with LTE showingonly some improvements over h1. A faster 802.11gnetwork did yield improvements of 4% to 56%. Theyfurther conclude that using multiple TCP connections(∼sharding) does not help SPDY.

“Is The Web HTTP/2 Yet?” (Varvello et al., 2016)periodically measures page load performance by loadingreal websites over real networks from their own originservers. They observe that most websites using h2 alsostill use h1 best practices like embedding and sharding.They find that “these practices make h2 more resilient topacket loss and jitter” and that 80% of the observed pagesperform better over h2 than over h1 and that h2’s ad-vantage grows in mobile networks. On the other handthey also find that the remaining 20% of the pages suffera loss of performance.

“Rethinking Sharding for Faster HTTP/2” (Goel et al.,2016) introduces a novel network emulation techniquebased on measurements from real cellular networks. Theyuse this technique to assess the performance of hostnamesharding optimization over h2. They find that h2 per-forms well for pages with large amounts of small andmedium sized objects, but suffers from higher packet lossand larger file sizes. They demonstrate that h2 perfor-mance can be improved by sharding, though it will notalways reach parity with h1.

“HTTP/1.1 pipelining vs HTTP2 in-the-clear: perfor-mance comparison” (Corbel et al., 2016) compares h1cto h2c (a mode of the protocol not currently supportedby any of the main browsers). They disregard browsercomputational overhead while running experiments over

Page 3: Concatenation,EmbeddingandSharding: DoHTTP ...

a network emulated by TC NETEM, and find that on av-erage h2c is 15% faster than h1c and “h2c is moreresilient to packet loss than h1c”.

Additional academic work found that for a packetloss of 2%, “h2s is completely defeated by h1s” andthat even naive server push schemes can yield up to 26%improvements (Liu et al., 2016). Others conclude that h2is mostly interesting for complex websites, showing up toa 48% decrease in page load time, 10% when using h2Push, and that h2 is resilient to higher latencies but notto packet loss (de Saxce et al., 2015). Further experimentsindicate that h2 Push seems to improve page load timesunder almost all circumstances (de Oliveira et al., 2016).Finally, Carlucci et al. find that packet loss has a veryhigh impact on SPDY, up to a 120% increase of page loadtime on a high bandwidth network (Carlucci et al., 2015).

Our review of related work clearly shows that the cur-rent state of the art is often contradictory in its conclusionsregarding h2 performance. It is not clear if sharding stillprovides significant benefits, if h2 is resilient to poor net-work conditions or not and what degrees of improvementdevelopers might expect when moving to h2.

We believe that one of the reasons for these contradic-tions is the way in which researchers construct, executeand measure their experiments. Additionally, most au-thors only use a single software tool for most parts of theprocess, e.g., just a single server package, one browser orclient, one network setup and one metric. Without com-paring different tools and metrics, some results could beattributed as being inherent to the tested protocol, whilethey are actually caused by implementation details inthe tool being used. Moreover, developers often usealternative tools that provide more complex metrics toassess their site performance (e.g., Google Chrome dev-tools, webpagetest.org and “Real User Monitoring” tools)(Gooding and Garza, 2016).

Finally, in this fast moving field, even the protocolsevolve very rapidly. Experiments run two years or evena couple of months ago might yield different results onnewer versions of the used software.

All this makes that results and conclusions are oftendifficult to compare and reproduce. We aspire to help re-solve this problem by introducing the Speeder frameworkfor web performance measurements.

3 THE SPEEDER FRAMEWORK

Recognizing the fact that different implementations caninfluence experiment results and conclusions and thatthese are often difficult to compare, we introduce theSpeeder framework for web performance measurement.Speeder allows users to run and compare experiments on

Table 1: Speeder software and metrics.

Protocols HTTP/1.1 (cleartext), HTTPS/1.1, HTTPS/2

Browsers Chrome (51 - 54), Firefox (45 - 49)

Test drivers Sitespeed.io (3), Webpagetest (2.19)

Servers Apache (2.4.20), NGINX (1.10), NodeJS (6.2.1)

Network - DUMMYNET (cable and cellular) (Webpagetest)- fixed TC NETEM (cable and cellular)- dynamic TC NETEM (cellular) (Goel et al., 2016)

Metrics All Navigation Timing values (Wang, 2012), SpeedIn-dex, firstPaint, visualComplete, other Webpagetestmetrics (Meenan, 2016)

a variety of test setups using various software packages.It automates setting up and driving the various toolsto make the overhead of maintaining multiple setupsmanageable. Users simply need to select the desiredsetup permutations and the framework collects andaggregates a multitude of key metrics. Users can thenutilize various visualization tools to compare the results.

Speeder mainly aims to accomplish two differentgoals in the following ways:

• Make it easier to differentiate between protocol-and implementation related behavior: Speeder in-cludes a number of different software packages foreach setup component (see Table 1). Through mutualcombination, these provide a large number of permu-tations for possible experiment setups, which leads toa large comparison basis for the results.

• Make it easier to create reproducible results forboth researchers and developers: Speeder usesfreely available off-the-shelf and open source toolsand software. We deploy our setups using Dockercontainers, which can easily be updated with newsoftware and older configurations can be re-used ifneeded. Docker images can also be distributed andshared with others.

Speeder focuses on the concept of a “full emulatedsetup” (both client and server), though it can conceivablyalso measure real web pages over real networks (onlyclient) or serve experiments for testing with external tools(only server). This approach should apply to researchersand also developers during their development process.

Unless indicated otherwise, the results in this work arefrom an experimental setup using NGINX on the server,Google Chrome driven by Webpagetest and the dynamicnetwork model (Goel et al., 2016).

Readers are encouraged to review our fulldataset (including results not presented in this pa-per for Apache, sitespeed.io and the Fixed net-work model) and setup details and sourcecode viahttps://speeder.edm.uhasselt.be.

Page 4: Concatenation,EmbeddingandSharding: DoHTTP ...

Figure 1: Conceptual schematic of the Speeder Framework.

Setup: As displayed in Figure 1, Speeder is deployedon eight backend linux Virtual Machines (VMs), eightsitespeed.io agent linux VMs, five Webpagetest agentWindows 10 VMs and one linux command-and-control(C&C) VM that includes an InfluxDB database to storeaggregated test results. These are distributed across twophysical Dell PowerEdge R420 machines with at leasttwo dedicated logical cores and 2 GB RAM per VM.They are connected through 1 Gbit ethernet.

All linux software is deployed inside Docker contain-ers, with at most one container running in each VM at thesame time to prevent test results being influenced by othertests. As such, no more than 8 individual tests can be run-ning at the same time. Webpagetest is deployed withoutDocker as there are VM-based distributions readily avail-able and it is more difficult to Dockerize its Windows-onlysetup. Network emulation using TC NETEM is done inthe server container.

While this type of highly virtualized setup caninfluence the measured load times, we believe it stillallows accurate comparisons since all tests are equallyinfluenced. Similar setups are used in related work (Goelet al., 2016; Corbel et al., 2016).

Metrics: Most related work uses loadEventEndfrom the Navigation Timing API (Wang, 2012) as themain metric for page performance. However, this metriconly indicates the total page load time and not how thevisual rendering progresses over time: a page that staysempty for 5s and only renders content the last 0.6s (pageA), will have a better observed performance than a pagethat only completely finishes loading at 7.5s, but that hadits main content drawn by 2.5s (page B), while the lattercan arguably has the better end-user experience.

Google’s SpeedIndex (Meenan, 2012) aims toprovide a better indication of how progressively a pagerenders by recording a video during the page load andusing image analysis to determine how much of a pagewas loaded at any given time. As such, it is more a metricof how early and often a page updates during load andnot purely of how fast a page load completes. Perhapscounter-intuitively, this means page B would have alower SpeedIndex than page A, even though it finishesloading later. SpeedIndex is expressed in millisecondsand lower values mean better performance (similar toloadEventEnd).

We find that the two metrics often show similar be-havior (especially for synthetic test cases) and so, foreasier comparison with related work, we present mainlyloadEventEnd while comparing it to SpeedIndex inthe case of significant divergence.

Network emulation: Because it is practically diffi-cult to continuously test a large number of parameterpermutations over real-life (cellular) networks, we aim toprovide a number of different network emulation modelsso we can perform our tests in a locally controlled net-work. We have implemented two models so far: fixedand dynamic.

Table 2: Fixed network model.

preset throughput (in kbit) latency / jitter (in ms) % loss

wifi 29000 75 / 5 0

wifi loss 29000 75 / 5 0.2

4g 4200 95 / 10 0

4g loss 4200 95 / 10 0.6

3g loss 750 170 / 20 1

2g loss 500 220 / 20 1.2

The fixed traffic model (see Table 2) was inspired bythe Google Chrome devtools throttling settings1, with anadded 75ms round-trip latency to simulate transatlantictraffic2 and added loss based on Tyler Treat’s setup3.This model simply sets and keeps the parameters for TCNETEM constant for the duration of the test. This meansthere is a constant random packet loss.

The dynamic model uses previous work (Goel et al.,2016) which introduced a model based on real-life cellularnetwork observations. The model has six levels of “userexperience (UX)”: NoLoss, Good, Fair, Passable, Poorand VeryPoor. Each UX level contains a long list of valuesfor bandwidth, latency and loss. The model changes theseparameters in intervals of 70ms by picking the next valuefrom the list to simulate a real network. This means thepacket loss is more bursty than with the fixed model. Fordetails, please see their paper or the original source code4.

We mainly report results using the dynamic modelbecause, despite the fact that it only emulates cellularnetworks, we feel it is most representative of a realisticnetwork and makes it easier for others to compare ourresults to related work.

1https://goo.gl/IBMtmV2http://www.verizonenterprise.com/about/network/latency/3https://github.com/tylertreat/comcast4https://github.com/akamai/cell-emulation-util

Page 5: Concatenation,EmbeddingandSharding: DoHTTP ...

4 HTTPS/2 IMPACT ON HTTP/1BEST PRACTICES

In this section we study the impact of applying h2 onthree well-known h1 performance best practices: re-source concatenation, resource embedding and hostnamesharding. We create a number of test cases for each bestpractice to assess its performance in isolation of other fac-tors. To be able to compare our results using the SpeedIn-dex metric, we make sure our loaded resources have astrong visual impact on the visible “above the fold” partof the website. All images are included as <img> andall CSS and JavaScript (JS) resources use code to stylea <div> element. Inconsistencies between loadEven-tEnd and SpeedIndex results can indicate that a resourcewas fast to load but slow to have visual impact.

Most of our graphs show loadEventEnd on the y-axis. Individual data points are aggregates of 10 to 100page loads. Each experiment was run at least five times.Unexpected datapoints and anomalies across all runs werefurther confirmed by manually checking the collected out-put of individual page loads (e.g., screenshots/videos, .harfiles, waterfall charts). The line plots show the median val-ues under Good network conditions. The box plots showthe median as a horizontal bar and the average as a blacksquare dot, along with the 25th and 75th percentiles andmin and max values as the whiskers. Some box plots usea logarithmic scale on the y-axis to allow for large values.

4.1 Concatenation

h1 Is slow in loading a large amount of files at the sametime because it can only fetch a single resource at a timeper connection and the browser often limits h1 to sixparallel connections per hostname. This means requestscan be stalled, waiting for the completion of previouslyissued requests. The best practice of concatenationhelps by merging multiple sub-resources together intoa single, larger file. This lowers the number of individualrequests but at the expense of reduced cacheability: ifa single sub-resource is updated, the whole concatenatedfile needs to be re-fetched as well. Note that imageconcatenation requires additional CSS code as well, todisplay only one specific area of the bigger image atevery smaller image location. Because h2 can multiplexmany requests per connection, this optimization shouldnot be needed, and we can request many small files andkeep fine-grained cacheability.

4.1.1 Concatenation for Images (Spriting)

We observe three experiments: (a) a large number (i.e.,380) of small files, (b) a medium number (i.e., 42) ofmedium sized files and (c) a medium number (i.e., 30) of

large files. In Figure 2 we compare the concatenated ver-sions of the images (left) to loading individual files (right).

NoLoss Good Fair Poor NoLoss Good Fair Poor

Network Condition

0.5k

1k

2k

5k

10k15k20k30k

loa

dE

ve

ntE

nd

(m

s)

concatenated single big sprite, total 779 KB 380 small icons, total 502 KB

(a)

HTTP/1

HTTPS/1

HTTPS/2

NoLoss Good Fair Poor NoLoss Good Fair Poor

Network Condition

0.5k

1k

2k

5k

10k15k20k30k

loa

dE

ve

ntE

nd

(m

s)

concatenated single big sprite, total 1.1 MB 42 medium images, total 1 MB

(b)

HTTP/1

HTTPS/1

HTTPS/2

NoLoss Good Fair Poor

Network Condition

0.5k

1k

2k

5k

10k15k20k30k

loadEventEnd (ms)

no results 30 large images, total 10.4 MB

(c)

Concatenated, single 10MB image often crashed the tested browser

or led to high (>35k) loadEventEnd values

HTTP/1

HTTPS/1

HTTPS/2

Figure 2: Synthetic test cases for image concatenation. h2performs well for many small files but deteriorates for less orlarger files.

For Figure 2(a) we observe that h2 significantly out-performs h1 when there is no concatenation, but that asingle spritesheet largely reduces h2’s benefit and bringsit somewhat on a par with h1. This is expected as h1is limited to requesting six images in parallel and keepthe others waiting, while the single h2 connection canefficiently multiplex the many small files. It is of notethat the concatenated version is two to five times faster,even though (in a rare compression fluke) its file size ismuch higher than the sum of the individual file sizes.

Figure 2(b) shows relatively little differences and noclear consistent winners between the concatenated andseparate files. This is expected for h2, as in both cases itsends the same amount of data over the same connection,but not for h1. We would expect the six parallel connec-tions to have more impact, but it seems they can actuallyhinder on good network conditions. This is probably be-cause of the limited bandwidth in our emulated network,

Page 6: Concatenation,EmbeddingandSharding: DoHTTP ...

where the six connections contend with each other, whilea single connection can consume the full bandwidth byitself. Comparing this to (a), we see that here h2 doesnot get significantly faster for the concatenated version.This indicates that the higher measurements in (a)(right)are in large part due to the overhead of handling the manyindividual requests.

Lastly, in Figure 2(c) we see that h2 struggles tokeep up with h1 for the larger files and performs sig-nificantly worse under bad network conditions (note they-axis’ log scale). Due to the much larger amount ofdata, the six parallel connections do help here, packet lossimpacts the single h2 connection more.

The SpeedIndex measurements (not included here)show very similar trends.

4.1.2 Concatenation for CSS and JavaScript

We observe two experiments: 500 <div>-elements arestyled using simple CSS files (single CSS rule per <div>)(left) and complex JS files (multiple statements per<div>) (right). We split the code over multiple files,from one file (full concatenation) to 500 files (no concate-nation). Figure 3 shows full results in (a) and shows moredetail for one to 30 files in (b).

0 100 200 300 400 500 0 100 200 300 400 500

amount of files

1k

2k

3k

4k

5k

6k

7k

loadEventEnd (ms)

simple CSS files, total 16.4 KB complex JS files, total 395 KB

(a)

HTTP/1

HTTPS/1

HTTPS/2

HTTPS/2 (firefox)

0 10 20 30 0 10 20 30

amount of files

0.2k

0.4k

0.6k

0.8k

1.0k

1.2k

1.4k

1.6k

1.8k

loa

dE

ve

ntE

nd

(m

s)

simple CSS files, total 16.4 KB complex JS files, total 395 KB

(b)

HTTP/1

HTTPS/1

HTTPS/2

HTTPS/2 (firefox)

Figure 3: Synthetic test cases for CSS/JS concatenation. h2performs well for many files but there is no clear winner for themore concatenated cases.

The big-picture trends in (a) look very similar to Fig-ure 2(a): h2 again clearly outperforms h1 as the num-ber of files rises and shows a much better progressiontowards larger file quantities than the quasi linear growthof h1. Interesting is also the performance of Firefox:

while its h1 results (not shown in Figure 3) look almostidentical to Chrome, the h2 values are much lower, in-dicating that it has a more efficient implementation thatscales better to numerous files.

Looking at the zoomed-in data in (b), we do see some-what different patterns. For the simple CSS files thetrends are relatively stable, with h1c outperforming h2and h2 beating h1s. This changes at about 30-40 files,where h2 finally takes the lead. For the more complexJS files (right), this tipping point comes much later around100 files. The measurements for one to ten JS files are alsomuch more irregular when compared to CSS. Because h1shows the same incongruous data as h2, we can assumethis is due to how the browser handles the computation ofthe larger incoming files. A multithreaded or otherwiseoptimized handling of multiple files can depend on howmany files are being handled at the same time. This wouldalso explain the very high measurements of a single JSfile in Firefox (consistent over multiple runs of the experi-ment). In additional tests, smaller JS files and larger CSSfiles also showed much more stable trends, indicating thatespecially large JS files incur a large computational over-head. Note as well that the timings for a smaller amountof JS files are sometimes higher than those for the largeramounts, indicating that concatenation might not alwaysbe optimal here (for none of the protocols).

Poor network conditions (not shown here) indicatesimilar trends to Good networks, but the h2 tippingpoints are later: 40-50 files for simple CSS, 150 for com-plex JS.

4.1.3 Concatenation for CSS and JavaScript WithSpeedIndex

0 100 200 300 400 500 0 100 200 300 400 500

amount of files

1k

2k

3k

4k

5k

6k

7k

SpeedIndex (ms)

simple CSS files, total 16.4 KB complex JS files, total 395 KB

HTTP/1

HTTPS/1

HTTPS/2

Figure 4: Synthetic test cases for CSS/JS concatenation(SpeedIndex). h2 SpeedIndex for JavaScript indicates that it ismuch slower to start rendering than h1.

For the tests in the previous section, the SpeedIndex re-sults were significantly different and merit separate dis-cussion. Figure 4 shows the same experiment but depictsSpeedIndex for Google Chrome. We notice that the datafor the simple CSS files (left) looks very similar to Figure3, but the results for the complex JS files (right) do not.Since the SpeedIndex metric gives an indication of how

Page 7: Concatenation,EmbeddingandSharding: DoHTTP ...

Table 3: Observed browser render-blocking behavior. Chromeblocks on too much, while Firefox doesn’t seem to block onanything.

Expected <head> Chrome <head> Firefox <head>

CSS blocking blocking progressive

JavaScript blocking progressive progressive

Expected bottom Chrome bottom Firefox bottom

CSS progressive blocking progressive

JavaScript progressive progressive progressive

progressively a page renders (Section 3) and because weknow from Figure 3 that h1 takes much longer than h2to load large amounts of files, we can only conclude thatunder h2 the JS files take much longer to have an effecton the page rendering, to skew the SpeedIndex in this way.

We manually checked this assumption using screen-shots and found that for h1 the JS was indeed progres-sively executed as soon as a file was downloaded, butwith h2 the JS took effect in “chunks”: in larger groupsof 50 to 300 files at a time and mostly towards the endof the page load. We first assumed this was becauseof erroneous multiplexing: if all the files are given thesame priority and weight, their data will be interleaved,delaying all files. Captures of h2 frame data howevershowed each file i was requested as dependent on filei - 1, ensuring that files were fully delivered in requestorder. We can once more only conclude that the browserimplementation somehow delays the processing of thefiles, either because of their JS complexity or becausethe handling of many concurrent h2 streams is not opti-mized yet. This argument is supported by the SpeedIndexresults for Firefox (not shown here), as its h2 values aremuch lower than those of h1, indicating that Firefox hasa more efficient implementation.

This does not explain why the h1 SpeedIndex resultsfor the simple CSS files are so different compared to thoseof JavaScript: if they would also be applied individuallyas soon as they were downloaded, the h1 SpeedIndexvalues would be much lower. Manual investigation re-vealed that the browser waits for all the CSS files to bedownloaded before applying them all at once for both h1and h2. While there does exist the concept of “render-blocking” CSS (and JS), where this behavior is actuallywanted for all files in the <head> of a page, we pur-posefully hoped to avoid this issue by loading the CSSand JS files on the bottom of the HTML document. Itseems Chrome always fully blocks rendering on all CSSfiles included anywhere in the page, making SpeedIndexindeed almost completely identical to loadEventEnd.Comparison with Firefox shows different conventions:here even CSS files that should be render-blocking (i.e.,in the <head>) were just applied progressively instead.

Table 3 shows the observed behaviors from our synthetictests and highlights where we would expect different im-plementations. We leave a deeper investigation into thereasons for this behavior and how well it generalizes toother cases for future work. Note that progressive for h2still means the files are applied in larger chunks insteadof individually.

4.2 Embedding and HTTP/2 Push

When loading a web page, the browser first requeststhe HTML file and only discovers it needs additionalCSS/JS files when parsing the HTML. It then waits forany render-blocking files (see Section 4.1.3) to be fullydownloaded before starting to render the HTML. Thismeans a minimum of two round-trip times (RTTs) be-tween the initial HTML request and “first paint”. It ishowever also possible to embed (or “inline”) CSS andJS code directly in the HTML document (e.g., through<style> and <script> tags). That way, the code isreceived after one RTT and rendering can start sooner.The main downside is that the embedded code cannot becached separately.

The HTTP/2 specification also recognizes the needto eliminate this second RTT and includes a mechanismcalled h2 (Server) Push (Belshe et al., 2015). The servercan send along additional files with the first HTML re-sponse (or indeed, any response) without having to em-bed their contents in the HTML, allowing the files to becached. Unlike embedding, Push can also properly han-dle binary resources (e.g., images, fonts). As such, Pushshould be a drop-in replacement for embedding.

However, while this all works well theoretically, inpractice these approaches are hindered by TCP’s “slowstart” congestion control algorithm. TCP sends only asmall amount of data at the start of the connection andthen exponentially ramps up its speed if no packet loss ordelays are present. In practice, the TCP data window atthe start is about 14 KB for modern Linux kernels (usedin our experiments) (Marx, 2016). Thus if the HTMLfile (including the embedded CSS/JS) is larger than 14KB, the remainder has to wait for the second “send burst”,arriving only after two RTTs. The best practice is thusto embed only a limited amount of so-called “criticalCSS” that renders a good first impression of the page(e.g., general layout, building blocks, background colors)and load the other CSS using <link> references. Pushhas a similar problem: if the HTML fills up the initial 14KB, push will have no benefit (Bergan et al., 2016).

Rather than create our own test case we use the ex-isting Push demo by Bradley Fazon5. This case uses thewww.eff.org frontpage as the reference point which wemanually adjust to include the discussed optimizations.

5https://github.com/bradleyfalzon/h2push-demo

Page 8: Concatenation,EmbeddingandSharding: DoHTTP ...

This page has a single “critical CSS” file and a good mixof additional CSS, JS and image files without being toocomplex. We verify our observations using other syn-thetic test cases. We show the results from tests using theApache webserver because NGINX does not yet supporth2 Push. We show the SpeedIndex results because theseshould be most affected.

4.2.1 Embedding

In Figure 5 we observe a single experiment: we embed the“critical CSS“ file in the <head> and move the other CSS<link> and JS <script> URLs to the bottom of thepage, so they should not be render-blocking. We comparethis optimization (left) to the original page (right).

NoLoss Good Fair Poor NoLoss Good Fair Poor

Network Condition

0

1k

2k

3k

4k

SpeedIn

dex (m

s)

single embedded CSS file original

HTTP/1

HTTPS/1

HTTPS/2

Figure 5: Realistic test case for embedding. The technique hasa low impact and is similar across protocols.

As we can see, embedding seems only to lead tovery limited SpeedIndex gains. This is unexpected, sincemanual confirmation of the results clearly shows that theembedded CSS does take effect early in the page andthat it is not limited by the render-blocking behavior ofChrome (see Section 4.1.3). This reduced impact onSpeedIndex has two reasons:

Firstly, the embedded CSS content is larger than the14 KB TCP first transmission limit. Consequently thebottom portion of the HTML containing the remainingCSS/JS files is now only received after the second RTT.Since the browser needs to parse the HTML to discoverthese additional dependencies and issue requests for them,the fetching of the other files is delayed, which in turnpostpones the total page load and render. This explainswhy we often see higher loadEventEnd values forpages that use embedding, even if the SpeedIndex drops.

Secondly, this optimization will primarily work wellwhen the network is the bottleneck and not the CPU speedof the device. As parsing CSS and rendering both costCPU resources, fast devices are more likely to have moreCPU idle time available for partial rendering in betweenreceiving the CSS data, while slow devices will haveno such budget and spend all their time parsing CSS;the observed end result is similar to the normal render-blocking behavior. We can somewhat see the higher

impact on slower networks in Figure 5 and a little moreclearly in Figure 6.

We can conclude that embedding is more of a micro-optimization and that it requires careful fine-tuning tolead to good results.

4.2.2 HTTP/2 Push

With h2 Push, the HTML is always fully sent beforethe pushed data. As such, we need to make sure theHTML code is smaller than 14 KB so the window canalso include the pushed CSS (Bergan et al., 2016) (notethat this also means Push does not suffer from the “lateresource discovery” problem seen when embedding). Wemanually remove some metadata from the eff.org HTMLand enable gzip compression, reducing the on-networkHTML size from 42 KB to 9 KB.

embed 1 CSS CSS/JS all images orig embed 1 CSS CSS/JS all images orig

test case

0

1k

2k

3k

4k

SpeedIn

dex (m

s)

Good network condition Poor network condition

HTTPS/2

Figure 6: Realistic test case for HTTP/2 Push. Pushing is similarto embedding. Pushing the wrong assets or in the wrong ordercan deteriorate performance.

In Figure 6 we observe six different experimentsbased on the adjusted eff.org test case: (1) embed a singleCSS file (no push, similar to Section 4.2.1), (2) push thesingle CSS file, (3) push all the CSS/JS files (10 files),(4) push all (images + CSS/JS + one font), (5) push allimages (18 files) and (6) the reference measurement (orig-inal, similar to Section 4.2.1). We see that pushing the“critical CSS” file is indeed very similar to embeddingit. It is unexpected however that pushing all CSS/JS per-forms a little better than just the “critical CSS” in theGood network condition. We found that in practice theinitial data window is often a bit larger than 14 KB, soit can accommodate more than just the single CSS file.However, this is not always the case and other runs of thesame experiment show less optimal results (which canalso be seen in the Poor network condition in Figure 6(right)). Given a larger data window, the “Push all” testcase should perform similarly, but it is consistently a bithigher. It turned out that we pushed the single font file atthe very end, after all the images. The font data shouldhave been given a higher h2 priority than the images, butdue to a recently discovered Apache bug6 this was not the

6https://icing.github.io/mod h2/nimble.html

Page 9: Concatenation,EmbeddingandSharding: DoHTTP ...

case and the font data had to wait, delaying the final pageload and render. This also explains why pushing just theimages performs worse than the reference: the much moreimportant CSS and JS is delayed behind the image data.

These experiments clearly show that, similarly to em-bedding, Push is hard to fine-tune and limited in thebenefit it can deliver. We had to make several manualchanges to the original website to get any measurablebenefit at all, at least in the context of the initial page loadand pushing along with the first HTML request.

However, the true power of Push might be that it canalso send data along with non-HTML files. In a separatesynthetic experiment we used JavaScript to fetch a .jsonfile that contains a list of image URLs to then add themto the page as <img> elements. This emulates how aREST API can be used in typical modern “Single PageApp” frameworks. We make sure the initial HTML islarge to “warm up” the TCP connection (Bergan et al.,2016). Along with the .json response, the server then alsoPushes the images that are referenced within. This didconsiderably speed up the rendering process, especiallyon bad networks. This illustrates that Push can havemany other applications than just as a replacement forembedding during the initial page load.

4.3 Sharding

Browsers set a maximum of six parallel connections perhostname but this is not enough for loading large amountsof files and thus modern, complex websites over h1 (seeSection 4.1). Browsers however also observe a highertotal limit of 17 to 60 open connections7 and so it hasbecome common practice to distribute files over multiplehostnames/servers (called sharding), since each can haveits own group of six connections. This can for example bedone using Content Delivery Networks (CDNs) to host“static” content. The main downside of sharding is theincreased overhead: the setup becomes more complexand additional connections require extra computationalresources and hardware.

On the other side, h2 wants to make optimal use ofjust a single connection and actively discourages parallelchannels. The h2 specification even includes a mecha-nism for coalescing requested HTTP connections to sepa-rate hostnames onto a single TCP connection if the hostsuse the same HTTPS certificate and resolve to the sameIP address (Belshe et al., 2015). h2 Push can also onlybe used for resources on the same domain.

4.3.1 Sharding for Images

We observe three experiments, identical to the unconcate-nated setup in Section 4.1.1: (a) 380 small files, (b) 42

7http://www.browserscope.org/

NoLoss Good Fair Poor NoLoss Good Fair Poor

Network Condition

0.5k

1k

2k

5k

10k15k20k30k

loadEventEnd (ms)

380 small icons, 502 KB over 4 hosts 380 small icons, 502 KB over single host

(a)

HTTP/1

HTTPS/1

HTTPS/2

NoLoss Good Fair Poor NoLoss Good Fair Poor

Network Condition

0.5k

1k

2k

5k

10k15k20k30k

loa

dE

ve

ntE

nd

(m

s)

42 medium images, 1 MB over 4 hosts 42 medium images, 1 MB over single host

(b)

HTTP/1

HTTPS/1

HTTPS/2

NoLoss Good Fair Poor NoLoss Good Fair Poor

Network Condition

0.5k

1k

2k

5k

10k15k20k30k

loa

dE

ve

ntE

nd

(m

s)

30 images, 10.4 MB over 4 hosts 30 images, 10.4 MB over single host

(c)

HTTP/1

HTTPS/1

HTTPS/2

Figure 7: Synthetic test cases for image sharding. h2 bene-fits for larger files but deteriorates for many smaller files. h1always improves.

medium sized files and (c) 30 large files. Figure 7 com-pares a sharded version over four hostnames (left) withthe unsharded version (right). In practice, over h1 thebrowser will open the maximum amount of connections(24, six per hostname) and a single connection per host-name for h2 (four in our case). The observed trends aresimilar for CSS/JS files.

Firstly, in Figure 7 (a) we observe that shardingdeteriorates h2 performance, while only marginallybenefiting h1. Because the files are that small, h2’smultiplexing was at its best in the single host caseand maximized the single connection’s throughput,while it now has less data to multiplex per connection.Conversely, h1 now has more connections but stillsuffers from HOL blocking on the small files.

In contrast, (b) shows inconsistent behavior whensharding: sometimes it helps and sometimes it hurts h2;it shows good benefits for h1c but less impressive im-provements for h1s. We posit that the additional over-head of setting up secured HTTPS connections (bothfor h1s and h2s) limits the effectiveness of the higherparallel throughput.

Page 10: Concatenation,EmbeddingandSharding: DoHTTP ...

Lastly, (c) clearly shows why sharding is consid-ered an h1 best practice: it can improve performanceby over 50%, especially in bad network conditions. h2also profits significantly from the multi-host setup. Forthese larger files, multiple connections provide the much-needed higher throughput. They also help mitigate HOLblocking for h1 and lessen the impact of loss when com-pared to a single h2 connection.

Overall we observe similar results to those of Goelet al. (Goel et al., 2016): for many, smaller objects,sharding can hurt h2 performance, but not considerably.For medium-sized websites, sharding usually helps butnot always substantially. The optimization has the mostimpact on websites with large objects, for both protocols.We also performed experiments with two and three hostsand found that if sharding helps for h2, sharding overmore hosts helps more, but there are diminishing returnswith each increase in the amount of hosts.

Additionally, h2 has another important feature that isimpacted heavily by sharding: HPACK header compres-sion (Belshe et al., 2015). HTTP headers often containduplicate information and can be efficiently compressed.Because HPACK partly uses a shared compression dic-tionary per connection that is built up at runtime, wesee its effectiveness decrease in the case of sharding, asthe algorithm has less data to learn from/to re-use. Ta-ble 4 shows how the total bytes sent by Google Chrome(composed mainly of HTTP headers) is different withand without sharding. h2s clearly outperforms h1s, butsharding leads to a decrease of its effectiveness by 25%to 50%. Note that h1 does not employ any header com-pression.

Table 4: Total bytes sent by Google Chrome (∼ HTTP headers)and ratio to total page size. For many small files, the HTTPheader overhead is significant.

Filecount Protocol 1 host

BytesOut4 hosts

BytesOutTotal

page size

4 hosts% of totalpage size

42 files h2s 649 1362 1075000 0.1%

h1s 2786 3346 1075000 0.3%

400 files h2s 29580 38680 610000 6%

h1s 165300 177600 610000 19%

5 INTERPRETING RESULTSFROM REALISTIC WEBSITES

Synthetic test cases are useful to assess individual tech-niques in isolation but they are often not very representa-tive for real websites. We now look at some more realistictest cases. Our goal here is not to come to definitive con-

clusions about h2 or h1 performance but rather to showthe difficulties involved in interpreting measurements.

We present results for nine different websites (1:barco.com, 2: bbc.com, 3: demo.cards, 4: deredactie.be,5: hbvl.be, 6: healthcare.gov, 7: standaard.be, 8: uhas-selt.be, 9: yappa.be). To make them easier to compare tothe clearest trends in the synthetic experiments (Section4.1.1), we select image-heavy websites in two categories:(A) media/news sites with typically large HTML contentand many smaller images (nr. 2, 4, 5 and 7) and (B) com-pany/product landing pages with “hero image(s)”(largeimages taking up most of the “above the fold” space) (nr.1, 3, 6, 8 and 9). These websites present a good mix:we have both optimized (using concatenation and em-bedding) and unoptimized pages, complex and relativelysimple sites. For more details we refer to our website.

We simulate what would happen if a developer wouldswitch their h1 site to h2 by naively moving all theirown assets over to a single server (disabling sharding) butstill downloading some external assets from third partyservers (e.g., Google analytics, some JS libraries). Thisapproach is similar to (Wang et al., 2014). We downloadthe websites to our local webserver using a custom wgettool that rewrites the main assets to a single hostname,and serve them using Speeder.

5.1 Results

In figure 8 we show the median loadEventEnd andSpeedIndex measurements for the nine websites overGood and Poor networks, loaded in Chrome and Fire-fox. Globally, we can state that loadEventEnd andSpeedIndex are often similar for the Good network, in-dicating that our sites are mostly network-dependent, asthe rendering waits for assets to come in. Poor networkconditions can have a very large impact. In various casesh2’s SpeedIndex is far above that of h1 even if theirloadEventEnd values are similar, indicating that h2 isslower to start rendering, consistent with our observationsin Section 4.1.3. h1c is faster than h2 in almost all ofthe cases and h2 is almost never much faster than h1s.Note that this is against our expectations, as h1 has tomake due without the benefits of sharding.

We now highlight how important it is to comparedifferent metrics and setups by looking more closely ata few of the individual sites. For example, if we weredeveloping site 8 and we would only look at Figure 8(a) and (b), we might conclude that h2 can give a hugespeedup on Good networks, which is then possibly re-versed on Poor networks. However, comparing this toFigure 8 (c) and (d) shows that the SpeedIndex values donot exhibit these anomalies. We indeed found that oneof the externally loaded (non render-blocking) JavaScriptfiles sometimes took a very long time to download and

Page 11: Concatenation,EmbeddingandSharding: DoHTTP ...

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

website nr

2k

4k

6k

8k

10k

12k

14k

loadEventEnd (ms)

Good network condition Poor network condition

(a)

Chrome loadEventEndHTTP/1

HTTPS/1

HTTPS/2

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

website nr

2k

4k

6k

8k

10k

12k

14k

loadEventEnd (ms)

Good network condition Poor network condition

(b)

Firefox loadEventEndHTTP/1

HTTPS/1

HTTPS/2

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

website nr

2k

4k

6k

8k

10k

12k

14k

SpeedIn

dex (m

s)

Good network condition Poor network condition

(c)

Chrome SpeedIndexHTTP/1

HTTPS/1

HTTPS/2

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

website nr

2k

4k

6k

8k

10k

12k

14k

SpeedIn

dex (m

s)

Good network condition Poor network condition

(d)

Firefox SpeedIndexHTTP/1

HTTPS/1

HTTPS/2

Figure 8: Nine realistic websites on dynamic Good and dynamicPoor network models. There is very similar performance underGood network conditions, but h2 clearly suffers from Poorconditions.

replaced it with a locally hosted version. This completelyremoved the anomalies for the Good network, but did notsignificantly affect the trends for the Poor network. As itturned out, the bottleneck was in downloading the site’smany large assets. As the network experienced moreloss and less bandwidth, the h2 connection suffered andthe limited number of six parallel h1 connections alsocould no longer keep up. Not comparing loadEventEndto SpeedIndex, or Poor to Good networks might haveled us to miss this issue and possibly draw the wrongconclusions for this datapoint.

As developers of site 6 we would be quite happy, sincethis is one of the faster websites with very little differencebetween h1 and h2 for the Good network and incon-sistent but similar results on the Poor network. Lookingmore closely we however noticed that the very large heroimage was taking up more than half of the total load time.The image URL was included halfway down the HTMLpage, behind CSS/JS files and other images. By the timethe browser discovered this resource, the six parallel h1connections were already taken and because browserscurrently assign the same h2 priority to all images, thehero image had to wait behind the other images. Simplyconcatenating some of the JS files ensured that there wasan h1 connection available by the time the hero imagewas discovered and the SpeedIndex value was cut in halffor h1. However h2 was still just naively prioritizingthe hero image behind the other assets. Moving the heroimage URL to before the other images was the solution.Sharding the hero image to a different hostname wouldlikely have had a similar impact. This problem was not ev-ident from the presented graphs, only from looking at thewaterfall charts. Even when comparing many differenttest setups, some problems will still remain hidden.

6 DISCUSSION

Conceptually, the ideal h2 setup will use a single TCPconnection to multiplex a large amount of small and indi-vidually cacheable site resources. This alleviates the h1application-layer HOL blocking and helps to reduce theoverhead of many parallel connections, while also max-imizing the efficiency of the underlying TCP protocol.Together with advanced resource priorities and h2 Pushthis can lead to (much) faster load times than are possibletoday over h1, with much less overhead.

However, as our experiments have shown, this idealsetup is not yet viable. While h2 is indeed faster thanh1, when loading many files (see Figures 2 and 3), it isstill often slower than loading concatenated versions ofthose files (see Section 4.1). Looking at the SpeedIndexvalues (see Figures 4 and 8) also shows that h2 is fre-quently later to start rendering the page than h1. h2 alsostruggles when downloading large files (see Figures 2 and7) and can quickly deteriorate when used in bad networkconditions. In our observations, h2 is in most cases cur-rently either a little slower than or on a par with h1 andshows both the most improvement and worst deteriorationin more extreme circumstances.

The good news is that almost all of the encounteredproblems limiting h2’s performance are due to inefficientimplementations in the used server and browser software.Firstly, while loading many smaller files incurs its ownconsiderable overhead, comparing Chrome and Firefox

Page 12: Concatenation,EmbeddingandSharding: DoHTTP ...

in Figure 3 we can see that this overhead can be limitedextensively, lessening the need for concatenation (thoughpossibly never completely removing it). Secondly, the factthat h2 is later to start rendering than h1 is also due toineffective processing of the h2 data, since we have con-firmed that resources are received well on-time to enablefaster first paints, see Section 4.1.3. It is also interestingto note that Firefox seems to have especially optimizedits pipeline for large amounts of files, since it outshinesChrome for synthetic pages in Figure 3 but falls behindfor more realistic content in Figure 8. Thirdly, cases inwhich h2 underperformed while using h2 Push in Sec-tion 4.2.2 or loading realistic sites in Section 5.1, could beattributed to the server implementation not correctly (re-)prioritizing individual assets. As these implementationsmature, we can expect many of these issues to be resolved.

However, h2 still retains some core limitations,mostly due to its single underlying TCP connection,which seems to simultaneously be its greatest strength andweakness. TCP’s congestion control algorithms can leadit to suffer significantly from packet loss on poor networks(most obvious when downloading multiple large files, seeFigures 2 and 7) and can heavily impact the effectivenessof h2 and resource embedding on newly established con-nections (see Section 4.2). We have to nuance these state-ments however, as in practice h2 actually performs quiteadmirably and usually does not suffer more from badnetworks than h1, despite using fewer connections.

Recognizing that these underlying h2 performanceproblems stem primarily from the use of TCP, in the newQUIC protocol (Carlucci et al., 2015) Google and itspartners implement their own application-layer reliabilityand congestion control logic on top of UDP. They removethe transport-layer HOL blocking by allowing out-of-order delivery of packets, differently handle re-transmitsin the case of loss, reduce the amount of round-tripsneeded to establish a new connection and allow largerinitial data transmissions. Running h2 on top of QUICcan greatly benefit h2’s multiplexing setup.

Until the h2 implementations mature and/or theQUIC protocol is finalized however, fully switching to theideal h2 approach could severely reduce the observedperformance, especially when looking at the SpeedIndexmeasurements for poor networks in Section 5.1. Luckily,the discussed h1 best practices (resource concatenation,resource embedding and hostname sharding) have a simi-lar impact on both h1 and h2 in non-extreme cases (seeSection 4). This means that most sites that are currentlyoptimized for h1 can quickly and safely transition to h2(or offer users both protocols in tandem) with a mini-mum of changes and without suffering large performancepenalties. More work may however be needed for siteswho’s users are mostly on bad networks or sites that areusing the unencrypted h1c, which is almost always faster

than h2. The new h2 Push mechanism looks especiallypromising on “warm” TCP connections but seems diffi-cult to fine-tune when pushing files along with the initialHTML file (see Section 4.2.2).

Finally, using the Speeder framework to run our ex-periments on a large number of emulated test setup per-mutations has shown to be a useful approach. Manyimplementation inefficiencies were discovered by com-paring different metrics, browsers and network conditionsand looking for inconsistencies. Analyzing and compar-ing both synthetic and realistic test data was also effective,with the more realistic results indeed somewhat dampen-ing the more extreme synthetic results but largely showingsimilar trends. Furthermore, the two concrete website ex-amples in Section 5.1 show that individual datapoints canhave subtle underlying reasons for their observed values,some having little to do with the underlying protocol orimplementations but originating in the HTML structureitself. They also show that even a deep comparison ofdifferent setups is not always enough to find importantproblems and that tools should also support manual con-firmation of results through a variety of visualizations.We can conclude that the Speeder framework is a ver-satile platform usable not only by researchers but alsodevelopers to assess both global trends for comparisons,and local optimizations for individual websites.

7 CONCLUSION

In this work we have used the Speeder framework torun experiments for both synthetic and realistic test casesover a variety of different setups, allowing us to compareh2s to h1s and h1c, Chrome to Firefox, loadEven-tEnd to SpeedIndex and good to bad network conditions.

Our results have shown that most of the cases in whichh2 currently underperforms are caused by unoptimizedimplementations and that most of the real inherent h2problems are caused by the use of a single underlyingTCP connection. This can lead h2 performance to de-teriorate on poor networks with high packet loss, but inpractice we have seen that the difference with h1 underthese conditions is usually not that large. Globally speak-ing, we find that h2 is often on a par with h1s and alittle slower than h1c.

We have found that all three discussed h1 best prac-tices of resource concatenation, resource embedding andhostname sharding perform similarly over h1 and h2in non-extreme circumstances. Especially concatenationand sharding can have a large impact on performance,while embedding and its h2 Push alternative are ob-served to be micro-optimizations that are difficult to fine-tune (at least for the initial page load). These findingslead us to recommend to developers that they often do

Page 13: Concatenation,EmbeddingandSharding: DoHTTP ...

not have to significantly alter their current setups whendeciding to migrate from h1 to h2 or to support bothversions in parallel.

We argue that, in time, the envisioned ideal h2model of using many small, individually cacheable filesover a single, well filled TCP connection will becomeviable with improving browser/server implementationsand the QUIC protocol.

We will keep expanding and sharing the Speederframework (e.g., with the h2o server, Google Lighthouse,sitespeed.io v4), using it to follow browser implementa-tion changes, as well as more deeply investigate the possi-bilities of h2 Push and h2 priorities in various settings.

ACKNOWLEDGEMENTS

This work is part of the imec ICON PRO-FLOW project.The project partners are among others Nokia Bell Labs,Androme, Barco and VRT. Robin Marx is a SB PhDfellow at FWO, Research Foundation - Flanders, projectnumber 1S02717N. Thanks to messrs Goel, Wijnants,Michiels, Robyns, Menten, Bonne and our anonymousreviewers for their help.

REFERENCES

Belshe, M., Peon, R., and Thomson, M. (2015). Hyper-text transfer protocol version 2. https://tools.ietf.org/html/rfc7540. Last visited: 2017-03-01.

Bergan, T., Pelchat, S., and Buettner, M.(2016). Rules of thumb for http/2 push.https://docs.google.com/document/d/1K0NykTXBbbbTlv60t5MyJvXjqKGsCVNYHyLEXIxYMv0.Last visited: 2017-03-01.

Carlucci, G., De Cicco, L., and Mascolo, S. (2015). Http overudp: an experimental investigation of quic. In Proceed-ings of the 30th Annual ACM Symposium on AppliedComputing, pages 609–614. ACM.

Corbel, R., Stephan, E., and Omnes, N. (2016). Http/1.1 pipelin-ing vs http2 in-the-clear: Performance comparison. In2016 13th International Conference on New Technologiesfor Distributed Systems (NOTERE), pages 1–6.

de Oliveira, I. N., Endo, P. T., Melo, W., Sadok, D., and Kelner,J. (2016). Should i wait or should i push? a performanceanalysis of push feature in http/2 connections. In Proceed-ings of the 2016 workshop on Fostering Latin-AmericanResearch in Data Communication Networks, pages 7–9.ACM.

de Saxce, H., Oprescu, I., and Chen, Y. (2015). Is http/2 reallyfaster than http/1.1? In Computer Communications Work-shops (INFOCOM WKSHPS), 2015 IEEE Conference on,pages 293–299. IEEE.

Erman, J., Gopalakrishnan, V., Jana, R., and Ramakrishnan,K. K. (2013). Towards a spdy’ier mobile web? In Pro-

ceedings of the Ninth ACM Conference on Emerging Net-working Experiments and Technologies, CoNEXT ’13,pages 303–314, New York, NY, USA. ACM.

Goel, U., Steiner, M., Wittie, M. P., Flack, M., and Ludin, S.(2016). Http/2 performance in cellular networks: Poster.In Proceedings of the 22Nd Annual International Confer-ence on Mobile Computing and Networking, MobiCom’16, pages 433–434, New York, NY, USA. ACM.

Gooding, M. and Garza, J. (2016). Realworld experiences with http/2. https://www.slideshare.net/JavierGarza18/real-world-experiences-with-http2-michael\-gooding-javier-garza-from-akamai. Last visited:2017-03-01.

Grigorik, I. (2013). High Performance Browser Networking.O’Reilly Media, Inc.

Liu, Y., Ma, Y., Liu, X., and Huang, G. (2016). Can http/2really help web performance on smartphones? In ServicesComputing (SCC), 2016 IEEE International Conferenceon, pages 219–226. IEEE.

Marx, R. (2016). Http/2 push : the details. http://calendar.perfplanet.com/2016/http2-push-the-details/.Last visited: 2017-03-01.

Meenan, P. (2012). Speed index. https://sites.google.com/a/webpagetest.org/docs/using-webpagetest/metrics/speed-index. Lastvisited: 2017-03-01.

Meenan, P. (2016). Webpagetest. https://webpagetest.org. Last visited: 2017-03-01.

Netravali, R., Goyal, A., Mickens, J., and Balakrishnan, H.(2016). Polaris: faster page loads using fine-graineddependency tracking. In 13th USENIX Symposium onNetworked Systems Design and Implementation (NSDI16).

Varvello, M., Schomp, K., Naylor, D., Blackburn, J., Finamore,A., and Papagiannaki, K. (2016). Is the web http/2 yet? InInternational Conference on Passive and Active NetworkMeasurement, pages 218–232. Springer.

Wang, X. S., Balasubramanian, A., Krishnamurthy, A., andWetherall, D. (2014). How speedy is spdy? In NSDI,pages 387–399.

Wang, Z. (2012). Navigation timing api. https://www.w3.org/TR/navigation-timing. Last visited: 2017-03-01.