Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein...

29
Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein and Dror Feitelson School of Engineering and Computer Science The Hebrew University of Jerusalem Supported by a grant from the Israel Internet Association

Transcript of Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein...

Empirical Quantification of Opportunities for Content Adaptation

in Web Servers

Michael Gopshtein and Dror FeitelsonSchool of Engineering and Computer Science

The Hebrew University of Jerusalem

Supported by a grant from the Israel Internet Association

Capacity Planning

Daily cycle of activity

Utilized capacityWasted capacity

time

capa

city

Capacity Planning

Flash crowd

capa

city

time

Capacity Planning

• The problem:– Required capacity for flash crowds cannot be

anticipated in advance– Even capacity for daily fluctuations is highly

wasteful

• Academic solution: use admission control

• Business practice: unacceptable to reject any clients– Especially in cases of surge in traffic

Content Adaptation

• Trade off quality for throughput– Installed capacity matches normal load– Handle abnormal load by reducing quality– But still manage to provide meaningful service

to all clients

• Assumes normal optimizations have been made already– Compress or combine images, promote

caching, …– Empirically this usually is not the case

Content Adaptationsmily

smily

smily

Low load

Content Adaptationsmily

smily

smily

High load

smilysmily

smily

smilysm

ily

Content Adaptation

• Maintain the invariant:

• Need to change quality (and cost!) of content– Prepare multiple versions in advance

capacityrequest

perstco

requests

ofrate

The Questions

• What are the main costs in web service?– Bottleneck is CPU / network / disk?– What do we gain by eliminating HTTP requests?– What do we gain by reducing file sizes?

• What can realistically be done?– What is the structure of a “random” site?– How much can we reduce quality?

Assumption: static web pages only

Costs of Serving Web Pages

Measuring Random Web Sites

• http://en.wikipedia.org/wiki/Special:Random

• Use title of page as input to Google search

• Extract domain of first link to get home page

• Retrieve it using IE

• Collect statistical data by intercepting system calls to send and receive

Retrieved Component Sizes

This is only 0.02% of the components

A ¼ of total data from components

larger than 200 KB

Download Times

Download time (and bandwidth requirements) roughly proportional to image size

Network Bandwidth

• Typical Ethernet packets are 1526 bytes– Ethernet and TCP/IP headers require 54 bytes– HTTP response headers require 280-325

• Most components fit into few packets– 43% fit into a single packet– 24% more fit into 2 packets

Save bandwidth by reducingnumber of small componentsor size of large components

Locality and Caching

• Flash crowds typically involve a very small number of pages (possibly the home page)

• Servers allocate GB of memory for cache

• This is enough for thousands of files

Disk is not expected to bea bottleneck

CPU Overhead

• CPU usage reflects several activities– Opening TCP connection– Processing request– Sending data

• Measure using combinatorical microbenchmarks– Open connection only– One extremely large file– Many small files– Many requests for non-existent file

CPU Overhead

Example: single 10KB file

• Equal processing and transfer at 240KB– Only 0.3% of files are so big

Establishing connection 25%

Processing request 72%

Data transfer 3%

If CPU is bottleneck, needto reduce number of requests

Optimizations

Guidelines

• Either CPU or network are the bottleneck

• Network bandwidth saved by reducing large components

• CPU saved by eliminating small components

• Maintaining “acceptable” quality is subjective

Eliminating Images

• Images have many functions– Story (main illustrative item)– Preview (for other page)– Commercial– Logo– Decoration (bullets, background)– Navigation (buttons, menus)– Text (special formatting)

• Some can be eliminated or replaced

Distribution of Types

• Manually classified 959 images from 30 random sites

• 50% decoration• 18% preview• 11% commercial• 6% logo• 6% text

Automatic Identification

• Decorations are candidates for elimination

• Identified by combination of attributes:– Use gif format– Appear in HTML tags other than <IMG>– Appear multiple times in same page– Small original size– Displayed size much bigger than original– Large change in aspect ratio when displayed

Image Sizes Distribution

decoration

preview

commercial

Auxiliary Files

• JavaScript– May be crucial for page function– Impossible to understand automatically

• CSS (style sheets)– May be crucial for page structure– May be possible to identify those parts that

are used

Auxiliary Files

• Cannot be eliminated

• Common wisdom: use separate files– Allow caching at client– Save retransmission with each page

• Alternative: embed in HTML– Reduce number of requests– May be better for flash crowds that do not

request multiple pages

Text and HTML

• Some areas may be eliminated under extreme conditions– Commercials– Some previews and navigation options

• Often encapsulated in <DIV> tags

• Sometimes identified by ID or class names, e.g. “sidebanner”– Especially when using modular design

Summary

Content Adaptation

• Degraded content usually better than exclusion

• Only way to handle flash crowds that overwhelm installed capacity

• Empirical results identify main options– Identify and eliminate decorations– Compress large images (story, commercial)– Embed JavaScript and CSS– Hide unnecessary blocks

Next Paper Preview

• Implementation in Apache

• Monitor CPU utilization and idle threads to switch between modes

• Use mod_rewrite to redirect URLs to adapted content

• Achieve up to x10 increase in throughput for extreme adaptation