Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein...
-
Upload
dorthy-lang -
Category
Documents
-
view
214 -
download
0
Transcript of Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein...
Empirical Quantification of Opportunities for Content Adaptation
in Web Servers
Michael Gopshtein and Dror FeitelsonSchool of Engineering and Computer Science
The Hebrew University of Jerusalem
Supported by a grant from the Israel Internet Association
Capacity Planning
• The problem:– Required capacity for flash crowds cannot be
anticipated in advance– Even capacity for daily fluctuations is highly
wasteful
• Academic solution: use admission control
• Business practice: unacceptable to reject any clients– Especially in cases of surge in traffic
Content Adaptation
• Trade off quality for throughput– Installed capacity matches normal load– Handle abnormal load by reducing quality– But still manage to provide meaningful service
to all clients
• Assumes normal optimizations have been made already– Compress or combine images, promote
caching, …– Empirically this usually is not the case
Content Adaptation
• Maintain the invariant:
• Need to change quality (and cost!) of content– Prepare multiple versions in advance
capacityrequest
perstco
requests
ofrate
The Questions
• What are the main costs in web service?– Bottleneck is CPU / network / disk?– What do we gain by eliminating HTTP requests?– What do we gain by reducing file sizes?
• What can realistically be done?– What is the structure of a “random” site?– How much can we reduce quality?
Assumption: static web pages only
Measuring Random Web Sites
• http://en.wikipedia.org/wiki/Special:Random
• Use title of page as input to Google search
• Extract domain of first link to get home page
• Retrieve it using IE
• Collect statistical data by intercepting system calls to send and receive
Retrieved Component Sizes
This is only 0.02% of the components
A ¼ of total data from components
larger than 200 KB
Network Bandwidth
• Typical Ethernet packets are 1526 bytes– Ethernet and TCP/IP headers require 54 bytes– HTTP response headers require 280-325
• Most components fit into few packets– 43% fit into a single packet– 24% more fit into 2 packets
Save bandwidth by reducingnumber of small componentsor size of large components
Locality and Caching
• Flash crowds typically involve a very small number of pages (possibly the home page)
• Servers allocate GB of memory for cache
• This is enough for thousands of files
Disk is not expected to bea bottleneck
CPU Overhead
• CPU usage reflects several activities– Opening TCP connection– Processing request– Sending data
• Measure using combinatorical microbenchmarks– Open connection only– One extremely large file– Many small files– Many requests for non-existent file
CPU Overhead
Example: single 10KB file
• Equal processing and transfer at 240KB– Only 0.3% of files are so big
Establishing connection 25%
Processing request 72%
Data transfer 3%
If CPU is bottleneck, needto reduce number of requests
Guidelines
• Either CPU or network are the bottleneck
• Network bandwidth saved by reducing large components
• CPU saved by eliminating small components
• Maintaining “acceptable” quality is subjective
Eliminating Images
• Images have many functions– Story (main illustrative item)– Preview (for other page)– Commercial– Logo– Decoration (bullets, background)– Navigation (buttons, menus)– Text (special formatting)
• Some can be eliminated or replaced
Distribution of Types
• Manually classified 959 images from 30 random sites
• 50% decoration• 18% preview• 11% commercial• 6% logo• 6% text
Automatic Identification
• Decorations are candidates for elimination
• Identified by combination of attributes:– Use gif format– Appear in HTML tags other than <IMG>– Appear multiple times in same page– Small original size– Displayed size much bigger than original– Large change in aspect ratio when displayed
Auxiliary Files
• JavaScript– May be crucial for page function– Impossible to understand automatically
• CSS (style sheets)– May be crucial for page structure– May be possible to identify those parts that
are used
Auxiliary Files
• Cannot be eliminated
• Common wisdom: use separate files– Allow caching at client– Save retransmission with each page
• Alternative: embed in HTML– Reduce number of requests– May be better for flash crowds that do not
request multiple pages
Text and HTML
• Some areas may be eliminated under extreme conditions– Commercials– Some previews and navigation options
• Often encapsulated in <DIV> tags
• Sometimes identified by ID or class names, e.g. “sidebanner”– Especially when using modular design
Content Adaptation
• Degraded content usually better than exclusion
• Only way to handle flash crowds that overwhelm installed capacity
• Empirical results identify main options– Identify and eliminate decorations– Compress large images (story, commercial)– Embed JavaScript and CSS– Hide unnecessary blocks