Under the Covers with the Web

42
Under the Covers with the Web by Trevor Lohrbeer @LabEscape [email protected] labescape.com @FastFedora [email protected] fastfedora.com

description

Walks through the basics of the HTTP protocol, URLs, cookies and caching, with tricks and tips that can be used by web developers. From a Geek.class I did on Oct 6, 2011 for Meet the Geeks.

Transcript of Under the Covers with the Web

Page 1: Under the Covers with the Web

Under the Coverswith the Web

by Trevor Lohrbeer

@LabEscape

[email protected]

labescape.com

@FastFedora

[email protected]

fastfedora.com

Page 2: Under the Covers with the Web

Simple Request Walkthrough

1. Enter URL in browser

2. Browser sends request to server

3. Server sends response to browser

4. Browser renders page

Page 3: Under the Covers with the Web

Making the Request

1. Parse URL

2. Resolve domain to IP address

3. Open TCP/IP connection to server

4. Use HTTP to send a request

Page 4: Under the Covers with the Web

URL: Uniform Resource Locators

About URLs: Not web-specific. Internet standard. Defined by Request For Comments (RFCs) Current Standard is RFC 3986 Unlike URIs, URLs provide an access

mechanism

Page 5: Under the Covers with the Web

URL FormatFormat

<protocol>//<user>:<password>@<host>:<port>/<url-path>

http://myserver.com/some/search.php?query=help#ch5http://admin:[email protected]:8000/some/path.php/info#ch5http://68.15.13.12/http://[2001:4860:0:2001::68]/

Parsing – Regular Expression^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? 12 3 4 5 6 7 8 9

Where:scheme = $2 authority = $4 path = $5 query = $7 fragment = $9

Details Valid characters: a-z A-Z 0-9 - . _ ~ Special characters for HTTP: / ? & = , + # Recommend % encoding other characters (eg: %20) Avoid URLs longer than 2,000 characters

Page 6: Under the Covers with the Web

Parsing an HTTP URL<protocol>://<host>:<port>/<path>?<query>#<fragment>

ProtocolEither http or https

Host & Port (Authority)Location of resource (default port 80 for http, 443 for https)

PathHierarchical string to resource

QueryParameters to pass to resource

FragmentIdentifies subset of resource

Page 7: Under the Covers with the Web

ProtocolsHTTP

Original web protocol. Currently version 1.1.

HTTPSHTTP tunneled within an SSL/TLS connection on port 443

S-HTTPOlder secure protocol – used port 80

SPDYFaster version of HTTP used by Google Chrome

Page 8: Under the Covers with the Web

Hosts Hosts can be:

DNS Names: localhost, www.fastfedora.com WINS (NetBIOS) Names: WEBSERVER IPv4 Addresses: 192.168.1.200 IPv6 Addresses: [2001:0db8:85a3:::8a2e:0370:7334] Other “Registered” Name

Tricks: Hardcode name to IP address in hosts / lmhosts

Windows: C:\WINDOWS\system32\drivers\etcLinux: /etc/hosts

On Windows, use “nbtstat –r” to refresh cache

Page 9: Under the Covers with the Web

Paths Hierarchical, separating path components by /

Can be empty (eg: http://example.com?id=5)

Start with first /

End with ?, # or end of URL

Absolute paths start with leading /

Relative paths start with . or .. to refer to current location or parent location, respectively

For application server, does not have to end with the application end-point, can have “path info” which extends past resource, eg: /book/search.php/My%20Book/Paperback

Page 10: Under the Covers with the Web

Queries Can be any text, eg: ?myQuery

Or uses name=value pairs separated by &, eg: ?query=myQuery

Names do not have to be unique, eg: ?stock=MSFT&stock=SUN is valid

Multiple values can be comma-separated for some application engines, eg: ?stock=MSFT,SUN

When using ampersands:

Use &amp; when using the URL in HTML / XML

Use & when using the URL elsewhere

Search engines used to not index, still not great

Always encode =, & and # in any names or values

Page 11: Under the Covers with the Web

Fragments Refers to a subset or view of web page

Not indexed by search engines

Used to reference anchors in web pages, eg: #chapter2 links to <a name=“chapter2”></a>

Used by AJAX to store state using JavaScript without refreshing the browser page – helps support bookmarks & browser history

Obsoleted by pushState in HTML 5

Used to increase SEO by canonicalizing URLs

Page 12: Under the Covers with the Web

Resolve Domain to IP address DNS

Resolves names to IP addresses & vice versus Not a 1-to-1 mapping between IPs & names Results cached at many levels:

• Application (web browser, e-mail)• Local OS resolver• DNS server• Authoritative nameserver

When moving hosting: Reduce time-to-live (TTL) on DNS record to 1 day Wait until old TTL expires (usually 1 week) Move hosting to new IP address Increase TTL back to 1 week

Page 13: Under the Covers with the Web

Open TCP/IP Connection About TCP

Sits atop Internet Protocol (IP) Provides reliable connection Optimized for accuracy, not timeliness Takes time to establish Uses resources on the server & client to maintain Can use a telnet client to establish

Connection Requires a three-way handshake

• Client SYN, Server SYN-ACK, Client ACK Uses an IP address + port for each endpoint (client &

server) Server port generally 80, client port generally > 1024

Page 14: Under the Covers with the Web

The HTTP Protocol

About HTTP Plain-text format for sending & retrieving web content

from a server Defined by RFC 2616 (v1.1) and RFC 1945 (v1.0)

Response Format Status Line Headers Blank Line Response Body

Request Format Request Line Headers Blank Line Optional Data

Page 15: Under the Covers with the Web

Debugging HTTP Browser Plugins

Firefox: LiveHeaders

Chrome: Go to chrome://net-internals/#events & filter by URL_REQUEST

Internet Explorer: IEWatch, ieHTTPHeaders

Web Debugging Proxies

Fiddler

Charles

Network Packet Analyzer

Wireshark

Telnet Client

Custom Code

Page 16: Under the Covers with the Web

HTTP Request Methods

Read-Only Methods GET HEAD OPTIONS TRACE

Idempotent Methods PUT DELETE

Write Methods POST

Page 17: Under the Covers with the Web

Common Request HeadersHost

Name of host request being sent to

Cache-Control / PragmaDetermines caching behavior

Accept / Accept-Charset / Accept-Encoding / Accept-LanguageWhich MIME types, character sets, encodings (compression techniques) and languages to use in the response

If-Modified-Since / If-Match Only send full response if data has been modified since this date

RefererThe URL that pointed to this URL

User-AgentThe identifier for the browser or program accessing the URL.

CookieSends any cookies previously set to server

Page 18: Under the Covers with the Web

Basic GET RequestGET /path/myPage.php?id=15 HTTP/1.1

Host: www.mysite.com

Page 19: Under the Covers with the Web

Typical GET RequestGET / HTTP/1.1Host: www.yahoo.comUser-Agent: Mozilla/5.0 (Windows NT 5.1; rv:2.0.1) Gecko/20100101

Firefox/4.0.1Accept:

text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8Accept-Language: en-us,en;q=0.5Accept-Encoding: gzip, deflateAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7Keep-Alive: 115Connection: keep-aliveCookie: B=5uh7jf9466q3e&b=4&d=zrFdfgYnwtpYEIRQwQyo3Yh8W.6mk7zDmSWeQ--

&s=fs&i=Yq8FH2OqgDl7KB7VK8; YLS==1&p=0&n=0; s_vsn_yahoogroupsygprod_1=687374583019; FPS=dl; F=a=e92100wTFmxh9hOe34539Z6Yp2uybfZ.8pX0oke5pYRVFcz4dYgR74..vuqgLRABfa5K8IfW3tJLSP3LyikNIAO2234sAesM7spi4di.8BaE&b=vZaW; ALP=bTowJmw6ZW5fVVMm; F=a=e92100wTF; Qur=RAB234fa5K8IfW3tJLSP3L; yikNI23AOwVsAesM7spi4di.8BaE&b=vZaW; ALP=bTowJmw2346ZW5fVVMm;

Page 20: Under the Covers with the Web

Basic Request Notes Determine locale using Accept-Language

PHP: Locale::acceptFromHttp() ASP.Net: Request.UserLanguages J2EE: ServletRequest.getLocales()

Use separate domain for static content to avoid overhead of sending cookies For instance: static.example.com Or use separate domain entirely: example-images.com When setting cookies, use a specific domain, eg: www.example.com

not example.com

Turn on gzip compression Always for uncompressed static files Never for compressed static files (images, PDFs) Sometimes for dynamic responses

Page 21: Under the Covers with the Web

Typical POST RequestPOST /data-generator/DataGenerator/Projects/Edit.aspx?ID=1 HTTP/1.1Host: localhostUser-Agent: Mozilla/5.0 (Windows NT 5.1; rv:2.0.1) Gecko/20100101

Firefox/4.0.1Accept:

text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8Accept-Language: en-us,en;q=0.5Accept-Encoding: gzip, deflateAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7Keep-Alive: 115Connection: keep-aliveReferer:

http://localhost/data-generator/DataGenerator/Projects/Edit.aspx?ID=1

Content-Type: application/x-www-form-urlencodedContent-Length: 411

__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUKMTE3MDA1NzgzNWRkD7Ixn245wh8y%2BUk486haBrWD82I%3D&__EVENTVALIDATION=%2FwEWBgKe67rCCwLCqe1mAubp%2BvAPAtn1w6cBArrGt4UFAsjVgtcCuHxmC3pWm2jBT5ZlGoNznlygJIk%3D&ctl00%24MainContent%24ID=1&ctl00%24MainContent%24Name=Project+Portfolio+Management&ctl00%24MainContent%24Description=A+general+project+portfolio+management+data+set.&ctl00%24MainContent%24SaveButton=Save

Page 22: Under the Covers with the Web

XML POST RequestPOST /data-generator/DataGenerator/Projects/Edit.aspx?ID=1 HTTP/1.1Host: localhostUser-Agent: Mozilla/5.0 (Windows NT 5.1; rv:2.0.1) Gecko/20100101

Firefox/4.0.1Accept:

text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8Accept-Language: en-us,en;q=0.5Accept-Encoding: gzip, deflateAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7Keep-Alive: 115Connection: keep-aliveContent-Type: text/xmlContent-Length: 60

<xmlRoot> <sample message="This is a test" /></xmlRoot>

Page 23: Under the Covers with the Web

More Request Notes POST requests rarely, if ever, cached

Some clients refuse to cache POST requests even if headers are sent

GET requests better for caching

Some proxies won’t cache GET requests if they contain cookies though

For AJAX, don’t send content, just the URL with query parameters (eg: GET or POST with no data)

Any content can be sent in GET or POST requests, just specify a content-type

Page 24: Under the Covers with the Web

Server Processes Request

1. Parses input stream

2. Creates request & response objects

3. Sends request through pipeline

4. Maps to content handler Requests handled internally or through CGI, ISAPI or

other module IIS 7 uses integrated pipeline for all requests

5. Receives content from handler (stream or fixed)

6. Sends response to client

7. Logs request

Page 25: Under the Covers with the Web

Create Request & Response Objects

PHP$_SERVER

$_REQUEST

$_GET

$_POST

$_COOKIE

$_FILES

$_SESSION

$_ENV

$HTTP_RAW_POST_DATA

ASP.NetRequest

.Params

.ServerVariables

.QueryString

.Form

.Cookies

.Files

.Headers

.BinaryRead()

Application

Session

Response

J2EEHttpServletRequest

.getParameterMap()

.getQueryString()

.getCookies()

.getSession()

.getHeaderNames()

.getInputStream()

HttpServletResponse

Page 26: Under the Covers with the Web

Common Response HeadersServer

The identifier for the server sending the response

Date

The date of this response as represented by the server

Cache-ControlHow this response should be cached.

Last-ModifiedThe date the resource was last modified. Used for caching

EtagThe entity tag. Uniquely identifies a version of the resource

Expires

The date when this response expires, or 0 to expire immediately.

Content-Type The MIME type of the content being sent. May include the character set

Content-LengthThe length of the content being sent. Optional & may not be sent

Location

Redirect the client to this new URL. Can be a permanent or temporary redirect

Set-CookieSets a cookie on the client

Page 27: Under the Covers with the Web

Sample ResponseHTTP/1.1 200 OK

Server: Microsoft-IIS/5.1

Date: Thu, 06 Oct 2011 15:43:53 GMT

X-Powered-By: ASP.NET

X-AspNet-Version: 2.0.50727

Cache-Control: private

Content-Type: text/html; charset=utf-8

Content-Length: 8313

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html>

</html>

Page 28: Under the Covers with the Web

CookiesAbout Cookies Defined by RFC 6265 Used to provide state to HTTP Often used to track session IDs Multiple Set-Cookie headers can be sent in a response Can be sent with any response code, though can be ignored

when sending 1xx response codes Not supposed to affect caching, but may

Sending a CookieSet-Cookie: <name>=<value>; <attribute>; <attribute>Set-Cookie: <name2>=<value2>; <attribute>

Receiving a CookieCookie: <name>=<value>; <name2>=<value2>

Page 29: Under the Covers with the Web

Cookie AttributesExpires

The date upon which the cookie expires. Cookies may be expunged before this date by the client

Max-AgeThe number of seconds until the cookie expires. Max-Age overrides the Expires attribute

DomainThe hosts for which the cookie will be sent. Must be second-level or greater & include the originating server in definition

PathThe path for which the cookie will be sent. Defaults to path of requested page

SecureOnly transmit the cookie for a secure protocol, eg: https

HttpOnlyPrevent the cookie from being accessible to scripts, applets & plugins

Page 30: Under the Covers with the Web

Cookie Issues Cookies get sent in clear text. Use HTTPS with Secure flag

to make them secure

If not using HttpOnly, cookies can be read and sent by JavaScript

Cookies can be cached by proxy servers, allowing users to access others’ sessions

Separation of authentication from destination allows an attacked to redirect a user to a web site and let the browser send the right cookie

When using wildcard domains, cookies can be overwritten by applications using another sub-domain

Page 31: Under the Covers with the Web

Disable HttpOnly for Session Cookies

ASP.NetAdd to Global.asax:

protected void Application_EndRequest(object s, EventArgs e) { if (Response.Cookies.Count > 0) { foreach (string name in Response.Cookies.AllKeys) { if (name == FormsAuthentication.FormsCookieName || name.ToLower() == "asp.net_sessionid") { Response.Cookies[lName].HttpOnly = false; } } }}

J2EEAdd to web.xml:

<session-config> <cookie-config> <http-only>true</http-only> </cookie-config></session-config>

Page 32: Under the Covers with the Web

Caching

Who caches? Browsers Plug-ins Proxies Gateways Reverse Proxies

Why cache? Reduce latency Reduce network traffic Reduce server load

Page 33: Under the Covers with the Web

Routing: A Straight Connection

Page 34: Under the Covers with the Web

Routing: Web Proxy

Page 35: Under the Covers with the Web

Routing: Web Proxy Revalidates

Page 36: Under the Covers with the Web

Routing: Reverse Proxy

Page 37: Under the Covers with the Web

Basics of CachingKey Questions

Should I cache this? Where should I cache this? How long should I cache it for? Can I check to see if it’s still valid? When should I get rid of it?

Concepts Directives Freshness Validation Invalidation

Page 38: Under the Covers with the Web

Should I Cache? Caching enabled by default

Status codes 200, 203, 206, 300, 301, 410 Not enabled if Authorization header present

Cache-Control controls caching in HTTP 1.1 public, private, no-cache no-store, no-transform no-cache max-age, min-fresh, max-stale, s-maxage only-if-cached must-revalidate, proxy-revalidate

Pragma controls caching in HTTP 1.0 no-cache

HTML META element generally only works for browsers

Page 39: Under the Covers with the Web

How long to cache for?Last-Modified

Indicates the date a resource was last modified

ExpiresIndicates the date when the response expires. Implies the response is

cacheable, unless overridden by a Cache-Control header. The max-age Cache-Control directive overrides this value when determining how long to cache

Age headerIndicates how long a response has been in the cache. Sent by a web

cache when sending a response

Page 40: Under the Covers with the Web

Revalidating Caches E-Tag If-Match If-None-Match If-Modified-Since If-Unmodified-Since

Page 41: Under the Covers with the Web

Other Tricks

Content-Disposition Force a Save As dialog to appear Content-Transfer-Encoding: binary Content-Disposition: attachment; filename=data.csv

Server Push Use multipart/x-mixed-replace MIME type when pipelining Not supported by Internet Explorer

Page 42: Under the Covers with the Web

Thanks for attending!Slides: blog.fastfedora.com

Trevor Lohrbeer

@LabEscape

[email protected]

labescape.com

@FastFedora

[email protected]

fastfedora.com