Under the Covers with the Web
-
Upload
trevor-lohrbeer -
Category
Technology
-
view
1.433 -
download
1
description
Transcript of Under the Covers with the Web
Under the Coverswith the Web
by Trevor Lohrbeer
@LabEscape
labescape.com
@FastFedora
fastfedora.com
Simple Request Walkthrough
1. Enter URL in browser
2. Browser sends request to server
3. Server sends response to browser
4. Browser renders page
Making the Request
1. Parse URL
2. Resolve domain to IP address
3. Open TCP/IP connection to server
4. Use HTTP to send a request
URL: Uniform Resource Locators
About URLs: Not web-specific. Internet standard. Defined by Request For Comments (RFCs) Current Standard is RFC 3986 Unlike URIs, URLs provide an access
mechanism
URL FormatFormat
<protocol>//<user>:<password>@<host>:<port>/<url-path>
http://myserver.com/some/search.php?query=help#ch5http://admin:[email protected]:8000/some/path.php/info#ch5http://68.15.13.12/http://[2001:4860:0:2001::68]/
Parsing – Regular Expression^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? 12 3 4 5 6 7 8 9
Where:scheme = $2 authority = $4 path = $5 query = $7 fragment = $9
Details Valid characters: a-z A-Z 0-9 - . _ ~ Special characters for HTTP: / ? & = , + # Recommend % encoding other characters (eg: %20) Avoid URLs longer than 2,000 characters
Parsing an HTTP URL<protocol>://<host>:<port>/<path>?<query>#<fragment>
ProtocolEither http or https
Host & Port (Authority)Location of resource (default port 80 for http, 443 for https)
PathHierarchical string to resource
QueryParameters to pass to resource
FragmentIdentifies subset of resource
ProtocolsHTTP
Original web protocol. Currently version 1.1.
HTTPSHTTP tunneled within an SSL/TLS connection on port 443
S-HTTPOlder secure protocol – used port 80
SPDYFaster version of HTTP used by Google Chrome
Hosts Hosts can be:
DNS Names: localhost, www.fastfedora.com WINS (NetBIOS) Names: WEBSERVER IPv4 Addresses: 192.168.1.200 IPv6 Addresses: [2001:0db8:85a3:::8a2e:0370:7334] Other “Registered” Name
Tricks: Hardcode name to IP address in hosts / lmhosts
Windows: C:\WINDOWS\system32\drivers\etcLinux: /etc/hosts
On Windows, use “nbtstat –r” to refresh cache
Paths Hierarchical, separating path components by /
Can be empty (eg: http://example.com?id=5)
Start with first /
End with ?, # or end of URL
Absolute paths start with leading /
Relative paths start with . or .. to refer to current location or parent location, respectively
For application server, does not have to end with the application end-point, can have “path info” which extends past resource, eg: /book/search.php/My%20Book/Paperback
Queries Can be any text, eg: ?myQuery
Or uses name=value pairs separated by &, eg: ?query=myQuery
Names do not have to be unique, eg: ?stock=MSFT&stock=SUN is valid
Multiple values can be comma-separated for some application engines, eg: ?stock=MSFT,SUN
When using ampersands:
Use & when using the URL in HTML / XML
Use & when using the URL elsewhere
Search engines used to not index, still not great
Always encode =, & and # in any names or values
Fragments Refers to a subset or view of web page
Not indexed by search engines
Used to reference anchors in web pages, eg: #chapter2 links to <a name=“chapter2”></a>
Used by AJAX to store state using JavaScript without refreshing the browser page – helps support bookmarks & browser history
Obsoleted by pushState in HTML 5
Used to increase SEO by canonicalizing URLs
Resolve Domain to IP address DNS
Resolves names to IP addresses & vice versus Not a 1-to-1 mapping between IPs & names Results cached at many levels:
• Application (web browser, e-mail)• Local OS resolver• DNS server• Authoritative nameserver
When moving hosting: Reduce time-to-live (TTL) on DNS record to 1 day Wait until old TTL expires (usually 1 week) Move hosting to new IP address Increase TTL back to 1 week
Open TCP/IP Connection About TCP
Sits atop Internet Protocol (IP) Provides reliable connection Optimized for accuracy, not timeliness Takes time to establish Uses resources on the server & client to maintain Can use a telnet client to establish
Connection Requires a three-way handshake
• Client SYN, Server SYN-ACK, Client ACK Uses an IP address + port for each endpoint (client &
server) Server port generally 80, client port generally > 1024
The HTTP Protocol
About HTTP Plain-text format for sending & retrieving web content
from a server Defined by RFC 2616 (v1.1) and RFC 1945 (v1.0)
Response Format Status Line Headers Blank Line Response Body
Request Format Request Line Headers Blank Line Optional Data
Debugging HTTP Browser Plugins
Firefox: LiveHeaders
Chrome: Go to chrome://net-internals/#events & filter by URL_REQUEST
Internet Explorer: IEWatch, ieHTTPHeaders
Web Debugging Proxies
Fiddler
Charles
Network Packet Analyzer
Wireshark
Telnet Client
Custom Code
HTTP Request Methods
Read-Only Methods GET HEAD OPTIONS TRACE
Idempotent Methods PUT DELETE
Write Methods POST
Common Request HeadersHost
Name of host request being sent to
Cache-Control / PragmaDetermines caching behavior
Accept / Accept-Charset / Accept-Encoding / Accept-LanguageWhich MIME types, character sets, encodings (compression techniques) and languages to use in the response
If-Modified-Since / If-Match Only send full response if data has been modified since this date
RefererThe URL that pointed to this URL
User-AgentThe identifier for the browser or program accessing the URL.
CookieSends any cookies previously set to server
Basic GET RequestGET /path/myPage.php?id=15 HTTP/1.1
Host: www.mysite.com
Typical GET RequestGET / HTTP/1.1Host: www.yahoo.comUser-Agent: Mozilla/5.0 (Windows NT 5.1; rv:2.0.1) Gecko/20100101
Firefox/4.0.1Accept:
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8Accept-Language: en-us,en;q=0.5Accept-Encoding: gzip, deflateAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7Keep-Alive: 115Connection: keep-aliveCookie: B=5uh7jf9466q3e&b=4&d=zrFdfgYnwtpYEIRQwQyo3Yh8W.6mk7zDmSWeQ--
&s=fs&i=Yq8FH2OqgDl7KB7VK8; YLS==1&p=0&n=0; s_vsn_yahoogroupsygprod_1=687374583019; FPS=dl; F=a=e92100wTFmxh9hOe34539Z6Yp2uybfZ.8pX0oke5pYRVFcz4dYgR74..vuqgLRABfa5K8IfW3tJLSP3LyikNIAO2234sAesM7spi4di.8BaE&b=vZaW; ALP=bTowJmw6ZW5fVVMm; F=a=e92100wTF; Qur=RAB234fa5K8IfW3tJLSP3L; yikNI23AOwVsAesM7spi4di.8BaE&b=vZaW; ALP=bTowJmw2346ZW5fVVMm;
Basic Request Notes Determine locale using Accept-Language
PHP: Locale::acceptFromHttp() ASP.Net: Request.UserLanguages J2EE: ServletRequest.getLocales()
Use separate domain for static content to avoid overhead of sending cookies For instance: static.example.com Or use separate domain entirely: example-images.com When setting cookies, use a specific domain, eg: www.example.com
not example.com
Turn on gzip compression Always for uncompressed static files Never for compressed static files (images, PDFs) Sometimes for dynamic responses
Typical POST RequestPOST /data-generator/DataGenerator/Projects/Edit.aspx?ID=1 HTTP/1.1Host: localhostUser-Agent: Mozilla/5.0 (Windows NT 5.1; rv:2.0.1) Gecko/20100101
Firefox/4.0.1Accept:
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8Accept-Language: en-us,en;q=0.5Accept-Encoding: gzip, deflateAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7Keep-Alive: 115Connection: keep-aliveReferer:
http://localhost/data-generator/DataGenerator/Projects/Edit.aspx?ID=1
Content-Type: application/x-www-form-urlencodedContent-Length: 411
__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUKMTE3MDA1NzgzNWRkD7Ixn245wh8y%2BUk486haBrWD82I%3D&__EVENTVALIDATION=%2FwEWBgKe67rCCwLCqe1mAubp%2BvAPAtn1w6cBArrGt4UFAsjVgtcCuHxmC3pWm2jBT5ZlGoNznlygJIk%3D&ctl00%24MainContent%24ID=1&ctl00%24MainContent%24Name=Project+Portfolio+Management&ctl00%24MainContent%24Description=A+general+project+portfolio+management+data+set.&ctl00%24MainContent%24SaveButton=Save
XML POST RequestPOST /data-generator/DataGenerator/Projects/Edit.aspx?ID=1 HTTP/1.1Host: localhostUser-Agent: Mozilla/5.0 (Windows NT 5.1; rv:2.0.1) Gecko/20100101
Firefox/4.0.1Accept:
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8Accept-Language: en-us,en;q=0.5Accept-Encoding: gzip, deflateAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7Keep-Alive: 115Connection: keep-aliveContent-Type: text/xmlContent-Length: 60
<xmlRoot> <sample message="This is a test" /></xmlRoot>
More Request Notes POST requests rarely, if ever, cached
Some clients refuse to cache POST requests even if headers are sent
GET requests better for caching
Some proxies won’t cache GET requests if they contain cookies though
For AJAX, don’t send content, just the URL with query parameters (eg: GET or POST with no data)
Any content can be sent in GET or POST requests, just specify a content-type
Server Processes Request
1. Parses input stream
2. Creates request & response objects
3. Sends request through pipeline
4. Maps to content handler Requests handled internally or through CGI, ISAPI or
other module IIS 7 uses integrated pipeline for all requests
5. Receives content from handler (stream or fixed)
6. Sends response to client
7. Logs request
Create Request & Response Objects
PHP$_SERVER
$_REQUEST
$_GET
$_POST
$_COOKIE
$_FILES
$_SESSION
$_ENV
$HTTP_RAW_POST_DATA
ASP.NetRequest
.Params
.ServerVariables
.QueryString
.Form
.Cookies
.Files
.Headers
.BinaryRead()
Application
Session
Response
J2EEHttpServletRequest
.getParameterMap()
.getQueryString()
.getCookies()
.getSession()
.getHeaderNames()
.getInputStream()
HttpServletResponse
Common Response HeadersServer
The identifier for the server sending the response
Date
The date of this response as represented by the server
Cache-ControlHow this response should be cached.
Last-ModifiedThe date the resource was last modified. Used for caching
EtagThe entity tag. Uniquely identifies a version of the resource
Expires
The date when this response expires, or 0 to expire immediately.
Content-Type The MIME type of the content being sent. May include the character set
Content-LengthThe length of the content being sent. Optional & may not be sent
Location
Redirect the client to this new URL. Can be a permanent or temporary redirect
Set-CookieSets a cookie on the client
Sample ResponseHTTP/1.1 200 OK
Server: Microsoft-IIS/5.1
Date: Thu, 06 Oct 2011 15:43:53 GMT
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Cache-Control: private
Content-Type: text/html; charset=utf-8
Content-Length: 8313
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
…
</html>
CookiesAbout Cookies Defined by RFC 6265 Used to provide state to HTTP Often used to track session IDs Multiple Set-Cookie headers can be sent in a response Can be sent with any response code, though can be ignored
when sending 1xx response codes Not supposed to affect caching, but may
Sending a CookieSet-Cookie: <name>=<value>; <attribute>; <attribute>Set-Cookie: <name2>=<value2>; <attribute>
Receiving a CookieCookie: <name>=<value>; <name2>=<value2>
Cookie AttributesExpires
The date upon which the cookie expires. Cookies may be expunged before this date by the client
Max-AgeThe number of seconds until the cookie expires. Max-Age overrides the Expires attribute
DomainThe hosts for which the cookie will be sent. Must be second-level or greater & include the originating server in definition
PathThe path for which the cookie will be sent. Defaults to path of requested page
SecureOnly transmit the cookie for a secure protocol, eg: https
HttpOnlyPrevent the cookie from being accessible to scripts, applets & plugins
Cookie Issues Cookies get sent in clear text. Use HTTPS with Secure flag
to make them secure
If not using HttpOnly, cookies can be read and sent by JavaScript
Cookies can be cached by proxy servers, allowing users to access others’ sessions
Separation of authentication from destination allows an attacked to redirect a user to a web site and let the browser send the right cookie
When using wildcard domains, cookies can be overwritten by applications using another sub-domain
Disable HttpOnly for Session Cookies
ASP.NetAdd to Global.asax:
protected void Application_EndRequest(object s, EventArgs e) { if (Response.Cookies.Count > 0) { foreach (string name in Response.Cookies.AllKeys) { if (name == FormsAuthentication.FormsCookieName || name.ToLower() == "asp.net_sessionid") { Response.Cookies[lName].HttpOnly = false; } } }}
J2EEAdd to web.xml:
<session-config> <cookie-config> <http-only>true</http-only> </cookie-config></session-config>
Caching
Who caches? Browsers Plug-ins Proxies Gateways Reverse Proxies
Why cache? Reduce latency Reduce network traffic Reduce server load
Routing: A Straight Connection
Routing: Web Proxy
Routing: Web Proxy Revalidates
Routing: Reverse Proxy
Basics of CachingKey Questions
Should I cache this? Where should I cache this? How long should I cache it for? Can I check to see if it’s still valid? When should I get rid of it?
Concepts Directives Freshness Validation Invalidation
Should I Cache? Caching enabled by default
Status codes 200, 203, 206, 300, 301, 410 Not enabled if Authorization header present
Cache-Control controls caching in HTTP 1.1 public, private, no-cache no-store, no-transform no-cache max-age, min-fresh, max-stale, s-maxage only-if-cached must-revalidate, proxy-revalidate
Pragma controls caching in HTTP 1.0 no-cache
HTML META element generally only works for browsers
How long to cache for?Last-Modified
Indicates the date a resource was last modified
ExpiresIndicates the date when the response expires. Implies the response is
cacheable, unless overridden by a Cache-Control header. The max-age Cache-Control directive overrides this value when determining how long to cache
Age headerIndicates how long a response has been in the cache. Sent by a web
cache when sending a response
Revalidating Caches E-Tag If-Match If-None-Match If-Modified-Since If-Unmodified-Since
Other Tricks
Content-Disposition Force a Save As dialog to appear Content-Transfer-Encoding: binary Content-Disposition: attachment; filename=data.csv
Server Push Use multipart/x-mixed-replace MIME type when pipelining Not supported by Internet Explorer
Thanks for attending!Slides: blog.fastfedora.com
Trevor Lohrbeer
@LabEscape
labescape.com
@FastFedora
fastfedora.com