1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will...

26
1 URLs – Uniform Resource URLs – Uniform Resource Locators Locators Since web pages may contain Since web pages may contain pointers to other pointers to other pages pages , we will see how those pointers are , we will see how those pointers are implemented implemented When the web was first created, it was apparent When the web was first created, it was apparent that having one page point to another required that having one page point to another required mechanisms for naming and locating pages mechanisms for naming and locating pages. In . In particular there were 3 questions that had to be particular there were 3 questions that had to be answered before a selected page could be displayed: answered before a selected page could be displayed: What is the page called? What is the page called? Where is the page located? Where is the page located? How can the page be accessed? How can the page be accessed?

Transcript of 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will...

Page 1: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

11

URLs – Uniform Resource URLs – Uniform Resource LocatorsLocators

Since web pages may contain Since web pages may contain pointers to other pagespointers to other pages, , we will see how those pointers are implementedwe will see how those pointers are implemented

When the web was first created, it was apparent that When the web was first created, it was apparent that having one page point to another required having one page point to another required mechanisms for naming and locating pagesmechanisms for naming and locating pages. In . In particular there were 3 questions that had to be particular there were 3 questions that had to be answered before a selected page could be displayed:answered before a selected page could be displayed:

• What is the page called?What is the page called?• Where is the page located?Where is the page located?• How can the page be accessed?How can the page be accessed?

Page 2: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

22

URLsURLs

The solution chosen identifies pages The solution chosen identifies pages in a way that solves all 3 problems at in a way that solves all 3 problems at once.once.

Each page is assigned a URL Each page is assigned a URL ((Uniform Resource LocatorUniform Resource Locator) that ) that effectively serves as the effectively serves as the page’s page’s worldwide nameworldwide name..

Page 3: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

33

URL’sURL’s URLs have 3 parts:URLs have 3 parts:

• The The protocolprotocol (also called a scheme) (also called a scheme)• The The DNS nameDNS name of the machine on which the of the machine on which the

page is located, andpage is located, and• A local name uniquely indicating the specific A local name uniquely indicating the specific

page (usually just a page (usually just a file namefile name on the machine on the machine where it resides)where it resides)

For example, the URL for the author’s For example, the URL for the author’s department is department is http://www.cs.vu.nl/welcome.htmlhttp://www.cs.vu.nl/welcome.html This This URL consists of 3 parts: the protocol (URL consists of 3 parts: the protocol (httphttp), ), the DNS name of the host (the DNS name of the host (www.cs.vu.nlwww.cs.vu.nl) ) and the file name (and the file name (welcome.htmlwelcome.html) with ) with certain punctuation separating the piecescertain punctuation separating the pieces

Page 4: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

44

URLsURLs Many sites have certain shortcuts for file names Many sites have certain shortcuts for file names

built in. For example, built in. For example, ~user/~user/ might be mapped onto might be mapped onto useruser’s WWW directory, with the convention that a ’s WWW directory, with the convention that a reference to the directory itself implies a certain file, reference to the directory itself implies a certain file, say, say, index.htmlindex.html

Thus the author’s home page can be reached at Thus the author’s home page can be reached at http://www.cs.vu.nl/~ast/http://www.cs.vu.nl/~ast/ even though the actual file even though the actual file name is different.name is different.

At many sites a null file name defaults to the At many sites a null file name defaults to the organization’s home page.organization’s home page.

Page 5: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

55

URLs – mechanismURLs – mechanism To make a piece of text clickable the page writer To make a piece of text clickable the page writer

must provide 2 items of information:must provide 2 items of information:

• The clickable text to be displayed, andThe clickable text to be displayed, and• The URL of the page to go to if the text is selectedThe URL of the page to go to if the text is selected

When the text is selected, the browser looks up When the text is selected, the browser looks up the host name using DNS. Now armed with the the host name using DNS. Now armed with the host’s IP address, the browser then establishes a host’s IP address, the browser then establishes a TCP connection to the host. Over that connection TCP connection to the host. Over that connection it sends the file name using the specified it sends the file name using the specified protocol. Next, back comes the page.protocol. Next, back comes the page.

Page 6: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

66

URLs - protocolsURLs - protocols

The URL scheme is open ended, in the The URL scheme is open ended, in the sense that it is straight forward to have sense that it is straight forward to have protocols other than HTTP. In fact, URLs protocols other than HTTP. In fact, URLs for various other protocols have been for various other protocols have been defined, and many browsers understand defined, and many browsers understand themthem

The next table illustrates slightly simplified The next table illustrates slightly simplified forms of the more common ones:forms of the more common ones:

Page 7: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

77

ULRs - ProtocolsULRs - ProtocolsName Used for Example

http Hypertext http://www.cs.vu.nl/~ast/

ftp File Transfer Protocol ftp://ftp.cs.vu.nl/pub

file Local file file:///usr/Suzanne/prog.c

news News group news:comp.os.minix

news News article News:[email protected]

gopher Gopher gopher://gopher.tc.umn.edu/11/Libraries

mailto Sending email mailto:[email protected]

telnet Remote login telnet://www.w3.org:80

Page 8: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

88

HTTP – HyperText Transfer HTTP – HyperText Transfer ProtocolProtocol

The standard Web transfer protocol is HTTP The standard Web transfer protocol is HTTP (HyperText Transfer Protocol)(HyperText Transfer Protocol)

The HTTP protocol consists of two fairly The HTTP protocol consists of two fairly distinct items: distinct items:

• the set of requests from browsers to servers, the set of requests from browsers to servers, and and

• the set of responses going back the other waythe set of responses going back the other way

Page 9: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

99

HTTPHTTP HTTP is an ASCII protocolHTTP is an ASCII protocol (each interaction consists of an (each interaction consists of an

ASCII request, followed by one MIME-like response)ASCII request, followed by one MIME-like response)

MIMEMIME (Multipurpose Internet Mail Extensions) – in the early (Multipurpose Internet Mail Extensions) – in the early days of the ARPNET email messages consisted exclusively days of the ARPNET email messages consisted exclusively of text messages written in English and expressed in ASCII. of text messages written in English and expressed in ASCII. Nowadays on the Internet this approach is no longer Nowadays on the Internet this approach is no longer adequate, as the following need to be addressed:adequate, as the following need to be addressed:

• Messages in languages with accents (French, German)Messages in languages with accents (French, German)• Messages in nonLatin alphabets (e.g. Hebrew, Russian)Messages in nonLatin alphabets (e.g. Hebrew, Russian)• Messages in languages withough alphabets (e.g. Chinese, Messages in languages withough alphabets (e.g. Chinese,

Japanese)Japanese)• Messages not containing text at all (e.g. audio, video)Messages not containing text at all (e.g. audio, video)

Page 10: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

1010

MIMEMIME

The basic idea of MIME is to define encoding The basic idea of MIME is to define encoding rules for non-ASCII messages. MIME defines 5 rules for non-ASCII messages. MIME defines 5 message headers:message headers:

Header Meaning

MIME-Version Identifies the MIME version

Content-Description Human readable string telling what is the message

Content-ID Unique identifier

Content-Transfer-Encoding How the body is wrapped for the transmission

Content-Type Nature of the message

Page 11: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

1111

MIME – Content TypeMIME – Content TypeHeader Subtype Meaning

Text PlainRichtext

Unformatted textText including simple formatting

Image GifJpeg

Still picture in GIF formatStill picture in JPEG format

Audio Basic Audible sound

Video Mpeg Movie in MPEG format

Application Octet-streamPostscript

An uninterpreted byte sequenceA printable document in PostScript

Message Rfc822PartialExternal-body

A MIME RFC 822 messageMessage has been split for transmissionMessage must be fetched over the net

Multipart MixedAlternativeParallelDigest

Independent parts Same message in different formatsParts must be viewed simultaneouslyEach part is a complete RFC 822 message

Page 12: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

1212

HTTP - requestHTTP - request Although HTTP was designed for use in the Web, it has Although HTTP was designed for use in the Web, it has

been intentionally made more general than necessary with been intentionally made more general than necessary with an eye to future object oriented applications. For this an eye to future object oriented applications. For this reason the reason the first word of a requestfirst word of a request line is simply the name of line is simply the name of the the methodmethod (command) to be executed on the Web page (command) to be executed on the Web page (or general object)(or general object)

The built in methods are as follows:The built in methods are as follows:

MethodMethod DescriptionDescription

GETGET Request to read a Web pageRequest to read a Web page

HEADHEAD Request to read a Web page’s headerRequest to read a Web page’s header

PUTPUT Request to store a Web pageRequest to store a Web page

POSTPOST Append to a named resource (web page)Append to a named resource (web page)

DELETEDELETE Remove the Web pageRemove the Web page

LINKLINK Connects two existing resourcesConnects two existing resources

UNLINKUNLINK Breaks an existing connection between Breaks an existing connection between resourcesresources

Page 13: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

1313

HTTP request / responseHTTP request / response A request is just a GET line, naming the page desired and A request is just a GET line, naming the page desired and

the HTTP protocol version:the HTTP protocol version:

GET /hypertext/WWW/TheProject.html HTTP/1.1GET /hypertext/WWW/TheProject.html HTTP/1.1

The response is just the raw page, headers, and MIME The response is just the raw page, headers, and MIME informationinformation

For example, because HTTP is an ASCII protocol, it is easy For example, because HTTP is an ASCII protocol, it is easy for aperson at a terminal (opposed to a browser) to direcly for aperson at a terminal (opposed to a browser) to direcly talk to Web servers. All that is a needed is a TCP connection talk to Web servers. All that is a needed is a TCP connection to port 80 on the server. The simplest way to get such to port 80 on the server. The simplest way to get such connection is the Telnet program:connection is the Telnet program:

Page 14: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

1414

HTTP - exampleHTTP - exampleClient: Telnet www.w3.org 80Client: Telnet www.w3.org 80

Trying 18.23.0.23Trying 18.23.0.23

Connected to www.w3.orgConnected to www.w3.org

Client: GET /hypertext/WWW/TheProject.html HTTP/1.1Client: GET /hypertext/WWW/TheProject.html HTTP/1.1

Server: HTTP/1.1 200 Document followsServer: HTTP/1.1 200 Document follows

Server: MIME-Version: 1.0Server: MIME-Version: 1.0

Server: Server: CERN/3.0Server: Server: CERN/3.0

Server: Content-Type: text/htmlServer: Content-Type: text/html

Server: Content-Length: 8247Server: Content-Length: 8247

Server: <HEAD><TITLE>The World Wide Web Consortium (W3C) </TITLE> </HEAD>Server: <HEAD><TITLE>The World Wide Web Consortium (W3C) </TITLE> </HEAD>

Server: <BODY> …Server: <BODY> …

Page 15: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

1515

HTTP ExampleHTTP Example Or could use a command line Or could use a command line

browser, (such as WFetch) to review browser, (such as WFetch) to review the same informationthe same information

Page 16: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

1616

Page 17: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

1717

HTML – HyperText Markup HTML – HyperText Markup LanguageLanguage

HTMLHTML is a is a markup languagemarkup language, a language for , a language for describing describing how documents are to be formattedhow documents are to be formatted. . The term “markup” comes from the old days The term “markup” comes from the old days when copyeditors acutally marked up documents when copyeditors acutally marked up documents to tell the printer (in those days a human being) to tell the printer (in those days a human being) which fonts to use, and so on.which fonts to use, and so on.

Markup languages thus contain Markup languages thus contain explicit explicit commands for formattingcommands for formatting. For example, in HTML, . For example, in HTML, <B> <B> means start boldface mode, andmeans start boldface mode, and </B> </B> means leave boldface mode.means leave boldface mode.

Page 18: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

1818

HTMLHTML

The advantage of a markup language over one The advantage of a markup language over one with no explicit markup is that writing a browser with no explicit markup is that writing a browser for it is straightforward: the browser simply has to for it is straightforward: the browser simply has to understand the markup commands.understand the markup commands.

By embedding the markup commands within By embedding the markup commands within each HTML file and standardizing them, it each HTML file and standardizing them, it becomes possible for any Web browser to read becomes possible for any Web browser to read and reformat any Web page. and reformat any Web page.

Page 19: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

1919

HTMLHTML HTTP and HTML are HTTP and HTML are constantly evolvingconstantly evolving. When . When

Mosaic was the only browser, the language it Mosaic was the only browser, the language it interpreted, HTML 1.0, was de facto standard. interpreted, HTML 1.0, was de facto standard.

When new browsers came along, there was a When new browsers came along, there was a need for a formal Internet standard, so the need for a formal Internet standard, so the HTML 2.0 standard was produced. Next, HTML HTML 2.0 standard was produced. Next, HTML 3.0 was created as a research effort to add 3.0 was created as a research effort to add many new features to HTML 2.0, including many new features to HTML 2.0, including tables, toolbars, mathematical formulas, tables, toolbars, mathematical formulas, advanced style sheets (for defining page advanced style sheets (for defining page layout and the meaning of symbols), etc.layout and the meaning of symbols), etc.

Page 20: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

2020

HTML – brief introductionHTML – brief introduction A proper Web page consists of a head and body A proper Web page consists of a head and body

enclosed by <HTML> and </HTML> enclosed by <HTML> and </HTML> tagstags (formatting commands), although most browsers (formatting commands), although most browsers do not complain if these tags are missing.do not complain if these tags are missing.

The head is bracketed by <HEAD> </HEAD> tags, The head is bracketed by <HEAD> </HEAD> tags, and the body is bracketed by <BODY> </BODY> and the body is bracketed by <BODY> </BODY> tagstags

The commands inside the tags are called The commands inside the tags are called directivesdirectives. Most HTML tags have this format, that . Most HTML tags have this format, that is, <SOMETHING> to mark the beginning of is, <SOMETHING> to mark the beginning of something and </SOMETHING> to mark its end.something and </SOMETHING> to mark its end.

Page 21: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

2121

HTML – brief introductionHTML – brief introduction

Numerous other examples of HTML are easily Numerous other examples of HTML are easily available. Most browsers have a menu item available. Most browsers have a menu item VIEW SOURCE or something similar. Selecting VIEW SOURCE or something similar. Selecting this item for an HTML page, displays the this item for an HTML page, displays the current HTML source, instead of formatted current HTML source, instead of formatted outputoutput

Page 22: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

2222

DNS – Domain Name SystemDNS – Domain Name System

Programs rarely refer to hosts, Programs rarely refer to hosts, mailboxes, and other resources by their mailboxes, and other resources by their binary network addresses. Instead, they binary network addresses. Instead, they use ASCII strings, such as use ASCII strings, such as [email protected]@art.ucsb.edu

Nevertheless, the network itself only Nevertheless, the network itself only understands binary addresses, so some understands binary addresses, so some mechanismmechanism is required to is required to convert the convert the ASCII strings to network addressesASCII strings to network addresses. .

Page 23: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

2323

DNSDNS

Way back in the ARPANET, there was simply a Way back in the ARPANET, there was simply a file, hosts.txt, that listed all the hosts and their file, hosts.txt, that listed all the hosts and their IP addresses. Every night, all the hosts would IP addresses. Every night, all the hosts would fetch it from the site and at which it was fetch it from the site and at which it was maintained. For a network of a few hundred maintained. For a network of a few hundred large timeshareing machines, this approach large timeshareing machines, this approach worked reasonably well.worked reasonably well.

However, when thousands of workstations were However, when thousands of workstations were connected to the net, everyone realized that this connected to the net, everyone realized that this approach could not continue to work forever.approach could not continue to work forever.

Page 24: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

2424

DNSDNS For one thing, the size of the file would For one thing, the size of the file would

become too large. However, even more become too large. However, even more important, host name conflicts would important, host name conflicts would occur constantly unless names were occur constantly unless names were centrally managed, something centrally managed, something unthinkable in a huge international unthinkable in a huge international network. network.

To solve these problems, To solve these problems, DNS (the DNS (the Domain Name System)Domain Name System) was invented. was invented.

Page 25: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

2525

DNSDNS

The essence of DNS is the invention The essence of DNS is the invention of a hierarchical, domain-based of a hierarchical, domain-based naming scheme and a distributed naming scheme and a distributed database system for implementing database system for implementing this naming scheme.this naming scheme.

It is primarily used for mapping host It is primarily used for mapping host names and email destinations to IP names and email destinations to IP addresses.addresses.

Page 26: 1 URLs – Uniform Resource Locators Since web pages may contain pointers to other pages, we will see how those pointers are implemented Since web pages.

2626

DNS – how it is usedDNS – how it is used To map a name onto an IP address, an To map a name onto an IP address, an

application program calls a library application program calls a library procedure called the procedure called the resolverresolver, passing it , passing it the name as a parameter. The resolver the name as a parameter. The resolver sends a UDP packet to a sends a UDP packet to a local DNS serverlocal DNS server, , which then looks up the name and which then looks up the name and returns returns the IP addressthe IP address to the resolver, which then to the resolver, which then returns it to the caller.returns it to the caller.

Armed with the IP address, Armed with the IP address, the program the program can then establish a TCP connectioncan then establish a TCP connection with with the destination, or send it UDP packets.the destination, or send it UDP packets.