The Internet and the WWW have become the backbone of our modern society and economy. The WWW is a distributed system of interlinked hypermedia resources in the Internet. It is based on a set of different, related technologies. We will have a brief look on the most important protocols and formats in this section, including Uniform Resource Identifiers (URIs), HTTP and HTML.
URI, currently specified in RFC 3986 [BL05], are strings that are used to reference resources. In terms of distributed systems, a URI has three distinct roles - naming, addressing, and identifying resources. We will focus on URI identifying resources in the WWW, although they can be used for other abstract or physical entities as well. According to the specification, a URI consists of five parts: scheme, authority, path, query and fragment. However, only scheme and path are mandatory, the other parts are optional. Scheme is declaring the type of the URI and thus determines the meaning of the other parts of the URI. If used, authority points to the responsible authority of the referenced resource. In case of http as scheme, this part becomes mandatory and contains the host of the web server hosting the resource. It can optionally contain a port number (80 is implied for http) and authentication data (deprecated). The mandatory path section is used to address the resource within the scope of the scheme (and authority). It is often structured hierarchically. The optional query part provides non-hierarchical data as part of the resource identifier. Fragments can be used to point to a certain part within the resource. The following example is adapted from RFC 3986 [BL05] and makes use of all five parts:
http://example.com:8080/over/there?search=test#first \_/ \______________/\_________/ \_________/ \__/ | | | | | scheme authority path query fragment
It identifies a web resource (scheme is http), that is hosted on the example.com (on port 8080). The resource path is /over/there and the query component contains the key/value pair search=test. Furthermore, the first fragment of the resource is referenced.
The HTTP is an application-level protocol that represents the foundation of communication for the WWW on top of TCP/IP. HTTP, as defined in RFC 2616 [Fie99], is a stateless protocol and complies with a client/server architecture and a request/response communication model. Servers host resources that are identified by URI and can be accessed by clients. The client issues an HTTP request to the server which in return provides an HTTP response. The communication model limits the possible message patterns to single request/response cycles that are always initiated by the client. Apart from clients and servers, HTTP also describes optional intermediaries, so called proxies. These components provide additional features such as caching or filtering. Proxies combine features of a client and a server and are thus often transparent for the clients and servers in terms of communication.
HTTP requests and responses have a common structure. Both start with a request line respectively status line. The next part contains a set of header lines that include information about the request respectively response and about the entity.
The entity is an optional body of an HTTP message that contains payload such as a representation of the resource. While the first two parts of an HTTP message are text-based, the entity can be any set of bytes. HTTP request lines contain a request URI and a method. There are different HTTP methods that provide different semantics when applied to a resource, as shown in table 2.1.
|GET||This is the most common method of the WWW. It is used for fetching resource representations.||✓||✓||(✓)|
|HEAD||Essentially, this method is the same as GET, however the entity is omitted in the response.||✓||✓||(✓)|
|PUT||This method is used for creating or updating existing resources with new representations.||✓|
|DELETE||Existing resources can be removed with DELETE||✓|
|POST||POST is used to create new resources. Due to its lack of idempotence and safety, it also often used to trigger arbitrary actions.|
|OPTIONS||Method provides meta data about a resource and available representations.||✓||✓|
In the subsequent HTTP response, the server informs the client about the outcome of a request by using predefined status codes. The classes of status codes can be seen in table 2.2.
|Range||Status Type||Usage||Example Code|
|1xx||informational||Preliminary response codes||100 Continue|
|2xx||success||The request has been successfully processed.||200 OK|
|3xx||redirection||The client must dispatch additional requests to complete the request.||303 See Other|
|4xx||client error||Result of an erroneous request caused by the client.||404 Not Found|
|5xx||server error||A server-side error occured (not caused by this request being invalid).||503 Service Unavailable|
A simple request/response exchange can be seen in the listing below as an example:
GET /html/rfc1945 HTTP/1.1 Host: tools.ietf.org User-Agent: Mozilla/5.0 (Ubuntu; X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3 Accept-Encoding: gzip, deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Connection: keep-alive If-Modified-Since: Sun, 13 Nov 2011 21:13:51 GMT If-None-Match: "182a7f0-2aac0-4b1a43b17a1c0;4b6bc4bba3192" Cache-Control: max-age=0 HTTP/1.1 304 Not Modified Date: Tue, 17 Jan 2012 17:02:44 GMT Server: Apache/2.2.21 (Debian) Connection: Keep-Alive Keep-Alive: timeout=5, max=99 Etag: "182a7f0-2aac0-4b1a43b17a1c0;4b6bc4bba3192" Content-Location: rfc1945.html Vary: negotiate,Accept-Encoding
We will now have a closer look at two advanced features of HTTP that are interesting for our later considerations, namely connection handling and chunked encoding.
As already mentioned, HTTP uses TCP/IP as underlying transport protocol. We will now examine the exact usage of TCP sockets for HTTP requests. The previous specifications of HTTP have suggested a separate socket connection for each request/response cycle. Adding the overhead of establishing a TCP connection for each request leads to poor performance and missing reusability of existing connections. The non-standard Connection: Keep-Alive header was a temporary workaround, but the current HTTP 1.1 specification has addressed this issue in detail. HTTP 1.1 introduced persistent connections as default. That is, the underlying TCP connection of an HTTP request is reused for subsequent HTTP requests. Request pipelining further improves throughput of persistent connections by allowing to dispatch multiple requests, without awaiting for responses to prior requests. The server then responds to all incoming request in the same sequential order. Both mechanisms have improved the performance and decreased latency problems of web applications. But the management of multiple open connections and the processing of pipelined requests has revealed new challenges for web servers as we see in chapter 4.
An HTTP message must contain the length of its entity, if any. In HTTP 1.1, this is neccessary for determining the overall length of a message and detecting the next message of a persistent connection. Sometimes, the exact length of an entity cannot be determined a priori. This is especially important for content that is generated dynamically or for entities that are compressed on-the-fly. Therefore, HTTP provides alternative transfer encodings. When chunked transfer encoding is used, the client or server streams chunks of the entity sequentially. The length of the next chunk to expect is prepended to the actual chunk. A chunk length of 0 denotes the end of the entity. This mechanism allows the transfer of generated entities with arbitrary length.
HTTP does not restrict the document formats to be used for entities. However, the core idea of the WWW is based on hypermedia, thus most of the formats are hypermedia-enabled. The single most important format is the HTML and its descendants.
The fifth revision of the HTML standard [Hya09] introduces several markup improvements (e.g. semantic tags), better multimedia content support, but most notably a rich set of new APIs. These APIs address various features including client-side storage, offline support, device sensors for context awareness and improved client-side performance. The Web Sockets API [Hic09a] complements the traditional HTTP request/response communication pattern with a low latency, bidirectional, full-duplex socket based on the WebSocket protocol [Fet11]. This is especially interesting for real-time web applications.
Besides proprietary and customized formats, web services often make use of generic, structured formats such as XML and JSON [Cro06]. XML is a comprehensive markup language providing a rich family of related technologies, like validation, transformation or querying. JSON is one of the leightweight alternatives that focuses solely on the succinct representation of structured data. While there is an increasing interest in lightweight formats for web services and messages, XML provides still the most extensive tool set and support.