2.1 The World Wide Web

The Internet and the WWW have become the backbone of our modern society and economy. The WWW is a distributed system of interlinked hypermedia resources in the Internet. It is based on a set of different, related technologies. We will have a brief look on the most important protocols and formats in this section, including Uniform Resource Identifiers (URIs), HTTP and HTML.

Uniform Resource Identifiers

URI, currently specified in RFC 3986 [BL05], are strings that are used to reference resources. In terms of distributed systems, a URI has three distinct roles - naming, addressing, and identifying resources. We will focus on URI identifying resources in the WWW, although they can be used for other abstract or physical entities as well. According to the specification, a URI consists of five parts: scheme, authority, path, query and fragment. However, only scheme and path are mandatory, the other parts are optional. Scheme is declaring the type of the URI and thus determines the meaning of the other parts of the URI. If used, authority points to the responsible authority of the referenced resource. In case of http as scheme, this part becomes mandatory and contains the host of the web server hosting the resource. It can optionally contain a port number (80 is implied for http) and authentication data (deprecated). The mandatory path section is used to address the resource within the scope of the scheme (and authority). It is often structured hierarchically. The optional query part provides non-hierarchical data as part of the resource identifier. Fragments can be used to point to a certain part within the resource. The following example is adapted from RFC 3986 [BL05] and makes use of all five parts:

It identifies a web resource (scheme is http), that is hosted on the example.com (on port 8080). The resource path is /over/there and the query component contains the key/value pair search=test. Furthermore, the first fragment of the resource is referenced.

The Hypertext Transfer Protocol

The HTTP is an application-level protocol that represents the foundation of communication for the WWW on top of TCP/IP. HTTP, as defined in RFC 2616 [Fie99], is a stateless protocol and complies with a client/server architecture and a request/response communication model. Servers host resources that are identified by URI and can be accessed by clients. The client issues an HTTP request to the server which in return provides an HTTP response. The communication model limits the possible message patterns to single request/response cycles that are always initiated by the client. Apart from clients and servers, HTTP also describes optional intermediaries, so called proxies. These components provide additional features such as caching or filtering. Proxies combine features of a client and a server and are thus often transparent for the clients and servers in terms of communication.

HTTP requests and responses have a common structure. Both start with a request line respectively status line. The next part contains a set of header lines that include information about the request respectively response and about the entity.

The entity is an optional body of an HTTP message that contains payload such as a representation of the resource. While the first two parts of an HTTP message are text-based, the entity can be any set of bytes. HTTP request lines contain a request URI and a method. There are different HTTP methods that provide different semantics when applied to a resource, as shown in table 2.1.

Method	Usage	Safe	Idempotent	Cachable
`GET`	This is the most common method of the WWW. It is used for fetching resource representations.	✓	✓	(✓)
`HEAD`	Essentially, this method is the same as `GET`, however the entity is omitted in the response.	✓	✓	(✓)
`PUT`	This method is used for creating or updating existing resources with new representations.		✓
`DELETE`	Existing resources can be removed with `DELETE`		✓
`POST`	`POST` is used to create new resources. Due to its lack of idempotence and safety, it also often used to trigger arbitrary actions.
`OPTIONS`	Method provides meta data about a resource and available representations.	✓	✓

Table 2.1: A table of the official HTTP 1.1 methods. In terms of RFC 2616 a method is safe, when a request using this method does not change any state on the server. If multiple dispatches of a request result in the same side effects than a single dispatch, the request semantics is called idempotent. If a request method provides cacheability, clients may store responses according to the HTTP caching semantics.

In the subsequent HTTP response, the server informs the client about the outcome of a request by using predefined status codes. The classes of status codes can be seen in table 2.2.

Range	Status Type	Usage	Example Code
`1xx`	informational	Preliminary response codes	`100 Continue`
`2xx`	success	The request has been successfully processed.	`200 OK`
`3xx`	redirection	The client must dispatch additional requests to complete the request.	`303 See Other`
`4xx`	client error	Result of an erroneous request caused by the client.	`404 Not Found`
`5xx`	server error	A server-side error occured (not caused by this request being invalid).	`503 Service Unavailable`

Table 2.2: A table of the code ranges of HTTP response codes. The first digit determines the status type, the last two digits the exact response code.

A simple request/response exchange can be seen in the listing below as an example:

We will now have a closer look at two advanced features of HTTP that are interesting for our later considerations, namely connection handling and chunked encoding.

HTTP Connection Handling

As already mentioned, HTTP uses TCP/IP as underlying transport protocol. We will now examine the exact usage of TCP sockets for HTTP requests. The previous specifications of HTTP have suggested a separate socket connection for each request/response cycle. Adding the overhead of establishing a TCP connection for each request leads to poor performance and missing reusability of existing connections. The non-standard Connection: Keep-Alive header was a temporary workaround, but the current HTTP 1.1 specification has addressed this issue in detail. HTTP 1.1 introduced persistent connections as default. That is, the underlying TCP connection of an HTTP request is reused for subsequent HTTP requests. Request pipelining further improves throughput of persistent connections by allowing to dispatch multiple requests, without awaiting for responses to prior requests. The server then responds to all incoming request in the same sequential order. Both mechanisms have improved the performance and decreased latency problems of web applications. But the management of multiple open connections and the processing of pipelined requests has revealed new challenges for web servers as we see in chapter 4.

HTTP Chunked Transfer Encoding

An HTTP message must contain the length of its entity, if any. In HTTP 1.1, this is neccessary for determining the overall length of a message and detecting the next message of a persistent connection. Sometimes, the exact length of an entity cannot be determined a priori. This is especially important for content that is generated dynamically or for entities that are compressed on-the-fly. Therefore, HTTP provides alternative transfer encodings. When chunked transfer encoding is used, the client or server streams chunks of the entity sequentially. The length of the next chunk to expect is prepended to the actual chunk. A chunk length of 0 denotes the end of the entity. This mechanism allows the transfer of generated entities with arbitrary length.

Web Formats

HTTP does not restrict the document formats to be used for entities. However, the core idea of the WWW is based on hypermedia, thus most of the formats are hypermedia-enabled. The single most important format is the HTML and its descendants.

Hypertext Markup Language

HTML [Jac99] is a markup language derived from the SGML [ISO86] and influenced by the XML [Bra08]. HTML provides a set of elements, properties and rules for describing web pages textually. A browser parses the HTML document, using its structural semantics for rendering a visual representation for humans. HTML supports hypermedia through hyperlinks and interactive forms. Also, media objects such as images can be used in an HTML document. The appearance and style of an HTML document can be customized by using CSS [Mey01]. For more dynamic user interfaces and interactive behavior, HTML documents can be enriched with embedded code of scripting languages, such as JavaScript [ECM99]. For instance, it can be used to programmatically load new contents in the background, without a complete reload of the page. This technique, also known as AJAX, has been one of the keys for more responsive user interfaces. It thus enables web applications to resemble interfaces of traditional desktop applications.

HTML5

The fifth revision of the HTML standard [Hya09] introduces several markup improvements (e.g. semantic tags), better multimedia content support, but most notably a rich set of new APIs. These APIs address various features including client-side storage, offline support, device sensors for context awareness and improved client-side performance. The Web Sockets API [Hic09a] complements the traditional HTTP request/response communication pattern with a low latency, bidirectional, full-duplex socket based on the WebSocket protocol [Fet11]. This is especially interesting for real-time web applications.

Generic Formats

Besides proprietary and customized formats, web services often make use of generic, structured formats such as XML and JSON [Cro06]. XML is a comprehensive markup language providing a rich family of related technologies, like validation, transformation or querying. JSON is one of the lightweight alternatives that focuses solely on the succinct representation of structured data. While there is an increasing interest in lightweight formats for web services and messages, XML provides still the most extensive tool set and support.