HTTP log formats – Part 2: a more useful HTTP log format

Preamble

This is the second part, of a two part blog series about HTTP log formats. The first part covers the shortcomings of the Common Log Format. The second part proposes a more useful log format.

Proposed Log Format

The log format discussed in this article is written in the Apache HTTP log syntax – the format however should be easily translatable to other webserver implementations (including web-caching proxies, and application servers).

The log format that I use and recommend is as follows:

[%{%F %T %z}t] %{Referer}i %{X-Forwarded-For}i %a "%{User-Agent}i" %{sessionid}C %m %{Host}i%U%q %>s %I %O %D

Describing each variable:

%t - time the request was received (in the format YYYY/MM/dd HH:mm:ss (plus timezone offset)
%{Referer}i - contents of "Referer" [sic] HTTP header
%{X-Forwarded-For}i - contents of the "X-Forwarded-For" HTTP header
%a - remote IP address
%{User-Agent}i - contents of the "User-Agent" HTTP header
%{sessionid}C - the session identifier (in this instance the value of the cookie named "sessionid")
%{Host}i - Host (i.e. domain name)
%U - URL path
%q - query string
%>s - last status code
%I - bytes received
%O - bytes sent (including headers)
%D - time taken to serve the request (in microseconds)

Discussion

The Forward-For header (%{X-Forwarded-For}i), and remote IP address (%a)

In today’s web, an end user’s HTTP request may pass through several intermediary (HTTP) servers before it reaches your application. A modern web stack can consist of a web-caching proxy, a webserver (for serving static files), and an application server (for executing your application’s code). In addition, each of these components may live on a separate server. It is therefore not enough to just log the remote IP (%a) – it could end up being an IP to a server you own! The logged IP information can be used in a myriad of ways, including aiding in debugging, and security.

The session identifier (e.g. %{sessionid}C)

Logging session identifiers allows you to easily grep the requests coming from a particular user. This information is especially useful when several users may be coming from a single IP address. This information makes it easier to follow a user’s “path” through your web application, which can make it easier to debug certain types of issues.

Note: The session identifier (a cookie named “sessionid” in this instance) will vary between web applications and frameworks. e.g. The Django web framework by default uses a cookie named “sessionid” to store the session identifier, while PHP by default uses a cookie named “PHPSESSID.

Bytes received (%I) and Bytes sent (%O)

Logging bytes received and sent (both including headers) allows you to more accurately answer questions about website bandwidth use. For example this can be useful when attempting to identify malicious behaviour (e.g. users sending large requests), or which specific resources (URLs) are consuming the most bandwidth each month (e.g. for performance and cost analysis). In the common log format only the size of the response [excluding headers] (%b) is logged.

Note: To make use of the %I and %O variables, you need to have enabled Apache’s mod_io.

 Time taken to serve the request [in microseconds] (%D)

Recording time taken to serve the request will allow you to answer questions which can be useful when attempting to optimise the speed of your website. It’ll allow you to answer questions along the lines of:

  • which single resource takes the longest time to serve
  • which single resource accumulatively consumes the most time being served

This information can help determine which resources need to be optimised for computational speed, size — or cached if possible.