HTTP Log Formats – Part 1: The Shortcomings of the Common Log Format

Preamble

This is the first part, of a two part blog series about HTTP log formats. The first part covers the shortcomings of the Common Log Format. The second part proposes a more useful log format.

Introduction

The usefulness of HTTP access logs cannot be understated. The Standard HTTP log formats, the NCSA Common Log Format (CLF), and the NCSA Combined Log Format, are readily configured, and used by default in many HTTP server implementations.

Common Log Format

The Common Log Format is from another era. My initial research into the Common Log Format shows it being referenced as early as 1995.

The Common Log Format (CLF) is:

"%h %l %u %t \"%r\" %>s %b"

In a more readable form, the following information is logged:

%h - remote hostname
%l - remote identifier (provided by identd on the client's machine)
%u - remote user (e.g. the username in HTTP Basic Authentication)
%t - time the request was received
%r - first line of the request (i.e. HTTP Method, URL Path, HTTP version)
%>s - final HTTP status code
%b - size of response in bytes (excludes HTTP headers)

Web technologies have changed substantially over the past 25 years, and the Common Log Format could certainly be improved upon.

The shortcomings

The Common/Combined Log Format logs data which is of little use in today’s technological landscape. The log formats also fail to log a lot of useful information — which I’ll discuss in the second part of this blog series.

Remote identifier (%l)

The remote identifier (%l) is of almost no use. The vast majority of computers (including phones/tablets, PCs, and servers) do not have an identd server installed or configured, and therefore will never provide this information.

Remote user (%u)

The remote user (%u) variable is slightly more useful, given webservers such as Apache can authenticate users using a third-party service (e.g. LDAP or Kerberos). However, the overwhelming majority of user-facing websites handle user authentication and authorisation at the web application level, and therefore it makes more sense for the web application to log user identifier information.

Size of response excluding headers (%b)

The size of response excluding headers (%b) variable can be a major cause of confusion and frustration. A user might want to know “which resources (URLs) consumed the most bandwidth this month?”. The %b variable records only the length of the body of the HTTP response, it does not include the size of headers sent. In addition, the common log format does not provide any information on the size of the received requests.This means that bandwidth utilisation reports generated by low-level network tools (i.e. those monitoring port 80/443), can have significantly different results to the reports generated by analysing HTTP logs.

A more useful Log Format

In the second part, of this two part series about HTTP log formats, I’ll be showing you the log format that I use and recommend — and why I think it’s an improvement over the log formats discussed in this blog post.