Imperfect by Design: A Log Viewer

Guillaume Godet-Bar

Scanning local log files to check the correct behaviour of a system or find a bug is a staple of software engineering. Surprisingly enough, it is a practice that has seen almost no evolution since the dawn of the computer age, as serious investigations still rely on text searches (using regexes for more experienced engineers) and command-line tools such as grep, cat and less.

It is not that difficult to explain, though, as there is no de facto standard for formatting logs. Every language and tool has its own idiosyncrasies in terms of:

timestamp formats – displaying all the date formats available out there, separating components with colons, slashes, reverse-style dates, using time components with various resolutions, with or without locale elements;
use of brackets, parens, curly braces, dashes etc. to separate log tags;
log tag types – displaying severity, thread id, session id, current method, etc., any combination of which you can think of;
component order – timestamp, then severity, then thread id, etc.;

... and the list goes on. There is obviously no "right way" to produce logs, as any element of a log line may be relevant in some usage context.

This all puts a considerable cognitive burden on the engineer scanning log files. Searching and grepping through the text help, but for any instance of reading through a log file, a large part of it is just noise. It's just never the same part from one day to the next.

Does that mean we're stuck with our primitive tools? Absolutely not. Most of that burden is solvable with well-chosen heuristics and good UX.

Terms, Assumptions, and Constraints

Before moving forward let's clarify a few terms I'll be using in this essay.

Header (or footer): The part of the log line that contains contextual data, typically a timestamp, a severity and other ancillary data. While the general practice is to put that data at the beginning of the line (hence header), some systems dump it at the end of the line (hence footer). While most tags (see below) are generated automatically by the logging system (timestamps in particular), it is quite common for developers to extend a header by adding tags of their own, such as module/class names.
Payload: In contrast to the line header, the payload is the specific content of the log line.
Tag: A part of the header, it identifies a consistent piece of data, for instance a timestamp (with all its date and time components), a fully-qualified name, etc.

With these definitions out of the way, let's now establish our fundamental assumption about local log file analysis: there is value in partitioning log lines by recurring tags, especially method/entity names, thread ids, etc.

These partitions could be presented to the user as filtered views of the main log file. This would enable them to focus on the lines emitted by a single method, thread or entity. In this context, timestamps, severities, purely numeric values and low-occurrence tags should not be considered for the partitioning.

For this to bring actual value to engineers used to grepping through log files, however, the partitioning (or categorization) process should be faster or more efficient in the majority of cases than manual operations with command-line tools.

Riding the Pareto curve

From a design perspective, what does this imply? Mainly that partitioning should be very fast, automatic, and unobtrusive.

However, does it have to be precise? As much as possible, of course, but my initial batch of tests and gathering of various log formats showed that a practical solution cannot be 100% accurate without a tremendous amount of (imperfect) special case handling. This is a typical Pareto situation: virtually 20% of the effort will get us 80% of the way, and reaching close to 100% would require considerably more work.

In the context of log file analysis, is reaching 100% partitioning accuracy an actual imperative? As long as it is trivially easy to switch from a partition to the complete log file, I believe it is actually perfectly fine. That said, dealing with inaccuracies – ignoring or hiding false positives, and creating partitions of their own – should be straightforward.

Also, do we want to tackle every log format under the sun? Once again I believe this would be pointless and fundamentally doomed to fail. As an initial stab at the problem, it seems reasonable to focus only on log formats that are plain UTF-8 text, with log lines constituted of a header followed by a payload (i.e., we won't be tackling cases where the payload is followed by a footer).

Technicalities of Categorization

From a technical stack perspective, the notion of automatically categorizing log lines looks like a perfect use case for AI, but our system would need microsecond-level processing times for each log line, which is clearly out of the scope of the current LLM capabilities.

Fortunately, log files are not as unstructured and unpredictable as prose. As we mentioned earlier, some data patterns (e.g., timestamps) are fully expected to appear in the line. What we actually need is a system for identifying loose matches in partially structured data. This can be done by running our log lines through a 2-step process:

An initial structural pass to identify the header bounds in the log line^[1], and identify tag candidates within those bounds, using regular expressions;
A bookkeeping pass to count the occurrences of each candidate, promoting candidates that either occur consistently or show up in bursts. Well-chosen window sizes naturally filter out most of the false positives.

Following this process, we can present to the user log filters that have a strong probability of making sense for their scanning task. How? With a TUI, obviously!

Terminal Considerations

I'm not sure that, as an industry, we're fully grasping yet the revolution that commoditized terminal user interfaces (TUI) is silently pushing in software engineering – with brilliant libraries like Ratatui (for Rust), Bubble Tea (for Go) or Textual (for Python). For the last 20+ years, the simplest way to present data, enable simple interactions, and guarantee a manageable level of portability has been to build web interfaces. This would involve deploying a server to connect to in order to access our local data, which is exceptionally overkill in this context, and a showstopper when working in a cloud environment, where we'll want to directly access the data without any fuss.

TUIs, on the other hand, are far simpler to build and trivial to run. They enable our users to stay at the terminal and file descriptor level. Also, while display performance may require quite a bit of work in web UIs, it is far more manageable when displaying glyphs in a terminal.

To the point: a TUI is the perfect vehicle to present our log file with the filtered lines dispatched into a separate pane for each identified tag. Each pane should be handled as an easily accessible tab, and it should be very easy to add a filter/tab manually and hide unwanted tabs with just a keystroke.

Introducing `splog`

A few days ago I released splog. It is the tool that came out of the design process outlined in this essay.

The following demo highlights how the tool automatically picks up relevant tags (e.g., spark.SecurityManager, rdd.HadoopRDD) from a Spark log file (found here), and displays them in their own easily-accessible panes. The demo also shows how the contents of a category pane can be searched, and how the search results can be directly promoted to yet another pane, and narrow down the scope of an investigation further.

The tool is live on GitHub, and can be installed via cargo install splog for the Rust-inclined, and otherwise by downloading precompiled binaries from the releases page. In the time-honored fashion of the Rust ecosystem, although the crate is tagged with version 0.1.0, it should be considered mostly feature-complete at this point.

Identifying the header and then extracting tags from its bounds avoids the issue of picking recurring, tag-like strings from the payload (e.g., System A is [Connected], which would eventually pick Connected as a category). ↩︎