Hash Consing

Jul 04, 2017 — Tags: The editor

Some operations from our clef take a full history as one of their arguments. In this article we’ll show how this can be efficiently implemented using a technique similar to Hash Consing.

When introducing the editor we introduced a set of operations that represent changes (we called this a clef). Two of the operations presented so far take a full history as one of their parameters:

Insert index history
Replace index history

We might wonder how to represent a serialization of such an operation, with the purpose of storing it in memory on disk or sending it over a network. Such a serialization must include a serialization of the parameter history.

In this article we’ll show how this is presently implemented in the editor.

The problem: history-size

In a naive implementation we might serialize the full history and include it into the serialization of the Replace operation. The problem with this approach is that histories tend to grow over time (by definition).

To see to which degree this would be an unworkable approach, consider the following. The raison d’être of the Replace operation as defined above is to allow the histories of arbitrary hierarchical structures to be composed of the histories of their substructures.

As an example of a practical application of that principle would be to keep track of the history of a single computer’s Operating System and applications. In the composed approach, such a system-level history is composed of the histories of each application, which are composed of the histories of their modules, functions, etc. down to the histories of the smallest expressions of which the applications are composed.

The smallest change to such a system would be to make a single change to one such of the smallest such sub-expressions, and then to bubble up the change through recursive application of the Replace operation. This would then result in the following Replace operations:

The history of the system as a whole is extended with a Replace operation which takes as a parameter…
The history of some software application as it was previously, but extended with a Replace operation which takes as a parameter…
The history of some module as it was previously, but extended with a Replace operation which takes as a parameter…

etc.

If we model the serialization of such Replace operations naively, by including the full history as a parameter, the top-level Replace from the above example would need to contain the full history of some application, which in turn contains full histories of modules, etc etc. This leads to polynomial space-complexity in the length of the history (with the depth of the tree being the exponent).

The challenge is therefore: being able to reference arbitrarily large histories at a constant cost.

Hashes to the rescue

The present prototype of the editor solves this problem as such:

First, we recognise that the history is a list of changes; the oldest change being the first element, the newest change being the last element in the list.

Any list of serializable elements may be modeled as a collection of records, where each record consists of 2 fields:

The element, as a serialized string of bytes.
Either [a] a hash of the previous record [b] a special value to denote no such record exists.

We can compute such a hash if the layout of the record is well-defined; the simplest such defintion (which is the one used in the editor) is to concatenate the serialized values of the 2 fields.

If we keep a single hash table in which we can look up records given their hash, and this hash table contains (at least) all records from a given list, we can always reconstruct lists from a given the hash of the last element of the list.

A similar technique in Lisp implementations is called Hash Consing.¹ More recent analogues are git’s commit object structure (though it implements a directed acyclic graph rather than a list because multiple ‘previous’ records may be referenced) and bitcoin’s blockchain.

Recognizing that any list can be written in the Hash Consing style, and that histories are lists, we’ve implemented histories as such.

It is precisely the hash of such a record that we can refer to from when serializing the operations Insert and Replace, which takes constant space. This means that the space-complexity of our implementation of history is bounded by the product of the depth of the tree and the length of the history.

Considerations

Finally, we’ll look at some considerations.

It is not strictly necessary to use a hash-based scheme to achieve the above-mentioned advantages in space-complexity. In principle, any scheme that deduplicates shared histories would suffice.

For exampe, we might implement the histories as backlinked lists in-memory, and implement Replace to contain a pointer to the relevant item from the list.

However, using a hash-based scheme brings several advantages over such an approach:

There is no tie-in to the memory addressing scheme of a particular process; the in-memory format may thus be used as a serialization format for storage on disk or sharing over the network.
Deduplication of any shared history is automatic, there is no need to keep track of shared histories programatically.

It would be tempting but incorrect to use the term “hash list” to describe such a construct: that term is used to denote a structure where a list of hashses is used as an input to a single root hash. In other words: the individual elements of a hash list are not linking to each other, but are linked together at a single point. The term “hash chain” seems to be reserved for repeated application a hash function to a single input. ↩