Hash tables, mutability, and identity: How to implement a bi-directional hash table in Java or .NET
Data structures like hash tables and hash maps are essential for storing arbitrary objects and efficiently retrieving them. The object, or value, is stored in the table and associated with a key. Using the key, one can later retrieve the associated object. However, there are situations where, in addition to mapping keys to objects, one might also want to retrieve the key associated with a given object. We’ve encountered this problem, and the solution is much trickier.
For example, in JNBridgePro, we need to efficiently keep track of the underlying remote objects corresponding to proxies. We do that by maintaining a table mapping unique keys to objects. This is straightforward to implement, using hash tables and maps available in many libraries. We also need to be able to obtain an object’s unique key, given the object – provided it has one.
(Yes, some libraries now offer bimaps, which purport to implement this functionality, but these didn’t exist when we first needed this, so we needed to implement it on our own. In addition, these bimaps don’t provide everything we need. We’ll discuss these issues later in the post.)
We’d like to share some of our experience in implementing these forward/reverse map pairs. Most of this information has appeared piecemeal in various places, but as far as we know this is the first place where these issues have been discussed for both Java and .NET.
Hashing Algorithm Requirements
Before we start, let’s review the requirements that must be fulfilled by whatever hashing algorithm we use, and how it relates to equality. While a hashing method can be implemented by the developer, it is expected to obey a contract. While the contract in general cannot be verified by the underlying platform, if it is not obeyed, data structures that rely on the hashing method may not behave properly.
In Java, for example, the contract of the hashCode()
method is:
- For any given object, the
hashCode()
method must return the same value throughout execution of the application (assuming that information used by the object’s equality test also remains unchanged). - If two objects are equal according to the objects’ equality test, then they must both have the same hash code.
Connected with the hashCode()
method is an equality test, which should implement an equivalence relation; that is, it must be reflexive, symmetric, and transitive. It should also be consistent; that is, it should yield the same result throughout execution of the application (again, assuming the information used by the equality test doesn’t change).
In .NET, the GetHashCode()
method and equality mechanism must obey a similar contract.
In addition to these contracts, both Java and .NET provide guidelines for the use and implementation of hash codes, although these are not binding and may just be advisory. For example, one such set of guidelines for .NET suggests that hash codes for mutable objects should only be based on aspects of the objects that are immutable, so that the hash codes never change; however, they loosen this to say that an object’s hash code must never change while it’s contained in a data structure that depends on the hash code not changing. As we’ll see, this is a crucial consideration when we implement our reverse maps.
In Java culture, the guidelines for implementing hashCode()
are less strict, and it is quite likely that an object’s hash code can change; for example, the hash code of a hash table object can change as items are added and removed. However, one informal guideline suggests that, as with .NET, one should be careful using an object as a key in a hash-based collection when its hash code can change.
If an object’s hash code changes while it’s stored inside a data structure that depends on the hash code (for example, if it’s used as a key in a hash table), then the object may never be retrieved, as it will be placed in one hash bucket as it’s added to the hash table, but looked for later in another hash bucket when the hash code has changed.
Implementation Assumptions
So, what does this mean when we need to implement forward/reverse map pairs? Let’s start with some assumptions:
- The forward map maps from keys to values. The keys are a particular type that we choose in advance, and values can be any object. The value objects can be implemented by anyone, and their hash codes and equality tests may not conform to implementation guidelines; they may not even obey the required contract.
- The reverse map maps from the user-provided values back to keys.
- For simplicity, the user-defined object cannot be null (although, if necessary, this can be accommodated, too, through some additional tests).
- Keys and user-defined objects should be unique; that is, they should not be used in the tables more than once.
Since the key is under our control, we can use objects of any immutable class (for example, Integer or Long), and avoid the possibility that the key’s hash code changes. For the forward map, we can use a simple hash table or hash map.
The reverse map is more of a problem. As discussed above, we can’t rely on the hash code method associated with the user-provided objects that we’re using as keys. Even in .NET, where Microsoft seems to have adhered to their guidance that hashing algorithms should be based on immutable attributes of data structures (.NET-based Hashtable objects, for example, don’t change their hash codes as objects are added and removed), there’s no guarantee that Microsoft has adhered to this guidance with all data structures, or that they will continue to do so in the future. Even if they do, it’s unsafe to trust that user-defined classes will obey these rules, or that the programmers that defined them were even aware of these guidelines.
Find an Immutable Attribute
What we need to do is to come up with an attribute that every object has, that is guaranteed to never change, even when the contents of the object do change. The attribute’s value should also be well distributed, and therefore suitable for use in hashing. One such immutable attribute is the object’s identity. When a hash table, for example, has items added to it, it’s still the same hash table, even though the computed hash code might change. Fortunately, both Java and .NET provide hashing methods based on object identity. In Java, there’s a method System.identityHashCode()
, which is guaranteed to return the same value on a given object, even if its state has changed, since it is guaranteed to use java.lang.Object’s hashCode()
method, which is identity-based, even if the target class overrides that method. Similarly, .NET provides a method System.Runtime.CompilerServices.RuntimeHelpers.GetHashCode()
, which works the same way.
Java actually provides a class java.util.IdentityHashMap, which uses identity hashing. Unfortunately, .NET doesn’t provide a counterpart class. On the Java side, we could have used IdentityHashMap for our reverse hash table, except that, unlike the Java Hashtable, IdentityHashMap is not synchronized, and we need hash tables that are thread safe.
Creating Identity-Based Hashing
In order to write our own identity-based hash tables, we need to make sure that, no matter what object is used as the key, we always use identity-based hashing. Unfortunately, we can’t override the hash methods for these classes, since they’re out of our control. Instead, we can make sure that the key objects are wrapped in classes that are in our control, and where we can control the way the hash values are computed. In these wrappers, the hash method simply returns the identity-based hash code of the wrapped object. In addition, since we’re using identity-based hashing, we need to also use reference-based equality, so that two objects are equal if and only if they’re the same object, rather than simply equivalent objects. In Java, we need to use the ‘==’ operator, which is guaranteed to be reference equality, rather than equals()
, which can be, and often is, redefined by the developer. Similarly, in .NET, we need to use Object.ReferenceEquals()
rather than Equals()
or the ‘==’ operator, which by default is reference equality (except when used by strings), but which can also be overridden by the developer.
In Java, our identity-based wrappers look like this:
final class IdentityWrapper { private Object theObject; public IdentityWrapper(Object wrappedObject) { theObject = wrappedObject; } public boolean equals(Object obj2) { if (obj2 == null) return false; if (!(obj2 instanceof IdentityWrapper)) return false; return (theObject == ((IdentityWrapper)obj2).theObject); } public int hashCode() { return System.identityHashCode(theObject); } }
In C#, the wrapper can be defined as follows:
sealed class IdentityWrapper { private object theObject; public IdentityWrapper(object o) { theObject = o; } public override bool Equals(object obj) { if (obj == null) return false; if (!(obj is IdentityWrapper)) return false; object otherObject = ((IdentityWrapper)obj).theObject; return Object.ReferenceEquals(theObject, otherObject); } public override int GetHashCode() { return System.Runtime.CompilerServices.RuntimeHelpers.GetHashCode(theObject); } }
Once we’ve defined these wrappers, we have everything we need for a reverse hash table that works correctly. In Java:
Hashtable ht = new Hashtable(); … ht.put(new IdentityWrapper(mutableUserDefinedObject), value); … mutableUserDefinedObject.modify(); … Value v = (Value) ht.get(new IdentityWrapper(mutableUserDefinedObject)); // retrieved v is the same as the value that was initially added
and similarly in .NET.
If the IdentityWrapper classes are not used, the ht.get() operation is not guaranteed to retrieve the proper value.
At this point, we have everything we need to implement bi-directional hash tables.
The Trouble With Other Libraries
What about other existing libraries? In particular, what about Google’s Guava library, which implements a HashBiMap class, as well as other classes implementing a BiMap interface? Why not use that, and avoid reinventing the wheel? Unfortunately, while HashBiMap implements a forward/reverse hashmap pair, and makes sure that the two are always in sync, it does not use identity hashing, and will not work properly if one of the key or values is a mutable object. This can be seen by examining the HashBiMap source code. So, while HashBiMap solves part of the problem of implementing forward/reverse HashMap pairs, it does not address another, arguably more difficult part: the problem of storing mutable objects. The approach we describe here solves that issue.
In Conclusion
This post discussed an important, but unfortunately somewhat obscure, issue in the implementation of hash tables: the way in which mutability can affect the behavior of hash tables, and the way in which misunderstanding of the issue can cause objects stored in a hash table to become irretrievable, and essentially disappear. When these issues are understood, it becomes possible to implement hash tables where any object, no matter how complex or mutable, can be used as a key, and where bi-directional hash tables can be easily created.