View Javadoc
1   /*
2    * Copyright (C) 2011 The Guava Authors
3    *
4    * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
5    * in compliance with the License. You may obtain a copy of the License at
6    *
7    * http://www.apache.org/licenses/LICENSE-2.0
8    *
9    * Unless required by applicable law or agreed to in writing, software distributed under the License
10   * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
11   * or implied. See the License for the specific language governing permissions and limitations under
12   * the License.
13   */
14  
15  package com.google.common.hash;
16  
17  import com.google.common.annotations.Beta;
18  import com.google.common.primitives.Ints;
19  import java.nio.ByteBuffer;
20  import java.nio.charset.Charset;
21  
22  /**
23   * A hash function is a collision-averse pure function that maps an arbitrary block of data to a
24   * number called a <i>hash code</i>.
25   *
26   * <h3>Definition</h3>
27   *
28   * <p>Unpacking this definition:
29   *
30   * <ul>
31   * <li><b>block of data:</b> the input for a hash function is always, in concept, an ordered byte
32   *     array. This hashing API accepts an arbitrary sequence of byte and multibyte values (via
33   *     {@link Hasher}), but this is merely a convenience; these are always translated into raw byte
34   *     sequences under the covers.
35   *
36   * <li><b>hash code:</b> each hash function always yields hash codes of the same fixed bit length
37   *     (given by {@link #bits}). For example, {@link Hashing#sha1} produces a 160-bit number, while
38   *     {@link Hashing#murmur3_32()} yields only 32 bits. Because a {@code long} value is clearly
39   *     insufficient to hold all hash code values, this API represents a hash code as an instance of
40   *     {@link HashCode}.
41   *
42   * <li><b>pure function:</b> the value produced must depend only on the input bytes, in the order
43   *     they appear. Input data is never modified. {@link HashFunction} instances should always be
44   *     stateless, and therefore thread-safe.
45   *
46   * <li><b>collision-averse:</b> while it can't be helped that a hash function will sometimes produce
47   *     the same hash code for distinct inputs (a "collision"), every hash function strives to
48   *     <i>some</i> degree to make this unlikely. (Without this condition, a function that always
49   *     returns zero could be called a hash function. It is not.)
50   * </ul>
51   *
52   * <p>Summarizing the last two points: "equal yield equal <i>always</i>; unequal yield unequal
53   * <i>often</i>." This is the most important characteristic of all hash functions.
54   *
55   * <h3>Desirable properties</h3>
56   *
57   * <p>A high-quality hash function strives for some subset of the following virtues:
58   *
59   * <ul>
60   * <li><b>collision-resistant:</b> while the definition above requires making at least <i>some</i>
61   *     token attempt, one measure of the quality of a hash function is <i>how well</i> it succeeds
62   *     at this goal. Important note: it may be easy to achieve the theoretical minimum collision
63   *     rate when using completely <i>random</i> sample input. The true test of a hash function is
64   *     how it performs on representative real-world data, which tends to contain many hidden
65   *     patterns and clumps. The goal of a good hash function is to stamp these patterns out as
66   *     thoroughly as possible.
67   *
68   * <li><b>bit-dispersing:</b> masking out any <i>single bit</i> from a hash code should yield only
69   *     the expected <i>twofold</i> increase to all collision rates. Informally, the "information" in
70   *     the hash code should be as evenly "spread out" through the hash code's bits as possible. The
71   *     result is that, for example, when choosing a bucket in a hash table of size 2^8, <i>any</i>
72   *     eight bits could be consistently used.
73   *
74   * <li><b>cryptographic:</b> certain hash functions such as {@link Hashing#sha512} are designed to
75   *     make it as infeasible as possible to reverse-engineer the input that produced a given hash
76   *     code, or even to discover <i>any</i> two distinct inputs that yield the same result. These
77   *     are called <i>cryptographic hash functions</i>. But, whenever it is learned that either of
78   *     these feats has become computationally feasible, the function is deemed "broken" and should
79   *     no longer be used for secure purposes. (This is the likely eventual fate of <i>all</i>
80   *     cryptographic hashes.)
81   *
82   * <li><b>fast:</b> perhaps self-explanatory, but often the most important consideration. We have
83   *     published <a href="#noWeHaventYet">microbenchmark results</a> for many common hash functions.
84   * </ul>
85   *
86   * <h3>Providing input to a hash function</h3>
87   *
88   * <p>The primary way to provide the data that your hash function should act on is via a
89   * {@link Hasher}. Obtain a new hasher from the hash function using {@link #newHasher}, "push" the
90   * relevant data into it using methods like {@link Hasher#putBytes(byte[])}, and finally ask for the
91   * {@code HashCode} when finished using {@link Hasher#hash}. (See an {@linkplain #newHasher example}
92   * of this.)
93   *
94   * <p>If all you want to hash is a single byte array, string or {@code long} value, there are
95   * convenient shortcut methods defined directly on {@link HashFunction} to make this easier.
96   *
97   * <p>Hasher accepts primitive data types, but can also accept any Object of type {@code
98   * T} provided that you implement a {@link Funnel}{@code <T>} to specify how to "feed" data from
99   * that object into the function. (See {@linkplain Hasher#putObject an example} of this.)
100  *
101  * <p><b>Compatibility note:</b> Throughout this API, multibyte values are always interpreted in
102  * <i>little-endian</i> order. That is, hashing the byte array {@code {0x01, 0x02, 0x03, 0x04}} is
103  * equivalent to hashing the {@code int} value {@code 0x04030201}. If this isn't what you need,
104  * methods such as {@link Integer#reverseBytes} and {@link Ints#toByteArray} will help.
105  *
106  * <h3>Relationship to {@link Object#hashCode}</h3>
107  *
108  * <p>Java's baked-in concept of hash codes is constrained to 32 bits, and provides no separation
109  * between hash algorithms and the data they act on, so alternate hash algorithms can't be easily
110  * substituted. Also, implementations of {@code hashCode} tend to be poor-quality, in part because
111  * they end up depending on <i>other</i> existing poor-quality {@code hashCode} implementations,
112  * including those in many JDK classes.
113  *
114  * <p>{@code Object.hashCode} implementations tend to be very fast, but have weak collision
115  * prevention and <i>no</i> expectation of bit dispersion. This leaves them perfectly suitable for
116  * use in hash tables, because extra collisions cause only a slight performance hit, while poor bit
117  * dispersion is easily corrected using a secondary hash function (which all reasonable hash table
118  * implementations in Java use). For the many uses of hash functions beyond data structures,
119  * however, {@code Object.hashCode} almost always falls short -- hence this library.
120  *
121  * @author Kevin Bourrillion
122  * @since 11.0
123  */
124 @Beta
125 public interface HashFunction {
126   /**
127    * Begins a new hash code computation by returning an initialized, stateful {@code
128    * Hasher} instance that is ready to receive data. Example: <pre>   {@code
129    *
130    *   HashFunction hf = Hashing.md5();
131    *   HashCode hc = hf.newHasher()
132    *       .putLong(id)
133    *       .putBoolean(isActive)
134    *       .hash();}</pre>
135    */
136   Hasher newHasher();
137 
138   /**
139    * Begins a new hash code computation as {@link #newHasher()}, but provides a hint of the expected
140    * size of the input (in bytes). This is only important for non-streaming hash functions (hash
141    * functions that need to buffer their whole input before processing any of it).
142    */
143   Hasher newHasher(int expectedInputSize);
144 
145   /**
146    * Shortcut for {@code newHasher().putInt(input).hash()}; returns the hash code for the given
147    * {@code int} value, interpreted in little-endian byte order. The implementation <i>might</i>
148    * perform better than its longhand equivalent, but should not perform worse.
149    *
150    * @since 12.0
151    */
152   HashCode hashInt(int input);
153 
154   /**
155    * Shortcut for {@code newHasher().putLong(input).hash()}; returns the hash code for the given
156    * {@code long} value, interpreted in little-endian byte order. The implementation <i>might</i>
157    * perform better than its longhand equivalent, but should not perform worse.
158    */
159   HashCode hashLong(long input);
160 
161   /**
162    * Shortcut for {@code newHasher().putBytes(input).hash()}. The implementation <i>might</i>
163    * perform better than its longhand equivalent, but should not perform worse.
164    */
165   HashCode hashBytes(byte[] input);
166 
167   /**
168    * Shortcut for {@code newHasher().putBytes(input, off, len).hash()}. The implementation
169    * <i>might</i> perform better than its longhand equivalent, but should not perform worse.
170    *
171    * @throws IndexOutOfBoundsException if {@code off < 0} or {@code off + len > bytes.length} or
172    *     {@code len < 0}
173    */
174   HashCode hashBytes(byte[] input, int off, int len);
175 
176   /**
177    * Shortcut for {@code newHasher().putBytes(input).hash()}. The implementation <i>might</i>
178    * perform better than its longhand equivalent, but should not perform worse.
179    *
180    * @since 23.0
181    */
182   HashCode hashBytes(ByteBuffer input);
183 
184   /**
185    * Shortcut for {@code newHasher().putUnencodedChars(input).hash()}. The implementation
186    * <i>might</i> perform better than its longhand equivalent, but should not perform worse. Note
187    * that no character encoding is performed; the low byte and high byte of each {@code char} are
188    * hashed directly (in that order).
189    *
190    * <p><b>Warning:</b> This method will produce different output than most other languages do when
191    * running the same hash function on the equivalent input. For cross-language compatibility, use
192    * {@link #hashString}, usually with a charset of UTF-8. For other use cases, use {@code
193    * hashUnencodedChars}.
194    *
195    * @since 15.0 (since 11.0 as hashString(CharSequence)).
196    */
197   HashCode hashUnencodedChars(CharSequence input);
198 
199   /**
200    * Shortcut for {@code newHasher().putString(input, charset).hash()}. Characters are encoded using
201    * the given {@link Charset}. The implementation <i>might</i> perform better than its longhand
202    * equivalent, but should not perform worse.
203    *
204    * <p><b>Warning:</b> This method, which reencodes the input before hashing it, is useful only for
205    * cross-language compatibility. For other use cases, prefer {@link #hashUnencodedChars}, which is
206    * faster, produces the same output across Java releases, and hashes every {@code char} in the
207    * input, even if some are invalid.
208    */
209   HashCode hashString(CharSequence input, Charset charset);
210 
211   /**
212    * Shortcut for {@code newHasher().putObject(instance, funnel).hash()}. The implementation
213    * <i>might</i> perform better than its longhand equivalent, but should not perform worse.
214    *
215    * @since 14.0
216    */
217   <T> HashCode hashObject(T instance, Funnel<? super T> funnel);
218 
219   /**
220    * Returns the number of bits (a multiple of 32) that each hash code produced by this hash
221    * function has.
222    */
223   int bits();
224 }