Java中的String.hashCode()方法可能有问题？_Java_Andy

AI实践哪家强？来 AICon，解锁技术前沿，探寻产业新机！了解详情 



 写点什么

过去几天，我一直在浏览 Reddit 上的一篇文章。这篇文章看得我要抓狂了。文章指出，Java 中的 String.hashCode() 方法（将任意长度的字符串对象映射成 32 位 int 值）生成的哈希值存在冲突。文章作者似乎对这个问题感到很惊讶，并声称 String.hashCode() 的算法是有问题的。用作者自己的话说：

不管使用哪一种哈希策略，冲突都是不可避免的，但其中总有相对较好的哈希也有较差的哈希。你可以认为 String 中的哈希是比较差的那种。

作者的措辞带有相当强烈的意味，并且已经证明了很多奇怪的短字符串在生成哈希时会产生冲突。（文章中提供了很多示例，例如!~ 和"_）。众所周知，32 位字符串哈希函数确实会发生很多冲突，但从经验来看，在真实的场景中，String.hashCode() 能够很好地管理哈希冲突。

那么“差”的哈希是什么样子的呢？而“好”的哈希又是什么样子的？

一点理论

32 位哈希只能占用 2^32 = 4,294,967,296 个唯一值。因为字符串中可以包含任意数量的字符，所以可能的字符串显然要比这个数字多得多。因此，根据鸽子原则，必然会存在冲突。

但冲突的可能性有多大？

著名的生日问题表明，对于 365 个可能的“哈希值”，在哈希冲突可能性达到 50％之前，必须计算出 23 个唯一哈希值。如果有 2^32 个可能的哈希值，那么在达到 50％的哈希冲突可能性之前，必须计算出大约 77,164 个唯一哈希值。根据这个近似公式：

复制代码

 from math import exp
def prob(x):
    return 1.0 -exp(-0.5 * x * (x-1) / 2**32)
prob(77163) # 0.4999978150170551
prob(77164) # 0.500006797931095

那么对于给定数量的独立哈希，预计会发生多少次冲突？所运的是，维基百科为此提供了一个封闭式方程式：

复制代码

 def count(d, n):
    return n - d + d * ((d - 1) / d)**n

这种封闭式的解决方案可用于在实际的哈希函数中加入理论拟合。

一点实践

那么 String.hashCode() 符合标准吗？试着运行这段代码：

复制代码

 import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
 
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;
 
import java.nio.charset.StandardCharsets;
 
public class HashTest {
    private static  Map<Integer,Set> collisions(Collection values) {
        Map<Integer,Set> result=new HashMap<>();
        for(T value : values) {
            Integer hc=Integer.valueOf(value.hashCode());
    
            Set bucket=result.get(hc);
            if(bucket == null)
                result.put(hc, bucket = new TreeSet<>());
    
            bucket.add(value);
        }
        return result;
    }
 
    public static void main(String[] args) throws IOException {
        System.err.println("Loading lines from stdin...");
        Set lines=new HashSet<>();
        try (BufferedReader r=new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8))) {
            for(String line=r.readLine();line!=null;line=r.readLine())
                lines.add(line);
        }
 
        // Warm up, if you please
        System.err.print("Warming up");
        for(int i=0;i<10;i++) {
            System.err.print(".");
            collisions(lines);
        }
        System.err.println();
 
        System.err.println("Computing collisions...");
        long start=System.nanoTime();
        Map<Integer,Set> collisions=collisions(lines);
        long finish=System.nanoTime();
        long elapsed=finish-start;
 
        int maxhc=0, maxsize=0;
        for(Map.Entry<Integer,Set> e : collisions.entrySet()) {
            Integer hc=e.getKey();
            Set bucket=e.getValue();
            if(bucket.size() > maxsize) {
                maxhc = hc.intValue();
                maxsize = bucket.size();
            }
        }
 
        System.out.println("Elapsed time: "+elapsed+"ns");
        System.out.println("Total unique lines: "+lines.size());
        System.out.println("Time per hashcode: "+String.format("%.4f", 1.0*elapsed/lines.size())+"ns");
        System.out.println("Total unique hashcodes: "+collisions.size());
        System.out.println("Total collisions: "+(lines.size()-collisions.size()));
        System.out.println("Collision rate: "+String.format("%.8f", 1.0*(lines.size()-collisions.size())/lines.size()));
        if(maxsize != 0)
            System.out.println("Max collisions: "+maxsize+" "+collisions.get(maxhc));
    }
}

我们使用短字符串（words.txt，链接见文末）作为输入：

复制代码

 $ cat words.txt | java HashTest
Loading lines from stdin...
Warming up..........
Computing collisions...
Elapsed time: 49117411ns
Total unique lines: 466544
Time per hashcode: 105.2793ns
Total unique hashcodes: 466188
Total collisions: 356
Collision rate: 0.00076306
Max collisions: 3 [Jr, KS, L4]

在这些英文短字符串中，总共有 466,544 个哈希，出现 356 次冲突。从理论上讲，“公平”的哈希函数应该只会产生 25.33 次冲突。因此，String.hashCode() 产生的冲突是公平哈希函数的 14.05 倍： 356.0 / 25.33 ≈ 14.05

不过，每 10,000 个哈希出现 8 次冲突的概率仍然是个不错的成绩。

那么长字符串值的结果怎样？使用莎士比亚全集中的句子（链接见文末）会产生以下输出：

复制代码

 $ cat shakespeare.txt | java HashTest
Loading lines from stdin...
Warming up..........
Computing collisions...
Elapsed time: 24106163ns
Total unique lines: 111385
Time per hashcode: 216.4220ns
Total unique hashcodes: 111384
Total collisions: 1
Collision rate: 0.00000897
                0.00076306
Max collisions: 2 [    There's half a dozen sweets.,   PISANIO. He hath been search'd among the dead and living,]

在这些较长的英语字符串中，总共有 111,385 个哈希，出现 1 次冲突。“公平”哈希函数将在这些数据上产生 1.44 次冲突，因此 String.hashCode() 优于公平哈希函数，冲突可能性是公平哈希函数的 69.4％： 1 / 1.44 ≈ 0.694

也就是说，每 100,000 个哈希产生不到 1 个冲突，这个成绩是极好的。

一点解释

显然，String.hashCode() 不具备唯一性，它也不可能具备唯一性。对于短字符串，它与理论平均值差得比较远，但其实做得还算不错。对于长字符串，它可以轻松打败平均理论值。

总得来看，它对于预期字符串而言是具备唯一性的，可以将字符串很好地分布在哈希表中。

最后，我还是认为 String.hashCode() 是具备唯一性的，至少它足够“好”。

延伸阅读

如果你对这个问题感兴趣，我强烈建议你看一看 Stack Overflow 上的答案（ https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed#answer-145633 ），它深入探讨了哈希函数冲突的问题。

重要链接：

Reddit 文章： https://www.reddit.com/r/coding/comments/967hci/stringhashcode_is_not_even_a_little_unique/

生日问题： https://en.wikipedia.org/wiki/Birthday_problem

words.txt： http://sigpwned.com/wp-content/uploads/2018/08/words.txt

莎士比亚长句： http://sigpwned.com/wp-content/uploads/2018/08/shakespeare.txt

查看英文原文： http://sigpwned.com/2018/08/10/string-hashcode-is-plenty-unique/

发布

暂无评论

创作场景

Java 中的 String.hashCode() 方法可能有问题？

一点理论

一点实践

一点解释

延伸阅读

重要链接：

评论

netty案例，netty4.1高级应用篇二，手写RPC框架第二章《netty通信》

netty案例，netty4.1高级应用篇三，手写RPC框架第三章《RPC中间件》

大数据技术思想入门(二)：分布式存储集群特点

阿里内部流传的Mybatis笔记终于流传出来了，赶紧收藏

netty案例，netty4.1中级拓展篇八《Netty心跳服务与断线重连》

netty案例，netty4.1高级应用篇一，手写RPC框架第一章《自定义配置xml》

netty案例，netty4.1源码分析篇三《Netty服务端初始化过程以及反射工厂的作用》

书摘之《堂吉诃德》—— 谁不曾想过仗剑走天涯？

netty案例，netty4.1中级拓展篇九《Netty集群部署实现跨服务端通信的落地方案》

Week10--课后作业

Spring的Controller是单例还是多例？怎么保证并发的安全

netty案例，netty4.1源码分析篇六《Netty异步架构监听类Promise源码分析》

netty案例，netty4.1源码分析篇二《ServerBootstrap配置与绑定启动》

netty案例，netty4.1源码分析篇四《ByteBuf的数据结构在使用方式中的剖析》

netty案例，netty4.1中级拓展篇十二《Netty流量整形数据流速率控制分析与实战》

Week10---课后总结

8锁问题

netty案例，netty4.1中级拓展篇十一《Netty基于ChunkedStream数据流切块传输》

netty案例，netty4.1源码分析篇一《NioEventLoopGroup源码分析》

做职场里的“超级英雄”，需要怎样的盔甲与工具？

一个实用的开源项目，可以快速将 Elasticsearch 数据导出到 csv

大龄程序员的自我介绍 v 0.1

netty案例，netty4.1中级拓展篇十《Netty接收发送多种协议消息类型的通信处理方案》

netty案例，netty4.1中级拓展篇十三《Netty基于SSL实现信息传输过程中双向加密验证》

netty案例，netty4.1源码分析篇五《一行简单的writeAndFlush都做了哪些事》

数字化背景下的经济社会发展的新特征新趋势

世界正在重塑加密货币将扮演什么角色

区块链的共识机制有哪些好处优势？

JDK8 日期 API 使用

程序开发中的持续集成、持续交付、持续部署

spring事务的这10种坑，你稍不注意可能就会踩中

	from math import exp
	def prob(x):
	return 1.0 -exp(-0.5 * x * (x-1) / 2**32)
	prob(77163) # 0.4999978150170551
	prob(77164) # 0.500006797931095

	import java.io.BufferedReader;
	import java.io.InputStreamReader;
	import java.io.IOException;

	import java.util.Collection;
	import java.util.HashMap;
	import java.util.HashSet;
	import java.util.Map;
	import java.util.Set;
	import java.util.TreeSet;

	import java.nio.charset.StandardCharsets;

	public class HashTest {
	private static Map<Integer,Set> collisions(Collection values) {
	Map<Integer,Set> result=new HashMap<>();
	for(T value : values) {
	Integer hc=Integer.valueOf(value.hashCode());

	Set bucket=result.get(hc);
	if(bucket == null)
	result.put(hc, bucket = new TreeSet<>());

	bucket.add(value);
	}
	return result;
	}

	public static void main(String[] args) throws IOException {
	System.err.println("Loading lines from stdin...");
	Set lines=new HashSet<>();
	try (BufferedReader r=new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8))) {
	for(String line=r.readLine();line!=null;line=r.readLine())
	lines.add(line);
	}

	// Warm up, if you please
	System.err.print("Warming up");
	for(int i=0;i<10;i++) {
	System.err.print(".");
	collisions(lines);
	}
	System.err.println();

	System.err.println("Computing collisions...");
	long start=System.nanoTime();
	Map<Integer,Set> collisions=collisions(lines);
	long finish=System.nanoTime();
	long elapsed=finish-start;

	int maxhc=0, maxsize=0;
	for(Map.Entry<Integer,Set> e : collisions.entrySet()) {
	Integer hc=e.getKey();
	Set bucket=e.getValue();
	if(bucket.size() > maxsize) {
	maxhc = hc.intValue();
	maxsize = bucket.size();
	}
	}

	System.out.println("Elapsed time: "+elapsed+"ns");
	System.out.println("Total unique lines: "+lines.size());
	System.out.println("Time per hashcode: "+String.format("%.4f", 1.0*elapsed/lines.size())+"ns");
	System.out.println("Total unique hashcodes: "+collisions.size());
	System.out.println("Total collisions: "+(lines.size()-collisions.size()));
	System.out.println("Collision rate: "+String.format("%.8f", 1.0*(lines.size()-collisions.size())/lines.size()));
	if(maxsize != 0)
	System.out.println("Max collisions: "+maxsize+" "+collisions.get(maxhc));
	}
	}

	$ cat words.txt \| java HashTest
	Loading lines from stdin...
	Warming up..........
	Computing collisions...
	Elapsed time: 49117411ns
	Total unique lines: 466544
	Time per hashcode: 105.2793ns
	Total unique hashcodes: 466188
	Total collisions: 356
	Collision rate: 0.00076306
	Max collisions: 3 [Jr, KS, L4]

	$ cat shakespeare.txt \| java HashTest
	Loading lines from stdin...
	Warming up..........
	Computing collisions...
	Elapsed time: 24106163ns
	Total unique lines: 111385
	Time per hashcode: 216.4220ns
	Total unique hashcodes: 111384
	Total collisions: 1
	Collision rate: 0.00000897
	0.00076306
	Max collisions: 2 [ There's half a dozen sweets., PISANIO. He hath been search'd among the dead and living,]

创作场景

Java 中的 String.hashCode() 方法可能有问题？

一点理论

一点实践

一点解释

延伸阅读

重要链接：

评论

更多内容推荐

推荐阅读

电子书

大厂实战PPT下载