Before anyone questions the fact of using string.intern()
at all, let me say that I need it in my particular application for memory and performance reasons. [1]
So, until now I used String.intern()
and assumed it was the most efficient way to do it. However, I noticed since ages it is a bottleneck in the software. [2]
Then, just recently, I tried to replace the String.intern()
by a huge map where I put/get the strings in order to obtain each time a unique instance. I expected this would be slower... but it was exactly the opposite! It was tremendously faster! Replacing the intern()
by pushing/polling a map (which achieves exactly the same) resulted in more than one order of magnitude faster.
The question is: why is intern()
so slow?!? Why isn't it then simply backed up by a map (or actually, just a customized set) and would be tremendously faster? I'm puzzled.
[1]: For the unconvinced ones: It is in natural language processing and has to process gigabytes of text, therefore needs to avoid many instances of a same string to avoid blowing up the memory and referential string comparison to be fast enough.
[2]: without it (normal strings) it is impossible, with it, this particular step remains the most computation intensive one
EDIT:
Due to the surprising interest in this post, here is some code to test it out:
http://pastebin.com/4CD8ac69
And the results of interning a bit more than 1 million strings:
HashMap
: 4 seconds
String.intern()
: 54 seconds
Due to avoid some warm-up / OS IO caching and stuff like this, the experiment was repeated by inverting the order of both benchmarks:
String.intern()
: 69 seconds
HashMap
: 3 seconds
As you see, the difference is very noticeable, more than tenfolds. (Using OpenJDK 1.6.0_22 64bits ...but using the sun one resulted in similar results I think)
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…