java - Why lambda IntStream.anyMatch() is 10 slower than naive implementation?

Question

Welcome To Ask or Share your Answers For Others

java - Why lambda IntStream.anyMatch() is 10 slower than naive implementation?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - Why lambda IntStream.anyMatch() is 10 slower than naive implementation?

I was recently profiling my code and found one interesting bottleneck in it. Here is benchmark :

@BenchmarkMode(Mode.Throughput)
@Fork(1)
@State(Scope.Thread)
@Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
public class Contains {

    private int[] ar = new int[] {1,2,3,4,5,6,7};

    private int val = 5;

    @Benchmark
    public boolean naive() {
        return contains(ar, val);
    }

    @Benchmark
    public boolean lambdaArrayStreamContains() {
        return Arrays.stream(ar).anyMatch(i -> i == val);
    }

    @Benchmark
    public boolean lambdaIntStreamContains() {
        return IntStream.of(ar).anyMatch(i -> i == val);
    }

    private static boolean contains(int[] ar, int value) {
        for (int arVal : ar) {
            if (arVal == value) {
                return true;
            }
        }
        return false;
    }

}

Result :

Benchmark                            Mode  Cnt       Score      Error  Units
Contains.lambdaArrayStreamContains  thrpt   10   22867.962 ± 1049.649  ops/s
Contains.lambdaIntStreamContains    thrpt   10   22983.800 ±  593.580  ops/s
Contains.naive                      thrpt   10  228002.406 ± 8591.186  ops/s

If shows that Array contains operation via lambda is 10 times slower than naive implementation with simple loop. I knew that lambdas should be a bit slower. But 10 times? Am I doing wrong lambda or this is some issue with java?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

2.1m questions

2.1m answers

60 comments

56.8k users

Most popular tags

javascript python c# java How android c++ php ios html sql r c node.js .net iphone asp.net css reactjs jquery ruby What Android objective mysql linux Is git Python windows Why regex angular swift amazon excel algorithm macos Java visual how bash Can multithreading PHP Using scala angularjs typescript apache spring performance postgresql database flutter json rust arrays C# dart vba django wpf xml vue.js In go Get google jQuery xcode jsf http Google mongodb string shell oop powershell SQL C++ security assembly docker Javascript Android: Does haskell Convert azure debugging delphi vb.net Spring datetime pandas oracle math Django

联盟问答网站-Union QA website

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

DevDocs API Documentations

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

在这了问答社区

DevDocs API Documentations

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

DevDocs API Documentations

广告位招租

深蓝 · Answer 1 · 2021-10-23T21:39:50+0000

Your benchmark does not actually measure anyMatch performance, but rather a stream overhead. This overhead can appear significant when comparing to a very simple operation like a five-element array lookup.

The slowdown will not look so terrible if we swtich from relative to absolute numbers. Let's measure latency instead of throughput for a clearer picture. I've omitted lambdaIntStream benchmark since it works exactly the same way as lambdaArrayStream.

Benchmark                   Mode  Cnt   Score   Error  Units
Contains.lambdaArrayStream  avgt    5  53,242 ± 2,034  ns/op
Contains.naive              avgt    5   5,876 ± 0,404  ns/op

5.8 ns is roughly 14 cycles of a 2.4 GHz CPU. The workload is so small that any extra cycle will be noticeable. So what is the overhead of stream operations?

Object allocation

Now rerun the benchmark with -prof gc profiler. It will show the amount of heap allocations:

Benchmark                                       Mode  Cnt     Score     Error   Units
Contains.lambdaArrayStream:·gc.alloc.rate.norm  avgt    5   152,000 ±   0,001    B/op
Contains.naive:·gc.alloc.rate.norm              avgt    5    ≈ 10??              B/op

lambdaArrayStream allocates 152 bytes per iteration while naive benchmark allocates nothing. Of course, allocation is not free: there are at least 5 objects constructed to support anyMatch, and each takes several nanoseconds:

Lambda i -> i == val
IntPipeline.Head
Spliterators.IntArraySpliterator
MatchOps.MatchOp
MatchOps.MatchSink

Call stack

java.util.stream implementation is a bit complicated since it must support all combinations of stream sources, intermediate and terminal operations. If you look at the call stack of anyMatch in your benchmark, you'll see something like that:

    at bench.Contains.lambda$lambdaArrayStream$0(Contains.java:24)
    at java.util.stream.MatchOps$2MatchSink.accept(MatchOps.java:119)
    at java.util.Spliterators$IntArraySpliterator.tryAdvance(Spliterators.java:1041)
    at java.util.stream.IntPipeline.forEachWithCancel(IntPipeline.java:162)
    at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:485)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.MatchOps$MatchOp.evaluateSequential(MatchOps.java:230)
    at java.util.stream.MatchOps$MatchOp.evaluateSequential(MatchOps.java:196)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.IntPipeline.anyMatch(IntPipeline.java:477)
    at bench.Contains.lambdaArrayStream(Contains.java:23)

Not all of these method calls can be inlined. Furthermore, JVM limits inlining to 9 levels, but here we see the deeper call stack. If we override the limit with -XX:MaxInlineLevel=20 the score will become a bit better:

Benchmark                   Mode  Cnt   Score   Error  Units
Contains.lambdaArrayStream  avgt    5  33,294 ± 0,367  ns/op  (was 53,242)
Contains.naive              avgt    5   5,822 ± 0,207  ns/op

Loop optimizations

for iteration over an array is a trivial counted loop. JVM can apply a wide range of loop optimizatons here: loop peeling, loop unrolling etc. This does not work for a while-kind loop in forEachWithCancel method, which is used to traverse an IntStream. The effect of loop optimizations can be measured with -XX:LoopUnrollLimit=0 -XX:-UseLoopPredicate:

Benchmark                   Mode  Cnt   Score   Error  Units
Contains.lambdaArrayStream  avgt    5  33,153 ± 0,559  ns/op
Contains.naive              avgt    5   9,853 ± 0,150  ns/op  (was 5,876)

Conclusions

There is some overhead to construct and to traverse a stream, but this is completely understood and cannot be considered a bug. I would not say the overhead is big (even 50 ns/op is not that much); however, in this particular example the overhead is dominant because of an extremely tiny workload.

Categories

java - Why lambda IntStream.anyMatch() is 10 slower than naive implementation?

java - Why lambda IntStream.anyMatch() is 10 slower than naive implementation?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Object allocation

Call stack

Loop optimizations

Conclusions

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags