Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.4k views
in Technique[技术] by (71.8m points)

c# - Code sample that shows casting to uint is more efficient than range check

So I am looking at this question and the general consensus is that uint cast version is more efficient than range check with 0. Since the code is also in MS's implementation of List I assume it is a real optimization. However I have failed to produce a code sample that results in better performance for the uint version. I have tried different tests and there is something missing or some other part of my code is dwarfing the time for the checks. My last attempt looks like this:

class TestType
{
    public TestType(int size)
    {
        MaxSize = size;
        Random rand = new Random(100);
        for (int i = 0; i < MaxIterations; i++)
        {
            indexes[i] = rand.Next(0, MaxSize);
        }
    }

    public const int MaxIterations = 10000000;
    private int MaxSize;
    private int[] indexes = new int[MaxIterations];

    public void Test()
    {
        var timer = new Stopwatch();

        int inRange = 0;
        int outOfRange = 0;

        timer.Start();
        for (int i = 0; i < MaxIterations; i++)
        {
            int x = indexes[i];
            if (x < 0 || x > MaxSize)
            {
                throw new Exception();

            }

            inRange += indexes[x];
        }
        timer.Stop();

        Console.WriteLine("Comparision 1: " + inRange + "/" + outOfRange + ", elapsed: " + timer.ElapsedMilliseconds + "ms");

        inRange = 0;
        outOfRange = 0;

        timer.Reset();
        timer.Start();

        for (int i = 0; i < MaxIterations; i++)
        {
            int x = indexes[i];
            if ((uint)x > (uint)MaxSize)
            {
                throw new Exception();
            }

            inRange += indexes[x];
        }

        timer.Stop();

        Console.WriteLine("Comparision 2: " + inRange + "/" + outOfRange + ", elapsed: " + timer.ElapsedMilliseconds + "ms");

    }
}

class Program
{
    static void Main()
    {
        TestType t = new TestType(TestType.MaxIterations);
        t.Test();
        TestType t2 = new TestType(TestType.MaxIterations);
        t2.Test();
        TestType t3 = new TestType(TestType.MaxIterations);
        t3.Test();
    }
}

The code is a bit of a mess because I tried many things to make uint check perform faster like moving the compared variable into a field of a class, generating random index access and so on but in every case the result seems to be the same for both versions. So is this change applicable on modern x86 processors and can someone demonstrate it somehow?

Note that I am not asking for someone to fix my sample or explain what is wrong with it. I just want to see the case where the optimization does work.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
       if (x < 0 || x > MaxSize)

The comparison is performed by the CMP processor instruction (Compare). You'll want to take a look at Agner Fog's instruction tables document (PDF), it list the cost of instructions. Find your processor back in the list, then locate the CMP instruction.

For mine, Haswell, CMP takes 1 cycle of latency and 0.25 cycles of throughput.

A fractional cost like that could use an explanation, Haswell has 4 integer execution units that can execute instructions at the same time. When a program contains enough integer operations, like CMP, without an interdependency then they can all execute at the same time. In effect making the program 4 times faster. You don't always manage to keep all 4 of them busy at the same time with your code, it is actually pretty rare. But you do keep 2 of them busy in this case. Or in other words, two comparisons take just as long as single one, 1 cycle.

There are other factors at play that make the execution time identical. One thing helps is that the processor can predict the branch very well, it can speculatively execute x > MaxSize in spite of the short-circuit evaluation. And it will in fact end up using the result since the branch is never taken.

And the true bottleneck in this code is the array indexing, accessing memory is one of the slowest thing the processor can do. So the "fast" version of the code isn't faster even though it provides more opportunity to allow the processor to concurrently execute instructions. It isn't much of an opportunity today anyway, a processor has too many execution units to keep busy. Otherwise the feature that makes HyperThreading work. In both cases the processor bogs down at the same rate.

On my machine, I have to write code that occupies more than 4 engines to make it slower. Silly code like this:

     if (x < 0 || x > MaxSize || x > 10000000 || x > 20000000 || x > 3000000) {
         outOfRange++; 
     }
     else {
         inRange++;
     }

Using 5 compares, now I can a difference, 61 vs 47 msec. Or in other words, this is a way to count the number of integer engines in the processor. Hehe :)

So this is a micro-optimization that probably used to pay off a decade ago. It doesn't anymore. Scratch it off your list of things to worry about :)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...