Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
357 views
in Technique[技术] by (71.8m points)

caching - Is my understanding of AoS vs SoA advantages/disadvantages correct?

I've recently been reading about AoS vs SoA structure design and data-oriented design. It's oddly difficult to find information about either, and what I have found seems to assume greater understanding of processor functionality than I possess. That said, what I do understand about the former topic in particular leads to some questions that I think I should be able to understand the answers to.

Firstly, to make sure I am not basing my understanding off of a false premise, my understanding of the functionality and pros and cons of AoS vs SoA, as applied to a collection of 'Person' records with 'Name' and 'Age' fields associated with them:

Structure of Arrays

  • Stores data as a single structure consisting of multiple arrays, for example as a People object with fields Names as an array of strings and Ages as an array of integers.
  • Information for, say, the third person in the list would be given by something like People.Names[2] and People.Ages[2]
  • Pros:
    • When working with only some of the data from many 'Person' records, only that data needs to be loaded from memory.
    • Said data is stored in a homogeneous way, allowing the cache to be put to better use by SIMD instructions in most such situations.
  • Cons: - When multiple fields need to be accessed at once, the above advantages go away. - Accessing all the data for one or a few of the objects becomes less efficient. - Most programming languages require much more verbose and hard to read/write code, since there is no explicit 'Person' structure.

Array of Structures

  • Stores data as multiple structures, each of which has a full set of fields, themselves stored in an array of all such structures, for example a People array of Person objects, which have Name as a string field and Age as an integer field.
  • Information for the third person would be given by something like People[2].Name and People[2].Age
  • Pros:
    • Code is structured around a simpler mental model, with the indirection being abstracted away.
    • Single records are easy to access and work with.
    • The presence of a Person structure makes writing code in most programming languages much more straightforward.
  • Cons:
    • When working with just some of the data from a large number of records, the entire set of structures needs to be loaded into memory including the irrelevant data.
    • The array of structures is not homogeneous, which in such situations limits the advantage that can be provided by SIMD instructions.

The long and short of it seems to be that, assuming for the sake of argument that your bottleneck for performance is data access and ease of coding is irrelevant, if you almost exclusively need to access a single field at a time on a large amount of data SoA is likely to be more performant while if you often need to access multiple fields from the same object or deal with single objects rather than many at once, AoS will be more performant.

That said, some of what I've been reading seems to muddy the picture. Firstly, multiple sources have stated that SoA requires indexed addressing which is claimed to be inefficient. I cannot make sense of this, and have been unable to find any explanations. It seems to me that AoS and SoA require exactly the same operations to access any particular piece of data, though in differing orders, except that SoA requires an additional pointer (possibly more than one, depending on the kind of structure used). Oversimplifying a bit, to get the age of the fifth person in my above example under AoS, you would first get the pointer to the array, add 4 to it, get the structure pointer at that element of the array, add the size of a string pointer to it since the age is the second field, then access the integer at that pointer. Under SoA, you would get the pointer to the structure and add the size of a string array pointer to it to get to the list of ages, then get the pointer to the list of integers stored there and add 4 to it, then get the integer stored there.

Secondly, it's not clear to me the degree to which the benefits from SoA are reliant on particular CPU architectures. On the one hand, what I understand of the benefits as described above does not rely on any particular architecture except that SIMD instructions can provide additional benefits not available under AoS in some cases. On the other, I've seen claims that SoA's benefits can be limited depending on the number of lanes available in a particular SIMD architecture. Again, that would seem to affect only the additional benefit that the SIMD instructions can provide over the more general cache benefit.

Finally, I've seen the claim that SoA can require more cache ways when traversing data. I'm not completely sure what cache ways are or what, if anything, specifically is meant by 'traversing' data. My best guess is that 'cache ways' either refers to or correlates with the number of potential collisions in an associative cache, and that it relates to the second Con I mentioned above.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

"traversing" just means looping over the data.

And yes, you're right about cache ways and collisions. 64B (cache line size) blocks of memory that are offset from each other by a large power of 2 map to the same set, and thus compete with each other for ways in that set, instead of being cached in different sets. (e.g. Intel's L1 data caches are 32kiB, 8-way associative, with 64B lines. There are 32kiB / 64 B/line = 512 lines grouped into 512 lines / 8 ways/set = 64 sets.

Loading 9 items offset from each other by 4kiB (64B/line * 64 sets, not coincidentally the page size) will evict the first one.

L2 and L3 caches are more highly associative, like 16 or 24 way, but still susceptible to "aliasing" like this, just like a hash table, where there's lots of demand for some sets (buckets) and no demand for other sets (buckets). For CPU caches, the "hash function" is almost always to use some of the address bits as an index, and to ignore the other bits. (The high bits of an address are used as the tag, to determine if any way in the set is actually caching the requested block, and the low bits are used to select bytes within the cache line.)


I think the SoA benefit is mostly from SIMD (auto-vectorization or manual), but also if you tend to loop over your data looking at only one or two fields from most structs, and only accessing the rest in rare cases where you find an interesting one based on one member.

A hybrid approach with separate arrays for each thing (or group of things) you look at together could make sense, with the rest of the data for each object in an array of structs. I'm imagining a linear search loop where most objects are rejected based on looking at one int field, but for the few objects that pass that test, you look at all the fields.

Grouping together the fields that are mostly accessed together gives you the benefit of spatial locality for these accesses, while still letting search loops that check the key field loop over contiguous memory (rather than with a big stride).


I'm currently experimenting with a layout that interleaves in SIMD vector sized groups. Most of the code that traverses the data needs all fields from every object, and doing it this way means the loop only needs one pointer, and all the memory is allocated as a single block.

This is for collision-detection masks (in a 2D space game (Endless Sky) where it's all collision between a line segment and a ship outline (traced automatically from the sprite), not between two polygons). Here's the original which looped over a vector of double x,y pairs (and used some (non-inline!) functions to operate on them as a 16B SIMD vector, often with slow SSE3 horizontal-add instructions and stuff like that :( ).

SSE2/SSE3 on XY pairs is probably better than nothing if you can't change the data layout, but changing the layout removes all the shuffling for doing 4 cross products in parallel. See the slides from this SIMD (SSE) intro at Insomniac Games (GDC 2015). It starts off with very basic stuff for people who haven't done anything with SIMD before, and explains exactly how structs of arrays are helpful. By the end, it gets to intermediate/advanced SSE techniques, so it's worth flipping through even if you already know some SIMD stuff. See also the tag wiki for some other links.


Anyway, this is the interleave data structure I came up with:

class Mask {
...

struct xy_interleave {
    static constexpr unsigned vecSize = 4;
    static constexpr unsigned alignMask = vecSize-1;
    alignas(64) float x[vecSize];
    float y[vecSize];
    // TODO: reduce cache footprint by calculating this on the fly, maybe with an unaligned load?
    float dx[vecSize]; // next - current;   next.x = x+dx
    float dy[vecSize];
};
std::vector<xy_interleave> outline_simd;

}

Then I can loop over it with stuff like (real code here: this is my work-in-progress not-cleaned-up code that's not ready to be sent upstream)

__m128 minus_point_ps = _mm_cvtpd_ps(-point);    // + is commutative, which helps the compiler with AVX
const __m128 minus_px = _mm_set1_ps(minus_point_ps[0]);
const __m128 minus_py = _mm_set1_ps(minus_point_ps[1]);
const __m128 range2 = _mm_set1_ps(float(range*range));

for(const xy_interleave &curr : outline_simd)
{
    __m128 dx = _mm_load_ps(curr.x) + minus_px;
    __m128 dy = _mm_load_ps(curr.y) + minus_py;
    // this is using GNU Vector Extensions for + and *, instead of _mm_add_ps and _mm_mul_ps, since GNU C++ defines __m128 in terms of __v4sf
    __m128 cmp = _mm_cmplt_ps(dx*dx - range2, dy*dy);  // transform the inequality for more ILP
    // load the x and y fields from this group of 4 objects, all of which come from the same cache line.

    if(_mm_movemask_ps(cmp))
        return true;
}

This compiles to really nice-looking asm loops, with only one pointer looping over the std::vector, and vector loads from constant offsets relative to that loop pointer.

However, scalar fallback loops over the same data are less pretty. (And actually I use loops like this (with j+=4) in the manually-vectorized parts, too, so I can change the interleave without breaking the code. It compiles away entirely, or turns into an unroll).

// TODO: write an iterator or something to make this suck less
for(const xy_interleave &curr : outline_simd)
    for (unsigned j = 0; j < curr.vecSize; ++j)
    {
        float dx = curr.x[j] - px;
        float dy = curr.y[j] - py;
        if(dx*dx + dy*dy < range2)
            return true;
    }

Unfortunately I've had no luck getting gcc or clang to auto-vectorize this, even for easy cases with no conditionals (e.g. just finding the min range from a query x,y to any point in the collision-mask, instead of checking if a point is within range).


I might discard this idea and go with separate x and y arrays. (Maybe packed head to tail in the same std::vector<float> (with an aligned allocator) to keep it part of one allocation, but that would still mean that loops would need separate x and y pointers because the offset between the x and y for a given vertex would be a runtime variable, not a compile-time constant.)

Having all the xs contiguous would be a big help if I want to stop storing the x[i+1]-x[i], and compute it on the fly. With my layout, I would need to shuffle between vectors, instead of just doing an unaligned offset by 1 float.

It would hopefully also allow the compiler to auto-vectorize some of the functions (e.g. for ARM, or for AVX/AVX2 with wider vectors).

Of course, manual vectorization is going to win here, since I'm doing stuff like XORing floats together because I only care about their sign bit as a truth value, instead of doing a compare and then XORing the compare result. (My testing so far has shown that treating negative 0 as negative still gives correct results for Mask::Intersect, but any way to express it in C is going to follow the IEEE rules where x >= 0 is true for x=-0.).


if you almost exclusively need to access a single field at a time on a large amount of data AoS is likely to be more performant while if you often need to access multiple fields from the same object or deal with single objects rather than many at once, SoA will be more performant.

You have this exactly backwards. Was this a typo? Grouping all the foo[i].key fields into a foo.key[i] array means they're all packed together in the cache, so accessing just that one field in many objects means you're using all 64 bytes of every cache line that you touch.

You got it correct earlier when you wrote

When working with only some of the data from many 'Person' records, only that data needs to be loaded into memory.

(except I think you mean "from" memory (into cache), unless you're talking about a memory-mapped file and faulting pages from disk into memory.)


Indexed addressing modes:

In a situation where you're looking at two or three fields in each object, an SoA layout is going to tie up more registers holding separate base addresses for each separate array you're looping over.

With multiple pointers, you're going to want to either use addressing modes like [reg1 + 4*reg2] on x86, or you're going to need to separately increment a bunch of different pointers inside your loop. Indexed addressing modes are potentially slightly slower on Intel SnB-family, because they can't stay micro-fused with ALU uops in the out-of-order core (only in the decoders and uop cache). Skylake can keep them micro-fused, but further testing is needed to find out when Intel made this change. Perhaps with Broadwell when three-input instructions beyond FMA (like CMOV and ADC) decoded to a single uop, but that's a pure guess. Testing on Haswell and Broadwell is needed.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...