language lawyer - Is pointer tagging in C undefined according to the standard?

Question

Welcome To Ask or Share your Answers For Others

language lawyer - Is pointer tagging in C undefined according to the standard?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

language lawyer - Is pointer tagging in C undefined according to the standard?

Some dynamically-typed languages use pointer tagging as a quick way to identify or narrow down the runtime type of the value being represented. A classic way to do this is to convert pointers to a suitably sized integer, and add a tag value over the least significant bits which are assumed to be zero for aligned objects. When the object needs to be accessed, the tag bits are masked away, the integer is converted to a pointer, and the pointer is dereferenced as normal.

This by itself is all in order, except it all hinges on one colossal assumption: that the aligned pointer will convert to an integer guaranteed to have zero bits in the right places.

Is it possible to guarantee this according to the letter of the standard?

Although standard section 6.3.2.3 (references are to the C11 draft) says that the result of a conversion from pointer to integer is implementation-defined, what I'm wondering is whether the pointer arithmetic rules in 6.5.2.1 and 6.5.6 effectively constrain the result of pointer->integer conversion to follow the same predictable arithmetic rules that many programs already assume. (6.3.2.3 note 67 seemingly suggests that this is the intended spirit of the standard anyway, not that that means much.)

I'm specifically thinking of the case where one might allocate a large array to act as a heap for the dynamic language, and therefore the pointers we're talking about are to elements of this array. I'm assuming that the start of the C-allocated array itself can be placed at an aligned position by some secondary means (by all means discuss this too though). Say we have an array of eight-byte "cons cells"; can we guarantee that the pointer to any given cell will convert to an integer with the lowest three bits free for a tag?

For instance:

typedef Cell ...; // such that sizeof(Cell) == 8
Cell heap[1024];  // such that ((uintptr_t)&heap[0]) & 7 == 0

((char *)&heap[11]) - ((char *)&heap[10]); // == 8
(Cell *)(((char *)&heap[10]) + 8);         // == &heap[11]
&(&heap[10])[0];                           // == &heap[10]
0[heap];                                   // == heap[0]

// So...
&((char *)0)[(uintptr_t)&heap[10]];        // == &heap[10] ?
&((char *)0)[(uintptr_t)&heap[10] + 8];    // == &heap[11] ?

// ...implies?
(Cell *)((uintptr_t)&heap[10] + 8);        // == &heap[11] ?

(If I understand correctly, if an implementation provides uintptr_t then the undefined behaviour hinted at in 6.3.2.3 paragraph 6 is irrelevant, right?)

If all of these hold, then I would assume that it means that you can in fact rely on the low bits of any converted pointer to an element of an aligned Cell array to be free for tagging. Do they && does it?

(As far as I'm aware this question is hypothetical since the normal assumption holds for common platforms anyway, and if you found one where it didn't, you probably wouldn't want to look to the C standard for guidance rather than the platform docs; but that's beside the point.)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:32:18+0000

This by itself is all in order, except it all hinges on one colossal assumption: that the aligned pointer will convert to an integer guaranteed to have zero bits in the right places.

Is it possible to guarantee this according to the letter of the standard?

It's possible for an implementation to guarantee this. The result of converting a pointer to an integer is implementation-defined, and an implementation can define it any way it likes, as long as it meets the standard's requirements.

The standard absolutely does not guarantee this in general.

A concrete example: I've worked on a Cray T90 system, which had a C compiler running under a UNIX-like operating system. In the hardware, an address is a 64-bit word containing the address of a 64-bit word; there were no hardware byte addresses. Byte pointers (void*, char*) were implemented in software by storing a 3-bit offset in the otherwise unused high-order 3 bits of a 64-bit word pointer.

All pointer-to-pointer, pointer-to-integer, and integer-to-pointer conversions simply copied the representation.

Which means that a pointer to an 8-byte aligned object, when converted to an integer, could have any bit pattern in its low-order 3 bits.

Nothing in the standard forbids this.

The bottom line: A scheme like the one you describe, that plays games with pointer representations, can work if you make certain assumptions about how the current system represents pointers -- as long as those assumptions happen to be valid for the current system.

But no such assumptions can be 100% reliable, because the standard says nothing about how pointers are represented (other than that they're of a fixed size for each pointer type, and that the representation can be viewed as an array of unsigned char).

(The standard doesn't even guarantee that all pointers are the same size.)

Categories

language lawyer - Is pointer tagging in C undefined according to the standard?

language lawyer - Is pointer tagging in C undefined according to the standard?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags