A function is allowed to assume its return-value object (pointed-to by a hidden pointer) is not the same object as anything else. i.e. that its output pointer (passed as a hidden first arg) doesn't alias anything.
You could think of this as the hidden first arg output pointer having an implicit restrict
on it. (Because in the C abstract machine, the return value is a separate object, and the x86-64 System V specifies that the caller provides space. x86-64 SysV doesn't give the caller license to introduce aliasing.)
Using an otherwise-private local as the destination (instead of separate dedicated space and then copying to a real local) is fine, but pointers that may point to something reachable another way must not be used. This requires escape analysis to make sure that a pointer to such a local hasn't been passed outside of the function.
I think the x86-64 SysV calling convention models the C abstract machine here by having the caller provide a real return-value object, not forcing the callee to invent that temporary if needed to make sure all the writes to the retval happened after any other writes. That's not what "the caller provides space for the return value" means, IMO.
That's definitely how GCC and other compilers interpret it in practice, which is a big part of what matters in a calling convention that's been around this long (since a year or two before the first AMD64 silicon, so very early 2000s).
Here's a case where your optimization would break if it were done:
struct Vec3{
double x, y, z;
};
struct Vec3 glob3;
__attribute__((noinline))
struct Vec3 do_something(void) { // copy glob3 to retval in some order
return (struct Vec3){glob3.y, glob3.z, glob3.x};
}
__attribute__((noinline))
void use(struct Vec3 * out){ // copy do_something() result to *out
*out = do_something();
}
void caller(void) {
use(&glob3);
}
With the optimization you're suggesting, do_something
's output object would be glob3
. But it also reads glob3
.
A valid implementation for do_something
would be to copy elements from glob3
to (%rdi)
in source order, which would do glob3.x = glob3.y
before reading glob3.x
as the 3rd element of the return value.
That is in fact exactly what gcc -O1
does (Godbolt compiler explorer)
do_something:
movq %rdi, %rax # tmp90, .result_ptr
movsd glob3+8(%rip), %xmm0 # glob3.y, glob3.y
movsd %xmm0, (%rdi) # glob3.y, <retval>.x
movsd glob3+16(%rip), %xmm0 # glob3.z, _2
movsd %xmm0, 8(%rdi) # _2, <retval>.y
movsd glob3(%rip), %xmm0 # glob3.x, _3
movsd %xmm0, 16(%rdi) # _3, <retval>.z
ret
Notice the glob3.y, <retval>.x
store before the load of glob3.x
.
So without restrict
anywhere in the source, GCC already emits asm for do_something
that assumes no aliasing between the retval and glob3
.
I don't think using struct Vec3 *restrict out
wouldn't help at all: that only tells the compiler that inside use()
you won't access the *out
object through any other name. Since use()
doesn't reference glob3
, it's not UB to pass &glob3
as an arg to a restrict
version of use
.
I may be wrong here; @M.M argues in comments that *restrict out
might make this optimization safe because the execution of do_something()
happens during out()
. (Compilers still don't actually do it, but maybe they would be allowed to for restrict
pointers.)
Update: Richard Biener said in the GCC missed-optimization bug-report that M.M is correct, and if the compiler can prove that the function returns normally (not exception or longjmp), the optimization is legal in theory (but still not something GCC is likely to look for):
If so, restrict would make this optimization safe if we can prove that
do_something is "noexcept" and doesn't longjmp.
Yes.
There's a noexecpt
declaration, but there isn't (AFAIK) a nolongjmp
declaration you can put on a prototype.
So that means it's only possible (even in theory) as an inter-procedural optimization when we can see the other function's body. Unless noexcept
also means no longjmp
.