Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
153 views
in Technique[技术] by (71.8m points)

c++ - Why don't memory allocators actively return freed memory to the OS?

Yes, this might be the third time you see this code, because I asked two other questions about it (this and this).. The code is fairly simple:

#include <vector>
int main() {
    std::vector<int> v;
}

Then I build and run it with Valgrind on Linux:

g++ test.cc && valgrind ./a.out
==8511== Memcheck, a memory error detector
...
==8511== HEAP SUMMARY:
==8511==     in use at exit: 72,704 bytes in 1 blocks
==8511==   total heap usage: 1 allocs, 0 frees, 72,704 bytes allocated
==8511==
==8511== LEAK SUMMARY:
==8511==    definitely lost: 0 bytes in 0 blocks
==8511==    indirectly lost: 0 bytes in 0 blocks
==8511==      possibly lost: 0 bytes in 0 blocks
==8511==    still reachable: 72,704 bytes in 1 blocks
==8511==         suppressed: 0 bytes in 0 blocks
...
==8511== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Here, Valgrind reports there is no memory leak, even though there is 1 alloc and 0 free.

The answer here points out the allocator used by C++ standard library don't necessarily return the memory back to the OS - it might keep them in an internal cache.

Question is:

1) Why keep them in an internal cache? If it is for speed, how is it faster? Yes, the OS needs to maintain a data structure to keep track of memory allocation, but this the maintainer of this cache also needs to do so.

2) How is this implemented? Because my program a.out terminates already, there is no other process that is maintaining this memory cache - or, is there one?

Edit: for question (2) - Some answers I've seen suggest "C++ runtime", what does it mean? If "C++ runtime" is the C++ library, but the library is just a bunch of machine code sitting on the disk, it is not a running process - the machine code is either linked to my a.out (static library, .a) or is invoked during runtime (shared objects, .so) in the process of a.out.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Clarification

First, some clarification. You asked: ... my program a.out terminates already, there is no other process that is maintaining this memory cache - or, is there one?

Everything we are talking about is within the lifetime of a single process: the process always returns all allocated memory when it exits. There is no cache that outlives the process1. The memory is returned even without any help from the runtime allocator: the OS simply "takes it back" when the process is terminated. So there is no system-wide leak possible from terminated applications with normal allocations.

Now what Valgrind is reporting is memory that is in use at the moment the process terminated, but before the OS cleans everything up. It works at the runtime library level, and not at the OS level. So it's saying "Hey, when the program finished, there were 72,000 bytes that hadn't been returned to the runtime" but an unstated implication is that "these allocations will be cleaned up shortly by the OS".

The Underlying Questions

The code and Valgrind output shown doesn't really correlate well with the titular question, so let's break them apart. First we'll just try to answer the questions you asked about allocators: why they exist and why they don't generally don't immediately return freed memory to the OS, ignoring the example.

You asked:

1) Why keep them in an internal cache? If it is for speed, how is it faster? Yes, the OS needs to maintain a data structure to keep track of memory allocation, but this the maintainer of this cache also needs to do so.

This is sort of two questions in one: one is why bother having a userland runtime allocator at all, and then the other one is (perhaps?) why don't these allocators immediately return memory to the OS when it is freed. They are related, but let's tackle them one at a time.

Why Runtime Allocators Exist

Why not just rely on the OS memory allocation routines?

  • Many operating systems, including most Linux and other Unix-like operating systems, simply don't have an OS system call to allocate and free arbitrary blocks of memory. Unix-alikes offer brk which only grows or shrinks one contiguous block of memory - you have no way to "free" arbitrary earlier allocations. They also offer mmap which allows you to independently allocate and free chunks of memory, but these allocate on a PAGE_SIZE granularity, which on Linux is 4096 bytes. So if you want request 32 bytes, you'll have to waste 4096 - 32 == 4064 bytes if you don't have your own allocator. On these operating systems you practically need a separate memory allocation runtime which turns these coarse-grained tools into something capable of efficiently allocating small blocks.

    Windows is a bit different. It has the HeapAlloc call, which is part of the "OS" and does offer malloc-like capabilities of allocating and freeing arbitrarily sized chunks of memory. With some compilers then, malloc is just implemented as a thin wrapper around HeapAlloc (the performance of this call has improved greatly in recent Windows versions, making this feasible). Still, while HeapAlloc is part of the OS it isn't implemented in the kernel - it is also mostly implemented in a user-mode library, managing a list of free and used blocks, with occasional kernel calls to get chunks of memory from the kernel. So it is mostly malloc in another disguise and any memory it is holding on to is also not available to any other processes.

  • Performance! Even if there were appropriate kernel-level calls to allocate arbitrary blocks of memory, the simple overhead roundtrip to the kernel is usually hundreds of nanoseconds or more. A well-tuned malloc allocation or free, on other hand, is often only a dozen instructions and may complete in 10 ns or less. On top of that, system calls can't "trust their input" and so must carefully validate parameters passed from user-space. In the case of free this means that it much check that the user passed a pointer which is valid! Most runtime free implements simply crash or silently corrupt memory since there is no responsibility to protect a process from itself.
  • Closer link to the rest of the language runtime. The functions you use to allocate memory in C++, namely new, malloc and friends, are part of an defined by the language. It is then entirely natural to implement them as part of the runtime that implements the rest of the language, rather than the OS which is for the most part language-agnostic. For example, the language may have specific alignment requirements for various objects, which can best be handled by language aware allocators. Changes to the language or compiler might also imply necessary changes to the allocation routines, and it would be a tough call to hope for the kernel to be updated to accommodate your language features!

Why Not Return Memory to the OS

Your example doesn't show it, but you asked and if you wrote a different test you would probably find that after allocating and then freeing a bunch of memory, your processes resident set size and/or virtual size as reported by the OS might not decrease after the free. That is, it seems like the process holds on to the memory even though you have freed it. This is in fact true of many malloc implementations. First, note that this is not a leak per se - the unreturned memory is still available to the process that allocated it, even if not to other processes.

Why do they do that? Here are some reasons:

  1. The kernel API makes it hard. For the old-school brk and sbrk system calls, it simply isn't feasible to return freed memory unless it happens to be at the end of very last block allocated from brk or sbrk. That's because the abstraction offered by these calls is a single large contiguous region that you can only extend from one end. You can't hand back memory from the middle of it. Rather than trying to support the unusual case where all the freed memory happens to be at the end of brk region, most allocators don't even bother.

    The mmap call is more flexible (and this discussion generally applies also to Windows where VirtualAlloc is the mmap equivalent), allowing you to at least return memory at a page granularity - but even that is hard! You can't return a page until all allocations that are part of that page are freed. Depending on the size and allocation/free pattern of the application that may be common or uncommon. A case where it works well is for large allocations - greater than a page. Here you're guaranteed to be able to free most of the allocation if it was done via mmap and indeed some modern allocators satisfy large allocations directly from mmap and free them back to the OS with munmap. For glibc (and by extension the C++ allocation operators), you can even control this threshold:

    M_MMAP_THRESHOLD
      For allocations greater than or equal to the limit specified
      (in bytes) by M_MMAP_THRESHOLD that can't be satisfied from
      the free list, the memory-allocation functions employ mmap(2)
      instead of increasing the program break using sbrk(2).
    
      Allocating memory using mmap(2) has the significant advantage
      that the allocated memory blocks can always be independently
      released back to the system.  (By contrast, the heap can be
      trimmed only if memory is freed at the top end.)  On the other
      hand, there are some disadvantages to the use of mmap(2):
      deallocated space is not placed on the free list for reuse by
      later allocations; memory may be wasted because mmap(2)
      allocations must be page-aligned; and the kernel must perform
      the expensive task of zeroing out memory allocated via
      mmap(2).  Balancing these factors leads to a default setting
      of 128*1024 for the M_MMAP_THRESHOLD parameter.
    

    So by default allocations of 128K or more will be allocated by the runtime directly from the OS and freed back to the OS on free. So sometimes you will see the behavior you might have expected is always the case.

  2. Performance! Every kernel call is expensive, as described in the other list above. Memory that is freed by a process will be needed shortly later to satisfy another allocation. Rather than trying to return it to the OS, a relatively heavyweight operation, why not just keep it around on a free list to satisfy future allocations? As pointed out in the man page entry, this also avoids the overhead of zeroing out all the memory returned by the kernel. It also gives the best chance of good cache behavior since the process is continually re-using the same region of the address space. Finally, it avoids TLB flushes which would be imposed by munmap (and possibly by shrinking via brk).
  3. The "problem" of not returning memory is the worst for long-lived processes that allocate a bunch of memory at some point, free it and then never allocate that much again. I.e., processes whose allocation high-water mark is larger than their long term typical allocation amount. Most processes just don't follow that pattern, however. Processes often free a lot of memory, but allocate at a rate such that their overall memory use is constant or perhaps increasing. Applications that do have the "big then small" live size pattern could perhaps force the issue with malloc_trim.
  4. Virtual memory helps mitigate the issue. So far I've been throwing around terms like "allocated memory" without really defining what it means. If a program allocates and then frees 2 GB of memory and then sits around doing nothing, is it wasting 2 GB of actual DRAM plugged into your motherboard somewhere? Probably not. It is using 2 GB of virtual address space in your process, sure, but virtual address space is per-process, so that doesn't directly take anything away from other processes. If the process actually wrote to the memory at some point, it would be allocated physical memory (yes, DRAM) - after freeing it, you are - by definition - no longer using it. At this point the OS may reclaim those physical pages by use for someone else.

    Now this still requires you have swap to absorb the dirty not-used pages, but some allocators are smart: they can issue a madvise(..., MADV_DONTNEED) call which tells the OS "this range doesn't have anything useful, you don't have to preserve its contents in swap". It still leaves the virtual address space mapped in the process and usable later (zero filled) and so it's more efficient than munmap and a subsequent mmap, but it avoid pointlessly swapping freed memory regions to swap.2

The Demonstrated Code

As pointed out in this answer your test with vector<int&g


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...