The OS can set whatever "copy on write" policy it wishes, but generally, they all do the same thing (i.e. what makes the most sense).
Loosely, for a POSIX-like system (linux, BSD, OSX), there are four areas (what you were calling segments) of interest: data
(where int x = 1;
goes), bss
(where int y
goes), sbrk
(this is heap/malloc), and stack
When a fork
is done, the OS sets up a new page map for the child that shares all the pages of the parent. Then, in the page maps of the parent and the child, all the pages are marked readonly.
Each page map also has a reference count that indicates how many processes are sharing the page. Before the fork, the refcount will be 1 and, after, it will be 2.
Now, when either process tries to write to a R/O page, it will get a page fault. The OS will see that this is for "copy on write", will create a private page for the process, copy in the data from the shared, mark the page as writable for that process and resume it.
It will also bump down the refcount. If the refcount is now [again] 1, the OS will mark the page in the other process as writable and non-shared [this eliminates a second page fault in the other process--a speedup only because at this point the OS knows that the other process should be free to write unmolested again]. This speedup could be OS dependent.
Actually, the bss
section get even more special treatment. In the initial page mapping for it, all pages are mapped to a single page that contains all zeroes (aka the "zero page"). The mapping is marked R/O. So, the bss
area could be gigabytes in size and it will only occupy a single physical page. This single, special, zero page is shared amongst all bss
sections of all processes, regardless whether they have any relationship to one another at all.
Thus, a process can read from any page in the area and gets what it expects: zero. It's only when the process tries to write to such a page, the same copy on write mechanism kicks in, the process gets a private page, the mapping is adjusted, and the process is resumed. It is now free to write to the page as it sees fit.
Once again, an OS can choose its policy. For example, after the fork, it might be more efficient to share most of the stack pages, but start off with private copies of the "current" page, as determined by the value of the stack pointer register.
When an exec
syscall is done [on the child], the kernel has to undo much of the mapping done during the fork
[bumping down refcounts], releasing the child's mapping, etc and restoring the parent's original page protections (i.e. it will no longer be sharing its pages unless it does another fork
)
Although not part of your original question, there are related activities that may be of interest, such as on demand loading [of pages] and on demand linking [of symbols] after an exec
syscall.
When a process does an exec
, the kernel does the cleanup above, and reads a small portion of the executable file to determine its object format. The dominate format is ELF, but any format that a kernel understands can be used (e.g. OSX can use ELF [IIRC], but it also has others].
For ELF, the executable has a special section that gives a full FS path to what's known as the "ELF interpreter", which is a shared library, and is usually /lib64/ld.linux.so
.
The kernel, using an internal form of mmap
, will map this into the application space, and set up a mapping for the executable file itself. Most things are marked as R/O pages and "not present".
Before we go further, we need to talk about the "backing store" for a page. That is, if a page fault occurs and we need to load the page from disk, where it comes from. For heap/malloc, this is generally the swap disk [aka paging disk].
Under linux, it's generally the partition that is of the type "linux swap" that was added when the system was installed. When a page is written to that has to flushed to disk to free up some physical memory, it gets written there. Note that the page sharing algorithm in the first section still applies.
Anyway, when an executable is first mapped into memory, its backing store is the executable file in the filesystem.
So, the kernel sets the app's program counter to point to the starting location of the ELF interpreter, and transfers control to it.
The ELF interpreter goes about its business. Every time it tries to execute a portion of itself [a "code" page] that is mapped but not loaded, a page fault occurs and the loads that page from the backing store (e.g. the ELF interpreter's file) and changes the mapping to R/O but present.
This occurs for the ELF interpreter, shared libraries, and the executable itself.
The ELF interpreter will now use mmap
to map libc
into the app space [again, subject to the demand loading]. If the ELF interpreter has to modify a code page to relocate a symbol [or tries to write to any that has the file as the backing store, like a data
page], a protection fault occurs, the kernel changes the backing store for the page from the on disk file to a page on the swap disk, adjusts the protections, and resumes the app.
The kernel must also handle the case where the ELF interpreter (e.g.) is trying to write to [say] a data
page that had never yet been loaded (i.e. it has to load it first and then change the backing store to the swap disk)
The ELF interpreter then uses portions of libc
to help it complete initial linking activities. It relocates the minimum necessary to allow it to do its job.
However, the ELF interpreter does not relocate anywhere near all the symbols for most other shared libraries. It will look through the executable and, again using mmap
, create a mapping for the shared libraries the executable needs (i.e. what you see when you do ldd executable
).
These mappings to shared libraries and executables, can be thought of as "segments".
There is a symbol jump table that points back to the interpreter in each shared library. But, the ELF interpreter makes minimal changes.
[Note: this is a loose explanation] Only when the application tries to call a given function's jump entry [this is that GOT et. al. stuff you may have seen] does a relocation occur. The jump entry transfers control to the interpreter, which locates the real address of the symbol and adjusts the GOT so that it now points directly to the final address for the symbol and redoes the call, which will now call the real function. On a subsequent call to the same given function, it now goes direct.
This is called "on demand linking".
A by-product of all this mmap
activity is the the classical sbrk
syscall is of little to no use. It would soon collide with one of the shared library memory mappings.
So, modern libc
doesn't use it. When malloc
needs more memory from the OS, it requests more memory from an anonymous mmap
and keeps track of which allocations belong to which mmap
mapping. (i.e. if enough memory got freed to comprise an entire mapping, free
could do an munmap
).
So, to sum up, we have "copy on write", "on demand loading", and "on demand linking" all going on at the same time. It seems complex, but makes fork
and exec
go quickly, smoothly. This adds some complexity, but extra overhead is done only when needed ("on demand").
Thus, instead of a large lurch/delay at the beginning launch of a program, the overhead activity gets spread out over the lifetime of the program, as needed.