I tried to achieve the same thing (in fact, its sightly simpler as I only need to take snapshots of a live region, I do not need to take copies of the copies). I did not find a good solution for this.
Direct kernel support (or the lack thereof): By modifying/adding a module it should be possible to achieve this. However there is no simple way to setup a new COW region from an existing one. The code used by fork (copy_page_rank
) copy a vm_area_struct
from one process/virtual address space to another (new one) but assumes that the address of the new mapping is the same as the address of the old one. If one want to implement a "remap" feature, the function must be modified/duplicated in order to copy a vm_area_struct
with address translation.
BTRFS: I thought of using COW on btrfs for this. I wrote a simple program mapping two reflink-ed files and tried to map them. However, looking at the page information with /proc/self/pagemap
shows the two instances of the file do not share the same cache pages. (At least unless my test is wrong). So you will not gain much by doing this. The physical pages of the same data will not be shared among different instances.
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <assert.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
#include <inttypes.h>
#include <stdio.h>
void* map_file(const char* file) {
struct stat file_stat;
int fd = open(file, O_RDWR);
assert(fd>=0);
int temp = fstat(fd, &file_stat);
assert(temp==0);
void* res = mmap(NULL, file_stat.st_size, PROT_READ, MAP_SHARED, fd, 0);
assert(res!=MAP_FAILED);
close(fd);
return res;
}
static int pagemap_fd = -1;
uint64_t pagemap_info(void* p) {
if(pagemap_fd<0) {
pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
if(pagemap_fd<0) {
perror("open pagemap");
exit(1);
}
}
size_t page = ((uintptr_t) p) / getpagesize();
int temp = lseek(pagemap_fd, page*sizeof(uint64_t), SEEK_SET);
if(temp==(off_t) -1) {
perror("lseek");
exit(1);
}
uint64_t value;
temp = read(pagemap_fd, (char*)&value, sizeof(uint64_t));
if(temp<0) {
perror("lseek");
exit(1);
}
if(temp!=sizeof(uint64_t)) {
exit(1);
}
return value;
}
int main(int argc, char** argv) {
char* a = (char*) map_file(argv[1]);
char* b = (char*) map_file(argv[2]);
int fd = open("/proc/self/pagemap", O_RDONLY);
assert(fd>=0);
int x = a[0];
uint64_t info1 = pagemap_info(a);
int y = b[0];
uint64_t info2 = pagemap_info(b);
fprintf(stderr, "%" PRIx64 " %" PRIx64 "
", info1, info2);
assert(info1==info2);
return 0;
}
mprotect
+mmap
anonymous pages: It does not work in your case, but a solution is to use a MAP_SHARED file for my main memory region. On a snapshot, the file is mapped somewhere else and both instances are mprotected. On a write, a anonymous page in mapped in the snapshot, the data is copied in this new page and the original page is unprotected. However this solution does not work in your case as you will not be able to repeat the process in the snapshot (because it is not a plain MAP_SHARED area but a MAP_SHARED with some MAP_ANONYMOUS pages. Moreover it does not scale with the number of copies?: if I have many COW copies, I will have to repeat the same process for each copy and this page will not be duplicated for the copies. And I can't map the anonymous page in the original area as it will not be possible to map the anonymous pages in the copies. This solution does not work in anyway.
mprotect
+remap_file_pages
: This looks like the only way do do this without touching the Linux kernel. The downside it that, in general, you will probably have to make a remap_file_page syscall for each page when doing a copy?: it might not be that efficient to make a lot of syscalls. When deduplicating a shared page, you need at least to?: remap_file_page a new/free page for the new written-to-page, m-un-protect the new page. It is necessary to reference count each page.
I do not think that the mprotect()
based approaches would scale very well (if you handle a lot of memory like this). On Linux, mprotect()
does not work at the memory page granularity but at the vm_area_struct
granularity (the entries you find in /prod//maps). Doing a mprotect()
at the memory page granularity will cause the kernel to constantly split and merge vm_area_struct:
you will end up with a very mm_struct
;
looking up a vm_area_struct (which is used for a log of virtual memory related operations) is on O(log #vm_area_struct)
but it might still have a negative performance impact;
memory consumption for those structures.
For this kind of reason, the remap_file_pages() syscall was created [http://lwn.net/Articles/24468/] in order to do non-linear memory mapping of a file. Doing this with mmap, requires a log of vm_area_struct
. I don not event think that they this was designed for page granularity mapping: the remap_file_pages() is not very optimised for this use case as it would need a syscall per page.
I think the only viable solution is to let the kernel do it. It is possible to do it in userspace with remap_file_pages but it will probably be quite inefficient as a snapshot will in generate need a number of syscalls proportional in the number of pages. A variant of remap_file_pages might do the trick.
This approach however duplicate the page logic of the kernel. I tend to think we should let the kernel do this. All in all, an implementation in the kernel seems to be the better solution. For someone who knows this part of the kernel, it should be quite easy to do.
KSM (Kernel Samepage Merging): There is a thing that the kernel can do. It can try to deduplicate the pages. You will still have to copy the data, but the kernel should be able to merge them. You need to mmap a new anonymous area for your copy, copy it manually with memcpy and madvide(start, end, MADV_MERGEABLE)
the areas. You need to enable KSM (in root):
echo 1 > /sys/kernel/mm/ksm/run
echo 10000 > /sys/kernel/mm/ksm/pages_to_scan
It works, it doesn't work so well with my workload but it's probably because the pages are not shared a lot in the end. The downside is that you still have to do the copy (you cannot have an efficient COW) and then the kernel will un-merge the page. It will generate page and cache faults when doing the copies, the KSM daemon thread will consume a lot of CPU (I have a CPU running at A00% for the whole simulation) and probably consume a log a cache. So you will not gain time when doing the copy but you might gain some memory. If your main motivation, is to use less memory in the long run and you do not care that much about avoiding the copies, this solution might work for you.