I have a rather large amount of data to analyse - each file is about 5gigs. Each file is of the following format:
xxxxx yyyyy
Both key and value can repeat, but the keys are sorted in increasing order. I'm trying to use a memory mapped file for this purpose and then find the required keys and work with them. This is what I've written:
if (data_file != "")
{
clock_start = clock();
data_file_mapped.open(data_file);
data_multimap = (std::multimap<double, unsigned int> *)data_file_mapped.data();
if (data_multimap != NULL)
{
std::multimap<double, unsigned int>::iterator it = data_multimap->find(keys_to_find[4]);
if (it != data_multimap->end())
{
std::cout << "Element found.";
for (std::multimap<double, unsigned int>::iterator it = data_multimap->lower_bound(keys_to_find[4]); it != data_multimap->upper_bound(keys_to_find[5]); ++it)
{
std::cout << it->second;
}
std::cout << "
";
clock_end = clock();
std::cout << "Time taken to read in the file: " << (clock_end - clock_start)/CLOCKS_PER_SEC << "
";
}
else
std::cerr << "Element not found at all" << "
";
}
else
std::cerr << "Nope - no data received."<< "
";
}
Basically, I need to locate ranges of keys and pull those chunks out to work on. I get a segfault the first time I try to use a method on the multimap. For example, when the find
method is called. I tried the upper_bound
, lower_bound
and other methods too, and still get a segfault.
This is what gdb
gives me:
Program received signal SIGSEGV, Segmentation fault.
_M_lower_bound (this=<optimized out>, __k=<optimized out>, __y=<optimized out>, __x=0xa31202030303833) at /usr/include/c++/4.9.2/bits/stl_tree.h:1261
1261 if (!_M_impl._M_key_compare(_S_key(__x), __k))
Could someone please point out what I'm doing wrong? I've only been able to find simplistic examples on memory mapped files - nothing like this yet. Thanks.
EDIT: More information:
The file I described above is basically a two column plain text file which a neural simulator gives me as the output of my simulations. It's simple like this:
$ du -hsc 201501271755.e.ras
4.9G 201501271755.e.ras
4.9G total
$ head 201501271755.e.ras
0.013800 0
0.013800 1
0.013800 10
0.013800 11
0.013800 12
0.013800 13
0.013800 14
0.013800 15
0.013800 16
0.013800 17
The first column is time, the second column is the neurons that fired at this time - (it's a spike time raster file). Actually, the output is a file like this from each MPI rank that is being used to run the simulation. The various files have been combined to this master file using sort -g -m
. More information on the file format is here: http://www.fzenke.net/auryn/doku.php?id=manual:ras
To calculate the firing rate and other metrics of the neuron set at certain times of the simulation, I need to - locate the time in the file, pull out a chunk between [time -1,time] and run some metrics and so on on this chunk. This file is quite small and I expect the size to increase quite a bit as my simulations get more and more complicated and run for longer time periods. It's why I began looking into memory mapped files. I hope that clarifies the problem statement somewhat. I only need to read the output file to process the information it contains. I do not need to modify this file at all.
To process the data, I'll use multiple threads again, but since all my operations on the map are read-only, I don't expect to run into trouble there.
See Question&Answers more detail:
os