Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
99 views
in Technique[技术] by (71.8m points)

c++ - Is it possible to use threads to speed up file reading?

I want to read a file as fast as possible (40k lines) [Edit : the rest is obsolete].

Edit: Andres Jaan Tack suggested a solution based on one thread per file, and I want to be sure I got this (thus this is the fastest way) :

  • One thread per entry file reads it whole and stocks its content in a container associated (-> as many containers as there are entry files)
  • One thread calculates the linear combination of every cell read by the input threads, and stocks the results in the exit container (associated to the output file).
  • One thread writes by block (every 4kB of data, so about 10 lines) the content of the output container.

Should I deduce that I must not use m-mapped files (because the program's on standby waiting for the data) ?

Thanks aforehand.

Sincerely,

Mister mystère.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Your question got a little bit deeper, when you asked further. I'll try to cover all your options...

Reading One File: How many threads?

Use one thread.

If you read straight through a file front-to-back from a single thread, the operating system will not fetch the file in small chunks like you're thinking. Rather, it will prefetch the file ahead of you in huge (exponentially growing) chunks, so you almost never pay a penalty for going to disk. You might wait for the disk a handful of times, but in general it will be like the file was already in memory, and this is even irrespective of mmap.

The OS is very good at this kind of sequential file reading, because it's predictable. When you read a file from multiple threads, you're essentially reading randomly, which is (obviously) less predictable. Prefetchers tend to be much less effective with random reads, in this case probably making the whole application slower instead of faster.

Notice: This is even before you add the cost of setting up the threads and all the rest of it. That costs something, too, but it's basically nothing compared with the cost of more blocking disk accesses.

Reading Multiple Files: How many threads?

Use as many threads as you have files (or some reasonable number).

File prefetching done separately for each open file. Once you start reading multiple files, you should read from several of them in parallel. This works because the disk I/O Scheduler will try to figure out the fastest order in which to read all of them in. Often, there's a disk scheduler both in the OS and on the hard drive itself. Meanwhile, the prefetcher can still do its job.

Reading several files in parallel is always better than reading the files one-by-one. If you did read them one at a time, your disk would idle between prefetches; that's valuable time to read more data into memory! The only way you can go wrong is if you have too little RAM to support many open files; that's not common, anymore.

A word of caution: If you're too overzealous with your multiple file reads, reading one file will start kicking bits of other files out of memory, and you're back to a random-read situation.

Combining n Files into One.

Processing and producing output from multiple threads might work, but it depends how you need to combine them. You'll have to be careful about how you synchronize the threads, in any case, though there are surely some relatively easy lock-less ways to do that.

One thing to look for, though: Don't bother writing the file in small (< 4K) blocks. Collect at least 4K of data at a time before you call write(). Also, since the kernel will lock the file when you write it, don't call write() from all of your threads together; they'll all wait for each other instead of processing more data.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...