reading large text / csv files in Julia takes a long time compared to Python. Here are the times to read a file whose size is 486.6 MB and has 153895 rows and 644 columns.
python 3.3 example
import pandas as pd
import time
start=time.time()
myData=pd.read_csv("C:\myFile.txt",sep="|",header=None,low_memory=False)
print(time.time()-start)
Output: 19.90
R 3.0.2 example
system.time(myData<-read.delim("C:/myFile.txt",sep="|",header=F,
stringsAsFactors=F,na.strings=""))
Output:
User System Elapsed
181.13 1.07 182.32
Julia 0.2.0 (Julia Studio 0.4.4) example # 1
using DataFrames
timing = @time myData = readtable("C:/myFile.txt",separator='|',header=false)
Output:
elapsed time: 80.35 seconds (10319624244 bytes allocated)
Julia 0.2.0 (Julia Studio 0.4.4) example # 2
timing = @time myData = readdlm("C:/myFile.txt",'|',header=false)
Output:
elapsed time: 65.96 seconds (9087413564 bytes allocated)
Julia is faster than R, but quite slow compared to Python. What can I do differently to speed up reading a large text file?
a separate issue is the size in memory is 18 x size of hard disk file size in Julia, but only 2.5 x size for python. in Matlab, which I have found to be most memory efficient for large files, it is 2 x size of hard disk file size. Any particular reason for the large file size in memory in Julia?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…