python - Why is this numpy array too big to load?

Question

Welcome To Ask or Share your Answers For Others

python - Why is this numpy array too big to load?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Why is this numpy array too big to load?

I have a 3.374Gb npz file, myfile.npz.

I can read it in and view the filenames:

a = np.load('myfile.npz')
a.files

gives

['arr_1','arr_0']

I can read in 'arr_1' ok

a1=a['arr_1']

However, I cannot load in arr_0, or read its shape:

a1=a['arr_0']
a['arr_0'].shape

both above operations give the following error:

ValueError: array is too big

I have 16Gb RAM of which 8.370Gb is available. So the problem doesn't seem related to memory. My questions are:

Should I be able to read this file in?
Can anyone explain this error?
I have been looking at using np.memmap to get around this - is this a reasonable approach?
What debugging approach should I use?

EDIT:

I got access to a computer with more RAM (48GB) and it loaded. The dtype was in fact complex128 and the uncompressed memory of a['arr_0'] was 5750784000 bytes. It seems that a RAM overhead may be required. Either that or my predicted amount of available RAM was wrong (I used windows sysinternals RAMmap).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T20:00:56+0000

An np.complex128 array with dimensions (200, 1440, 3, 13, 32) ought to take up about 5.35GiB uncompressed, so if you really did have 8.3GB of free, addressable memory then in principle you ought to be able to load the array.

However, based on your responses in the comments below, you are using 32 bit versions of Python and numpy. In Windows, a 32 bit process can only address up to 2GB of memory (or 4GB if the binary was compiled with the IMAGE_FILE_LARGE_ADDRESS_AWARE flag; most 32 bit Python distributions are not). Consequently, your Python process is limited to 2GB of address space regardless of how much physical memory you have.

You can either install 64 bit versions of Python, numpy, and any other Python libraries you need, or live with the 2GB limit and try to work around it. In the latter case you might get away with storing arrays that exceed the 2GB limit mainly on disk (e.g. using np.memmap), but I'd advise you to go for option #1, since operations on memmaped arrays are a lot slower in most cases than for normal np.arrays that reside wholly in RAM.

If you already have another machine that has enough RAM to load the whole array into core memory then I would suggest you save the array in a different format (either as a plain np.memmap binary, or perhaps better, in an HDF5 file using PyTables or H5py). It's also possible (although slightly trickier) to extract the problem array from the .npz file without loading it into RAM, so that you can then open it as an np.memmap array residing on disk:

import numpy as np

# some random sparse (compressible) data
x = np.random.RandomState(0).binomial(1, 0.25, (1000, 1000))

# save it as a compressed .npz file
np.savez_compressed('x_compressed.npz', x=x)

# now load it as a numpy.lib.npyio.NpzFile object
obj = np.load('x_compressed.npz')

# contains a list of the stored arrays in the format '<name>.npy'
namelist = obj.zip.namelist()

# extract 'x.npy' into the current directory
obj.zip.extract(namelist[0])

# now we can open the array as a memmap
x_memmap = np.load(namelist[0], mmap_mode='r+')

# check that x and x_memmap are identical
assert np.all(x == x_memmap[:])

Categories

python - Why is this numpy array too big to load?

python - Why is this numpy array too big to load?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags