Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
621 views
in Technique[技术] by (71.8m points)

memory - Python MemoryError or ValueError in np.loadtxt and iter_loadtxt

My starting point was a problem with NumPy's function loadtxt:

X = np.loadtxt(filename, delimiter=",")

that gave a MemoryError in np.loadtxt(..). I googled it and came to this question on StackOverflow. That gave the following solution:

def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=dtype)
    data = data.reshape((-1, iter_loadtxt.rowlength))
    return data

data = iter_loadtxt('your_file.ext')

So I tried that, but then encountered the following error message:

> data = data.reshape((-1, iter_loadtext.rowlength))
> ValueError: total size of new array must be unchanged

Then I tried to add the number of rows and maximum number of cols to the code with the code fragments down here, which I partly got from another question and partly wrote myself:

num_rows = 0
max_cols = 0
with open(filename, 'r') as infile:
    for line in infile:
        num_rows += 1
        tmp = line.split(",")
        if len(tmp) > max_cols:
            max_cols = len(tmp)

def iter_func():
    #didn't change

data = np.fromiter(iter_func(), dtype=dtype, count=num_rows)
data = data.reshape((num_rows, max_cols))

But this still gave the same error message though I thought it should have been solved. On the other hand I'm not sure if I'm calling data.reshape(..) in the correct manner.

I commented the rule where date.reshape(..) is called to see what happened. That gave this error message:

> ValueError: need more than 1 value to unpack

Which happened at the first point where something is done with X, the variable where this problem is all about.

I know this code can work on the input files I got, because I saw it in use with them. But I can't find why I can't solve this problem. My reasoning goes as far as that because I'm using a 32-bit Python version (on a 64-bit Windows machine), something goes wrong with memory that doesn't happen on other computers. But I'm not sure. For info: I'm having 8GB of RAM for a 1.2GB file but my RAM is not full according to Task Manager.

What I want to solve is that I'm using open source code that needs to read and parse the given file just like np.loadtxt(filename, delimiter=","), but then within my memory. I know the code originally worked in MacOsx and Linux, and to be more precise: "MacOsx 10.9.2 and Linux (version 2.6.18-194.26.1.el5 ([email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) 1 SMP Tue Nov 9 12:46:16 EST 2010)."

I don't care that much about time. My file contains +-200.000 lines on which there are 100 or 1000 (depending on the input files: one is always 100, one is always 1000) items per line, where one item is a floating point with 3 decimals either negated or not and they are separated by , and a space. F.e.: [..] 0.194, -0.007, 0.004, 0.243, [..], and 100 or 100 of those items of which you see 4, for +-200.000 lines.

I'm using Python 2.7 because the open source code needs that.

Does any of you have the solution for this? Thanks in advance.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

On Windows a 32 bit process is only given a maximum of 2GB (or GiB?) memory and numpy.loadtxt is notorious for being heavy on memory, so that explains why the first approach doesn't work.

The second problem you appear to be facing is that the particular file you are testing with has missing data, i.e. not all lines have the same number of values. This is easy to check, for example:

import numpy as np

numbers_per_line = []
with open(filename) as infile:
    for line in infile:
        numbers_per_line.append(line.count(delimiter) + 1)

# Check where there might be problems
numbers_per_line = np.array(numbers_per_line)
expected_number = 100
print np.where(numbers_per_line != expected_number)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...