Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

pandas dataframe memory python

i want to transform a sparse matrix (156060x11780) to dataframe but i get a memory error this is my code

vect = TfidfVectorizer(sublinear_tf=True, analyzer='word', 
                       stop_words='english' , tokenizer=tokenize,
                       strip_accents = 'ascii') 

X = vect.fit_transform(df.pop('Phrase')).toarray()

for i, col in enumerate(vect.get_feature_names()):
    df[col] = X[:, i]

I have a problem in X = vect.fit_transform(df.pop('Phrase')).toarray(). How can i solve it?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Try this:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, analyzer='word', stop_words='english',
                       tokenizer=tokenize,
                       strip_accents='ascii',dtype=np.float16)
X = vect.fit_transform(df.pop('Phrase'))  # NOTE: `.toarray()` was removed


for i, col in enumerate(vect.get_feature_names()):
    df[col] = pd.SparseSeries(X[:, i].toarray().reshape(-1,), fill_value=0)

UPDATE: for Pandas 0.20+ we can construct SparseDataFrame directly from sparse arrays:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, analyzer='word', stop_words='english',
                       tokenizer=tokenize,
                       strip_accents='ascii',dtype=np.float16)

df = pd.SparseDataFrame(vect.fit_transform(df.pop('Phrase')),
                        columns=vect.get_feature_names(),
                        index=df.index)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...