Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
391 views
in Technique[技术] by (71.8m points)

python - Convert large csv to sparse matrix for use in sklearn

I have a ~30GB (~1.7 GB compressed | 180K rows x 32K columns) matrix saved in a csv format. I would like to convert this matrix to sparse format to be able to load the full dataset in memory for machine learning with sklearn. The cells that are populated contain float numbers less than 1. A caveat of the large matrix is the target variable is stored as the last column. What is the best method to allow this large matrix to be utilized in sklearn? I.E. How can you transition the ~30GB csv into a scipy sparse format without loading the original matrix into memory?

Pseudocode

  1. Remove target variable (keep order intact)
  2. Convert ~30 GB matrix to sparse format (Help!!)
  3. Load sparse format into memory and target variable to run machine learning pipeline (How would I do this?)
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can row-wise build a sparse matrix in memory pretty easily:

import numpy as np
import scipy.sparse as sps

input_file_name = "something.csv"
sep = ""

def _process_data(row_array):
    return row_array

sp_data = []
with open(input_file_name) as csv_file:
    for row in csv_file:
        data = np.fromstring(row, sep=sep)
        data = _process_data(data)
        data = sps.coo_matrix(data)
        sp_data.append(data)


sp_data = sps.vstack(sp_data)

This will be easier to write into hdf5 which is a way better way to store numbers at this scale than a text file.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...