python - Efficiently add data to h5py dataset over a loop

Question

Welcome To Ask or Share your Answers For Others

python - Efficiently add data to h5py dataset over a loop

asked Feb 19, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Efficiently add data to h5py dataset over a loop

I have torch.tensors that I need to save to disk as they are large files and will consume all memory.

I am new to h5py and I am having trouble figuring out how to make a data set efficiently. This process is VERY slow.

Below is a very MWE that I would intend to transform into a loop.

import numpy as np
import h5py

data = np.random.random((13, 8, 512, 768))

f = h5py.File('C:\Users\Andrew\Desktop\test_h5\xd.h5', 'w')
dset = f.create_dataset('embeds', shape=(13, 8, 512, 768),
                        maxshape=(None, 8, 512, 768), chunks=(13, 8, 512, 768),
                        dtype=np.float16)

# add first chunk of rows
dset[:] = data[0:13, :, :,]

# Resize the dataset to accommodate the next chunk of rows
dset.resize(26, axis=0)

# Write the next chunk
dset[13:] = np.random.random((13, 8, 512, 768))

# check data
with h5py.File('C:\Users\Andrew\Desktop\test_h5\xd.h5', 'r') as f:
    print(f['embeds'][0:26].shape)
    print(f['embeds'][0:26])
f.close()

Edit:

I am not having issues figuring out how to ensure the last appended data is actually the last generated data, consider the following:

import numpy as np
import h5py

data = np.random.random((13, 8, 512, 768)).astype(np.float32)

batch_size=8
with h5py.File('SO_65606675.h5', 'w') as f:
    # create empty data set
    dset = f.create_dataset('embeds', shape=(13, 16, 512, 768),
                            maxshape=(13, None, 512, 768), chunks=(13, 8, 512, 768),
                            dtype=np.float32)
    for cnt in range(2):
        # add chunk of rows
        start = cnt*batch_size
        dset[:, start:start+batch_size, :, :] = data[:, :, :, :]

        # Create attribute with last_index value
        dset.attrs['last_index']=(cnt+1)*batch_size


# check data
with h5py.File('SO_65606675.h5', 'r') as f:
    print(f['embeds'].attrs['last_index'])
    print(f['embeds'].shape)
    x = f['embeds'][:, 8:16, :, :]  # get last entry
np.array_equal(x, data)  # passes

Edit2 : I think I had an error above and this works; will check my "real" data.

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-02-19T04:03:38+0000

Here is a simple example that pulls my suggestions together to show how everything might work in your case. Program flow summary:

Open a new file, create dataset 'embeds' with shape=(130, 8, 512, 768), then add 2 sets of data, write the 'last_index' attribute then close the file.
Re-Open the file in APPEND mode, accesses dataset 'embeds', add more 2 sets of data (starting at 'last_index'), write the 'last_index' attribute and close the file.
The last file open is in READ mode to print dataset attribute and shape parameters.

Notes:

I use HDFView to visually verify dataset contents. I discovered a problem viewing np.float16, so I used np.float32. This should work with np.float16. I will let you verify that.
You should also add standard integrity checks and error handling. For example: 1) that 'embeds' dataset and the 'last_index' attribute both exist, 2) check the size of the dataset size to confirm data will fit, and 3) resize if your new data goes beyond the current bounds.

CODE BELOW:

import numpy as np
import h5py

data = np.random.random((13, 8, 512, 768)).astype(np.float32)

with h5py.File('SO_65606675.h5', 'w') as f:
    # create empty data set
    dset = f.create_dataset('embeds', shape=(130, 8, 512, 768),
                            maxshape=(None, 8, 512, 768), chunks=(13, 8, 512, 768),
                            dtype=np.float32)
    for cnt in range(2):
        # add chunk of rows
        start = cnt*13
        dset[start:start+13, :, :, :] = data[:, :, :, :]
        
        # Create attribute with last_index value
        dset.attrs['last_index']=(cnt+1)*13

# add more data
with h5py.File('SO_65606675.h5', 'a') as f: # USE APPEND MODE
    dset = f['embeds']
    for cnt in range(2):
        start = dset.attrs['last_index']
        # add chunk of rows
        dset[start:start+13, :, :, :] = data[:, :, :, :]
    
        # Resize the dataset to accommodate the next chunk of rows
        #dset.resize(26, axis=0)
        
        # Create attribute with last_index value
        dset.attrs['last_index']=start+(cnt+1)*13

# check data
with h5py.File('SO_65606675.h5', 'r') as f:
    print(f['embeds'].attrs['last_index'])
    print(f['embeds'].shape)
    #print(f['embeds'][0:26])

Categories

python - Efficiently add data to h5py dataset over a loop

python - Efficiently add data to h5py dataset over a loop

Edit:

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags