Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
182 views
in Technique[技术] by (71.8m points)

python - How to find a corrupted data file in Tensorflow data pipelines

I'm using Tensorflow CsvDataset for reading data from the disk during training.

def preprocess(*fields):
    print(len(fields))
    features=tf.stack(fields[:-1])
    labels=tf.stack([int(x) for x in fields[-1:]])
    return features,labels  # x, y

training_csvs =  sorted(str(p) for p in pathlib.Path('./../Dataset/Train').glob("*/*.csv"))

training_dataset=tf.data.experimental.CsvDataset(
    training_csvs,
    record_defaults=defaults, 
    compression_type=None, 
    buffer_size=None,
    header=True, 
    field_delim=',',
    use_quote_delim=True,
    na_value="",
    select_cols=selected_indices
)

training_dataset = training_dataset.map(preprocess)
training_dataset= training_dataset.shuffle(50000)
validate_ds = training_dataset.batch(50).take(100)
train_ds = training_dataset.batch(50, drop_remainder=True).skip(100)


for f,l in train_ds.take(1):   # Here it throws error for one of three datasets
    print(f)
    print(l)

My data reading code is working for two datasets but throws error for a third dataset as:

InvalidArgumentError: Expect 712 fields but have 711 in record [Op:IteratorGetNext]

As per my understading, some of my csv files have corrupted data, but how to debug the data_iterator to get the name/folder name of those files?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...