Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
669 views
in Technique[技术] by (71.8m points)

tensorflow - Parallelism isn't reducing the time in dataset map

TF Map function supports parallel calls. I'm seeing no improvements passing num_parallel_calls to map. With num_parallel_calls=1 and num_parallel_calls=10, there is no improvement in performance run time. Here is a simple code

import time
def test_two_custom_function_parallelism(num_parallel_calls=1, batch=False, 
    batch_size=1, repeat=1, num_iterations=10):
    tf.reset_default_graph()
    start = time.time()
    dataset_x = tf.data.Dataset.range(1000).map(lambda x: tf.py_func(
        squarer, [x], [tf.int64]), 
        num_parallel_calls=num_parallel_calls).repeat(repeat)
    if batch:
        dataset_x = dataset_x.batch(batch_size)
    dataset_y = tf.data.Dataset.range(1000).map(lambda x: tf.py_func(
       squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat)
    if batch:
        dataset_y = dataset_x.batch(batch_size)
        X = dataset_x.make_one_shot_iterator().get_next()
        Y = dataset_x.make_one_shot_iterator().get_next()

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        i = 0
        while True:
            try:
                res = sess.run([X, Y])
                i += 1
                if i == num_iterations:
                    break
            except tf.errors.OutOfRangeError as e:
                pass

Here are the timings

%timeit test_two_custom_function_parallelism(num_iterations=1000, 
 num_parallel_calls=2, batch_size=2, batch=True)
370ms

%timeit test_two_custom_function_parallelism(num_iterations=1000, 
 num_parallel_calls=5, batch_size=2, batch=True)
372ms

%timeit test_two_custom_function_parallelism(num_iterations=1000, 
 num_parallel_calls=10, batch_size=2, batch=True)
384ms

I used %timeit in Juypter notebook. What am I doing it wrong?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The problem here is that the only operation in the Dataset.map() function is a tf.py_func() op. This op calls back into the local Python interpreter to run a function in the same process. Increasing num_parallel_calls will increase the number of TensorFlow threads that attempt to call back into Python concurrently. However, Python has something called the "Global Interpreter Lock" that prevents more than one thread from executing code at once. As a result, all but one of these multiple parallel calls will be blocked waiting to acquire the Global Interpreter Lock, and there will be almost no parallel speedup (and perhaps even a slight slowdown).

Your code example didn't include the definition of the squarer() function, but it might be possible to replace tf.py_func() with pure TensorFlow ops, which are implemented in C++, and can execute in parallel. For example—and just guessing by the name—you could replace it with an invocation of tf.square(x), and you might then enjoy some parallel speedup.

Note however that if the amount of work in the function is small, like squaring a single integer, the speedup might not be very large. Parallel Dataset.map() is more useful for heavier operations, like parsing a TFRecord with tf.parse_single_example() or performing some image distortions as part of a data augmentation pipeline.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...