python - 如何在TensorFlow中调试NaN值？(How does one debug NaN values in TensorFlow?)

Question

Welcome To Ask or Share your Answers For Others

python - 如何在TensorFlow中调试NaN值？(How does one debug NaN values in TensorFlow?)

asked Mar 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - 如何在TensorFlow中调试NaN值？(How does one debug NaN values in TensorFlow?)

I was running TensorFlow and I happen to have something yielding a NaN.

(我正在运行TensorFlow，并且碰巧产生了NaN。)

I'd like to know what it is but I do not know how to do this.

(我想知道它是什么，但我不知道该怎么做。)

The main issue is that in a "normal" procedural program I would just write a print statement just before the operation is executed.

(主要问题在于，在“正常”过程程序中，我只是在执行操作之前编写一条打印语句。)

The issue with TensorFlow is that I cannot do that because I first declare (or define) the graph, so adding print statements to the graph definition does not help.

(TensorFlow的问题在于我无法做到这一点，因为我先声明（或定义）了图形，因此在图形定义中添加打印语句无济于事。)

Are there any rules, advice, heuristics, anything to track down what might be causing the NaN?

(是否有任何规则，建议，试探法，还有什么可追踪可能导致NaN的原因？)

In this case I know more precisely what line to look at because I have the following:

(在这种情况下，我更确切地知道要看哪一行，因为我有以下几点：)

Delta_tilde = 2.0*tf.matmul(x,W) - tf.add(WW, XX) #note this quantity should always be positive because its pair-wise euclidian distance
Z = tf.sqrt(Delta_tilde)
Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity)
Z = tf.pow(Z, 2.0)
A = tf.exp(Z)

when this line is present I have it that it returns NaN as declared by my summary writers.

(当此行存在时，它可以返回摘要编写者声明的NaN。)

Why is this?

(为什么是这样？)

Is there a way to at least explore what value Z has after its being square rooted?

(有没有一种方法至少可以探索Z平方根后的值？)

For the specific example I posted, I tried tf.Print(0,Z) but with no success it printed nothing.

(对于我发布的特定示例，我尝试了tf.Print(0,Z)但没有成功，但未打印任何内容。)

As in:

(如：)

Delta_tilde = 2.0*tf.matmul(x,W) - tf.add(WW, XX) #note this quantity should always be positive because its pair-wise euclidian distance
Z = tf.sqrt(Delta_tilde)
tf.Print(0,[Z]) # <-------- TF PRINT STATMENT
Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity)
Z = tf.pow(Z, 2.0)
A = tf.exp(Z)

I actually don't understand what tf.Print is suppose to do.

(我实际上不明白tf.Print应该做什么。)

Why does it need two arguments?

(为什么需要两个参数？)

If I want to print 1 tensor why would I need to pass 2?

(如果我想打印1张量，为什么我需要通过2？)

Seems bizarre to me.

(对我来说似乎很奇怪。)

I was looking at the function tf.add_check_numerics_ops() but it doesn't say how to use it (plus the docs seem to not be super helpful).

(我当时在看函数tf.add_check_numerics_ops（），但是它没有说明如何使用它（加上文档似乎没有太大帮助）。)

Does anyone know how to use this?

(有人知道如何使用吗？)

Since I've had comments addressing the data might be bad, I am using standard MNIST.

(由于我对数据的注释可能不好，因此我使用的是标准MNIST。)

However, I am computing a quantity that is positive (pair-wise eucledian distance) and then square rooting it.

(但是，我正在计算一个正数（成对的欧氏距离），然后平方根。)

Thus, I wouldn't see how the data specifically would be an issue.

(因此，我看不到具体的数据将是什么问题。)

ask by Pinocchio translate from so

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-03-06T04:20:58+0000

There are a couple of reasons WHY you can get a NaN-result, often it is because of too high a learning rate but plenty other reasons are possible like for example corrupt data in your input-queue or a log of 0 calculation.

(为什么可以得到NaN结果有两个原因，通常是由于学习率太高，但是还有许多其他原因也是可能的，例如输入队列中的数据损坏或计算记录为0。)

Anyhow, debugging with a print as you describe cannot be done by a simple print (as this would result only in the printing of the tensor-information inside the graph and not print any actual values).

(无论如何，使用您描述的打印调试无法通过简单的打印完成（因为这只会导致在图形内部打印张量信息，而不会打印任何实际值）。)

However, if you use tf.print as an op in bulding the graph ( tf.print ) then when the graph gets executed you will get the actual values printed (and it IS a good exercise to watch these values to debug and understand the behavior of your net).

(但是，如果将tf.print用作构建图形（ tf.print ）的操作，则在执行图形时，您将获得打印的实际值（观察这些值以调试和了解行为是一个不错的练习的净值）。)

However, you are using the print-statement not entirely in the correct manner.

(但是，您不是完全以正确的方式使用打印语句。)

This is an op, so you need to pass it a tensor and request a result-tensor that you need to work with later on in the executing graph.

(这是一个操作，因此您需要向其传递一个张量并请求一个结果张量，稍后在执行图中需要使用该结果张量。)

Otherwise the op is not going to be executed and no printing occurs.

(否则，将不会执行该操作，并且不会进行打印。)

Try this:

(尝试这个：)

Z = tf.sqrt(Delta_tilde)
Z = tf.Print(Z,[Z], message="my Z-values:") # <-------- TF PRINT STATMENT
Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity)
Z = tf.pow(Z, 2.0)

Categories

python - 如何在TensorFlow中调试NaN值？(How does one debug NaN values in TensorFlow?)

python - 如何在TensorFlow中调试NaN值？(How does one debug NaN values in TensorFlow?)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags