timeout - CUDA apps time out & fail after several seconds - how to work around this?

Question

Welcome To Ask or Share your Answers For Others

timeout - CUDA apps time out & fail after several seconds - how to work around this?

1 Answer

深蓝 · Answer 1 · 2021-10-16T23:30:23+0000

I'm not a CUDA expert, --- I've been developing with the AMD Stream SDK, which AFAIK is roughly comparable.

You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious. To disable it, you need to regedit HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlWatchdogDisplayDisableBugCheck, create a REG_DWORD and set it to 1. You may also need to do something in the NVidia control panel. Look for some reference to "VPU Recovery" in the CUDA docs.

Ideally, you should be able to break your kernel operations up into multiple passes over your data to break it up into operations that run in the time limit.

Alternatively, you can divide the problem domain up so that it's computing fewer output pixels per command. I.e., instead of computing 1,000,000 output pixels in one fell swoop, issue 10 commands to the gpu to compute 100,000 each.

The basic unit that has to fit within the time slice is not your entire application, but the execution of a single command buffer. In the AMD Stream SDK, a long sequence of operations can be broken up into multiple time slices by explicitly flushing the command queue with a CtxFlush() call. Perhaps CUDA has something similar?

You should not have to read all of your data back and forth across the PCIX bus on every time slice; you can leave your textures, etc. in gpu local memory; you just have some command buffers complete occasionally, to prove to the OS that you're not stuck in an infinite loop.

Finally, GPUs are fast, so if your application is not able to do useful work in that 5 or 10 seconds, I'd take that as a sign that something is wrong.

[EDIT Mar 2010 to update:] (outdated again, see the updates below for the most recent information) The registry key above is out-of-date. I think that was the key for Windows XP 64-bit. There are new registry keys for Vista and Windows 7. You can find them here: http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx or here: http://msdn.microsoft.com/en-us/library/ee817001.aspx

[EDIT Apr 2015 to update:] This is getting really out of date. The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.

[EDIT August 2018 to update:] Although the NVIDIA tools allow disabling the TDR now, the same question is relevant for AMD/OpenCL developers. For those: The current link that documents the TDR settings is at https://docs.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys

Categories

timeout - CUDA apps time out & fail after several seconds - how to work around this?

timeout - CUDA apps time out & fail after several seconds - how to work around this?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags