Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
865 views
in Technique[技术] by (71.8m points)

video - How to use GPU to accelerate the processing speed of ffmpeg filter?

According to NVIDIA's developer website, you can use GPU to speed up the rendering of the ffmpeg filter.

Create high-performance end-to-end hardware-accelerated video processing, 1:N encoding and 1:N transcoding pipeline using built-in > filters in FFmpeg

Ability to add your own custom high-performance CUDA filters using the shared CUDA context implementation in FFmpeg

The problem I am having now is how to use the GPU to speed up multiple ffmpeg filter processing?

For example:

ffmpeg -loop 1 -i dog.jpg -filter_complex "scale=iw*4:-1,zoompan=z='zoom+0.002':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':s=720x960" -pix_fmt yuv420p -vcodec libx264 -preset ultrafast -y -r:v 25 -t 5 -crf 28 dog.mp4
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

When it comes to hardware acceleration in FFmpeg, you can expect the following implementations by type:

1. Hardware-accelerated encoders: In the case of NVIDIA, NVENC is supported and implemented via the h264_nvenc and the hevc_nvenc wrappers. See this answer on how to tune them, and any limitations you may run into depending on the generation of hardware you're on.

2. Hardware-accelerated filters: Filters that perform duties such as scaling and post-processing (deinterlacing, etc) are available in FFmpeg, and some implementations are hardware-accelerated. For NVIDIA, the following filters can take advantage of hardware-acceleration:

(a). scale_cuda: This is a scaling filter analogous to the generic scale filter, implemented in CUDA. It's dependency is the ffnvcodec project, headers needed to also enable the NVENC-based encoders. When the ffnvcodec headers are present, the respective filters dependent on it (scale_cuda and yadif_cuda) will be automatically enabled. In production, it may be wise to deprecate this filter in favor of scale_npp as it has a very limited set of options.

(b). scale_npp: This is a scaling filter implemented in NVIDIA's Performance Primitives. It's primary dependency is the CUDA SDK, and it must be explicitly enabled by passing --enable-libnpp, --enable-cuda-nvcc and --enable-nonfree flags to ./configure at compile time when building FFmpeg from source. Use this filter in place of scale_cuda wherever possible.

(c). yadif_cuda: This is a deinterlacer, implemented in CUDA. It's dependency, as stated above, is the ffnvcodec package of headers.

(d). All OpenCL-based filters: All NVENC-capable GPUs supported by both the mainline NVIDIA driver and the CUDA SDK implement OpenCL support. I started this section with this clarification because there's news in the wind that NVIDIA will be deprecating mobile Kepler GPUs in their mainline driver, relegating them to Legacy support status. For this reason, if you're on such a platform, take this into consideration.

To enable these filters, pass --enable-opencl to FFmpeg's ./configure script at build time. Note that this requires the OpenCL headers to be present on your system, and can be safely satisfied by your package manager on whatever Linux distribution you're on. On other operating systems, your mileage may vary.

To see all OpenCL-based filters, run:

ffmpeg -h filters | grep opencl

A few notable examples being unsharp_opencl,avgblur_opencl, etc. See this wiki section for more options.

(e). All Vulkan-based filters:

If FFmpeg is built with support for the Vulkan back-end, new filters will be available, which can be listed via:

ffmpeg -filters | grep vulkan

These filters are mostly beneficial for VAAPI and AMD's AMF interoperability, where shared HWContexts can be used to massively speed up functions such as scaling, etc. AMD's use case, in particular, allows you to perform hardware-accelerated scaling with Vulkan, which is critical for real-time throughput with the AMF's encoders because the current implementation of AMF in FFmpeg lacks scaling filters. This could change in the future as Khronos finishes up on Vulkan extensions for video encoding.

An example of a Vulkan-based scale filter with FFmpeg running on an NVIDIA GPU with NVDEC H/W acceleration with NVENC encoding is shown below:

ffmpeg -threads 1 -loglevel info -nostdin -y 
   -fflags +genpts-fastseek 
   -init_hw_device cuda=cuda:0 -filter_hw_device cuda 
   -hwaccel nvdec -hwaccel_output_format cuda -extra_hw_frames 3 
   -reinit_filter 1 -vsync 1 -async 1 -filter_threads 2 -filter_complex_threads 2 
   -i input.mp4 -filter_complex 
  "[0:v]hwupload=derive_device=vulkan,split=2[s0][s1]; 
   [s0]scale_vulkan=w=1920:h=1080:scaler=0,hwupload=derive_device=cuda[v0]; 
   [s1]scale_vulkan=w=1280:h=720:scaler=0,hwupload=derive_device=cuda[v1]" 
  -map "[v0]" -b:v:0 5800k -minrate:v:0 5800k -maxrate:v:0 5800k -bufsize:v:0 5800k -c:v:0 h264_nvenc -r:v:0 ntsc 
  -profile:v:0 high -preset:v:0 llhp -rc:v:0 cbr_ld_hq -g:v:0 60 -gpu:v:0 0 -strict_gop:v:0 1 -bf:v:0 0 
  -map "[v1]"  -b:v:1 4000k -minrate:v:1 4000k -maxrate:v:1 4000k -bufsize:v:1 4000k -c:v:1 h264_nvenc -r:v:1 ntsc 
  -profile:v:1 high -preset:v:1 llhp -rc:v:1 cbr_ld_hq -g:v:1 60 -gpu:v:1 0 -strict_gop:v:1 1 -bf:v:1 0 
  -map 0:a -c:a libfdk_aac -ac 2 -ar 48000 -b:a 128k 
  -flags +global_header+cgop 
  -max_muxing_queue_size 9000000 -f tee  
  "[select='v:0,a':f=mp4]'hq.mp4'| 
   [select='v:1,a':f=mp4]'med.mp4'"

See how the snippet above utilizes hwupload's filter's device derivation capability to insert a Vulkan H/W context into the complex filter chain.

A note pertaining to performance with OpenCL and Vulkan-based filters: Please take into account any overheads that mechanisms introduced by filter chains such as hwupload and hwdownload may introduce into your pipeline, as uploading textures to and from system memory and the accelerator in question will affect performance, and so will format conversion operations (via the format filter) where needed/required. In this case, it may be beneficial to take advantage of the hwmap filter, and deriving contexts where applicable. For instance, VAAPI has a mechanism that allows for OpenCL device derivation and reverse mapping via hwmap, if the cl_intel_va_api_media_sharing OpenCL extension is present. This is typically provided by the Beignet ICD, and is absent in others, such as the newer Neo OpenCL driver.

3. Hardware-accelerated decoders (and their associated wrappers): Depending on your input source, and the capabilities of your NVIDIA GPU, based on generation, you may also tap into hardware accelerations based on either CUVID or NVDEC. These methods differ in how they handle textures in-flight on the accelerator, and it is wise to evaluate other factors, such as VRAM utilization, when they are in use. Typically, you can take advantage of the CUVID-based hwaccels for operations such as deinterlacing, if so desired. See their usage via:

ffmpeg -h decoder=h264_cuvid
ffmpeg -h decoder=hevc_cuvid
ffmpeg -h decoder=mpeg2_cuvid

However, beware that handling MBAFF encoded content with these decoders, where double deinterlacing is required, is not advisable as NVIDIA has not yet implemented MBAFF support in the backend. Take a look at this thread for more on the same.

In closing: It is wise to evaluate where and when hardware accelerated offloading (filtering, encoding and decoding) offers an advantage or an acceptable trade-off (in quality, feature support and reliability) in your pipeline prior to deployment in production. This is a vendor-neutral approach when deciding what and when to offload parts of your pipeline, and the same applies to NVIDIA's solutions.

For more information, refer to the hardware acceleration entry in FFmpeg's wiki.

Warning: Be sure to lower the decoder's thread count to 1. These hwaccels, particularly cuvid (and the nvdec wrapper) do not implement threading support. In fact, they'll throw warnings at you if the thread count exceeds 32. For these decoders, thread count(s) explicitly assume the surface count.

Pass -threads 1 to ffmpeg before input. The argument position of threads is important. In this case, it sets the thread count for the decoder to 1. After the input, it sets the thread count used by FFmpeg's encoders and muxers (if threading is supported) to the configured value.

Also note the usage of a new parameter -extra_hw_frames 3 passed directly to FFmpeg when using NVDEC. This is done to ensure that the surface pool allocated to the decoder and encoder instances is sufficient, typically the case where other filters are chained along such as deinterlacing with yadif_cuda, scale_npp, etc. See this ticket for more information.

Samples demonstrating the use of hardware-accelerated filtering, encoding and decoding based on the notes above:

1. Demonstrate the use of 1:N encoding with NVENC:

The following assumption is made: The test-bed only has one NVENC-capable GPU present, a simple GTX 1070. For this reason I'm limited to two simultaneous NVENC sessions, and that is taken into account with the snippets below. Be warned that cases needing to utilize multiple NVENC-capable GPUs will need the command line(s) modified as appropriate.

My sample files are in ~/Desktop/src

I'll be working with a sample file as shown below:

ffprobe -i deint-testfile.mkv -show_format -hide_banner -show_streams

Input #0, matroska,webm, from 'deint-testfile.mkv':
  Metadata:
    encoder         : libebml v1.3.3 + libmatroska v1.4.4
    creation_time   : 2016-03-02T23:20:05.000000Z
  Duration: 00:04:56.97, start: 0.066000, bitrate: 31036 kb/s
    Stream #0:0: Video: h264 (High), yuv420p(tv, bt709, top first), 1920x1080 [SAR 1:1 DAR 16:9], 59.94 fps, 59.94 tbr, 1k tbn, 59.94 tbc (default)
    Metadata:
      BPS             : 29131349
      BPS-eng         : 29131349
      DURATION        : 00:04:56.896000000
      DURATION-eng    : 00:04:56.896000000
      NUMBER_OF_FRAMES: 17598
      NUMBER_OF_FRAMES-eng: 17598
      NUMBER_OF_BYTES : 1081122637
      NUMBER_OF_BYTES-eng: 1081122637
      _STATISTICS_WRITING_APP: mkvm

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...