Unlocking The Power Of CUDA For Video Processing: Insights On GstCUDA

Jafet Chaves
Apr 28, 2023
5 min read

Updated: Dec 18, 2024

Introduction

By using GstCUDA, developers can avoid the burden of developing GStreamer elements from scratch to wrap around their CUDA algorithms. This allows them to focus on the algorithm's logic itself rather than the boilerplate code, shortening the time required to deliver the product.

GstCUDA is a powerful tool to perform high-speed video processing. Many industries, including surveillance, healthcare, defense, and transportation, rely on video processing. GstCUDA accelerates and achieves real-time video processing by harnessing the parallel processing capabilities of NVIDIA GPUs along with zero-copy GStreamer pipeline configurations.

GstCUDA is also very customizable and adaptable, making it appropriate for a wide range of applications. This way developers can utilize GStreamer, the popular open-source multimedia framework, to build custom video analysis pipelines that are suited to their needs.

There are several key reasons to consider GstCUDA

High-Speed Video Processing: GstCUDA utilizes the power of NVIDIA GPUs and advanced GStreamer zero-copy techniques to enable high-speed video processing. This means you can analyze vast amounts of video data in real-time, enabling quick decision-making and improved efficiency.
Flexible and Customizable: GstCUDA is highly customizable and flexible, making it suitable for a wide range of applications. Developers can use the GStreamer framework to create custom video analysis pipelines tailored to their specific needs.
Hardware Acceleration: GstCUDA provides hardware acceleration for video processing, allowing for faster analysis and better performance.
Support and Expertise: RidgeRun provides comprehensive support and expertise for GstCUDA, ensuring that you get the most out of the tool. With RidgeRun, you can rely on expert guidance and assistance to help you integrate and optimize GstCUDA for your specific needs.
Multi-Platform Support: GstCUDA is designed to work seamlessly across multiple platforms. This means you can use GstCUDA regardless of your platform (x86 or NVIDIA Jetson), providing greater flexibility and compatibility.
Easy Integration and Prototyping: GstCUDA is easy to integrate into existing video processing pipelines, thanks to its compatibility with GStreamer. This means you can quickly and easily add GstCUDA to your existing video processing workflow without significant disruption or processing penalties.

How can GstCUDA be used then?

GstCUDA can be best understood as a framework to integrate any CUDA-based algorithm into a GStreamer element. The project offers a library and sample GStreamer elements. This allows developers to focus on writing the CUDA algorithm, instead of complex GStreamer boilerplate code.

GstCUDA is typically used, but not limited to, as a preprocessing element in a media processing pipeline or as the core processing stage in computer vision applications. Thanks to the interoperability between CUDA with popular frameworks like OpenCV or NVIDIA VPI, you can easily integrate your computer vision algorithms into a GStreamer processing pipeline, getting complex solutions done in a shorter time. GstCUDA is implemented to avoid unnecessary memory copies between the different elements in the GStreamer pipeline along with optimal CPU-GPU transfers, exclusively when needed, offering ideal performance.

Additionally, GstCUDA is not limited to simple single-input single-output algorithms. If your CUDA algorithm has multiple inputs/multiple outputs and you need to integrate it into a GStreamer processing pipeline, then, GstCUDA has you covered, since it has been designed to abstract multiple filter element topologies into different base classes. You can read all the details about it in our developers wiki.

Demo Example

To demonstrate the capabilities of the GstCUDA framework we will show the usage of some of the sample GStreamer elements the project offers. First we will focus on a single-input single output example and then proceed with a more complex multiple input single output processing pipeline.

For the proposed concept examples in this section, the hardware and software setup that was used is shown below:

NVIDIA Jetson Xavier NX (developer kit)
JetPack 5.1.1
GStreamer 1.16.3
GstPerf (to measure frame rate and CPU load)
GstCUDA

Single input/single output case

Figure 1 illustrates what is happening in the pipeline description below. The pipeline makes use of the cudafilter element, which is provided along with GstCUDA to allow for quick prototyping of CUDA algorithms. Notice how the cudafilter dynamically loads any CUDA algorithm through its "location" property which receives the CUDA kernel compiled as a shared object, and allows testing a CUDA kernel without any GStreamer programming at all! Once the kernel is working properly, it can be migrated to a custom production-ready GStreamer element, by implementing one of the provided GstCUDA base classes.

cudafilter+opencvwarp GStreamer pipeline example:

gst-launch-1.0 videotestsrc is-live=true ! "video/x-raw,width=640,height=480" ! nvvidconv ! "video/x-raw,width=3840,height=2160,framerate=30/1" ! queue ! opencvwarp demo=true ! cudafilter in-place=true location=gst-cuda/tests/examples/cudafilter_algorithms/gray-scale-filter/gray-scale-filter.so ! nvvidconv ! "video/x-raw(memory:NVMM),format=I420" ! perf ! queue ! nv3dsink sync=false

Figure 1. CUDA filter pipeline representation

Figure 2 shows the output of the grayscale + warped image in the proposed pipeline example at different instances. It is important to remark that this pipeline is able to run at a stable resolution of 4K@30 fps (no particular optimization in place in the pipeline description).

Figure 2. OpenCV + CUDA warp and grayscale filter pipeline output

Multiple instances of GstCUDA elements can easily be cascaded as well. This element can be used to implement any processing filter that follows a single input/single output topology for example:

Debayering.
Image denoising.
Image warping.
Edge detection.
Image scaling and enhancing.
Image deblurring.
Lens distortion correction.

Multiple input/single output case

Similarly, cudamux is a quick-prototyping utility element that allows loading a CUDA algorithm dynamically into a pipeline, without any GStreamer programming at all. Again, after when the kernel is finished, it can be migrated to its own production-ready element by subclassing one of GstCUDA's provided base classes.

Figure 3 shows a representation of this pipeline. This pipeline takes different input patterns from the videotestsrc element and combines it into a single, same original input resolution image through the usage of cudamux. In a later stage the cudamux output image is warped. Also, notice how any of the input streams for the cudamux can be even chained with cudafilter too after upscaling for example.

cudamux example:

gst-launch-1.0 cudamux name=mux in-place=true location=gst-cuda/tests/examples/cudamux_algorithms/mixer/mixer.so videotestsrc is-live=true ! "video/x-raw,width=640,height=480,format=I420,framerate=30/1" ! nvvidconv ! "video/x-raw,width=2560,height=1440,format=I420,framerate=30/1" ! queue ! mux.sink_0 videotestsrc is-live=true pattern=ball ! "video/x-raw,width=640,height=480,format=I420,framerate=30/1" ! nvvidconv ! "video/x-raw,width=2560,height=1440,format=I420,framerate=30/1" ! queue ! mux.sink_1 mux. ! opencvwarp demo=true ! nvvidconv ! "video/x-raw(memory:NVMM),format=I420" ! perf ! queue ! nv3dsink sync=false

Figure 3. CUDA muxer pipeline representation

Figure 4 shows the results of the warped image, which is composed of the ball pattern and color bars pattern from the videotestsrc element inputs. This pipeline is able to run stable at 2K@30 fps (no particular optimization in place in the pipeline description).

Figure 4. cudamuxer and OpenCV + CUDA warp pipeline output

This element can be used to implement any processing filter that follows a multiple inputs/single output topology for example:

Image registration.
Image stitching.
Image colorization.
Depth stereo estimation.
Image inpainting.
Image fusion.

Closing Thoughts

Overall, GstCUDA offers a modular and straightforward solution. GstCUDA's high level of abstraction architecture, library and robust capabilities make it possible to incorporate hardware-accelerated video processing into your product quickly and simply without having to worry about the difficulties of integrating CUDA kernels into application code and GStreamer programming. GstCUDA can assist you in achieving faster and more effective processing, which will improve performance and user experience whether you are working on a video streaming service, a video analytics platform, or any other kind of video-based product.

What’s Next?

Find out more in our developer's wiki:

https://developer.ridgerun.com/wiki/index.php/GstCUDA

Check out these other products that use GstCUDA:

CUDA Undistort: CUDA Accelerated GStreamer Camera Undistort.
CUDA Stitcher: Image Stitching for NVIDIA Jetson.
CUDA ISP: Image signal processing on NVIDIA GPUs.

For technical questions or to ask for an evaluation (free of charge) version of the plugin please send an email to support@ridgerun.com or send a message through https://www.ridgerun.com/contact .