top of page

Improving Latency on the Holoscan Sensor Bridge with CUDA ISP

  • Writer: Luis G. Leon-Vega
    Luis G. Leon-Vega
  • Dec 13, 2024
  • 4 min read

Updated: Mar 13

Image with CUDA ISP on the Jetson connected to a Holoscan Sensor Bridge
Lower the glass-to-glass latency by using CUDA ISP on your Holoscan Sensor Bridge video pipeline

In previous blog posts, we have addressed the NVIDIA Holoscan Sensor Bridge from a brief introduction to measuring the glass-to-glass latency of an example implementation of video capture with live video display. This time, we will propose an optimisation to shorten the video pipeline by replacing the image signal processing stages with our homemade CUDA ISP solution.


If you want to have a previous overview of the platform and how to get started, please visit our other blog posts:



What is CUDA ISP?



RidgeRun CUDA Image Signal Processing Library Logo
RidgeRun CUDA Image Signal Processing Library

CUDA ISP is a library for Image Signal Processing that allows GPU-accelerated debayering, binary shifting, and white balancing, including CUDA buffer allocators. This makes integrating multimedia applications and computer vision algorithms easy. The library provides an intuitive C++ API that can be seamlessly integrated into existing workflows and is GStreamer-friendly, making it ideal for streaming applications, computer vision, and image analysis.



What's the Baseline Performance of Holoscan Sensor Bridge?


The baseline performance of the Holoscan Sensor Bridge is taken from the Linux IMX274 example, provided by the GitHub repository. This executes the following pipeline:


Baseline Image Signal Processing Pipeline
Baseline Image Signal Processing Pipeline

In this pipeline, the receiver operator is purely done in the Holoscan Sensor Bridge, where the data is transmitted to the Jetson through UDP Linux Sockets. Then, the Image Processor is CUDA-accelerated through custom CUDA kernels, performing white and black balancing, mainly. The Bayer Demosaic is a Holoscan operator already present in the Holoscan Framework, based on the NPP library. The Gamma Correction is also a custom CUDA kernel.


For this baseline pipeline, the glass-to-glass latency is 41.61 ms when using all the optimisations and the maximum power mode available in the AGX Orin.



Optimizing the Pipeline with CUDA ISP


CUDA ISP integrates an outstanding algorithm for colour correction and auto-white balancing with RGB space. It adjusts the histograms of each colour channel within a confidence interval, leading to a more complete colour balancing.


Recalling the baseline pipeline, our optimization implies dropping the ISP Processor block and replacing the Gamma Correction block with the CUDA ISP block. Internally, the CUDA ISP block performs the following operations:


  1. Downsample the image from RGBA64 to RGBA32

  2. Auto-White Balancing

  3. Upsample the image from RGBA32 to RGBA64


The operations 1. and 3. are needed given the previous and following processing blocks. The Holoviz requires RGBA64. Moreover, the Bayer Demosaic is configured to provide RGBA64 rather than RGBA32. For simplicity, we propose removing the ISP Processor, replacing the Gamma Correction, and performing the sampling. We are going to cover the RGBA64 compatibility with CUDA ISP in future blog posts.


CUDA ISP Based Image Signal Processing Pipeline
CUDA ISP Based Image Signal Processing Pipeline

The optimized pipeline is shown above, highlighting the removal and replacement of the ISP Processor block. Each of these blocks is executed in parallel for pipeline-like acceleration. Removing one of the blocks will shorten the frame processing time (latency), and further optimizing any of these blocks will also decrease the latency.


For more information about CUDA ISP, visit our developer wiki.


Final Results


For this experiment, we use the Holoscan Sensor Bridge and the NVIDIA Orin with the IMX274 Sensor. For the entire setup, you can visit our previous blog: Glass to Glass Assessment of the Holoscan Sensor Bridge on an NVIDIA AGX Orin.


Monitor and Display Mode

Power Profile / Jetson Clocks

Image Signal Processing Pipeline

Glass-to-Glass Latency

Display Port / Exclusive Display

MAXN / Enabled

Debayer only

35.95 ms

Display Port / Exclusive Display

MAXN / Enabled

Baseline Pipeline

41.61 ms

Display Port / Exclusive Display

MAXN / Enabled

CUDA ISP Pipeline and Histogram Stretch Auto White Balance

37.93 ms

Display Port / Exclusive Display

MAXN / Enabled

CUDA ISP Pipeline and GrayWorld Auto White Balance

49.76 ms

According to the table, the image signal processing pipeline optimisation with CUDA ISP has achieved better results than the base pipeline provided in the Holoscan Sensor Bridge Example for the IMX274, improving the glass-to-glass latency in 8.8%, using the Histogram Stretch Auto White Balance Module. Moreover, without any refinements, the glass-to-glass latency with only debayering is almost 36 ms, placing a low-latency capture ideal for medical, robotics and critical applications.



Further Improvement


The downsampling and upsampling for the RGBA64-RGBA32 conversion happened because of a compatibility issue with CUDA ISP. Nevertheless, the image quality might be sacrificed. The next step is to add RGBA64 support to CUDA ISP, which will come in the near future.


On the other hand, more improvement can be applied by offloading the image signal processing to the FPGA, reducing the pressure on the Jetson system. The FPGA can potentially reduce the latency given the dataflow execution pattern offered by FPGA Hardware Acceleration. RidgeRun is exploring new ways to reduce the latency to 50 ms or less for critical applications, optimizing ISP algorithms and using FPGAs.




Important Remarks


  • For stereo capture, the processing platform (i.e. the NVIDIA Jetson) must have two separate network interfaces, given that each camera stream is delivered in separate ports.

  • The NVIDIA Jetson AGX Orin developer kit does not possess a DPDK-compatible card, falling back into the Linux socket system. This increases the glass-to-glass latency. For DPDK, it is necessary to connect a DPDK-compatible card or use a custom carrier board with a compatible NIC.


Expect more Information From Us


At RidgeRun, we are experts in image signal processing, algorithm optimisation and hardware acceleration. If you want to know more about how to leverage this technology in your project: Contact Us.



bottom of page