Case Study: Using Libpanorama to Generate Birds Eye View Scenes

Michael Gruner
Jul 13, 2023
10 min read

Updated: Apr 17, 2024

In this blog we'll cover the process of creating a Birds Eye View (BEV) scene from a collection of cameras. For that, we'll be using Libpanorama: a new library for efficient image transformations developed by RidgeRun. By the end of this article you'll have a general idea of:

How BEV images work
How easy it is to generate them with Libpanorama
How Libpanorama can be used for other image transformation pipelines

What are Birds Eye View Images?

The easiest way to explain what Birds Eye View image is, is by using a picture.

Two images side by side, generated by a simulator. At the left, a car is show with four cameras: 2 on each side mirror, 1 at the front and 1 at the back. Besides each camera, there is an image showing what it is capturing. These images show that the cameras are capturing sideways. To the right, there is an aerial view of the same car. The complete surroundings of the car can be seen. It is as if a camera would've captured the scene from above. — Figure 1. A simulated Birds Eye View image generated from 4 perspective cameras.

As you can imagine by now, the image on the right is the BEV. It is an aerial view of the scene: what you would see if you were a bird flying over it, looking down. The BEV is a virtual composition formed by a collection of images that capture the surroundings. It is interesting to notice how these images are not facing down. The aerial view is generated from the limited floor information provided by the camera, after performing a carefully crafted perspective transformation.

The cameras must have some overlap between them. This allows the system to generate a full BEV image, without patches of missing information. The center of a BEV image, which is the object of interest, is typically a black region. Since there are no cameras facing the object, we effectively have a void of pixels there. Comercial applications typically overlay an avatar of the object on top of it. In this case: the black car.

Birds Eye View images have become very popular, specially among terrestrial vehicles. It is common to see them on modern cars, where they are used by the driver in order to have better control of the car dimensions and the obstacles in the exterior. Autonomous robots also use them to simplify collision avoidance and path planning algorithms. Lately, heavy machinery vehicles use them to have a complete view of the surroundings and avoid accidents, which otherwise would be impossible due to the size of the system.

Generating Birds Eye View Scenes

The general process of creating a Birds Eye View image can be easily understood. In there practice, there are several more details to take care of, but we'll cover some of them later. The main idea behind classical BEV systems is the Inverse Perspective Mapping (or IPM, for short). The following figure exemplifies this concept.

A collection of three images. At the left a camera looking slightly down in a parking lot space. At the middle a 3rd person view of the camera capturing the previous image. This image shows a virtual camera capturing the scene from the top. At the right, the Birds Eye View resulting image.

Figure 2. Illustration of the Inverse Perspective Transformation. Taken from here.

I'll spare the math since it's out of the scope of this article. The following figure shows the IPM process in the form of an image processing chain.

A block diagram showing the different steps of the BEV process. First, the camera captures the frames. Next, these frames are resized. Next, the Fisheye lens distortion is removed. Next, the perspective transformation is performed. This process is divided in three steps: 1. The perspective mapping 2. Image enlarging and 3. Cropping. Finally the output image is created. — Figure 3. The processing pipeline of a classical IPM based BEV.

The process is self explanatory but its worth focusing on the third step: the lens distortion removal. The IPM needs to be performed on rectilinear images. This means that, in order to use fisheye lens cameras, you need to remove this distortion first. Even if the image was not captured using a fisheye lens, perspective cameras still have some slightly noticeable curvature that may be corrected. Rectifying these images result in higher quality BEV images.

The next interesting step is fourth one: the perspective transform. The following image shows the sub-process.

A chain of four images. First, the input image of a parking lot space taken with a camera slightly looking down. This image goes through the perspective mapping. The result is the second image, where the parking lot space is now seen from above. There is a blue rectangle selecting a portion of the image. This image goes to the enlarging process. The result is the third image, which is similar to the previous one, except the aspect ratio looks more natural thanks to the enlargement. Finally, this image goes through the cropping process which results in the fourth image, which is a cropped out version of the previous one, where no missing pixels can be seen. — Figure 4. The perspective transformation sub-process.

Notice how, after the perspective transform. the image is enlarged. This typically corrects the BEV aspect ratio so that the resulting image has the natural dimensions of the scene. Finally, the resulting image is a cropped sub-section. This serves two purposes: on the one hand, it removes sections where there is no information available and, on the other hand, it removes the parts of the image where stretching is too noticeable due to extreme pixel interpolation. Both of these effects are caused by the IPM process.

Finally, you may be wondering how is this perspective transformation found. Don't worry, we'll cover that in a moment.

Introducing Libpanorama

You may have noticed how, from the original capture to the final BEV, the image undergoes several transformations. It would be great if these transformations could be combined together instead of performing each of them one by one. Moreover, you may also have noticed that the resulting BEV image consists only of a small, internal portion of the original capture. It would be ideal if we could only process the pixels that are actually needed by the BEV, ignoring the rest and saving precious processing time. That is precisely what Libpanorama was designed for.

Libpanorama is a library designed to perform efficient image transformations by composing mapping stages and processing only what is required, when it is required.

Libpanorama is written in C++ and heavily relies on template meta-programming and static polymorphism to delegate as much processing to the compiler as possible, as opposed to runtime. A typical Libpanorama application consists of the following high level parts:

Image capture / read
Parameter loading / definition
Transformation map generation
Image warping
Image display / write

Programmatically, these parts would look something like the following (error handling left out for simplicity):

1. Image Capture / Read

Images and maps are represented by a lp::Image. This is a generic container capable of handling different content types (RGBA and Gray8 at the time being). Images can be read from a file using an IO object like lp::io::Qt, as in the example above. This is shown from lines 8 to 10.

Reading images from files is intended mostly for prototypes. More complex applications will use more specialized IO objects that interact directly with 3rd party frameworks like GStreamer, V4L2, NVIDIA Libargus, etc...

2. Parameter Loading / Definition

Parameters control the behavior of the transformation chain. They consist of a structure that, internally, hosts the configuration for each processing stage. Parameters may be modified at runtime.

More often than not, you may want to save parameters for a later usage. For example, after a successful Birds Eye View calibration. For those scenarios, JSON serializers and deserializers (SerDe) are provided. As the parameters may be extended by the user, the serialization policy for the struct is defined externally, so that this extra fields are properly saved and restored as well.

3. Transformation Map Generation

This is arguably the most important step of the process. Using the parameters and a transformation recipe, a dispatcher generates the image warp maps for the X and Y coordinates (lines from 22 to 27). Typically these maps don't change from frame to frame (hence the reason to decouple it from the remap stage). This recipe is designed so that two very important things are true:

The steps in the processing chain (the recipe) are defined at compile time and combined into a single processing unit.
Only the pixels that need to be processed will be processed.

Libpanorama provides pre-defined recipes, such as Birds Eye View, Panorama, Fisheye Rectification, etc... The example above uses the Birds Eye View recipe (line 27). However, it is completely possible (and even encouraged) that the user defines its own chain. For example, a simple toy recipe that rotates and image a certain amount of degrees can be defined as:

Spare the slightly noisy notation, there is a reason for it: having the compiler perform much of the heavy lifting! The chain, seen from lines 7 to 11, defines a 3 step process:

Normalize and center the image: Rotate the image by its center, rather than the upper left corner.
Actually perform the rotation.
Denormalize back to the image coordinate system.

While this may seem overkill for a simple image rotation, it absolutely pays off on more complex chains (like real-world chains are).

These steps are known as kernels and serve as building blocks for complex processing chains. Libpanorama provides many out-of-the-box kernels ready to be used on your own custom chain. Many use cases won't require such low-level programming and will take advantage of the predefined recipes.

4. Image Warping

The maps previously generated are passed to remap object. Its job is to move pixels from the source image to the destination image as indicated by the maps, interpolate inter-pixel colors and discard pixels that fall out of the rendering section. This step is performed to every new frame, while the maps can be recycled.

Of course, Libpanorama provides specialized remap objects that make use of HW acceleration units available on the different platforms. For example, on NVIDIA Jetson boards, you would use the NPP (CUDA) remapper.

5. Image Display / Write

Finally, the resulting image is ready to be consumed. For prototyping stages, Libpanorama provides de IO classes to open a window and visualize the result or write the content to an image file. In more complex use cases you would use specialized IO objects that push this result to other frameworks, like GStreamer, V4L2, etc...

RidgeRun is actively developing Libpanorama at the time of this writing so not all the mentioned features are production ready. Stay tuned for updates!

Birds Eye View in Libpanorama

The example application code shown in the previous section performs the IPM over a single image. It needs to be slightly extended to combine all the available images. A simplified version can be seen below:

The main difference is that a pair of maps is generated for each image, which has its own set of parameters. Next, these maps are used to transform that specific image to the destination image, which remains the same for all input images.

The BirdsEyeView Chain Recipe

The Birds Eye View recipe provided by Libpanorama is defined as the following:

Now that you are familiar with transformation chains, it should be easier for you to grasp what is going on here. The steps, read from the bottom to the top, are:

(line 9) Move the image back to image coordinates.
(line 8) Perform a correction for the input size aspect-ratio. This is necessary to avoid image deformations when moving form one aspect ratio to another.
(line 7) Perform the perspective transform.
(line 6) Rotate the image, if necessary.
(line 5) Move the image to the correct location, if necessary.
(line 4) Perform a correction for the output size aspect-ratio. Required for the same reason as in step 2.
(line 3) Only paint the result in the given portion of the output image (avoid overlapping results).
(line 2) Center and normalize the resulting image.

It may seem counter-intuitive that the chain is defined in reverse order, however, this is standard practice in image processing. There is a justification for it, which falls out of the scope of this reading. For the time being, just know that chains are defined backwards, starting on the destination image and ending on the source image.

The following image shows the result of applying the Birds Eye View chain to a single image. Notice the effect of the ROI and how we used the rotation and translation steps to position the result in the appropriate location.

Two images side by side. The left image is a standard foto of an office space with the camera slightly tilted down so that it is viewing a bit to the floor. At the right a black image with the aerial view of the previous image at the top. — Figure 5. Applying the BEV recipe to a single image.

Calibrating the System

Calibrating the system simply means finding the appropriate set of parameters that correctly produce the desired Birds Eye View scene. In our case, we need to find the parameters for:

Perspective transform
Rotation
Translation
Scaling

The remainder of the parameters are taken directly from the input and output image sizes and, hence, don't require calibration.

For the remainder of the article, I'll be using six images taken from cameras installed in a test setup in my office space, as shown in the following image:

A mosaic of images. To the right, 6 cameras mounted on an office cabinet each looking at a different location, slightly tilted downwards. To the right the 6 images captured by these cameras. — Figure 6. Test setup and captured images.

To perform this calibration, we make use of the examples/birds_eye_view.cpp example application included with Libpanorama. By far, the most interesting step in the calibration process is the IPM. This step requires a square reference object, typically a chess board. In our scenario, we can leverage the square tiles in the floor. The following figure shows the result of performing the IPM calibration on one of the images.

A stack of two images. The original on top has a green polygon enclosing a square on the floor, which doesn't look square because of perspective effects. The bottom image is the resulting aerial view. — Figure 6. IPM calibration of a single image.

On the top image, a green polygon can be appreciated. That polygon is composed by 4 points that are defined by the user after clicking on the screen. These points must enclose the reference square object (in our case, the tiles). Once the 4 points are defined, they are used to estimate the aerial view, as shown in the bottom image.

The translation, scale and rotation parameters are calibrated using a more manual process. Basically the resulting IPM images are rendered on the destination image and the user places/rotates/scales them appropriately using keyboard commands.

Birds Eye View Results

After calibrating the system, the resulting image looks like the following:

A birds eye view image formaed by 6 different portions. Each portion is irregular in shape, but they keep horizontal symmetry. The image shows a few defects: camera colors don't match. and the images don't fully match with each other in some places. — Figure 7. Resulting Birds Eye View scene.

As it can be appreciated, the Birds Eye View recipe provided by Libpanorama successfully generates an aerial view from a set of images. The utilities and examples included in the project allowed me to generate the image above, without modifying the source code.

Performance-wise, we achieved the following results:

Platform: NVIDIA AGX Orin 64GB
Cameras: 6x 1280x720@30fps 110° FOV USB cameras
Capture IO: GStreamer
Software: Ubuntu 20.04
Acceleration: CUDA 11.4
Output: 1080x1920 RGBA
Achieved framerate: 30fps
GPU usage: %
CPU usage: %

Addressing the Remaining Issues

The image above still shows a few noticeable defects. Specifically:

Color differences between cameras.
Mismatches between images in some places (bottom left, for example).
Seamless stitching would be preferable.

The color differences are typically addressed in one of two different ways. The first, recommended one, is at the sensor level. The gains, white balance, exposure level, can be maintained in sync among all the cameras. You can even perform a color calibration so that the color response of the cameras are as similar as possible. The second post-processes the image content in an attempt to match the color distribution among all the images.

The mismatches between images are likely due to unwanted curvatures in the camera lenses. These curvatures cause straight lines to appear curved as they approach the image borders. The solution to this is to perform the lens distortion calibration and add the rectification as part of the Birds Eye View recipe. Users will always have face this tradeoff: either they calibrate each camera unit they ship, or absorb the curvature errors.

Finally, having a single seamless image would be preferred over the image with fixed separators. Image stitching is a well known technique that can be applied to the IPM resulting images. One typical challenge in this stage is that floors don't typically contain much texture or points to use as overlapping features in the stitching algorithm.

At the time of this writing, solutions to these items are a work in progress and not yet included in Libpanorama.

Libpanorama for your Project

RidgeRun is actively developing Libpanorama. We are at a point were we are starting to give early access to the evaluation version of the project. Please don't hesitate to contact us if Libpanorama is a good fit for your project.

At the time being, Libpanorama includes:

IO:
- GStreamer
- QT
- OpenCV
Dispatcher:
- CPU
Remap:
- CPU
- CUDA (NPP)
Supported Platforms:
- x86
- NVIDIA Jetson family
- NVIDIA Orin family
Color spaces
- RGBA
- Gray8
Recipes:
- BirdsEyeView
- SphericalVideo
- RectilinearFromSpherical
Kernels:
- Fisheye projections
- Spherical projections
- Rectilinear projections
- Basic image transformations (scale, rotate, translate, perspective)
- Pinhole transformations (rotate in the 3D space)
- ROI

Not sure if Libpanorama meets your needs? Lets have a conversation! Email us at contactus@ridgerun.com.