top of page
Writer's picturejuandelgado92

Exploring Text Generation on NVIDIA Jetson with Generative AI Lab

Updated: Apr 17

Introduction

At RidgeRun, we are dedicated to actively seeking new opportunities and technologies to enhance our services. Our commitment to continuous improvement allows us to better serve our clients by providing cutting-edge solutions and staying ahead of industry trends. 


We believe in the power of leveraging the latest advancements to create value for our clients and contribute to their success. In this blog post we are exploring NVIDIA’s Jetson Generative AI Lab and its potential for embedded applications.


Generative AI is a type of artificial intelligence technology that helps users and developers generate new content instead of just analyzing or manipulating existing data. This content can be text (like GPT4, Claude, Bard or LLama), image (like Dall E, Stable Diffusion or Midjourney), audio (like MusicLM or Magnet), and so on.  This type of AI can learn from multiple sources such as wiki pages, GitHub repositories, images, songs, movies, and user input. They are supplied with datasets and can create new content using what they learned [1].


Among the many tasks of Generative AI, Large Language Models (LLM) in particular, have gained huge popularity lately. The Generative AI world focuses on Large Language Models (LLMs). The LLMs have developed a massive statistical model of our spoken language, which gives them the capability to have human-like conversations.


Other models that have gained popularity are Vision language models (VLMs) and Diffusion models. VLMs extend language models with image analysis capabilities, which allows the LLMs to explore, understand and describe the real world through a camera. On the other hand, diffusion models can receive simple text prompts and convert them into visual content [2].


With the new NVIDIA Jetson boards, such as the Orin AGX, Orin NX and Orin Nano, it is possible to not only run LLMs, but also run VLMs, and stable diffusion models locally by using the Jetson Generative AI Lab that NVIDIA brought to light. This new tool allows developers and users to go through the generation of new content in a real-world setting with the Jetson Edge devices and technology [2]


At RidgeRun we are interested in exploring NVIDIA’s Jetson Generative AI Lab to present users with the results of running these generative models on NVIDIA Jetson boards. In this post, we explore the Text Generation, Image Generation, and Text Plus Vision tutorials. You’ll find information regarding environment details for each tutorial, including the models that are used. The reader will be able to check CPU, GPU and memory performance metrics as well as the results that are obtained from exploring and executing the tutorials.


Environment


Hardware details

The tutorials explored in this post are Text Generation, Image Generation, Text + Vision and NanoSAM. They were executed in a Jetson Orin AGX 32 GB with 64 GB of eMMC. This Jetson Orin which has an Ampere GPU architecture and 12 CPU ARM cores is capable of heavy computation around the Generative AI tasks.


The JetPack used is version 5.1.2, which is the latest JetPack 5 version delivered by NVIDIA as of the date of writing this post. It has Ubuntu 20.04 and can run the tutorials in a Docker container based on Linux For Tegra from JetPack 5 and the latest JetPack 6.


The Jetson Orin supports different power modes. The one used for this blog is the maximum power mode which has no power budget and uses the 12 online CPUs. Furthermore, to improve performance the jetson_clocks script was enabled for all the tests.


Models details

Each tutorial explored in this post has a different model for achieving its purpose. The models' details can be summarized in the following table:


Table 1. Models details used for the tutorials explored.

Tutorial

Model details

Text Generation

Model Name: Llama-2-7B-Chat-GGUF

Quantization: llama-2-7b-chat.Q4_K_M.gguf

Quantization method: Q4_K_M

Bits: 4

Size: 4.08 GB

Max RAM required: 6.58 GB

Description: Medium size model, good balance between quality and memory usage

Image Generation

Model Name: v1-5-pruned-emaonly.safetensors

Quantization: N/A

Quantization method: N/A

Bits: N/A

Size: 4.27 GB

Max RAM required: N/A

Description: Photo-realistic image based on text input

Text + Vision

Model Name: TheBloke/llava-v1.5-13B-GPTQ

Quantization: 4

Quantization method: N/A

Bits: 4

Size: 7.26 GB

Max RAM required: N/A

Description: 4-bit. It Consumes less than 64GB VRAM. However, it has lower accuracy

NanoSAM

Image encoder Name: ResNet18 Image encoder

Quantization: FP16

Quantization method: N/A

Bits: 16

Size: N/A

Max RAM required: N/A

Used case: Resnet18 is used for image classification


Mask decoder Name: MobileSAM

Quantization: FP32

Quantization method: N/A

Bits: 32

Size: N/A

Max RAM required: N/A

Description: It is a Segment Anything (SAM) model for mobile applications.


There are some Performance notes for the models that are used in this tutorial


Furthermore, the Tutorials’ Benchmarks can be consulted in  NVIDIA Jetson Generative AI Lab Benchmarks.


Results summary

Detailed instructions on how to run the NVIDIA generative AI tutorials can be found in our developer’s wiki, here: NVIDIA Jetson Generative AI Lab RidgeRun Exploration

In addition, the performance metrics details can be taken from Figure 2, Figure 5 and Figure 7, which are summarized in the following bullet points:

  • The text generation tutorial consumes 91.3% of GPU on average. While the CPU consumption is 20.37% on average. The output was generated at 16.69 tokens/s for this case.

  • The Image generation tutorial consumes 99% of GPU and 52.6% of CPU on average. The image was generated in 4.3 seconds with a 512x512 size for this case.

  • In the case of the Text Plus Vision tutorial, it consumes 36.71% of GPU and 111.86% of CPU on average. The output was generated at 3.83 tokens/s.

  • The NanoSAM tutorial consumes 6% of GPU, but 982.13% of CPU on average. The memory usage had an average of 12.64%

  • For the three first tutorials the RAM value increased less than 1%. According to the plots, the memory was not freed.


Results and Performance


One of the main purposes of this post is exploring the performance that can be achieved for generative models on a Jetson platform. There are plenty of metrics that can be extracted from these executions, the ones measured here are:

  • CPU usage

  • GPU usage

  • RAM usage

  • Tokens/s


The first three parameters are taken by parsing the NVIDIA tegrastats utility and plotting the measurements to have a better understanding of resource consumption. The execution time is reported by the application itself.


Text Generation

The text generation model was tested on its capabilities for generating a response on what approach to follow to prevent race conditions in an application. The following prompt was used:

What approach would you use to detect and prevent race conditions in a multithreaded application?


Here are some questions that the user can ask the AI model.


This request took 41.46 seconds to be completed and successfully produced the following output. 


Results

The following gif shows how the text generation execution creates the output when an input prompt is provided:




Figure 1. Text generation demo result.


The above output was generated at a rate of 16.69 tokens/s.


This same execution was used to measure performance and profile resource usage for the model on the Jetson, the following plots show a summary of the performance achieved by the model.




Figure 2. Text generation performance metrics.


These plots show the impact on  CPU, GPU and memory usage when the model starts; on the GPU, usage reached an average of 91.3% during inference. On the CPU however, we see a smaller increase with an average usage of  20.37%, with some spikes at the start and end of execution. Regarding memory usage we see no significant change with a variation of less than 1%, this is caused by the model being already loaded beforehand.


Image Generation

The main idea of Image Generation using the Stable Diffusion tutorial is to generate an image based on a text input prompt. So here is the prompt used to create the following image and measure resource consumption:


Futuristic city with sunset, high quality, 4K image.


Results

The following gif shows how the image generation execution creates the image when an input prompt description is provided:




Figure 3. Image generation demo result.


The image can be downloaded, the result is the following:




Figure 4. A futuristic city with a sunset.


The time taken by the tutorial to generate the image was 4.3 seconds and the image was created under the following conditions:

  • Steps: 20

  • Sampler: Euler a

  • CFG scale: 7

  • Seed: 3341820767

  • Size: 512x512


The performance and the profile resource usage for the model were measured during this same execution on the Jetson, the following plots show a summary of the performance achieved by the model.





Figure 5. Image generation performance metrics.


These plots show the CPU, GPU and memory percentage resource usage when the tutorial is run. The GPU usage reached an average of 99% during inference. On the CPU side, we see a smaller increase with an average usage of  52.6%. Regarding memory usage, we see a behavior similar to the text generation tutorial, the memory was less than 1%, this is caused by the model being already loaded beforehand. The image was generated in 4.3 seconds.


Text Plus Vision

The Text Plus Vision tutorial presents some different methods on how to run the tutorial in the Jetson. The one that is explored in this blog is the Chat with Llava using text-generation-webui


Results

The image to test is a free-to-use image taken from https://www.pexels.com/photo/people-walking-in-market-439818/. The following gif shows how the image was loaded and how the output was generated:




Figure 6. Text Plus Vision demo result.


The above output was generated at a rate of 3.83 tokens/s. Furthermore, here are the performance metrics generated by running this tutorial method:




Figure 7. Text Plus Vision performance metrics.


These plots show the CPU, GPU and memory percentage resource usage when the model is inferring. The GPU usage had an average of 36.71%. On the CPU side, the usage reached an average of 111.86%. Regarding memory usage, we see a behavior similar to the previous tutorial shown above, the memory was less than 1%, this is caused by the model being already loaded beforehand.


NanoSAM

This tutorial takes NVIDIA's NanoSAM project and runs the Example 1 - Segment with bounding box which uses the basic_usage.py example code.


Results

The output image looks like the following. The example is going to segment the dog that is enclosed in the rectangle created by the (100,100), (850,759) points.




Figure 8. NanoSAM demo results.


Here are the performance metrics generated by running this tutorial:




The above plots show the CPU, GPU and memory percentage resource usage when the example is run. The GPU usage had an average of 6%. On the CPU side, the usage reached an average of 982.13%. This time, the memory usage behavior is not similar to the previous tutorials, it had an average of 12.64%.


Conclusions

With the results generated in this blog, it can be seen that the generative AI tutorials provided by NVIDIA can be successfully executed and get their results on the Jetson Orin AGX board. Furthermore, they were easy to use thanks to the instructions provided by the  NVIDIA Jetson Generative AI Lab 


On the metrics performance side, the text generation and image generation tutorial occupied more than 90% of GPU resources. While the Text Plus Vision and the NanoSAM tutorials consumed less than 40% of GPU resources, but more than 100% of CPU consumption on average. So, it is difficult for the Jetson board to execute other real-time tasks while the Generative AI process is running since there are no real-time resources capabilities available.


Finally, at RidgeRun we see a lot of potential regarding the NVIDIA Generative AI tutorials and their execution on Jetson, there are still a lot more things to explore in depth to adapt this technology to custom applications from our clients. Watch out for more blogs on the topic and technical documentation in our developer’s wiki.


References

297 views
bottom of page