nvidia

Open Source

GenerativeAIExamples

![](docs/images/[email protected]) # NVIDIA Generative AI Examples This repository is a starting point for developers looking to integrate with the NVIDIA software ecosystem to speed up their generative AI systems. Whether you are building RAG pipelines, agentic workflows, or fine-tuning models, this repository will help you integrate NVIDIA, seamlessly and natively, with your development stack. ## Table of Contents  * [What's New?](#whats-new) * [Data Flywheel](#data-flywheel) * [Safer Agentic AI](#safer-agentic-ai) * [Knowledge Graph RAG](#knowledge-graph-rag) * [Agentic Workflows with Llama 3.1](#agentic-workflows-with-llama-31) * [RAG with Local NIM Deployment and LangChain](#rag-with-local-nim-deployment-and-langchain) * [Vision NIM Workflows](#vision-nim-workflows) * [Try it Now!](#try-it-now) * [Data Flywheel](#data-flywheel) * [Tool-Calling Notebooks](#tool-calling-notebooks) * [RAG](#rag) * [RAG Notebooks](#rag-notebooks) * [RAG Examples](#rag-examples) * [RAG Tools](#rag-tools) * [RAG Projects](#rag-projects) * [Documentation](#documentation) * [Getting Started](#getting-started) * [How To's](#how-tos) * [Reference](#reference) * [Community](#community)  ## What's New? ### Data Flywheel These tutorials demonstrate Data Flywheel workflows that use NVIDIA NeMo Microservices. They include components such as NVIDIA NeMo Datastore, NeMo Entity Store, NeMo Customizer, NeMo Evaluator, NeMo Guardrails microservices, and NVIDIA NIMs. - [Tool Calling Fine-tuning, Inference, Evaluation, and Guardrailing with NVIDIA NeMo Microservices and NIMs](./nemo/data-flywheel/tool-calling) - [Embedding Fine-tuning, Inference, and Evaluation with NVIDIA NeMo Microservices and NIMs](./nemo/data-flywheel/embedding-finetuning/) ### Safer Agentic AI The following tutorials illustrate how to audit your large language models with NeMo Auditor to identify vulnerabilities to unsafe prompts, and how to run inference with multiple rails in parallel to reduce latency and improve throughput. - [Audit your LLMs](./nemo/NeMo-Auditor/Getting_Started_With_NeMo_Auditor.ipynb) - [Inference with Parallel Rails](./nemo/NeMo-Guardrails/Parallel_Rails_Tutorial.ipynb) ### Knowledge Graph RAG This example implements a GPU-accelerated pipeline for creating and querying knowledge graphs using RAG by leveraging NIM microservices and the RAPIDS ecosystem to process large-scale datasets efficiently. - [Knowledge Graphs for RAG with NVIDIA AI Foundation Models and Endpoints](community/knowledge_graph_rag) ### Agentic Workflows with Llama 3.1 - Build an Agentic RAG Pipeline with Llama 3.1 and NVIDIA NeMo Retriever NIM microservices [[Blog](https://developer.nvidia.com/blog/build-an-agentic-rag-pipeline-with-llama-3-1-and-nvidia-nemo-retriever-nims/), [Notebook](RAG/notebooks/langchain/agentic_rag_with_nemo_retriever_nim.ipynb)] - [NVIDIA Morpheus, NIM microservices, and RAG pipelines integrated to create LLM-based agent pipelines](https://github.com/NVIDIA/GenerativeAIExamples/blob/v0.7.0/experimental/event-driven-rag-cve-analysis) ### RAG with Local NIM Deployment and LangChain - Tips for Building a RAG Pipeline with NVIDIA AI LangChain AI Endpoints by Amit Bleiweiss. [[Blog](https://developer.nvidia.com/blog/tips-for-building-a-rag-pipeline-with-nvidia-ai-langchain-ai-endpoints/), [Notebook](https://github.com/NVIDIA/GenerativeAIExamples/blob/v0.7.0/notebooks/08_RAG_Langchain_with_Local_NIM.ipynb)] For more information, refer to the [Generative AI Example releases](https://github.com/NVIDIA/GenerativeAIExamples/releases/). ### Vision NIM Workflows A collection of Jupyter notebooks, sample code and reference applications built with Vision NIMs. To pull the vision NIM workflows, clone this repository recursively: ``` git clone https://github.com/nvidia/GenerativeAIExamples --recurse-submodules ``` The workflows will then be located at [GenerativeAIExamples/vision_workflows](vision_workflows/README.md) Follow the links below to learn more: - [Learn how to use VLMs to automatically monitor a video stream for custom events.](nim_workflows/vlm_alerts/README.md) - [Learn how to search images with natural language using NV-CLIP.](nim_workflows/nvclip_multimodal_search/README.md) - [Learn how to combine VLMs, LLMs and CV models to build a robust text extraction pipeline.](nim_workflows/vision_text_extraction/README.md) - [Learn how to use embeddings with NVDINOv2 and a Milvus VectorDB to build a few shot classification model.](nim_workflows/nvdinov2_few_shot/README.md) ## Try it Now! Experience NVIDIA RAG Pipelines with just a few steps! 1. Get your NVIDIA API key. 1. Go to the [NVIDIA API Catalog](https://build.ngc.nvidia.com/explore/). 1. Select any model. 1. Click **Get API Key**. 1. Run: ```console export NVIDIA_API_KEY=nvapi-... ``` 1. Clone the repository. ```console git clone https://github.com/nvidia/GenerativeAIExamples.git ``` 1. Build and run the basic RAG pipeline. ```console cd GenerativeAIExamples/RAG/examples/basic_rag/langchain/ docker compose up -d --build ``` 1. Go to <https://localhost:8090/> and submit queries to the sample RAG Playground. 1. Stop containers when done. ```console docker compose down ``` ## Data Flywheel A [Data Flywheel](https://www.nvidia.com/en-us/glossary/data-flywheel/) is a self-reinforcing cycle where user interactions generate data that improves AI models or products, leading to better outcomes that attract more users and further enhance data quality. This feedback loop relies on continuous data processing, model refinement, and guardrails to ensure accuracy and compliance while compounding value over time. Real-world applications range from personalized customer experiences to operational systems like inventory management, where improved predictions drive efficiency and growth. ### Tool-Calling Notebooks Tool calling empowers Large Language Models (LLMs) to integrate with external APIs, execute dynamic workflows, and retrieve real-time data beyond their training scope. The NVIDIA NeMo microservices platform offers a modular infrastructure for deploying AI pipelines that includes fine-tuning, evaluation, inference, and guardrail enforcement—across Kubernetes clusters in cloud or on-premises environments. This end-to-end [tutorial](./nemo/data-flywheel/tool-calling) demonstrates how to leverage NeMo Microservices to customize [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) by using the [xLAM](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) function-calling dataset, assess its accuracy, and implement safety constraints to govern its behavior. ## RAG ### RAG Notebooks NVIDIA has first-class support for popular generative AI developer frameworks like [LangChain](https://python.langchain.com/v0.2/docs/integrations/chat/nvidia_ai_endpoints/), [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/llm/nvidia/), and [Haystack](https://haystack.deepset.ai/integrations/nvidia). These end-to-end notebooks show how to integrate NIM microservices using your preferred generative AI development framework. Use these [notebooks](./RAG/notebooks/README.md) to learn about the LangChain and LlamaIndex connectors. #### LangChain Notebooks - RAG - [Basic RAG with CHATNVIDIA LangChain Integration](./RAG/notebooks/langchain/langchain_basic_RAG.ipynb) - [RAG using local NIM microservices for LLMs and Retrieval](./RAG/notebooks/langchain/RAG_Langchain_with_Local_NIM.ipynb) - [RAG for HTML Documents](./RAG/notebooks/langchain/RAG_for_HTML_docs_with_Langchain_NVIDIA_AI_Endpoints.ipynb) - [Chat with NVIDIA Financial Reports](./RAG/notebooks/langchain/Chat_with_nvidia_financial_reports.ipynb) - Agents - [NIM Tool Calling 101](https://github.com/langchain-ai/langchain-nvidia/blob/main/cookbook/nvidia_nim_agents_llama3.1.ipynb) - [Agentic RAG with NeMo Retriever](./RAG/notebooks/langchain/agentic_rag_with_nemo_retriever_nim.ipynb) - [Agents with Human in the Loop](./RAG/notebooks/langchain/LangGraph_HandlingAgent_IntermediateSteps.ipynb) #### LlamaIndex Notebooks - [Basic RAG with LlamaIndex Integration](./RAG/notebooks/llamaindex/llamaindex_basic_RAG.ipynb) ### RAG Examples By default, these end-to-end [examples](RAG/examples/README.md) use preview NIM endpoints on [NVIDIA API Catalog](https://catalog.ngc.nvidia.com). Alternatively, you can run any of the examples [on premises](./RAG/examples/local_deploy/). #### Basic RAG Examples - [LangChain example](./RAG/examples/basic_rag/langchain/README.md) - [LlamaIndex example](./RAG/examples/basic_rag/llamaindex/README.md) #### Advanced RAG Examples - [Multi-Turn](./RAG/examples/advanced_rag/multi_turn_rag/README.md) - [Multimodal Data](./RAG/examples/advanced_rag/multimodal_rag/README.md) - [Structured Data](./RAG/examples/advanced_rag/structured_data_rag/README.md) (CSV) - [Query Decomposition](./RAG/examples/advanced_rag/query_decomposition_rag/README.md) ### RAG Tools Example tools and tutorials to enhance LLM development and productivity when using NVIDIA RAG pipelines. - [Evaluation](./RAG/tools/evaluation/README.md) - [Observability](./RAG/tools/observability/README.md) ### RAG Projects - [NVIDIA Tokkio LLM-RAG](https://docs.nvidia.com/ace/latest/workflows/tokkio/text/Tokkio_LLM_RAG_Bot.html): Use Tokkio to add avatar animation for RAG responses. - [Hybrid RAG Project on AI Workbench](https://github.com/NVIDIA/workbench-example-hybrid-rag): Run an NVIDIA AI Workbench example project for RAG. ## Documentation ### Getting Started - [Prerequisites](./docs/common-prerequisites.md) ### How To's - [Changing the Inference or Embedded Model](./docs/change-model.md) - [Customizing the Vector Database](./docs/vector-database.md) - [Customizing the Chain Server](./docs/chain-server.md): - [Chunking Strategy](./docs/text-splitter.md) - [Prompting Template Engineering](./docs/prompt-customization.md) - [Configuring LLM Parameters at Runtime](./docs/llm-params.md) - [Supporting Multi-Turn Conversations](./docs/multiturn.md) - [Speaking Queries and Listening to Responses with NVIDIA Riva](./docs/riva-asr-tts.md) ### Reference - [Support Matrix](./docs/support-matrix.md) - [Architecture](./docs/architecture.md) - [Using the Sample Chat Web Application](./docs/using-sample-web-application.md) - [RAG Playground Web Application](./docs/frontend.md) - [Software Component Configuration](./docs/configuration.md) ## Community We're posting these examples on GitHub to support the NVIDIA LLM community and facilitate feedback. We invite contributions! Open a GitHub issue or pull request! See [contributing](docs/contributing.md) Check out the [community](./community/README.md) examples and notebooks.

AI & Machine Learning DevOps & Infrastructure

4.1K Github Stars

Open Source

DALI

|License| |Documentation| |Format| NVIDIA DALI =========== .. overview-begin-marker-do-not-remove The NVIDIA Data Loading Library (DALI) is a GPU-accelerated library for data loading and pre-processing to accelerate deep learning applications. It provides a collection of highly optimized building blocks for loading and processing image, video and audio data. It can be used as a portable drop-in replacement for built in data loaders and data iterators in popular deep learning frameworks. Deep learning applications require complex, multi-stage data processing pipelines that include loading, decoding, cropping, resizing, and many other augmentations. These data processing pipelines, which are currently executed on the CPU, have become a bottleneck, limiting the performance and scalability of training and inference. DALI addresses the problem of the CPU bottleneck by offloading data preprocessing to the GPU. Additionally, DALI relies on its own execution engine, built to maximize the throughput of the input pipeline. Features such as prefetching, parallel execution, and batch processing are handled transparently for the user. In addition, the deep learning frameworks have multiple data pre-processing implementations, resulting in challenges such as portability of training and inference workflows, and code maintainability. Data processing pipelines implemented using DALI are portable because they can easily be retargeted to TensorFlow, PyTorch, and PaddlePaddle. .. image:: /dali.png :width: 800 :align: center :alt: DALI Diagram .. github display off .. GitHub discards custom styles when rendering Markdown. We can use it to our advantage to hide the GitHub-style admonition in the sphinx-rendered documentation .. only:: html .. raw:: html <style> .github-only { display: none !important; } </style> .. rst-class:: github-only .. pull-quote:: [!TIP] The `dali-dynamic-mode <https://github.com/NVIDIA/skills/blob/main/skills/dali-dynamic-mode/SKILL.md>`_ skill provides AI agents with guidance on the Dynamic Mode API and best practices. It can be installed as follows: .. code-block:: sh npx skills add nvidia/skills --skill dali-dynamic-mode For more information, see the `NVIDIA/skills <https://github.com/NVIDIA/skills>`_ GitHub repository. DALI in action -------------- .. container:: dali-tabs **Pipeline mode:** .. code-block:: python from nvidia.dali.pipeline import pipeline_def import nvidia.dali.types as types import nvidia.dali.fn as fn from nvidia.dali.plugin.pytorch import DALIGenericIterator import os # To run with different data, see documentation of nvidia.dali.fn.readers.file # points to https://github.com/NVIDIA/DALI_extra data_root_dir = os.environ['DALI_EXTRA_PATH'] images_dir = os.path.join(data_root_dir, 'db', 'single', 'jpeg') def loss_func(pred, y): pass def model(x): pass def backward(loss, model): pass @pipeline_def(num_threads=4, device_id=0) def get_dali_pipeline(): images, labels = fn.readers.file( file_root=images_dir, random_shuffle=True, name="Reader") # decode data on the GPU images = fn.decoders.image_random_crop( images, device="mixed", output_type=types.RGB) # the rest of processing happens on the GPU as well images = fn.resize(images, resize_x=256, resize_y=256) images = fn.crop_mirror_normalize( images, crop_h=224, crop_w=224, mean=[0.485 * 255, 0.456 * 255, 0.406 * 255], std=[0.229 * 255, 0.224 * 255, 0.225 * 255], mirror=fn.random.coin_flip()) return images, labels train_data = DALIGenericIterator( [get_dali_pipeline(batch_size=16)], ['data', 'label'], reader_name='Reader' ) for i, data in enumerate(train_data): x, y = data[0]['data'], data[0]['label'] pred = model(x) loss = loss_func(pred, y) backward(loss, model) **Dynamic mode:** .. code-block:: python import os import nvidia.dali.types as types import nvidia.dali.experimental.dynamic as ndd import torch # To run with different data, see documentation of ndd.readers.File # points to https://github.com/NVIDIA/DALI_extra data_root_dir = os.environ['DALI_EXTRA_PATH'] images_dir = os.path.join(data_root_dir, 'db', 'single', 'jpeg') def loss_func(pred, y): pass def model(x): pass def backward(loss, model): pass reader = ndd.readers.File(file_root=images_dir, random_shuffle=True) for images, labels in reader.next_epoch(batch_size=16): images = ndd.decoders.image_random_crop(images, device="gpu", output_type=types.RGB) # the rest of processing happens on the GPU as well images = ndd.resize(images, resize_x=256, resize_y=256) images = ndd.crop_mirror_normalize( images, crop_h=224, crop_w=224, mean=[0.485 * 255, 0.456 * 255, 0.406 * 255], std=[0.229 * 255, 0.224 * 255, 0.225 * 255], mirror=ndd.random.coin_flip(), ) x = torch.as_tensor(images) y = torch.as_tensor(labels.gpu()) pred = model(x) loss = loss_func(pred, y) backward(loss, model) Highlights ---------- - Easy-to-use functional style Python API. - Multiple data formats support - LMDB, RecordIO, TFRecord, COCO, JPEG, JPEG 2000, WAV, FLAC, OGG, H.264, VP9 and HEVC. - Portable across popular deep learning frameworks: TensorFlow, PyTorch, PaddlePaddle, JAX. - Supports CPU and GPU execution. - Scalable across multiple GPUs. - Flexible graphs let developers create custom pipelines. - Extensible for user-specific needs with custom operators. - Accelerates image classification (ResNet-50), object detection (SSD) workloads as well as ASR models (Jasper, RNN-T). - Allows direct data path between storage and GPU memory with `GPUDirect Storage <https://developer.nvidia.com/gpudirect-storage>`__. - Easy integration with `NVIDIA Triton Inference Server <https://developer.nvidia.com/nvidia-triton-inference-server>`__ with `DALI TRITON Backend <https://github.com/triton-inference-server/dali_backend>`__. - Open source. .. overview-end-marker-do-not-remove ---- DALI success stories: --------------------- - `During Kaggle computer vision competitions <https://www.kaggle.com/code/theoviel/rsna-breast-baseline-faster-inference-with-dali>`__: `"DALI is one of the best things I have learned in this competition" <https://www.kaggle.com/competitions/rsna-breast-cancer-detection/discussion/391059>`__ - `Lightning Pose - state of the art pose estimation research model <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10168383/>`__ - `To improve the resource utilization in Advanced Computing Infrastructure <https://arcwiki.rs.gsu.edu/en/dali/using_nvidia_dali_loader>`__ - `MLPerf - the industry standard for benchmarking compute and deep learning hardware and software <https://developer.nvidia.com/blog/mlperf-hpc-v1-0-deep-dive-into-optimizations-leading-to-record-setting-nvidia-performance/>`__ - `"we optimized major models inside eBay with the DALI framework" <https://www.nvidia.com/en-us/on-demand/session/gtc24-s62578/>`__ ---- DALI Roadmap ------------ `The following issue represents <https://github.com/NVIDIA/DALI/issues/5320>`__ a high-level overview of our 2024 plan. You should be aware that this roadmap may change at any time and the order of its items does not reflect any type of priority. We strongly encourage you to comment on our roadmap and provide us feedback on the mentioned GitHub issue. ---- Installing DALI --------------- To install the latest DALI release for the latest CUDA version (12.x):: pip install nvidia-dali-cuda120 # or pip install --extra-index-url https://pypi.nvidia.com --upgrade nvidia-dali-cuda120 DALI requires `NVIDIA driver <https://www.nvidia.com/drivers>`__ supporting the appropriate CUDA version. In case of DALI based on CUDA 12, it requires `CUDA Toolkit <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`__ to be installed. DALI comes preinstalled in the `TensorFlow <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow>`__, `PyTorch <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch>`__, and `PaddlePaddle <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/paddlepaddle>`__ containers on `NVIDIA GPU Cloud <https://ngc.nvidia.com>`__. For other installation paths (TensorFlow plugin, older CUDA version, nightly and weekly builds, etc), and specific requirements please refer to the `Installation Guide <https://docs.nvidia.com/deeplearning/dali/user-guide/docs/installation.html>`__. To build DALI from source, please refer to the `Compilation Guide <https://docs.nvidia.com/deeplearning/dali/user-guide/docs/compilation.html>`__. ---- Examples and Tutorials ---------------------- An introduction to DALI can be found in the `Getting Started <https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/getting_started.html>`__ page. More advanced examples can be found in the `Examples and Tutorials <https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/index.html>`__ page. For an interactive version (Jupyter notebook) of the examples, go to the `docs/examples <https://github.com/NVIDIA/DALI/blob/main/docs/examples>`__ directory. **Note:** Select the `Latest Release Documentation <https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html>`__ or the `Nightly Release Documentation <https://docs.nvidia.com/deeplearning/dali/main-user-guide/docs/index.html>`__, which stays in sync with the main branch, depending on your version. ---- Additional Resources -------------------- - GPU Technology Conference 2024; **Optimizing Inference Model Serving for Highest Performance at eBay**; Yiheng Wang: `event <https://www.nvidia.com/en-us/on-demand/session/gtc24-s62578/>`__ - GPU Technology Conference 2023; **Developer Breakout: Accelerating Enterprise Workflows With Triton Server and DALI**; Brandon Tuttle: `event <https://www.nvidia.com/en-us/on-demand/session/gtcspring23-se52140/>`__. - GPU Technology Conference 2023; **GPU-Accelerating End-to-End Geospatial Workflows**; Kevin Green: `event <https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s51796/>`__. - GPU Technology Conference 2022; **Effective NVIDIA DALI: Accelerating Real-life Deep-learning Applications**; Rafał Banaś: `event <https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41442/>`__. - GPU Technology Conference 2022; **Introduction to NVIDIA DALI: GPU-accelerated Data Preprocessing**; Joaquin Anton Guirao: `event <https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41443/>`__. - GPU Technology Conference 2021; **NVIDIA DALI: GPU-Powered Data Preprocessing** by Krzysztof Łęcki and Michał Szołucha: `event <https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31298/>`__. - GPU Technology Conference 2020; **Fast Data Pre-Processing with NVIDIA Data Loading Library (DALI)**; Albert Wolant, Joaquin Anton Guirao: `recording <https://developer.nvidia.com/gtc/2020/video/s21139>`__. - GPU Technology Conference 2019; **Fast AI data pre-preprocessing with DALI**; Janusz Lisiecki, Michał Zientkiewicz: `slides <https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9925-fast-ai-data-pre-processing-with-nvidia-dali.pdf>`__, `recording <https://developer.nvidia.com/gtc/2019/video/S9925/video>`__. - GPU Technology Conference 2019; **Integration of DALI with TensorRT on Xavier**; Josh Park and Anurag Dixit: `slides <https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9818-integration-of-tensorrt-with-dali-on-xavier.pdf>`__, `recording <https://developer.nvidia.com/gtc/2019/video/S9818/video>`__. - GPU Technology Conference 2018; **Fast data pipeline for deep learning training**, T. Gale, S. Layton and P. Trędak: `slides <http://on-demand.gputechconf.com/gtc/2018/presentation/s8906-fast-data-pipelines-for-deep-learning-training.pdf>`__, `recording <https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2018-s8906/>`__. - `Developer Page <https://developer.nvidia.com/DALI>`__. - `Blog Posts <https://developer.nvidia.com/blog/tag/dali/>`__. ---- Contributing to DALI -------------------- We welcome contributions to DALI. To contribute to DALI and make pull requests, follow the guidelines outlined in the `Contributing <https://github.com/NVIDIA/DALI/blob/main/CONTRIBUTING.md>`__ document. If you are looking for a task good for the start please check one from `external contribution welcome label <https://github.com/NVIDIA/DALI/labels/external%20contribution%20welcome>`__. Reporting Problems, Asking Questions ------------------------------------ We appreciate feedback, questions or bug reports. When you need help with the code, follow the process outlined in the `Stack Overflow <https://stackoverflow.com/help/mcve>`__ document. Ensure that the posted examples are: - **minimal**: Use as little code as possible that still produces the same problem. - **complete**: Provide all parts needed to reproduce the problem. Check if you can strip external dependency and still show the problem. The less time we spend on reproducing the problems, the more time we can dedicate to the fixes. - **verifiable**: Test the code you are about to provide, to make sure that it reproduces the problem. Remove all other problems that are not related to your request. Acknowledgements ---------------- DALI was originally built with major contributions from Trevor Gale, Przemek Tredak, Simon Layton, Andrei Ivanov and Serge Panev. .. |License| image:: https://img.shields.io/badge/License-Apache%202.0-blue.svg :target: https://opensource.org/licenses/Apache-2.0 .. |Documentation| image:: https://img.shields.io/badge/NVIDIA%20DALI-documentation-brightgreen.svg?longCache=true :target: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html .. |Format| image:: https://img.shields.io/badge/code%20style-black-000000.svg :target: https://github.com/psf/black

ML Frameworks System Monitoring

5.7K Github Stars

Open Source

DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.

Education & Learning ML Frameworks

14.8K Github Stars

Open Source

TensorRT

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Documentation](https://img.shields.io/badge/TensorRT-documentation-brightgreen.svg)](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html) [![Roadmap](https://img.shields.io/badge/Roadmap-Q3_2026-brightgreen.svg)](documents/tensorrt_roadmap_2026q3.pdf) # :mega::mega: Announcement :mega::mega: TensorRT 11.0 is now released with powerful new capabilities designed to accelerate your AI inference workflows. With this major version bump, TensorRT's API has been streamlined and a few legacy features have been removed. Below provides migration guides for the following features: - Weakly-typed networks and related APIs have been removed, replaced by [Strongly Typed Networks](https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/advanced.html#strongly-typed-networks). - Implicit quantization and related APIs have been removed, replaced by [Explicit Quantization](https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html#explicit-quantization) - IPluginV2 and related APIs have been removed, replaced by [IPluginV3](https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/extending-custom-layers.html#migrating-v2-plugins-to-ipluginv3) - TREX tool has been removed, replaced by [Nsight Deep Learning Designer](https://docs.nvidia.com/nsight-dl-designer/UserGuide/index.html#visualizing-a-tensorrt-engine) - Python bindings for Python 3.9 and older versions have been removed. RPM packages for RHEL/Rocky Linux 8 and RHEL/Rocky Linux 9 now depend on Python 3.12. # TensorRT Open Source Software This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. It includes the sources for TensorRT plugins and ONNX parser, as well as sample applications demonstrating usage and capabilities of the TensorRT platform. These open source software components are a subset of the TensorRT General Availability (GA) release with some extensions and bug-fixes. - For step-by-step walkthroughs of the TensorRT import paths (ONNX, Torch-TensorRT, HuggingFace/Optimum, Network Definition API) with examples and tooling tips, see the [Import Workflows Guide](documents/import_workflows.md). - For the per-model support matrix across import paths (LLM, encoder-NLP, vision, audio, diffusion, multimodal), see [Supported Models](documents/supported_models.md). - For code contributions to TensorRT-OSS, please see our [Contribution Guide](CONTRIBUTING.md) and [Coding Guidelines](CODING-GUIDELINES.md). - For a summary of new additions and updates shipped with TensorRT-OSS releases, please refer to the [Changelog](CHANGELOG.md). - For business inquiries, please contact [[email protected]](mailto:[email protected]) - For press and other inquiries, please contact Hector Marinez at [[email protected]](mailto:[email protected]) Need enterprise support? NVIDIA global support is available for TensorRT with the [NVIDIA AI Enterprise software suite](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/). Check out [NVIDIA LaunchPad](https://www.nvidia.com/en-us/launchpad/ai/ai-enterprise/) for free access to a set of hands-on labs with TensorRT hosted on NVIDIA infrastructure. Join the [TensorRT and Triton community](https://www.nvidia.com/en-us/deep-learning-ai/triton-tensorrt-newsletter/) and stay current on the latest product updates, bug fixes, content, best practices, and more. # Prebuilt TensorRT Python Package We provide the TensorRT Python package for an easy installation. \ To install: ```bash pip install tensorrt ``` You can skip the **Build** section to enjoy TensorRT with Python. # Build ## Prerequisites To build the TensorRT-OSS components, you will first need the following software packages. **TensorRT GA build** - TensorRT v11.0.0.114 - Available from direct download links listed below **System Packages** - [CUDA](https://developer.nvidia.com/cuda-toolkit) - Recommended versions: - cuda-13.2.0 - cuda-12.9.0 - [CUDNN (optional)](https://developer.nvidia.com/cudnn) - cuDNN 8.9 - [GNU make](https://ftp.gnu.org/gnu/make/) >= v4.1 - [cmake](https://github.com/Kitware/CMake/releases) >= v3.31 - [python](https://www.python.org/downloads/) >= v3.10, <= v3.13.x - [pip](https://pypi.org/project/pip/#history) >= v19.0 - Essential utilities - [git](https://git-scm.com/downloads), [pkg-config](https://www.freedesktop.org/wiki/Software/pkg-config/), [wget](https://www.gnu.org/software/wget/faq.html#download) **Optional Packages** - [NCCL](https://developer.nvidia.com/nccl/nccl-download) >= v2.19, < v3.0 — only when building with multi-device support (`-DTRT_BUILD_ENABLE_MULTIDEVICE=ON`) for the `sampleDistCollective` sample. - Containerized build - [Docker](https://docs.docker.com/install/) >= 19.03 - [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker) - PyPI packages (for demo applications/tests) - [onnx](https://pypi.org/project/onnx/) - [onnxruntime](https://pypi.org/project/onnxruntime/) - [tensorflow-gpu](https://pypi.org/project/tensorflow/) >= 2.5.1 - [Pillow](https://pypi.org/project/Pillow/) >= 9.0.1 - [pycuda](https://pypi.org/project/pycuda/) < 2021.1 - [numpy](https://pypi.org/project/numpy/) - [pytest](https://pypi.org/project/pytest/) - Code formatting tools (for contributors) - [Clang-format](https://clang.llvm.org/docs/ClangFormat.html) - [Git-clang-format](https://github.com/llvm-mirror/clang/blob/master/tools/clang-format/git-clang-format) > NOTE: [onnx-tensorrt](https://github.com/onnx/onnx-tensorrt), [cub](http://nvlabs.github.io/cub/), and [protobuf](https://github.com/protocolbuffers/protobuf.git) packages are downloaded along with TensorRT OSS, and not required to be installed. ## Downloading TensorRT Build 1. #### Download TensorRT OSS ```bash git clone -b main https://github.com/nvidia/TensorRT TensorRT cd TensorRT git submodule update --init --recursive ``` 2. #### (Optional - if not using TensorRT container) Specify the TensorRT GA release build path If using the TensorRT OSS build container, TensorRT libraries are preinstalled under `/usr/lib/x86_64-linux-gnu` and you may skip this step. Else download and extract the TensorRT GA build from [NVIDIA Developer Zone](https://developer.nvidia.com) with the direct links below: - [TensorRT 11.0.0.114 for CUDA 13.2, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/11.0.0/tars/TensorRT-Enterprise-11.0.0.114-Linux-x86_64-cuda-13.2-Release-external.tar.zst) - [TensorRT 11.0.0.114 for CUDA 12.9, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/11.0.0/tars/TensorRT-Enterprise-11.0.0.114-Linux-x86_64-cuda-12.9-Release-external.tar.zst) - [TensorRT 11.0.0.114 for CUDA 13.2, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/11.0.0/zip/TensorRT-Enterprise-11.0.0.114-Windows-amd64-cuda-13.2-Release-external.zip) - [TensorRT 11.0.0.114 for CUDA 12.9, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/11.0.0/zip/TensorRT-Enterprise-11.0.0.114-Windows-amd64-cuda-12.9-Release-external.zip) **Example: Ubuntu 22.04 on x86-64 with cuda-13.2** ```bash cd ~/Downloads tar --zstd -xvf TensorRT-Enterprise-11.0.0.114-Linux-x86_64-cuda-13.2-Release-external.tar.zst export TRT_LIBPATH=`pwd`/TensorRT-11.0.0.114/lib ``` **Example: Windows on x86-64 with cuda-12.9** ```powershell Expand-Archive -Path TensorRT-Enterprise-11.0.0.114-Windows-amd64-cuda-12.9-Release-external.zip $env:TRT_LIBPATH="$pwd\TensorRT-11.0.0.114\lib" ``` ## Setting Up The Build Environment For Linux platforms, we recommend that you generate a docker container for building TensorRT OSS as described below. For native builds, please install the [prerequisite](#prerequisites) _System Packages_. 1. #### Generate the TensorRT-OSS build container. **Example: Ubuntu 24.04 on x86-64 with cuda-13.2 (default)** ```bash ./docker/build.sh --file docker/ubuntu-24.04.Dockerfile --tag tensorrt-ubuntu24.04-cuda13.2 ``` **Example: Rockylinux8 on x86-64 with cuda-13.2** ```bash ./docker/build.sh --file docker/rockylinux8.Dockerfile --tag tensorrt-rockylinux8-cuda13.2 ``` **Example: Ubuntu 24.04 cross-compile for Jetson (aarch64) with cuda-13.2 (JetPack SDK)** ```bash ./docker/build.sh --file docker/ubuntu-cross-aarch64.Dockerfile --tag tensorrt-jetpack-cuda13.2 ``` **Example: Ubuntu 24.04 on aarch64 with cuda-13.2** ```bash ./docker/build.sh --file docker/ubuntu-24.04-aarch64.Dockerfile --tag tensorrt-aarch64-ubuntu24.04-cuda13.2 ``` 2. #### Launch the TensorRT-OSS build container. **Example: Ubuntu 24.04 build container** ```bash ./docker/launch.sh --tag tensorrt-ubuntu24.04-cuda13.2 --gpus all ``` > NOTE: > 1. Use the `--tag` corresponding to build container generated in Step 1. > 2. [NVIDIA Container Toolkit](#prerequisites) is required for GPU access (running TensorRT applications) inside the build container. > 3. `sudo` password for Ubuntu build containers is 'nvidia'. > 4. Specify port number using `--jupyter <port>` for launching Jupyter notebooks. > 5. Write permission to this folder is required as this folder will be mounted inside the docker container for uid:gid of 1000:1000. ## Building TensorRT-OSS - Generate Makefiles and build **Example: Linux (x86-64) build with default cuda-13.2** ```bash cd $TRT_OSSPATH mkdir -p build && cd build cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out make -j$(nproc) ``` **Example: Linux (aarch64) build with default cuda-13.2** ```bash cd $TRT_OSSPATH mkdir -p build && cd build cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64-native.toolchain make -j$(nproc) ``` **Example: Native build on Jetson Thor (aarch64) with cuda-13.2** ```bash cd $TRT_OSSPATH mkdir -p build && cd build cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DTRT_PLATFORM_ID=aarch64 CC=/usr/bin/gcc make -j$(nproc) ``` > NOTE: C compiler must be explicitly specified via CC= for native aarch64 builds of protobuf. **Example: Ubuntu 24.04 Cross-Compile for Jetson Thor (aarch64) with cuda-13.2 (JetPack)** ```bash cd $TRT_OSSPATH mkdir -p build && cd build cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64_cross.toolchain make -j$(nproc) ``` **Example: Ubuntu 24.04 Cross-Compile for DriveOS (aarch64) with cuda-13.2** ```bash cd $TRT_OSSPATH mkdir -p build && cd build cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64_dos_cross.toolchain make -j$(nproc) ``` **Example: Native builds on Windows (x86) with cuda-13.2** ```bash cd $TRT_OSSPATH New-Item -ItemType Directory -Path build cd build cmake .. -DTRT_LIB_DIR="$env:TRT_LIBPATH" -DTRT_OUT_DIR="$pwd\\out" msbuild TensorRT.sln /property:Configuration=Release -m:$env:NUMBER_OF_PROCESSORS ``` > NOTE: The default CUDA version used by CMake is 13.2. To override this, for example to 12.9, append `-DCUDA_VERSION=12.9` to the cmake command. - Required CMake build arguments are: - `TRT_LIB_DIR`: Path to the TensorRT installation directory containing libraries. - `TRT_OUT_DIR`: Output directory where generated build artifacts will be copied. - Optional CMake build arguments: - `CMAKE_BUILD_TYPE`: Specify if binaries generated are for release or debug (contain debug symbols). Values consists of [`Release`] | `Debug` - `CUDA_VERSION`: The version of CUDA to target, for example [`12.9.9`]. - `CUDNN_VERSION`: The version of cuDNN to target, for example [`8.9`]. - `PROTOBUF_VERSION`: The version of Protobuf to use, for example [`3.20.1`]. Note: Changing this will not configure CMake to use a system version of Protobuf, it will configure CMake to download and try building that version. - `CMAKE_TOOLCHAIN_FILE`: The path to a toolchain file for cross compilation. - `BUILD_PARSERS`: Specify if the parsers should be built, for example [`ON`] | `OFF`. If turned OFF, CMake will try to find precompiled versions of the parser libraries to use in compiling samples. First in `${TRT_LIB_DIR}`, then on the system. If the build type is Debug, then it will prefer debug builds of the libraries before release versions if available. - `BUILD_PLUGINS`: Specify if the plugins should be built, for example [`ON`] | `OFF`. If turned OFF, CMake will try to find a precompiled version of the plugin library to use in compiling samples. First in `${TRT_LIB_DIR}`, then on the system. If the build type is Debug, then it will prefer debug builds of the libraries before release versions if available. - `BUILD_SAMPLES`: Specify if the samples should be built, for example [`ON`] | `OFF`. - `BUILD_SAFE_SAMPLES`: Specify if safety samples should be built, for example [`ON`] | `OFF`. - `TRT_SAFETY_INFERENCE_ONLY`: Specify if only build the safety inference components, for example [`ON`] | `OFF`. If turned ON, all other components will be turned OFF except `BUILD_SAFE_SAMPLES`. - `TRT_PLATFORM_ID`: Bare-metal build (unlike containerized cross-compilation). Currently supported options: `x86_64` (default). - `TRT_BUILD_ENABLE_MULTIDEVICE`: Enable the multi-device sample (`sampleDistCollective`). Use `-DTRT_BUILD_ENABLE_MULTIDEVICE=ON` to build it; requires [NCCL](https://developer.nvidia.com/nccl/nccl-download) >= v2.19, < v3.0. - `TRT_BUILD_TESTING` : Build gTests for samples. Requires [gtest](https://github.com/google/googletest) if available; otherwise fetches googletest at configure time. ## Building TensorRT DriveOS Samples - Generate Makefiles and build **Example: Cross-Compile for DOS7 Linux (aarch64)** ```bash cd $TRT_OSSPATH mkdir -p build && cd build cmake .. -DBUILD_SAMPLES=ON -DBUILD_PLUGINS=OFF -DBUILD_PARSERS=OFF -DTRT_OUT_DIR=`pwd`/bin_dynamic_cross -DTRT_LIB_DIR=$TRT_LIBPATH -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64_dos_cross.toolchain make -j$(nproc) ``` **Example: Cross-Compile for DOS6.5 Linux (aarch64)** ```bash cd $TRT_OSSPATH mkdir -p build && cd build cmake .. -DBUILD_SAMPLES=ON -DBUILD_PLUGINS=OFF -DBUILD_PARSERS=OFF -DTRT_OUT_DIR=`pwd`/bin_dynamic_cross -DTRT_LIB_DIR=$TRT_LIBPATH -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64_dos_cross.toolchain -DCUDA_VERSION=11.4 -DCMAKE_CUDA_ARCHITECTURES=87 make -j$(nproc) ``` **Example: Native build for DOS6.5 and DOS7 Linux (aarch64)** ```bash cd $TRT_OSSPATH mkdir -p build && cd build cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64-native.toolchain -DBUILD_SAMPLES=ON -DBUILD_PLUGINS=OFF -DBUILD_PARSERS=OFF make -j$(nproc) ``` **Example: Cross-Compile for DOS6.5 QNX (aarch64)** ```bash cd $TRT_OSSPATH mkdir -p build && cd build export CUDA_VERSION=11.4 export CUDA=cuda-$CUDA_VERSION export CUDA_ROOT=/usr/local/cuda-safe-$CUDA_VERSION export QNX_BASE=/drive/toolchains/qnx_toolchain # Set to your QNX toolchain installation path export QNX_HOST=$QNX_BASE/host/linux/x86_64/ export QNX_TARGET=$QNX_BASE/target/qnx7/ export PATH=$PATH:$QNX_HOST/usr/bin cmake .. -DBUILD_SAMPLES=ON -DBUILD_PLUGINS=OFF -DBUILD_PARSERS=OFF -DBUILD_SAFE_SAMPLES=OFF -DCMAKE_CUDA_COMPILER=$CUDA_ROOT/bin/nvcc -DTRT_OUT_DIR=`pwd`/bin_dynamic_cross -DTRT_LIB_DIR=$TRT_LIBPATH -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_qnx.toolchain -DCUDA_VERSION=$CUDA_VERSION -DCMAKE_CUDA_ARCHITECTURES=87 make -j$(nproc) ``` > NOTE: Set `QNX_BASE` to your QNX toolchain installation path. > If your CUDA version is not the same as in the example, set `CUDA_VERSION` (for examples that use it in multiple places) or add `-DCUDA_VERSION=<version>` to the cmake command. **Example: Cross-Compile for DOS6.5 QNX Safety (aarch64)** ```bash cd $TRT_OSSPATH mkdir -p build && cd build export CUDA_VERSION=11.4 export QNX_BASE=/drive/toolchains/qnx_toolchain # Set to your QNX toolchain installation path export QNX_HOST=$QNX_BASE/host/linux/x86_64/ export QNX_TARGET=$QNX_BASE/target/qnx7/ export PATH=$PATH:$QNX_HOST/usr/bin export CUDA=cuda-$CUDA_VERSION export CUDA_ROOT=/usr/local/cuda-safe-$CUDA_VERSION cmake .. -DBUILD_SAMPLES=OFF -DBUILD_SAFE_SAMPLES=ON -DBUILD_PLUGINS=OFF -DBUILD_PARSERS=OFF -DTRT_SAFETY_INFERENCE_ONLY=ON -DTRT_OUT_DIR=`pwd`/bin_dynamic_cross -DTRT_LIB_DIR=$TRT_LIBPATH -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_qnx_safe.toolchain -DCUDA_VERSION=$CUDA_VERSION -DCMAKE_CUDA_COMPILER=$CUDA_ROOT/bin/nvcc -DCMAKE_CUDA_ARCHITECTURES=87 make -j$(nproc) ``` > NOTE: Set `QNX_BASE` to your QNX toolchain installation path. > If your CUDA version is not the same as in the example, set `CUDA_VERSION` (for examples that use it in multiple places) or add `-DCUDA_VERSION=<version>` to the cmake command. **Example: Cross-Compile for DOS7 QNX (aarch64)** ```bash cd $TRT_OSSPATH mkdir -p build && cd build export CUDA_VERSION=13.2 export CUDA=cuda-$CUDA_VERSION export CUDA_ROOT=/usr/local/cuda-safe-$CUDA_VERSION export QNX_BASE=/drive/toolchains/qnx_toolchain # Set to your QNX toolchain installation path export QNX_HOST=$QNX_BASE/host/linux/x86_64/ export QNX_TARGET=$QNX_BASE/target/qnx/ export PATH=$PATH:$QNX_HOST/usr/bin cmake .. -DBUILD_SAMPLES=ON -DBUILD_PLUGINS=OFF -DBUILD_PARSERS=OFF -DBUILD_SAFE_SAMPLES=OFF -DCMAKE_CUDA_COMPILER=$CUDA_ROOT/bin/nvcc -DTRT_OUT_DIR=`pwd`/bin_dynamic_cross -DTRT_LIB_DIR=$TRT_LIBPATH -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_qnx.toolchain -DCUDA_VERSION=$CUDA_VERSION -DCMAKE_CUDA_ARCHITECTURES=110 make -j$(nproc) ``` > NOTE: Set `QNX_BASE` to your QNX toolchain installation path. > If your CUDA version is not the same as in the example, set `CUDA_VERSION` (for examples that use it in multiple places) or add `-DCUDA_VERSION=<version>` to the cmake command. # References ## TensorRT Resources - [TensorRT Developer Home](https://developer.nvidia.com/tensorrt) - [TensorRT QuickStart Guide](https://docs.nvidia.com/deeplearning/tensorrt/quick-start-guide/index.html) - [TensorRT Developer Guide](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html) - [TensorRT Sample Support Guide](https://docs.nvidia.com/deeplearning/tensorrt/sample-support-guide/index.html) - [TensorRT ONNX Tools](https://docs.nvidia.com/deeplearning/tensorrt/index.html#tools) - [TensorRT Discussion Forums](https://devtalk.nvidia.com/default/board/304/tensorrt/) - [TensorRT Release Notes](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html) ## Known Issues - Please refer to [TensorRT Release Notes](https://docs.nvidia.com/deeplearning/tensorrt/release-notes)

ML Frameworks Code Editors & IDEs

13.1K Github Stars

Open Source

cutlass

![ALT](./media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition") # Overview # CUTLASS 4.5.2 _CUTLASS 4.5.2 - May 2026_ CUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement. CUTLASS decomposes these "moving parts" into reusable, modular software components and abstractions. Primitives for different levels of a conceptual parallelization hierarchy can be specialized and tuned via custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies their use as building blocks within custom kernels and applications. CUTLASS has been providing CUDA C++ template abstractions for high-performance linear algebra since 2017 and these abstractions provide extensive support for a wide range of computations including mixed-precision computations, specialized data-movement (async copy) and multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16, [FP32 emulation via tensor core instruction](https://github.com/NVIDIA/cutlass/tree/main/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm), 8b floating point types (e5m2 and e4m3), block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8), narrow integer types (4 and 8b signed and unsigned integers), and binary 1b data types (where architectures allow for the native support of such data types) across NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures. To this rich ecosystem of C++ based kernel programming abstractions, CUTLASS 4 adds CUTLASS DSLs. These are Python native interfaces for writing high-performance CUDA kernels based on core CUTLASS and CuTe concepts without any performance compromises. This allows for a much smoother learning curve, orders of magnitude faster compile times, native integration with DL frameworks without writing glue code, and much more intuitive metaprogramming that does not require deep C++ expertise. Overall we envision CUTLASS DSLs as a family of domain-specific languages (DSLs). With the release of 4.0, we are releasing the first of these in CuTe DSL. This is a low level programming model that is fully consistent with CuTe C++ abstractions — exposing core concepts such as layouts, tensors, hardware atoms, and full control over the hardware thread and data hierarchy. CuTe DSL demonstrates optimal matrix multiply and other linear algebra operations targeting the programmable, high-throughput _Tensor Cores_ implemented by NVIDIA's Ampere, Hopper, and Blackwell architectures. We believe it will become an indispensable tool for students, researchers, and performance engineers alike — flattening the learning curve of GPU programming, rapidly prototyping kernel designs, and bringing optimized solutions into production. CuTe DSL is currently in public beta and will graduate out of beta by end of summer 2025. To get started quickly - please refer : - [CUTLASS C++ Quick Start Guide](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/quickstart.html). - [CuTe DSL Quick Start Guide](https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/quick_start.html). # What's New in CUTLASS 4.5 ## CuTe DSL * New features - New Block API `block_copy()` to simplify TMA and S2T copy. Users can ignore detail about multicast and 2CTA partition for TMA by `block_copy()` and need not to invoke `tma_partition()`. And users can remove bulk of S2T initialization to simplify S2T copy. - MXF8F6F4 mixed precision support - BlockScaled MMA now supports MXF8*MXF4 or MXF8*MXF6 - Block Scaled MMA for SM120 now works on Spark - EFC broadcast semantics support - EFC epilogue functions can now broadcast and remap tensor modes via `C.remap_modes[:, 0, 1]` subscript syntax (where `:` marks a broadcast dimension and integers select source mode indices). Covers scalar broadcast, row/column broadcast, and arbitrary mode permutations (e.g. transpose). The PyTorch reference evaluator mirrors the same transformations. - Initial linter support: Improved type hints on CuTe DSL APIs to support static type checkers like MyPy - dataclasses.dataclass is now supported for JIT compilaton and cute.compile for both plain and tvm-ffi path - cute.copy now supports user specified loop unrolling - Python 3.14t is now supported with GIL enabled * Bug fixing and improvements - Improved source code correlation for profiling/debugging - Fixed an aarch64 segfault issue with tvm-ffi - Re-organization for CuTe DSL examples/tutorials for better discoverability - Fixed following issues: https://github.com/NVIDIA/cutlass/issues/3219 https://github.com/NVIDIA/cutlass/issues/3218 https://github.com/NVIDIA/cutlass/issues/3212 https://github.com/NVIDIA/cutlass/issues/3210 https://github.com/NVIDIA/cutlass/issues/3208 https://github.com/NVIDIA/cutlass/issues/3201 https://github.com/NVIDIA/cutlass/issues/3227 https://github.com/NVIDIA/cutlass/issues/3240 https://github.com/NVIDIA/cutlass/issues/3241 - Fixed Jax int64 stride divisibility issue - Fixed issues for SM120 blockscaled MMAs - added missing MXFP8MMAOP and MXF8F6F4MMAOP for sm120. * More examples of authorizing peak-performance kernels - MOE examles - A new style of grouped-gemm that aligns to torch's grouped_mm and scaled_groued_mm interface. - Expert-wise tensormap descriptor setup by a cheap helper kernel (~2us) to avoid long latency in tile switching, kernel structure is much more closer to a normal GEMM. - Compared to torch_210_cu13, very few problem has worse perf in B200. - mxfp8_2dx3d: avg 1.29 speedup; - mxfp8_2dx2d: avg 1.41 speedup; - nvfp4_2dx3d: avg 1.11 speedup; - nvfp4_2dx2d: avg 1.12 speedup (worst case 0.98) - bf16_2dx3d: avg 1.15 speedup (worst case 0.98) - bf16_2dx2d: avg 1.17 speedup (worst case 0.96) - Note: The perf is measured from torch profiler, this impl includes the helper kernel + main kernel, while torch's includes its setup kernel and cutlass_cpp main kernel. * API changes - ab_dtype is deprecated in make_trivial_tiled_mma and make_blockscaled_trivial_tiled_mma from blackwell_helpers.py. Please specify a_dtype and b_dtype separately instead. ## CUTLASS C++ * Add 2SM MMA instruction support to mixed TMA+CpAsync SM100 vanilla GEMM kernels. - Mixed TMA+CpAsync can now accept static, but non trivial cluster shapes. - Uses TMA multicast for A tile when using non-trivial cluster size along N mode. - Uses an additional barrier (mma_trampoline_barrier) to track cp.async arrivals in both CTAs. - Changes included in [example 92](https://github.com/NVIDIA/cutlass/tree/main/examples/92_blackwell_moe_gemm). * Add support for 128x32xK and 128x64xK tile sizes for SM120 blockscaled MMA collective builders, yielding up to 30% performance improvement on Blackwell SM121 related kernels. * Add static load to tensor memory support, included in [example 77](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/). * Use 64-bit adds for SM100 MMA descriptor offsets and reduce move instructions for improved code generation. * Add [example 95](https://github.com/NVIDIA/cutlass/tree/main/examples/95_blackwell_gemm_green_context) to support green context SM partition - Enables launching GEMM on stream with partial SM allocation. * Add [Snake](https://github.com/NVIDIA/cutlass/blob/main/test/unit/epilogue/thread/activation.cu#L409) activation functor for EVT. * Fix SM100 F8F6F4 SS MMA (1SM and 2SM) traits to use typed op templates. * Add UE8M0 (uniform exponent distribution) initialization support in tensor fill utilities. * Add `cvt.rn.bf16x2.e4m3x2` conversion instruction support to `numeric_conversion.h`. * Update [example 93](https://github.com/NVIDIA/cutlass/tree/main/examples/93_blackwell_low_latency_gqa) with paged KV cache support for Blackwell low-latency GQA. * Fix some kernel issues: - Fix l2_capacity=0 handling in Blackwell SM100/SM120 kernel templates - Fix CUTLASS clang build issues - Remove `PipelineStorage` shadowing in SM100 complex epilogue - Fix build issue in SM90 epilogue fusion visitor TMA warpspecialized - Fix missing convert fucntion in EVT for fp4 kernels * Fix some profiler issues: - Add missing reference kernels for blockwise GEMM profiler. - Avoid instantiate 2sm tma kernels where ctaN is none power of 64 when ctaN > 128 in profiler. Note: CUTLASS 4.x builds are known to be down on Windows platforms for all CUDA toolkits. CUTLASS team is working on a fix. **See the [CHANGELOG](https://docs.nvidia.com/cutlass/latest/CHANGELOG.html) for details of all past releases and updates.** # Performance CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels, they exhibit nearly optimal utilization of peak theoretical throughput. The figure below shows CUTLASS 3.8's performance as a % of theoretical peak utilization on various input and output data types when run on NVIDIA Blackwell SM100 architecture GPU. ![ALT](media/images/cutlass-3.8-blackwell-gemm-peak-performance.svg "") The two figures below show the continual CUTLASS performance improvements on an [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) (NVIDIA Hopper architecture) since CUTLASS 3.1. CUTLASS 3.5.1 was compiled with the [CUDA 12.5u1 Toolkit](https://developer.nvidia.com/cuda-downloads). Tensor Core operations are implemented using CUDA's [mma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma) and [wgmma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions) instructions. ![ALT](media/images/cutlass-3.5.1-gemm-peak-performance.png "") ![ALT](media/images/cutlass-3.5.1-gemm-peak-performance-fp8.png "") # CuTe CUTLASS 3.0 introduced a new core library, CuTe, to describe and manipulate tensors of threads and data. CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides `Layout` and `Tensor` objects that compactly package the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. This lets programmers focus on the logical descriptions of their algorithms while CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design, implement, and modify all dense linear algebra operations. The core abstractions of CuTe are hierarchically multidimensional layouts which can be composed with data arrays to represent tensors. The representation of layouts is powerful enough to represent nearly everything we need to implement efficient dense linear algebra. Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning. CUTLASS 3.0 and beyond adopts CuTe throughout the GEMM hierarchy in its templates. This greatly simplifies the design and improves code composability and readability. More documentation specific to CuTe can be found in its [dedicated documentation directory](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/cute/00_quickstart.html). # Compatibility Minimum requirements: - Architecture: Volta (compute capability 7.0) - Compiler: Must support at least C++17 - CUDA Toolkit version: 11.4 CUTLASS requires a C++17 host compiler and performs best when built with the [**CUDA 12.8 Toolkit**](https://developer.nvidia.com/cuda-downloads). It is also compatible with CUDA 11.4, CUDA 11.5, CUDA 11.6, CUDA 11.7, CUDA 11.8, and all other CUDA 12.x versions. ## Operating Systems We have tested the following environments. |**Operating System** | **Compiler** | |-----------------|----------| | Ubuntu 18.04 | GCC 7.5.0 | | Ubuntu 20.04 | GCC 10.3.0 | | Ubuntu 22.04 | GCC 11.2.0 | Note: GCC 8.5.0 has known regressions regarding fold expressions and overloaded operators. Using GCC 7.5.0 or (preferred) GCC >= 9 is recommended. Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits. CUTLASS team is working on a fix. ## Hardware CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on Volta, Turing, Ampere, Ada, and Hopper architecture based NVIDIA GPUs. |**GPU**|**CUDA Compute Capability**|**Minimum CUDA Toolkit Required by CUTLASS-3**| |---|---|---| |NVIDIA V100 Tensor Core GPU |7.0|11.4| |NVIDIA TitanV |7.0|11.4| |NVIDIA GeForce RTX 20x0 series |7.5|11.4| |NVIDIA T4 |7.5|11.4| |NVIDIA A100 Tensor Core GPU |8.0|11.4| |NVIDIA A10 |8.6|11.4| |NVIDIA GeForce RTX 30x0 series |8.6|11.4| |NVIDIA GeForce RTX 40x0 series |8.9|11.8| |NVIDIA L40 |8.9|11.8| |NVIDIA H100 Tensor Core GPU |9.0|11.8| |NVIDIA H200 Tensor Core GPU |9.0|11.8| |NVIDIA B200 Tensor Core GPU |10.0|12.8| |NVIDIA B300 Tensor Core GPU |10.3|13.0| |NVIDIA DRIVE Thor |11.0|13.0| |NVIDIA GeForce RTX 50x0 series |12.0|12.8| |NVIDIA DGX Spark |12.1|13.0| ## Target Architecture In general, PTX code generated for one target architecture can be run on future architectures (i.e., it is forward compatible). However, CUDA 12.0 introduced the concept of "architecture-accelerated features" whose PTX does not have forward compatibility guarantees. Several Hopper and Blackwell PTX instructions fall under this category of architecture-accelerated features, and thus require a `sm_90a` or `sm100a` target architecture (note the "a" appended). For more details on this and other architecture-accelerated instructions, please refer to the [CUDA Documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#feature-availability). The target architecture information is passed on to CUTLASS via the cmake flag `CUTLASS_NVCC_ARCHS`. In order to maximize performance on Hopper GH100, users are required to build CUTLASS with `90a` as the target architecture. If a user accidentally builds a kernel which uses SM90a features (e.g. Hopper Tensor Core Instructions), using the SM90 target (note the lack of "a"), with either CUDA Toolkit 12 or 11.8, the kernel is expected to fail with a runtime error. ``` cmake .. -DCUTLASS_NVCC_ARCHS="90a" ``` Or ``` cmake .. -DCUTLASS_NVCC_ARCHS="100a" ``` Note: The NVIDIA Blackwell SM100 architecture used in the datacenter products has a different compute capability than the one underpinning NVIDIA Blackwell GeForce RTX 50 series GPUs (SM120). As a result, kernels compiled for Blackwell SM100 architecture with arch conditional features (using `sm100a`) are not compatible with RTX 50 series GPUs. Please refer to the [functionality documentation](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/functionality.html) for details on which kernels require which target architectures. # Documentation CUTLASS is described in the following documents and the accompanying [Doxygen documentation](https://nvidia.github.io/cutlass). - [Quick Start Guide](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/quickstart.html) - basics of building and running CUTLASS - [Functionality](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/functionality.html) - summarizes functionality available in CUTLASS - [Efficient GEMM in CUDA](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/efficient_gemm.html) - describes how GEMM kernels may be implemented efficiently in CUDA - [CUTLASS 3.x Design](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/cutlass_3x_design.html) - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components - [GEMM API 3.x](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/gemm_api_3x.html) - describes the CUTLASS 3.x GEMM model and C++ template concepts - [GEMM API 2.x](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/gemm_api.html) - describes the CUTLASS 2.x GEMM model and C++ template concepts - [Implicit GEMM Convolution](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/implicit_gemm_convolution.html) - describes 2-D and 3-D convolution in CUTLASS - [Code Organization](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/code_organization.html) - describes the organization and contents of the CUTLASS project - [Terminology](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/terminology.html) - describes terms used in the code - [Programming Guidelines](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/programming_guidelines.html) - guidelines for writing efficient modern CUDA C++ - [Fundamental types](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/fundamental_types.html) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays - [Layouts](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/layout.html) - describes layouts of matrices and tensors in memory - [Tile Iterators](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/tile_iterator_concept.html) - describes C++ concepts for iterating over tiles of matrices in memory - [CUTLASS Profiler](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/profiler.html) - command-line driven profiling application - [CUTLASS Utilities](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/utilities.html) - additional templates used to facilitate rapid development - [Dependent kernel launch](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/dependent_kernel_launch.html) - describes a new feature in Hopper which allows overlapping dependent kernels in the same stream, and how it is used in CUTLASS. # Resources We have also described the structure of an efficient GEMM in our talk at the [GPU Technology Conference 2018](http://on-demand.gputechconf.com/gtc/2018/presentation/s8854-cutlass-software-primitives-for-dense-linear-algebra-at-all-levels-and-scales-within-cuda.pdf). - [CUTLASS: Software Primitives for Dense Linear Algebra at All Levels and Scales within CUDA](https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2018-s8854/) - [Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100](https://www.nvidia.com/en-us/on-demand/session/gtcsj20-s21745/) - [Accelerating Convolution with Tensor Cores in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31883/) - [Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41996/) - [CUTLASS: Python API, Enhancements, and NVIDIA Hopper](https://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41131/) # Building CUTLASS CUTLASS is a header-only template library and does not need to be built to be used by other projects. Client applications should target CUTLASS's `include/` directory in their include paths. CUTLASS unit tests, examples, and utilities can be build with CMake. The minimum version of CMake is given in the [Quickstart guide](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/quickstart.html). Make sure the `CUDACXX` environment variable points to NVCC in the CUDA Toolkit installed on your system. ```bash $ export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc ``` Create a build directory within the CUTLASS project, then run CMake. By default CUTLASS will build kernels for CUDA architecture versions 5.0, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6, 8.9, and 9.0. To reduce compile time you can specify the architectures to build CUTLASS for by changing the CMake configuration setting `CUTLASS_NVCC_ARCHS`. ```bash $ mkdir build && cd build $ cmake .. -DCUTLASS_NVCC_ARCHS=80 # compiles for NVIDIA's Ampere Architecture ``` From the `build/` directory, compile and run the CUTLASS unit tests by building the target `test_unit` with make. The unit tests are organized as several binaries mirroring the top-level namespaces of CUTLASS, and they may be executed in parallel via make's `-j` command line argument. ```bash $ make test_unit -j ... ... ... [----------] Global test environment tear-down [==========] 946 tests from 57 test cases ran. (10812 ms total) [ PASSED ] 946 tests. ``` All tests should pass on supported platforms, though the exact number of tests may vary over time. # Project Structure CUTLASS is arranged as a header-only library along with Utilities, Tools, Examples, and unit tests. [Doxygen documentation](https://nvidia.github.io/cutlass) provides a complete list of files, classes, and template concepts defined in the CUTLASS project. A detailed explanation of the source code organization may be found in the [CUTLASS documentation](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/code_organization.html), but several main components are summarized below. ## CUTLASS Template Library ``` include/ # client applications should target this directory in their build's include paths cutlass/ # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only arch/ # direct exposure of architecture features (including instruction-level GEMMs) conv/ # code specialized for convolution epilogue/ # code specialized for the epilogue of gemm/convolution gemm/ # code specialized for general matrix product computations layout/ # layout definitions for matrices, tensors, and other mathematical objects in memory platform/ # CUDA-capable Standard Library components reduction/ # bandwidth-limited reduction kernels that do not fit the "gemm" model thread/ # simt code that can be performed within a CUDA thread transform/ # code specialized for layout, type, and domain transformations * # core vocabulary types, containers, and basic numeric operations cute/ # CuTe Layout, layout algebra, MMA/Copy atoms, tiled MMA/Copy algorithm/ # Definitions of core operations such as copy, gemm, and operations on cute::tuples arch/ # Bare bones PTX wrapper structs for copy and math instructions atom/ # Meta-information either link to or built from arch/ operators mma_atom.hpp # cute::Mma_Atom and cute::TiledMma copy_atom.hpp # cute::Copy_Atom and cute::TiledCopy *sm*.hpp # Arch specific meta-information for copy and math operations * # Core library types such as Shape, Stride, Layout, Tensor, and associated operations ``` ### CUTLASS SDK Examples [CUTLASS SDK examples](https://github.com/NVIDIA/cutlass/tree/main/examples) apply CUTLASS templates to implement basic computations. ### Tools ``` tools/ library/ # CUTLASS Instance Library - contains instantiations of all supported CUTLASS templates include/ cutlass/ library/ profiler/ # CUTLASS Profiler - command-line utility for executing operations in the # CUTLASS Library util/ # CUTLASS Utilities - contains numerous helper classes for include/ # managing tensors in device memory, reference cutlass/ # implementations for GEMM, random initialization util/ # of tensors, and I/O. ``` ### Test The `test/unit/` directory consist of unit tests implemented with Google Test that demonstrate basic usage of Core API components and complete tests of the CUTLASS GEMM computations. Instructions for building and running the Unit tests are described in the [Quickstart guide](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/quickstart.html). # Performance Profiling The `tools/profiler/` directory contains a command-line utility for launching each of the GEMM kernels. It can be built as follows: ```bash $ make cutlass_profiler -j16 ``` ## Building all GEMM and Convolution kernels (_long_ build times) By default, only one tile size is instantiated for each data type, math instruction, and layout. To instantiate all, set the following environment variable when running CMake from an empty `build/` directory. Beware, this results in *tens of thousands* of kernels and long build times. This would also result in a large binary size and on some platforms linker to fail on building the library. Therefore, it's highly recommended to generate only a subset of kernels as demonstrated in the sub-section below. ```bash $ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=all ... $ make cutlass_profiler -j16 ``` ## Building a subset of GEMM and Convolution kernels (_reduced_ build times) To compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with wildcard characters may be used to reduce the set of kernels. The following examples show building exactly one or a subset of kernels for NVIDIA Ampere and Turing architecture: ### Building a subset Tensor Core GEMM kernels To compile a subset of Tensor Core GEMM kernels with FP32 accumulation and FP16 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line: ```bash $ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8 ... $ make cutlass_profiler -j16 ``` Example command line for profiling a subset of Tensor Core GEMM kernels is as follows: ```bash ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=3456 --n=4096 --k=4096 ... ============================= Problem ID: 1 Provider: CUTLASS OperationKind: gemm Operation: cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8 Status: Success Verification: ON Disposition: Passed reference_device: Passed cuBLAS: Passed Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --alpha=1 \ --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128 \ --cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75 \ --max_cc=1024 Bytes: 118489088 bytes FLOPs: 115992428544 flops Runtime: 1.55948 ms Memory: 70.7616 GiB/s Math: 74378.8 GFLOP/s ============================= ... ``` ### Building one CUDA Core GEMM kernel To compile one SGEMM kernel targeting NVIDIA Ampere and Turing architecture, use the below cmake command line: ```bash $ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1 ... $ make cutlass_profiler -j16 ``` Example command line for profiling single SGEMM CUDA kernel is as follows: ```bash $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096 ============================= Problem ID: 1 Provider: CUTLASS OperationKind: gemm Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1 Status: Success Verification: ON Disposition: Passed cuBLAS: Passed Arguments: --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1 \ --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4 \ --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024 Bytes: 180355072 bytes FLOPs: 115992428544 flops Runtime: 6.73655 ms Memory: 24.934 GiB/s Math: 17218.4 GFLOP/s ============================= ``` ### Building a subset of Tensor Core Convolution kernels To compile a subset of Tensor core convolution kernels implementing forward propagation (fprop) with FP32 accumulation and FP16 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line: ```bash $ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*fprop_optimized_f16 ... $ make cutlass_profiler -j16 ``` Example command line for profiling a subset of Tensor Core convolution kernels is as follows: ```bash $ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*fprop_optimized_f16 --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 ... ============================= Problem ID: 1 Provider: CUTLASS OperationKind: conv2d Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_32x5_nhwc Status: Success Verification: ON Disposition: Passed reference_device: Passed Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1 \ --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc \ --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 \ --eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=32 --stages=5 \ --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024 Bytes: 1130659840 bytes FLOPs: 118482796544 flops Runtime: 0.711496 ms Memory: 1479.99 GiB/s Math: 166526 GFLOP/s ============================= ... ``` ### Building one Convolution CUDA kernel To compile and run one CUDA Core convolution kernel implementing forward propagation (fprop) with F32 accumulation and FP32 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line: ```bash $ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc ... $ make cutlass_profiler -j16 ``` Example command line for profiling one CUDA Core convolution kernel: ```bash $ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 ============================= Problem ID: 1 Provider: CUTLASS OperationKind: conv2d Operation: cutlass_simt_sfprop_optimized_128x128_8x2_nhwc Status: Success Verification: ON Disposition: Passed reference_device: Passed Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1 \ --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f32:nhwc --Filter=f32:nhwc --Output=f32:nhwc \ --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 \ --eq_gemm_provider=none --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4 \ --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024 Bytes: 2055798784 bytes FLOPs: 118482796544 flops Runtime: 7.34266 ms Memory: 260.752 GiB/s Math: 16136.2 GFLOP/s ============================= ``` ## More Details on Compiling CUTLASS Kernels and CUTLASS Profiler - Please follow the links for more CMake examples on selectively compiling CUTLASS kernels: - [GEMM CMake Examples](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/quickstart.html#gemm-cmake-examples) - [Implicit GEMM convolution CMake Examples](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/quickstart.html#convolution-cmake-examples) - [Further details about the CUTLASS Profiler are described here.](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/profiler.html) # About CUTLASS is released by NVIDIA Corporation as Open Source software under the [3-clause "New" BSD license](LICENSE.txt). # Contributors The official list of CUTLASS developers and contributors is available here: [CONTRIBUTORS](CONTRIBUTORS.md). # Copyright Copyright (c) 2017 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: BSD-3-Clause ``` Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ```

ML Frameworks Code Editors & IDEs

9.9K Github Stars

Open Source

pix2pixHD

Synthesizing and manipulating 2048x1024 images with conditional GANs

ML Frameworks Image Editing

6.9K Github Stars

Open Source

nccl

# NCCL Optimized primitives for inter-GPU communication. ## Introduction NCCL (pronounced "Nickel") is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications. For more information on NCCL usage, please refer to the [NCCL documentation](https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/index.html). ## Build Note: the official and tested builds of NCCL can be downloaded from: https://developer.nvidia.com/nccl. You can skip the following build steps if you choose to use the official builds. To build the library : ```shell $ cd nccl $ make -j src.build ``` If CUDA is not installed in the default /usr/local/cuda path, you can define the CUDA path with : ```shell $ make src.build CUDA_HOME=<path to cuda install> ``` NCCL will be compiled and installed in `build/` unless `BUILDDIR` is set. By default, NCCL is compiled for all supported architectures. To accelerate the compilation and reduce the binary size, consider redefining `NVCC_GENCODE` (defined in `makefiles/common.mk`) to only include the architecture of the target platform : ```shell $ make -j src.build NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90" ``` ## Install To install NCCL on the system, create a package then install it as root. Debian/Ubuntu : ```shell $ # Install tools to create debian packages $ sudo apt install build-essential devscripts debhelper fakeroot $ # Build NCCL deb package $ make pkg.debian.build $ ls build/pkg/deb/ ``` RedHat/CentOS : ```shell $ # Install tools to create rpm packages $ sudo yum install rpm-build rpmdevtools $ # Build NCCL rpm package $ make pkg.redhat.build $ ls build/pkg/rpm/ ``` OS-agnostic tarball : ```shell $ make pkg.txz.build $ ls build/pkg/txz/ ``` Python wheel : ```shell $ # Install uv to create the Python wheel (uv manages Python deps in a venv) $ # See: https://docs.astral.sh/uv/getting-started/installation/ $ curl -LsSf https://astral.sh/uv/install.sh | sh $ # Build NCCL Python wheel (this also builds the .txz archive as an intermediate) $ make pkg.python_wheel.build $ ls build/pkg/python_wheel/ ``` ## Tests Tests for NCCL are maintained separately at https://github.com/nvidia/nccl-tests. ```shell $ git clone https://github.com/NVIDIA/nccl-tests.git $ cd nccl-tests $ make $ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g <ngpus> ``` ## Copyright All source code and accompanying documentation is copyright (c) 2015-2020, NVIDIA CORPORATION. All rights reserved.

ML Frameworks Code Editors & IDEs

4.8K Github Stars

Open Source

nim-anywhere

# NVIDIA NIM Anywhere [![Clone Me with AI Workbench](https://img.shields.io/badge/Open_In-AI_Workbench-76B900)](https://ngc.nvidia.com/open-ai-workbench/aHR0cHM6Ly9naXRodWIuY29tL05WSURJQS9uaW0tYW55d2hlcmUK) [![NVIDIA: LLM NIM](https://img.shields.io/badge/NVIDIA-LLM%20NIM-green?logo=nvidia&logoColor=white&color=%2376B900)](https://docs.nvidia.com/nim/#large-language-models) [![NVIDIA: Embedding NIM](https://img.shields.io/badge/NVIDIA-Embedding%20NIM-green?logo=nvidia&logoColor=white&color=%2376B900)](https://docs.nvidia.com/nim/#nemo-retriever) [![NVIDIA: Reranker NIM](https://img.shields.io/badge/NVIDIA-Reranker%20NIM-green?logo=nvidia&logoColor=white&color=%2376B900)](https://docs.nvidia.com/nim/#nemo-retriever) [![CI Pipeline Status](https://github.com/nvidia/nim-anywhere/actions/workflows/ci.yml/badge.svg?query=branch%3Amain)](https://github.com/NVIDIA/nim-anywhere/actions/workflows/ci.yml?query=branch%3Amain) ![Python: 3.10 | 3.11 | 3.12](https://img.shields.io/badge/Python-3.10%20|%203.11%20|%203.12-yellow?logo=python&logoColor=white&color=%23ffde57) Please join \#cdd-nim-anywhere slack channel if you are a internal user, open an issue if you are external for any question and feedback. One of the primary benefit of using AI for Enterprises is their ability to work with and learn from their internal data. Retrieval-Augmented Generation ([RAG](https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/)) is one of the best ways to do so. NVIDIA has developed a set of micro-services called [NIM micro-service](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) to help our partners and customers build effective RAG pipeline with ease. NIM Anywhere contains all the tooling required to start integrating NIMs for RAG. It natively scales out to full-sized labs and up to production environments. This is great news for building a RAG architecture and easily adding NIMs as needed. If you're unfamiliar with RAG, it dynamically retrieves relevant external information during inference without modifying the model itself. Imagine you're the tech lead of a company with a local database containing confidential, up-to-date information. You don’t want OpenAI to access your data, but you need the model to understand it to answer questions accurately. The solution is to connect your language model to the database and feed them with the information. To learn more about why RAG is an excellent solution for boosting the accuracy and reliability of your generative AI models, [read this blog](https://developer.nvidia.com/blog/enhancing-rag-applications-with-nvidia-nim/). Get started with NIM Anywhere now with the [quick-start](#quick-start) instructions and build your first RAG application using NIMs! ![NIM Anywhere Screenshot](.static/_static/nim-anywhere.png) - [Quick-start](#quick-start) - [Generate your NGC Personal Key](#generate-your-ngc-personal-key) - [Authenticate with Docker](#authenticate-with-docker) - [Install AI Workbench](#install-ai-workbench) - [Download this project](#download-this-project) - [Configure this project](#configure-this-project) - [Start This Project](#start-this-project) - [Populating the Knowledge Base](#populating-the-knowledge-base) - [Developing Your Own Applications](#developing-your-own-applications) - [Application Configuration](#application-configuration) - [Config from a file](#config-from-a-file) - [Config from a custom file](#config-from-a-custom-file) - [Config from env vars](#config-from-env-vars) - [Chain Server config schema](#chain-server-config-schema) - [Chat Frontend config schema](#chat-frontend-config-schema) - [Contributing](#contributing) - [Code Style](#code-style) - [Updating the frontend](#updating-the-frontend) - [Updating documentation](#updating-documentation) - [Managing your Development Environment](#managing-your-development-environment) - [Environment Variables](#environment-variables) - [Python Environment Packages](#python-environment-packages) - [Operating System Configuration](#operating-system-configuration) - [Updating Dependencies](#updating-dependencies) # Quick-start ## Generate your NGC Personal Key To allow AI Workbench to access NVIDIA’s cloud resources, you’ll need to provide it with a Personal Key. These keys begin with `nvapi-`. <details> <summary> Expand this section for instructions for creating this key. </summary> 1. Go to the [NGC Personal Key Manager](https://org.ngc.nvidia.com/setup/personal-keys). If you are prompted to, then register for a new account and sign in. > **HINT** You can find this tool by logging into > [ngc.nvidia.com](https://ngc.nvidia.com), expanding your profile > menu on the top right, selecting *Setup*, and then selecting > *Generate Personal Key*. 2. Select *Generate Personal Key*. ![Generate Personal Key](.static/_static/generate_personal_key.png) 3. Enter any value as the Key name, an expiration of 12 months is fine, and select all the services. Press *Generate Personal Key* when you are finished. ![Personal Key Form](.static/_static/personal_key_form.png) 4. Save your personal key for later. Workbench will need it and there is no way to retrieve it later. If the key is lost, a new one must be created. Protect this key as if it were a password. ![Personal Key](.static/_static/personal_key.png) </details> ## Authenticate with Docker Workbench will use your system's Docker client to pull NVIDIA NIM containers, so before continuing, make sure to follow these steps to authenticate your Docker client with your NGC Personal Key. 1. Run the following Docker login command ``` bash docker login nvcr.io ``` 2. When prompted for your credentials, use the following values: - Username: `$oauthtoken` - Password: Use your NGC Personal key beggining with `nv-api` ## Install AI Workbench This project is designed to be used with [NVIDIA AI Workbench](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/workbench/). While this is not a requirement, running this demo without AI Workbench will require manual work as the pre-configured automation and integrations may not be available. This quick start guide will assume a remote lab machine is being used for development and the local machine is a thin-client for remotely accessing the development machine. This allows for compute resources to stay centrally located and for developers to be more portable. Note, the remote lab machine must run Ubuntu, but the local client can run Windows, MacOS, or Ubuntu. To install this project local only, simply skip the remote install. ``` mermaid flowchart LR local subgraph lab environment remote-lab-machine end local <-.ssh.-> remote-lab-machine ``` ### Client Machine Install Ubuntu is required if the local client will also be used for developent. When using a remote lab machine, this can be Windows, MacOS, or Ubuntu. <details> <summary> Expand this section for a Windows install. </summary> For full instructions, see the [NVIDIA AI Workbench User Guide](https://docs.nvidia.com/ai-workbench/user-guide/latest/installation/windows.html). 1. Install Prerequisite Software 1. If this machine has an NVIDIA GPU, ensure the GPU drivers are installed. It is recommended to use the [GeForce Experience](https://www.nvidia.com/en-us/geforce/geforce-experience/) tooling to manage the GPU drivers. 2. Install [Docker Desktop](https://www.docker.com/products/docker-desktop/) for local container support. Please be mindful of Docker Desktop's licensing for enterprise use. [Rancher Desktop](https://rancherdesktop.io/) may be a viable alternative. 3. *\[OPTIONAL\]* If Visual Studio Code integration is desired, install [Visual Studio Code](https://code.visualstudio.com/). 2. Download the [NVIDIA AI Workbench](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/workbench/) installer and execute it. Authorize Windows to allow the installer to make changes. 3. Follow the instructions in the installation wizard. If you need to install WSL2, authorize Windows to make the changes and reboot local machine when requested. When the system restarts, the NVIDIA AI Workbench installer should automatically resume. 4. Select Docker as your container runtime. 5. Log into your GitHub Account by using the *Sign in through GitHub.com* option. 6. Enter your git author information if requested. </details> <details> <summary> Expand this section for a MacOS install. </summary> For full instructions, see the [NVIDIA AI Workbench User Guide](https://docs.nvidia.com/ai-workbench/user-guide/latest/installation/macos.html). 1. Install Prerequisite Software 1. Install [Docker Desktop](https://www.docker.com/products/docker-desktop/) for local container support. Please be mindful of Docker Desktop's licensing for enterprise use. [Rancher Desktop](https://rancherdesktop.io/) may be a viable alternative. 2. *\[OPTIONAL\]* If Visual Studio Code integration is desired, install [Visual Studio Code](https://code.visualstudio.com/). When using VSCode on a Mac, an a[dditional step must be performed](https://code.visualstudio.com/docs/setup/mac#_launching-from-the-command-line) to install the VSCode CLI interface used by Workbench. 2. Download the [NVIDIA AI Workbench](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/workbench/) disk image (*.dmg* file) and open it. 3. Drag AI Workbench into the Applications folder and run *NVIDIA AI Workbench* from the application launcher. ![Mac DMG Install Interface](.static/_static/mac_dmg_drag.png) 4. Select Docker as your container runtime. 5. Log into your GitHub Account by using the *Sign in through GitHub.com* option. 6. Enter your git author information if requested. </details> <details> <summary> Expand this section for an Ubuntu install. </summary> For full instructions, see the [NVIDIA AI Workbench User Guide](https://docs.nvidia.com/ai-workbench/user-guide/latest/installation/ubuntu-local.html). Run this installation as the user who will be user Workbench. Do not run these steps as `root`. 1. Install Prerequisite Software 1. *\[OPTIONAL\]* If Visual Studio Code integration is desired, install [Visual Studio Code](https://code.visualstudio.com/). 2. Download the [NVIDIA AI Workbench](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/workbench/) installer, make it executable, and then run it. You can make the file executable with the following command: ``` bash chmod +x NVIDIA-AI-Workbench-*.AppImage ``` 3. AI Workbench will install the NVIDIA drivers for you (if needed). You will need to reboot your local machine after the drivers are installed and then restart the AI Workbench installation by double-clicking the NVIDIA AI Workbench icon on your desktop. 4. Select Docker as your container runtime. 5. Log into your GitHub Account by using the *Sign in through GitHub.com* option. 6. Enter your git author information if requested. </details> ### Remote Machine Install Only Ubuntu is supported for remote machines. <details> <summary> Expand this section for a remote Ubuntu install. </summary> For full instructions, see the [NVIDIA AI Workbench User Guide](https://docs.nvidia.com/ai-workbench/user-guide/latest/installation/ubuntu-remote.html). Run this installation as the user who will be using Workbench. Do not run these steps as `root`. 1. Ensure SSH Key based authentication is enabled from the local machine to the remote machine. If this is not currently enabled, the following commands will enable this is most situations. Change `REMOTE_USER` and `REMOTE-MACHINE` to reflect your remote address. - From a Windows local client, use the following PowerShell: ``` powershell ssh-keygen -f "C:\Users\local-user\.ssh\id_rsa" -t rsa -N '""' type $env:USERPROFILE\.ssh\id_rsa.pub | ssh REMOTE_USER@REMOTE-MACHINE "cat >> .ssh/authorized_keys" ``` - From a MacOS or Linux local client, use the following shell: ``` bash if [ ! -e ~/.ssh/id_rsa ]; then ssh-keygen -f ~/.ssh/id_rsa -t rsa -N ""; fi ssh-copy-id REMOTE_USER@REMOTE-MACHINE ``` 2. SSH into the remote host. Then, use the following commands to download and execute the NVIDIA AI Workbench Installer. ``` bash mkdir -p $HOME/.nvwb/bin && \ curl -L https://workbench.download.nvidia.com/stable/workbench-cli/$(curl -L -s https://workbench.download.nvidia.com/stable/workbench-cli/LATEST)/nvwb-cli-$(uname)-$(uname -m) --output $HOME/.nvwb/bin/nvwb-cli && \ chmod +x $HOME/.nvwb/bin/nvwb-cli && \ sudo -E $HOME/.nvwb/bin/nvwb-cli install ``` 3. AI Workbench will install the NVIDIA drivers for you (if needed). You will need to reboot your remote machine after the drivers are installed and then restart the AI Workbench installation by re-running the commands in the previous step. 4. Select Docker as your container runtime. 5. Log into your GitHub Account by using the *Sign in through GitHub.com* option. 6. Enter your git author information if requested. 7. Once the remote installation is complete, the Remote Location can be added to the local AI Workbench instance. Open the AI Workbench application, click *Add Remote Location*, and then enter the required information. When finished, click *Add Location*. - \*Location Name: \* Any short name for this new location - \*Description: \* Any brief metadata for this location. - \*Hostname or IP Address: \* The hostname or address used to remotely SSH. If step 1 was followed, this should be the same as `REMOTE-MACHINE`. - \*SSH Port: \* Usually left blank. If a nonstandard SSH port is used, it can be configured here. - \*SSH Username: \* The username used for making an SSH connection. If step 1 was followed, this should be the same as `REMOTE_USER`. - \*SSH Key File: \* The path to the private key for making SSH connections. If step 1 was followed, this should be: `/home/USER/.ssh/id_rsa`. - \*Workbench Directory: \* Usually left blank. This is where Workbench will remotely save state. </details> ## Download this project There are two ways to download this project for local use: Cloning and Forking. Cloning this repository is the recommended way to start. This will not allow for local modifications, but is the fastest to get started. This also allows for the easiest way to pull updates. Forking this repository is recommended for development as changes will be able to be saved. However, to get updates, the fork maintainer will have to regularly pull from the upstream repo. To work from a fork, follow [GitHub's instructions](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo) and then reference the URL to your personal fork in the rest of this section. <details> <summary> Expand this section for a details on downloading this project. </summary> 1. Open the local NVIDIA AI Workbench window. From the list of locations displayed, select either the remote one you just set up, or local if you're going to work locally. ![AI Workbench Locations Menu](.static/_static/nvwb_locations.png) 2. Once inside the location, select *Clone Project*. ![AI Workbench Projects Menu](.static/_static/nvwb_projects.png) 3. In the 'Clone Project' pop up window, set the Repository URL to `https://github.com/NVIDIA/nim-anywhere.git`. You can leave the Path as the default of `/home/REMOTE_USER/nvidia-workbench/nim-anywhere.git`. Click *Clone*.\` ![AI Workbench Clone Project Menu](.static/_static/nvwb_clone.png) 4. You will be redirected to the new project’s page. Workbench will automatically bootstrap the development environment. You can view real-time progress by expanding the Output from the bottom of the window. ![AI Workbench Log Viewer](.static/_static/nvwb_logs.png) </details> ## Configure this project The project must be configured to use your NGC personal key. <details> <summary> Expand this section for a details on configuring this project. </summary> 1. Before running for the first time, your NGC personal key must be configured in Workbench. This is done using the *Environment* tab from the left-hand panel. ![AI Workbench Side Menu](.static/_static/nvwb_left_menu.png) 2. Scroll down to the **Secrets** section and find the *NGC_API_KEY* entry. Press *Configure* and provide the personal key for NGC that was generated earlier. </details> ## Start This Project Even the most basic of LLM Chains depend on a few additional microservices. These can be ignored during development for in-memory alternatives, but then code changes are required to go to production. Thankfully, Workbench manages those additional microservices for development environments. <details> <summary> Expand this section for details on starting the demo application. </summary> > **HINT:** For each application, the debug output can be monitored in > the UI by clicking the Output link in the lower left corner, selecting > the dropdown menu, and choosing the application of interest (or > **Compose** for applications started via compose). Since you can either pull NIMs and run them locally, or utilize the endpoints from *ai.nvidia.com* you can run this project with *or* without GPUs. 1. The applications bundled in this workspace can be controlled by navigating to two tabs: - **Environment** \> **Compose** - **Environment** \> **Applications** 2. First, navigate to the **Environment** \> **Compose** tab. If you're not working in an environment with GPUs, you can just click **Start** to run the project using a lightweight deployment. This default configuration will run the following containers: - *Milvus Vector DB*: An unstructured knowledge base - *Redis*: Used to store conversation histories 3. If you have access to GPU resources and want to run any NIMs locally, use the dropdown menu under **Compose** and select which set of NIMs you want to run locally. Note that you *must* have at least 1 available GPU per NIM you plan to run locally. Below is an outline of the available configurations: - Local LLM (min 1 GPU required) - The first time the LLM NIM is started, it will take some time to download the image and the optimized models. - During a long start, to confirm the LLM NIM is starting, the progress can be observed by viewing the logs by using the *Output* pane on the bottom left of the UI. - If the logs indicate an authentication error, that means the provided *NGC_API_KEY* does not have access to the NIMs. Please verify it was generated correctly and in an NGC organization that has NVIDIA AI Enterprise support or trial. - If the logs appear to be stuck on `..........: Pull complete`. `..........: Verifying complete`, or `..........: Download complete`; this is all normal output from Docker that the various layers of the container image have been downloaded. - Any other failures here need to be addressed. - Local LLM + Embedding (min 2 GPUs required) - Local LLM + Embedding + Reranking (min 3 GPUs required) > **NOTE:** > > - Each profile will also run *Milvus Vector DB* and *Redis* > - Due to the nature of Docker Compose profiles, the UI will let > you select multiple profiles at the same time. In the context of > this project, selecting multiple profiles does not make sense. > It will not cause any errors, however we recommend only > selecting one profile at a time for simplicity. 4. Once the compose services have been started, navigate to the **Environment** \> **Applications** tab. Now, the *Chain Server* can safely be started. This contains the custom LangChain code for performing our reasoning chain. By default, it will use the local Milvus and Redis, but use *ai.nvidia.com* for LLM, Embedding, and Reranking model inferencing. 5. Once the *Chain Server* is up, the *Chat Frontend* can be started. Starting the interface will automatically open it in a browser window. If you are running any local NIMs, you can edit the config to connect to them via the *Chat Frontend* ![NIM Anywhere Frontend](.static/_static/na_frontend.png) </details> ## Populating the Knowledge Base To get started developing demos, a sample dataset is provided along with a Jupyter Notebook showing how data is ingested into a Vector Database. 1. To import PDF documentation into the vector Database, open Jupyter using the app launcher in AI Workbench. 2. Use the Jupyter Notebook at `code/upload-pdfs.ipynb` to ingest the default dataset. If using the default dataset, no changes are necessary. 3. If using a custom dataset, upload it to the `data/` directory in Jupyter and modify the provided notebook as necessary. # Developing Your Own Applications This project contains applications for a few demo services as well as integrations with external services. These are all orchestrated by [NVIDIA AI Workbench](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/workbench/). The demo services are all in the `code` folder. The root level of the code folder has a few interactive notebooks meant for technical deep dives. The Chain Server is a sample application utilizing NIMs with LangChain. (Note that the Chain Server here gives you the option to experiment with and without RAG). The Chat Frontend folder contains an interactive UI server for exercising the chain server. Finally, sample notebooks are provided in the Evaluation directory to demonstrate retrieval scoring and validation. ``` mermaid mindmap root((AI Workbench)) Demo Services Chain Server LangChain + NIMs Frontend Interactive Demo UI Evaluation Validate the results Notebooks Advanced usage Integrations RedisConversation History MilvusVector Database LLM NIMOptimized LLMs ``` # Application Configuration The Chain Server can be configured with either a configuration file or environment variables. ## Config from a file By default, the application will search for a configuration file in all of the following locations. If multiple configuration files are found, values from lower files in the list will take precedence. - ./config.yaml - ./config.yml - ./config.json - ~/app.yaml - ~/app.yml - ~/app.json - /etc/app.yaml - /etc/app.yml - /etc/app.json ## Config from a custom file An additional config file path can be specified through an environment variable named `APP_CONFIG`. The value in this file will take precedence over all the default file locations. ``` bash export APP_CONFIG=/etc/my_config.yaml ``` ## Config from env vars Configuration can also be set using environment variables. The variable names will be in the form: `APP_FIELD__SUB_FIELD` Values specified as environment variables will take precedence over all values from files. ## Chain Server config schema ``` yaml # Your API key for authentication to AI Foundation. # ENV Variables: NGC_API_KEY, NVIDIA_API_KEY, APP_NVIDIA_API_KEY # Type: string, null nvidia_api_key: ~ # The Data Source Name for your Redis DB. # ENV Variables: APP_REDIS_DSN # Type: string redis_dsn: redis://localhost:6379/0 llm_model: # The name of the model to request. # ENV Variables: APP_LLM_MODEL__NAME # Type: string name: meta/llama3-8b-instruct # The URL to the model API. # ENV Variables: APP_LLM_MODEL__URL # Type: string url: https://integrate.api.nvidia.com/v1 embedding_model: # The name of the model to request. # ENV Variables: APP_EMBEDDING_MODEL__NAME # Type: string name: nvidia/nv-embedqa-e5-v5 # The URL to the model API. # ENV Variables: APP_EMBEDDING_MODEL__URL # Type: string url: https://integrate.api.nvidia.com/v1 reranking_model: # The name of the model to request. # ENV Variables: APP_RERANKING_MODEL__NAME # Type: string name: nv-rerank-qa-mistral-4b:1 # The URL to the model API. # ENV Variables: APP_RERANKING_MODEL__URL # Type: string url: https://integrate.api.nvidia.com/v1 milvus: # The host machine running Milvus vector DB. # ENV Variables: APP_MILVUS__URL # Type: string url: http://localhost:19530 # The name of the Milvus collection. # ENV Variables: APP_MILVUS__COLLECTION_NAME # Type: string collection_name: collection_1 log_level: ``` ## Chat Frontend config schema The chat frontend has a few configuration options as well. They can be set in the same manner as the chain server. ``` yaml # The URL to the chain on the chain server. # ENV Variables: APP_CHAIN_URL # Type: string chain_url: http://localhost:3030/ # The url prefix when this is running behind a proxy. # ENV Variables: PROXY_PREFIX, APP_PROXY_PREFIX # Type: string proxy_prefix: / # Path to the chain server's config. # ENV Variables: APP_CHAIN_CONFIG_FILE # Type: string chain_config_file: ./config.yaml log_level: ``` # Contributing All feedback and contributions to this project are welcome. When making changes to this project, either for personal use or for contributing, it is recommended to work on a fork on this project. Once the changes have been completed on the fork, a Merge Request should be opened. ## Code Style This project has been configured with Linters that have been tuned to help the code remain consistent while not being overly burdensome. We use the following Linters: - Bandit is used for security scanning - Pylint is used for Python Syntax Linting - MyPy is used for type hint linting - Black is configured for code styling - A custom check is run to ensure Jupyter Notebooks do not have any output - Another custom check is run to ensure the README.md file is up to date The embedded VSCode environment is configured to run the linting and checking in realtime. To manually run the linting that is done by the CI pipelines, execute `/project/code/tools/lint.sh`. Individual tests can be run be specifying them by name: `/project code/tools/lint.sh [deps|pylint|mypy|black|docs|fix]`. Running the lint tool in fix mode will automatically correct what it can by running Black, updating the README, and clearing the cell output on all Jupyter Notebooks. ## Updating the frontend The frontend has been designed in an effort to minimize the required HTML and Javascript development. A branded and styled Application Shell is provided that has been created with vanilla HTML, Javascript, and CSS. It is designed to be easy to customize, but it should never be required. The interactive components of the frontend are all created in Gradio and mounted in the app shell using iframes. Along the top of the app shell is a menu listing the available views. Each view may have its own layout consisting of one or a few pages. ### Creating a new page Pages contain the interactive components for a demo. The code for the pages is in the `code/frontend/pages` directory. To create a new page: 1. Create a new folder in the pages directory 2. Create an `__init__.py` file in the new directory that uses Gradio to define the UI. The Gradio Blocks layout should be defined in a variable called `page`. 3. It is recommended that any CSS and JS files needed for this view be saved in the same directory. See the `chat` page for an example. 4. Open the `code/frontend/pages/__init__.py` file, import the new page, and add the new page to the `__all__` list. > **NOTE:** Creating a new page will not add it to the frontend. It must > be added to a view to appear on the Frontend. ### Adding a view View consist of one or a few pages and should function independently of each other. Views are all defined in the `code/frontend/server.py` module. All declared views will automatically be added to the Frontend's menu bar and made available in the UI. To define a new view, modify the list named `views`. This is a list of `View` objects. The order of the objects will define their order in the Frontend menu. The first defined view will be the default. View objects describe the view name and layout. They can be declared as follow: ``` python my_view = frontend.view.View( name="My New View", # the name in the menu left=frontend.pages.sample_page, # the page to show on the left right=frontend.pages.another_page, # the page to show on the right ) ``` All of the page declarations, `View.left` or `View.right`, are optional. If they are not declared, then the associated iframes in the web layout will be hidden. The other iframes will expand to fill the gaps. The following diagrams show the various layouts. - All pages are defined ``` mermaid block-beta columns 1 menu["menu bar"] block columns 2 left right end ``` - Only left is defined ``` mermaid block-beta columns 1 menu["menu bar"] block columns 1 left:1 end ``` ### Frontend branding The frontend contains a few branded assets that can be customized for different use cases. #### Logo The frontend contains a logo on the top left of the page. To modify the logo, an SVG of the desired logo is required. The app shell can then be easily modified to use the new SVG by modifying the `code/frontend/_assets/index.html` file. There is a single `div` with an ID of `logo`. This box contains a single SVG. Update this to the desired SVG definition. ``` html <div id="logo" class="logo"> <svg viewBox="0 0 164 30">...</svg> </div> ``` #### Color scheme The styling of the App Shell is defined in `code/frontend/_static/css/style.css`. The colors in this file may be safely modified. The styling of the various pages are defined in `code/frontend/pages/*/*.css`. These files may also require modification for custom color schemes. #### Gradio theme The Gradio theme is defined in the file `code/frontend/_assets/theme.json`. The colors in this file can safely be modified to the desired branding. Other styles in this file may also be changed, but may cause breaking changes to the frontend. The [Gradio documentation](https://www.gradio.app/guides/theming-guide) contains more information on Gradio theming. ### Messaging between pages > **NOTE:** This is an advanced topic that most developers will never > require. Occasionally, it may be necessary to have multiple pages in a view that communicate with each other. For this purpose, Javascript's `postMessage` messaging framework is used. Any trusted message posted to the application shell will be forwarded to each iframe where the pages can handle the message as desired. The `control` page uses this feature to modify the configuration of the `chat` page. The following will post a message to the app shell (`window.top`). The message will contain a dictionary with the key `use_kb` and a value of true. Using Gradio, this Javascript can be executed by [any Gradio event](https://www.gradio.app/guides/custom-CSS-and-JS#adding-custom-java-script-to-your-demo). ``` javascript window.top.postMessage({"use_kb": true}, '*'); ``` This message will automatically be sent to all pages by the app shell. The following sample code will consume the message on another page. This code will run asynchronously when a `message` event is received. If the message is trusted, a Gradio component with the `elem_id` of `use_kb` will be updated to the value specified in the message. In this way, the value of a Gradio component can be duplicated across pages. ``` javascript window.addEventListener( "message", (event) => { if (event.isTrusted) { use_kb = gradio_config.components.find((element) => element.props.elem_id == "use_kb"); use_kb.props.value = event.data["use_kb"]; }; }, false); ``` ## Updating documentation The README is rendered automatically; direct edits will be overwritten. In order to modify the README you will need to edit the files for each section separately. All of these files will be combined and the README will be automatically generated. You can find all of the related files in the `docs` folder. Documentation is written in Github Flavored Markdown and then rendered to a final Markdown file by Pandoc. The details for this process are defined in the Makefile. The order of files generated are defined in `docs/_TOC.md`. The documentation can be previewed in the Workbench file browser window. ### Header file The header file is the first file used to compile the documentation. This file can be found at `docs/_HEADER.md`. The contents of this file will be written verbatim, without any manipulation, to the README before anything else. ### Summary file The summary file contains quick description and graphic that describe this project. The contents of this file will be added to the README immediately after the header and just before the table of contents. This file is processed by Pandoc to embed images before writing to the README. ### Table of Contents file The most important file for the documentation is the table of contents file at `docs/_TOC.md`. This file defines a list of files that should be concatenated in order to generate the final README manual. Files must be on this list to be included. ### Static Content Save all static content, including images, to the `_static` folder. This will help with organization. ### Dynamic documentation It may be helpful to have documents that update and write themselves. To create a dynamic document, simply create an executable file that writes the Markdown formatted document to stdout. During build time, if an entry in the table of contents file is executable, it will be executed and its stdout will be used in its place. ### Rendering documentation When a documentation related commit is pushed, a GitHub Action will render the documentation. Any changes to the README will be automatially committed. # Managing your Development Environment ## Environment Variables Most of the configuration for the development environment happens with Environment Variables. To make permanent changes to environment variables, modify [`variables.env`](./variables.env) or use the Workbench UI. ## Python Environment Packages This project uses one Python environment at `/usr/bin/python3` and dependencies are managed with `pip`. Because all development is done inside a container, any changes to the Python environment will be ephemeral. To permanently install a Python package, add it to the [`requirements.txt`](./requirements.txt) file or use the Workbench UI. ## Operating System Configuration The development environment is based on Ubuntu 22.04. The primary user has password-less sudo access, but all changes to the system will be ephemeral. To make permanent changes to installed packages, add them to the \[`apt.txt`\] file. To make other changes to the operating system such as manipulating files, adding environment variables, etc; use the [`postBuild.bash`](./postBuild.bash) and [`preBuild.bash`](./preBuild.bash) files. ## Updating Dependencies It is typically good practice to update dependencies monthly to ensure no CVEs are exposed through misused dependencies. The following process can be used to patch this project. It is recommended to run the regression testing after the patch to ensure nothing has broken in the update. 1. **Update Environment:** In the workbench GUI, open the project and navigate to the Environment pane. Check if there is an update available for the base image. If an updated base image is available, apply the update and rebuild the environment. Address any build errors. Ensure that all of the applications can start. 2. **Update Python Packages and NIMs:** The Python dependencies and NIM applications can be updated automatically by running the `/project/code/tools/bump.sh` script. 3. **Update Remaining applications:** For the remaining applications, manually check their default tag and compare to the latest. Update where appropriate and ensure that the applications still start up successfully. 4. **Restart and rebuild the environment.** 5. **Audit Python Environment:** It is now best to check the installed versions of ALL Python packages, not just the direct dependencies. To accomplish this, run `/project/code/tools/audit.sh`. This script will print out a report of all Python packages in a warning state and all packages in an error state. Anything in an error state must be resolved as it will have active CVEs and known vulnerabilities. 6. **Check Dependabot Alerts:** Check all of the [Dependabot](https://github.com/NVIDIA/nim-anywhere/security/dependabot) alerts and ensure they should be resolved. 7. **Regression testing:** Run through the entire demo, from document ingesting to the frontend, and ensure it is still functional and that the GUI looks correct.

ML Frameworks Vector Databases

241 Github Stars

Open Source

MinkowskiEngine

[pypi-image]: https://badge.fury.io/py/MinkowskiEngine.svg [pypi-url]: https://pypi.org/project/MinkowskiEngine/ [pypi-download]: https://img.shields.io/pypi/dm/MinkowskiEngine [slack-badge]: https://img.shields.io/badge/slack-join%20chats-brightgreen [slack-url]: https://join.slack.com/t/minkowskiengine/shared_invite/zt-piq2x02a-31dOPocLt6bRqOGY3U_9Sw # Minkowski Engine [![PyPI Version][pypi-image]][pypi-url] [![pypi monthly download][pypi-download]][pypi-url] [![slack chat][slack-badge]][slack-url] The Minkowski Engine is an auto-differentiation library for sparse tensors. It supports all standard neural network layers such as convolution, pooling, unpooling, and broadcasting operations for sparse tensors. For more information, please visit [the documentation page](http://nvidia.github.io/MinkowskiEngine/overview.html). ## News - 2021-08-11 Docker installation instruction added - 2021-08-06 All installation errors with pytorch 1.8 and 1.9 have been resolved. - 2021-04-08 Due to recent errors in [pytorch 1.8 + CUDA 11](https://github.com/NVIDIA/MinkowskiEngine/issues/330), it is recommended to use [anaconda for installation](#anaconda). - 2020-12-24 v0.5 is now available! The new version provides CUDA accelerations for all coordinate management functions. ## Example Networks The Minkowski Engine supports various functions that can be built on a sparse tensor. We list a few popular network architectures and applications here. To run the examples, please install the package and run the command in the package root directory. | Examples | Networks and Commands | |:---------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Semantic Segmentation | <img src="https://nvidia.github.io/MinkowskiEngine/_images/segmentation_3d_net.png"> <img src="https://nvidia.github.io/MinkowskiEngine/_images/segmentation.png" width="256"> `python -m examples.indoor` | | Classification | ![](https://nvidia.github.io/MinkowskiEngine/_images/classification_3d_net.png) `python -m examples.classification_modelnet40` | | Reconstruction | <img src="https://nvidia.github.io/MinkowskiEngine/_images/generative_3d_net.png"> <img src="https://nvidia.github.io/MinkowskiEngine/_images/generative_3d_results.gif" width="256"> `python -m examples.reconstruction` | | Completion | <img src="https://nvidia.github.io/MinkowskiEngine/_images/completion_3d_net.png"> `python -m examples.completion` | | Detection | <img src="https://nvidia.github.io/MinkowskiEngine/_images/detection_3d_net.png"> | ## Sparse Tensor Networks: Neural Networks for Spatially Sparse Tensors Compressing a neural network to speedup inference and minimize memory footprint has been studied widely. One of the popular techniques for model compression is pruning the weights in convnets, is also known as [*sparse convolutional networks*](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf). Such parameter-space sparsity used for model compression compresses networks that operate on dense tensors and all intermediate activations of these networks are also dense tensors. However, in this work, we focus on [*spatially* sparse data](https://arxiv.org/abs/1409.6070), in particular, spatially sparse high-dimensional inputs and 3D data and convolution on the surface of 3D objects, first proposed in [Siggraph'17](https://wang-ps.github.io/O-CNN.html). We can also represent these data as sparse tensors, and these sparse tensors are commonplace in high-dimensional problems such as 3D perception, registration, and statistical data. We define neural networks specialized for these inputs as *sparse tensor networks* and these sparse tensor networks process and generate sparse tensors as outputs. To construct a sparse tensor network, we build all standard neural network layers such as MLPs, non-linearities, convolution, normalizations, pooling operations as the same way we define them on a dense tensor and implemented in the Minkowski Engine. We visualized a sparse tensor network operation on a sparse tensor, convolution, below. The convolution layer on a sparse tensor works similarly to that on a dense tensor. However, on a sparse tensor, we compute convolution outputs on a few specified points which we can control in the [generalized convolution](https://nvidia.github.io/MinkowskiEngine/sparse_tensor_network.html). For more information, please visit [the documentation page on sparse tensor networks](https://nvidia.github.io/MinkowskiEngine/sparse_tensor_network.html) and [the terminology page](https://nvidia.github.io/MinkowskiEngine/terminology.html). | Dense Tensor | Sparse Tensor | |:---------------------------------------------------------------------------:|:----------------------------------------------------------------------------:| | <img src="https://nvidia.github.io/MinkowskiEngine/_images/conv_dense.gif"> | <img src="https://nvidia.github.io/MinkowskiEngine/_images/conv_sparse.gif"> | -------------------------------------------------------------------------------- ## Features - Unlimited high-dimensional sparse tensor support - All standard neural network layers (Convolution, Pooling, Broadcast, etc.) - Dynamic computation graph - Custom kernel shapes - Multi-GPU training - Multi-threaded kernel map - Multi-threaded compilation - Highly-optimized GPU kernels ## Requirements - Ubuntu >= 14.04 - CUDA >= 10.1.243 and **the same CUDA version used for pytorch** (e.g. if you use conda cudatoolkit=11.1, use CUDA=11.1 for MinkowskiEngine compilation) - pytorch >= 1.7 To specify CUDA version, please use conda for installation. You must match the CUDA version pytorch uses and CUDA version used for Minkowski Engine installation. `conda install -y -c nvidia -c pytorch pytorch=1.8.1 cudatoolkit=10.2`) - python >= 3.6 - ninja (for installation) - GCC >= 7.4.0 ## Installation You can install the Minkowski Engine with `pip`, with anaconda, or on the system directly. If you experience issues installing the package, please checkout the [the installation wiki page](https://github.com/NVIDIA/MinkowskiEngine/wiki/Installation). If you cannot find a relevant problem, please report the issue on [the github issue page](https://github.com/NVIDIA/MinkowskiEngine/issues). - [PIP](https://github.com/NVIDIA/MinkowskiEngine#pip) installation - [Conda](https://github.com/NVIDIA/MinkowskiEngine#anaconda) installation - [Python](https://github.com/NVIDIA/MinkowskiEngine#system-python) installation - [Docker](https://github.com/NVIDIA/MinkowskiEngine#docker) installation ### Pip The MinkowskiEngine is distributed via [PyPI MinkowskiEngine][pypi-url] which can be installed simply with `pip`. First, install pytorch following the [instruction](https://pytorch.org). Next, install `openblas`. ``` sudo apt install build-essential python3-dev libopenblas-dev pip install torch ninja pip install -U MinkowskiEngine --install-option="--blas=openblas" -v --no-deps # For pip installation from the latest source # pip install -U git+https://github.com/NVIDIA/MinkowskiEngine --no-deps ``` If you want to specify arguments for the setup script, please refer to the following command. ``` # Uncomment some options if things don't work # export CXX=c++; # set this if you want to use a different C++ compiler # export CUDA_HOME=/usr/local/cuda-11.1; # or select the correct cuda version on your system. pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps \ # \ # uncomment the following line if you want to force cuda installation # --install-option="--force_cuda" \ # \ # uncomment the following line if you want to force no cuda installation. force_cuda supercedes cpu_only # --install-option="--cpu_only" \ # \ # uncomment the following line to override to openblas, atlas, mkl, blas # --install-option="--blas=openblas" \ ``` ### Anaconda MinkowskiEngine supports both CUDA 10.2 and cuda 11.1, which work for most of latest pytorch versions. #### CUDA 10.2 We recommend `python>=3.6` for installation. First, follow [the anaconda documentation](https://docs.anaconda.com/anaconda/install/) to install anaconda on your computer. ``` sudo apt install g++-7 # For CUDA 10.2, must use GCC < 8 # Make sure `g++-7 --version` is at least 7.4.0 conda create -n py3-mink python=3.8 conda activate py3-mink conda install openblas-devel -c anaconda conda install pytorch=1.9.0 torchvision cudatoolkit=10.2 -c pytorch -c nvidia # Install MinkowskiEngine export CXX=g++-7 # Uncomment the following line to specify the cuda home. Make sure `$CUDA_HOME/nvcc --version` is 10.2 # export CUDA_HOME=/usr/local/cuda-10.2 pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps --install-option="--blas_include_dirs=${CONDA_PREFIX}/include" --install-option="--blas=openblas" # Or if you want local MinkowskiEngine git clone https://github.com/NVIDIA/MinkowskiEngine.git cd MinkowskiEngine export CXX=g++-7 python setup.py install --blas_include_dirs=${CONDA_PREFIX}/include --blas=openblas ``` #### CUDA 11.X We recommend `python>=3.6` for installation. First, follow [the anaconda documentation](https://docs.anaconda.com/anaconda/install/) to install anaconda on your computer. ``` conda create -n py3-mink python=3.8 conda activate py3-mink conda install openblas-devel -c anaconda conda install pytorch=1.9.0 torchvision cudatoolkit=11.1 -c pytorch -c nvidia # Install MinkowskiEngine # Uncomment the following line to specify the cuda home. Make sure `$CUDA_HOME/nvcc --version` is 11.X # export CUDA_HOME=/usr/local/cuda-11.1 pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps --install-option="--blas_include_dirs=${CONDA_PREFIX}/include" --install-option="--blas=openblas" # Or if you want local MinkowskiEngine git clone https://github.com/NVIDIA/MinkowskiEngine.git cd MinkowskiEngine python setup.py install --blas_include_dirs=${CONDA_PREFIX}/include --blas=openblas ``` ### System Python Like the anaconda installation, make sure that you install pytorch with the same CUDA version that `nvcc` uses. ``` # install system requirements sudo apt install build-essential python3-dev libopenblas-dev # Skip if you already have pip installed on your python3 curl https://bootstrap.pypa.io/get-pip.py | python3 # Get pip and install python requirements python3 -m pip install torch numpy ninja git clone https://github.com/NVIDIA/MinkowskiEngine.git cd MinkowskiEngine python setup.py install # To specify blas, CXX, CUDA_HOME and force CUDA installation, use the following command # export CXX=c++; export CUDA_HOME=/usr/local/cuda-11.1; python setup.py install --blas=openblas --force_cuda ``` ### Docker ``` git clone https://github.com/NVIDIA/MinkowskiEngine cd MinkowskiEngine docker build -t minkowski_engine docker ``` Once the docker is built, check it loads MinkowskiEngine correctly. ``` docker run MinkowskiEngine python3 -c "import MinkowskiEngine; print(MinkowskiEngine.__version__)" ``` ## CPU only build and BLAS configuration (MKL) The Minkowski Engine supports CPU only build on other platforms that do not have NVidia GPUs. Please refer to [quick start](https://nvidia.github.io/MinkowskiEngine/quick_start.html) for more details. ## Quick Start To use the Minkowski Engine, you first would need to import the engine. Then, you would need to define the network. If the data you have is not quantized, you would need to voxelize or quantize the (spatial) data into a sparse tensor. Fortunately, the Minkowski Engine provides the quantization function (`MinkowskiEngine.utils.sparse_quantize`). ### Creating a Network ```python import torch.nn as nn import MinkowskiEngine as ME class ExampleNetwork(ME.MinkowskiNetwork): def __init__(self, in_feat, out_feat, D): super(ExampleNetwork, self).__init__(D) self.conv1 = nn.Sequential( ME.MinkowskiConvolution( in_channels=in_feat, out_channels=64, kernel_size=3, stride=2, dilation=1, bias=False, dimension=D), ME.MinkowskiBatchNorm(64), ME.MinkowskiReLU()) self.conv2 = nn.Sequential( ME.MinkowskiConvolution( in_channels=64, out_channels=128, kernel_size=3, stride=2, dimension=D), ME.MinkowskiBatchNorm(128), ME.MinkowskiReLU()) self.pooling = ME.MinkowskiGlobalPooling() self.linear = ME.MinkowskiLinear(128, out_feat) def forward(self, x): out = self.conv1(x) out = self.conv2(out) out = self.pooling(out) return self.linear(out) ``` ### Forward and backward using the custom network ```python # loss and network criterion = nn.CrossEntropyLoss() net = ExampleNetwork(in_feat=3, out_feat=5, D=2) print(net) # a data loader must return a tuple of coords, features, and labels. coords, feat, label = data_loader() input = ME.SparseTensor(feat, coordinates=coords) # Forward output = net(input) # Loss loss = criterion(output.F, label) ``` ## Discussion and Documentation For discussion and questions, please use `[email protected]`. For API and general usage, please refer to the [MinkowskiEngine documentation page](http://nvidia.github.io/MinkowskiEngine/) for more detail. For issues not listed on the API and feature requests, feel free to submit an issue on the [github issue page](https://github.com/NVIDIA/MinkowskiEngine/issues). ## Known Issues ### Specifying CUDA architecture list In some cases, you need to explicitly specify which compute capability your GPU uses. The default list might not contain your architecture. ```bash export TORCH_CUDA_ARCH_LIST="5.2 6.0 6.1 7.0 7.5 8.0 8.6+PTX"; python setup.py install --force_cuda ``` ### Unhandled Out-Of-Memory thrust::system exception There is [a known issue](https://github.com/NVIDIA/thrust/issues/1448) in thrust with CUDA 10 that leads to an unhandled thrust exception. Please refer to the [issue](https://github.com/NVIDIA/MinkowskiEngine/issues/357) for detail. ### Too much GPU memory usage or Frequent Out of Memory There are a few causes for this error. 1. Out of memory during a long running training MinkowskiEngine is a specialized library that can handle different number of points or different number of non-zero elements at every iteration during training, which is common in point cloud data. However, pytorch is implemented assuming that the number of point, or size of the activations do not change at every iteration. Thus, the GPU memory caching used by pytorch can result in unnecessarily large memory consumption. Specifically, pytorch caches chunks of memory spaces to speed up allocation used in every tensor creation. If it fails to find the memory space, it splits an existing cached memory or allocate new space if there's no cached memory large enough for the requested size. Thus, every time we use different number of point (number of non-zero elements) with pytorch, it either split existing cache or reserve new memory. If the cache is too fragmented and allocated all GPU space, it will raise out of memory error. **To prevent this, you must clear the cache at regular interval with `torch.cuda.empty_cache()`.** ### CUDA 11.1 Installation ``` wget https://developer.download.nvidia.com/compute/cuda/11.1.1/local_installers/cuda_11.1.1_455.32.00_linux.run sudo sh cuda_11.1.1_455.32.00_linux.run --toolkit --silent --override # Install MinkowskiEngine with CUDA 11.1 export CUDA_HOME=/usr/local/cuda-11.1; pip install MinkowskiEngine -v --no-deps ``` ### Running the MinkowskiEngine on nodes with a large number of CPUs The MinkowskiEngine uses OpenMP to parallelize the kernel map generation. However, when the number of threads used for parallelization is too large (e.g. OMP_NUM_THREADS=80), the efficiency drops rapidly as all threads simply wait for multithread locks to be released. In such cases, set the number of threads used for OpenMP. Usually, any number below 24 would be fine, but search for the optimal setup on your system. ``` export OMP_NUM_THREADS=<number of threads to use>; python <your_program.py> ``` ## Citing Minkowski Engine If you use the Minkowski Engine, please cite: - [4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks, CVPR'19](https://arxiv.org/abs/1904.08755), [[pdf]](https://arxiv.org/pdf/1904.08755.pdf) ``` @inproceedings{choy20194d, title={4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks}, author={Choy, Christopher and Gwak, JunYoung and Savarese, Silvio}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, pages={3075--3084}, year={2019} } ``` For multi-threaded kernel map generation, please cite: ``` @inproceedings{choy2019fully, title={Fully Convolutional Geometric Features}, author={Choy, Christopher and Park, Jaesik and Koltun, Vladlen}, booktitle={Proceedings of the IEEE International Conference on Computer Vision}, pages={8958--8966}, year={2019} } ``` For strided pooling layers for high-dimensional convolutions, please cite: ``` @inproceedings{choy2020high, title={High-dimensional Convolutional Networks for Geometric Pattern Recognition}, author={Choy, Christopher and Lee, Junha and Ranftl, Rene and Park, Jaesik and Koltun, Vladlen}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, year={2020} } ``` For generative transposed convolution, please cite: ``` @inproceedings{gwak2020gsdn, title={Generative Sparse Detection Networks for 3D Single-shot Object Detection}, author={Gwak, JunYoung and Choy, Christopher B and Savarese, Silvio}, booktitle={European conference on computer vision}, year={2020} } ``` ## Unittest For unittests and gradcheck, use torch >= 1.7 ## Projects using Minkowski Engine Please feel free to update [the wiki page](https://github.com/NVIDIA/MinkowskiEngine/wiki/Usage) to add your projects! - [Projects using MinkowskiEngine](https://github.com/NVIDIA/MinkowskiEngine/wiki/Usage) - Segmentation: [3D and 4D Spatio-Temporal Semantic Segmentation, CVPR'19](https://github.com/chrischoy/SpatioTemporalSegmentation) - Representation Learning: [Fully Convolutional Geometric Features, ICCV'19](https://github.com/chrischoy/FCGF) - 3D Registration: [Learning multiview 3D point cloud registration, CVPR'20](https://arxiv.org/abs/2001.05119) - 3D Registration: [Deep Global Registration, CVPR'20](https://arxiv.org/abs/2004.11540) - Pattern Recognition: [High-Dimensional Convolutional Networks for Geometric Pattern Recognition, CVPR'20](https://arxiv.org/abs/2005.08144) - Detection: [Generative Sparse Detection Networks for 3D Single-shot Object Detection, ECCV'20](https://arxiv.org/abs/2006.12356) - Image matching: [Sparse Neighbourhood Consensus Networks, ECCV'20](https://www.di.ens.fr/willow/research/sparse-ncnet/)

ML Frameworks

2.9K Github Stars

Open Source

farm

# Farm Job Scheduler Farm is a job scheduler for executing tasks as CLI programs or wrapped in containers. Groups of dependent tasks can be scheduled hierarchically as a Taskflow which are described as a Directed acyclic graph (DAG). All components of Farm are known to work in the following operating systems: - Linux - macOS - Windows It can be configured to execute tasks locally (via Agents), or schedule tasks to downstream schedulers such as K8s and [Cloud Tasks](https://docs.nvidia.com/cloud-functions/user-guide/latest/cloud-function/tasks.html). Farm can be run with all microservices executing within a single python process (stand-alone mode), or can be deployed into K8s as a scalable distributed system using the provided [Helm](helm/nv.svc.farm) charts. By default, when Farm is run in stand-alone mode data will be persisted to a local [SQLite](https://sqlite.org/) database. [MariaDB](https://mariadb.org/) is the recommended database for Helm chart deployments, although [Postgresql](https://www.postgresql.org/) is well supported. ## Quickstart - Install Python 3.12+, tox, poetry, then build and run a local stand-alone instance of Farm: ```shell pip3 install tox poetry env svc='python3 -m nv.svc.farm.standalone' make start ``` - Access the [local Farm OpenAPI docs](http://127.0.0.1:8222/docs). ## Included Documentation - [Omniverse Farm docs](/user-manual/) - The best place to get started, includes K8s and Omniverse integration docs. - [Modern Workflow docs](/user-docs/index.md) - A practical example user job submission that implements Taskflows (Directed Acyclic Graph) for job submission. # Extended Development Getting Started ## Prerequisites This project uses **[Python](https://www.python.org/)** minimum version 3.12, **[Poetry](https://python-poetry.org/)** to manage dependencies and **[Tox](https://tox.wiki/en/stable/)** to automate and standardize testing across multiple versions of Python. | Requirement | Description | Installation | ------------- | ----------- | ------------ | | [Python 3.12+](https://www.python.org/) | Interpreter | [Ubuntu](https://phoenixnap.com/kb/how-to-install-python-3-ubuntu) | | [Poetry](https://python-poetry.org/docs/) | Dependency Management / Packaging | [Pip](https://python-poetry.org/docs/#installation) | | [Tox](https://tox.wiki/en/latest/index.html) | Standardize testing across multiple Python versions | [Pip](https://tox.wiki/en/latest/installation.html#via-pip) | | [Make]() | Entrypoints to commands | | ## Quickstart - Clone this repository using `git clone` - From within the repository folder, create a virtual environment `make venv`. Note: if this command is not run before opening VS Code, VS Code will not be able to find your virtual environment and you may have to re-open it. - Open VSCode `code .` ## Development It is recommended that most development related commands be executed through `make` commands. This allows the CI system to use an interface rather than an implementation. For example, `make test` will internally call a `tox` target which knows how to run tests on your project allowing us to change how we run tests without modifying CI. Most `make` targets call `tox` commands which create and maintain a virtual environment, install required dependencies and optionally install your package. This is the recommended way to do your development as it allows the tooling to do most of the work. ### Exploring Make Targets The following is a list of commonly used `make` targets. To see the entire list refer to the `Makefile` at the root of your repository. Keep in mind that `tox` will create and manage its own virtual environment for you. | Command | Action | | :---------------- | :----------------------------------------------------- | | `start` | Starts the application locally | | `quicktest` | Runs available testcases | | `build` | Builds Python wheel | | `clean` | Cleans workspace / venvs | | `coverage-report` | Generates a code coverage report | | `docs` | Generates documenation site | | `docs-server` | Starts a live server, previewing your docs in realtime | | `check-format` | Checks the format of your code | | `fix-format` | Rectifies the output of `check-format` | | `e2e-up` | Starts E2E test environment (Docker Compose) | | `e2e-test` | Runs Playwright E2E tests | | `e2e-down` | Stops E2E test environment | ## Exploring pyproject.toml This file contains all configuration necessary for use with packaging tools, in our case `poetry`. To see more on what this file may contain consult the [packaging guide](https://packaging.python.org/en/latest/guides/writing-pyproject-toml/). ## Dependency Manipulation The `pyproject.toml` file contains all of your projects dependencies logically grouped. - `tool.poetry.dependencies` - dependencies to be included for your final artifact - `tool.poetry.group.dev.dependencies` - dependencies included for development purposes only (flake8, pylint, etc) - `tool.poetry.group.docs.dependencies` - dependencies included for generating documentation ### Adding Dependencies Add a dependency to a group: `poetry add mdx-include --group docs` ### Removing Dependencies Remove a dependency from a group: `poetry remove markdown-callouts --group docs` ### Updating Dependencies Update all dependencies: `poetry update` ## Testing We use [pytest](https://docs.pytest.org/en/8.0.x/) to run test-cases which are located in the `tests` folder at the root of the project. ### Running Tests `make quicktest` - Creates an environment using tox, installs tests dependencies, installs your package and runs through your test-cases. Keep in mind you can still use `unittest` but some features may not be [compatible](https://docs.pytest.org/en/7.1.x/how-to/unittest.html). ## E2E Testing End-to-end tests use [Playwright](https://playwright.dev/) to test the full application stack including the dashboard UI. ### Prerequisites - Docker and Docker Compose - Node.js 20+ (for running Playwright locally) ### Running E2E Tests ```bash # Start the E2E environment (Farm + MySQL in Docker) make e2e-up # Run Playwright tests make e2e-test # Stop the environment when done make e2e-down ``` The E2E environment runs: - **Farm** service on `http://localhost:8222` - **Dashboard** at `http://localhost:8222/queue/management/dashboard/` - **MySQL** database on port 3306 ### Writing E2E Tests E2E tests are located in `dashboard-ui/e2e/`. To add new tests: 1. Create a new `.spec.ts` file in `dashboard-ui/e2e/` 2. Use Playwright's test API to interact with the dashboard 3. Run `make e2e-test` to verify Example test structure: ```typescript import { test, expect } from "@playwright/test"; test("can perform action", async ({ page }) => { await page.goto("/queue/management/dashboard/"); // ... test steps }); ``` ### Debugging E2E Tests Run Playwright in headed mode to see the browser: ```bash cd dashboard-ui && npx playwright test --headed ``` Run a specific test file: ```bash cd dashboard-ui && npx playwright test e2e/login.spec.ts ``` ## Starting the Application It is recommended to start your application using `make start` or by using the pre-configured launch configuration found in `./vscode/launch.json` See [Debugging with VSCode](https://code.visualstudio.com/docs/editor/debugging) for more information. ## Accessing OpenAPI Spec With your application running using `make start` or otherwise, the OpenAPI Spec can be found by navigating to the `/docs` endpoint. For example `http://localhost:8222/docs` ## Building the Container Image The `Dockerfile` used to build the image is located at the project's root. You can build an image using the command `docker build . -t "your-image-name:your-version"` and you can run this image by using the command `docker run -p 8222:8222 "your-image-name:your-version"` ## Versioning The `pyproject.toml` contains all the necessary information to build your Python artifact along with its version. This version should be updated each time you wish to merge into your main branch. The build system will build, package, push, and tag your repository with this version. ## Documentation This project is configured to produce documentation using `mkdocs` which will traverse your docstrings and generate a static HTML site. - Start a live docs server `make docs-server` - This starts a webserver that live updates on changes in your repository for preview - Generate static site `make docs` - This produces the static site ## VSCode All VSCode configuration can be found in the `.vscode` directory at the root of the application. Here you can find settings, launch configurations for debugging and recommended extensions.

Cron & Job Scheduling Artifact & Package Registries

18 Github Stars

Software by nvidia