# 英伟达之你想要的

NVIDIA有很多新的技术或者即将开源的技术(未来的半年内)绝大部分都会在这里展示出来，称做On-Demand。为啥叫ON-DEMAND？简单查了一下英文字典，大意是“按需索取”。可以理解为，也就是你想要或者感兴趣的技术，大概都在这里了！

• 自动驾驶、机器人
• 大数据、网络、可视化
• 数据科学
• 深度学习
• GPU编程
• 图形图像以及设计
• 高性能计算
• 仿真、思维

# TensorRT Quick Start Guide

We’ll walk you through the TensorRT Quick Start Guide. The newly-published TensorRT Quick Start Guide provides a quick introduction to new users starting out with TensorRT. It includes Jupyter notebooks and C++ examples of the most common TensorRT workflows and examples for using TensorRT with TensorFlow, PyTorch, and ONNX.

TensorRT的更新还是挺快的，老潘在写TensorRT-7.2.3.4的时候，TensorRT已经悄悄出了8的EA版。正如老潘之前提到的，那会出来的是TensorRT-EA版，隔了半个月就马上GA了，更新神速！现在已经是TensorRT-8GA版本，不论是性能还是易用性，相比上一个版本都有所提升。

# Accelerate Deep Learning Inference with TensorRT 8.0

TensorRT is an SDK for high-performance deep learning inference used in production to minimize latency and maximize throughput. The upcoming TensorRT 8.0 release provides features such as sparsity optimized for NVIDIA Ampere GPUs, quantization-aware training, and enhanced compiler to accelerate transformer-based networks. Deep learning compilers need to have a robust method to import, optimize, and deploy models. New users can learn about the common workflow, while experienced users can learn more about new TensorRT 8.0 features.

TensorRT8终于是发布了，英伟达官方也着重宣传了一番，不光有博客，也有PPT以及相应的课程说明。

• 支持QTA量化(也就是训练中量化)，可以直接将其他框架中训练中量化的模型导入到TensorRT中使用
• 对于安培(Ampere的)架构的显卡，支持稀疏化网络，可提升50%的吞吐量
• 对于BERT等transformer构架的网络有了更好的优化

TensorRT8的变动还是蛮大的，毕竟是大版本的更新。详细的内容可以先看这个演讲PPT。老潘之后也会详细介绍下(埋坑嘻嘻)。

# Introduction to TensorRT and Triton: A Walkthrough of Optimizing Your First Deep Learning Inference Model

NVIDIA TensorRT is a deep learning platform that optimizes neural network models and speeds up inference across GPU-accelerated platforms running in the data center and embedded devices. We’ll provide an overview of TensorRT, show how to optimize a PyTorch model, and demonstrate how to deploy this highly optimized model using NVIDIA Triton Inference Server. By the end of this workshop, developers will see the substantial benefits of integrating TensorRT and get started on optimizing their own deep learning models.

Triton确实是好用的不行。Triton server的特性与其他服务器框架无异，而支持的底层backend有TensorRT、onnxruntime、libtorch、TensorFlow、Pytorch、Openvino等，支持http和grpc协议，也可以自定义协议(毕竟开源嘛)，支持多卡，支持多实例，支持热加载。

triton最新版21.06的特性：

# Quantization Aware Training in PyTorch with TensorRT 8.0

Quantization is used to improve latency and resource requirements of Deep Neural Networks during inference. Quantization Aware Training (QAT) improves accuracy of quantized networks by emulating quantization errors in the forward and backward passes during training. TensorRT 8.0 brings improved support for QAT with PyTorch, in conjunction with NVIDIA’s open-source pytorch-quantization toolkit. This session gives an overview of the improvements in QAT with TensorRT 8.0, and walks through an end-to-end usage example.

TensorRT8可以直接加载通过QTA量化后且导出为ONNX的模型，官方也提供了Pytorch量化配套工具，可谓是一步到位。

# Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture

In this session, we’ll share details of Sparse Tensor Cores in the NVIDIA Ampere Architecture and the unique 2:4 sparse format they support. Learn how we’ve simplified maintaining accuracy when pruning all types of networks, including classification networks, language models, and GANs. Finally, find out how to accelerate your own workloads using Sparse Tensor Cores from start to finish with ASP and TensorRT 8.0 and cuSPARSELt.

# Prototyping and Debugging Deep Learning Inference Models Using TensorRT’s ONNX-Graphsurgeon and Polygraphy Tools

Deep learning researchers and engineers usually have to spend a significant amount of time debugging accuracy and performance of their deep learning inference models before deploying them. TensorRT recently open-sourced some more tools to assist with the development and debugging of deep neural networks for inference. ONNX GraphSurgeon is a tool that allows you to easily generate new ONNX graphs, or modify existing ones. This can be useful in scenarios like using custom implementations for parts of the ONNX graph, in place of those provided by TensorRT. Polygraphy is a toolkit designed to assist in running and debugging deep learning models in various frameworks. It includes a Python API and several command-line tools built using this API. These tools allow displaying information about models, such as network structure; determining which layers of a TensorRT network need to be run in a higher precision for accuracy; and comparing inference results across frameworks, among other features.

Polygraphy是一个非常强大的工具。强烈推荐，这个工具可能会在工作中省掉你一半debug的时间。目前关于这个工具的推广和介绍并不是很多，很多人还不知道。

• 查看ONNX结构 polygraphy inspect model mymodel.onnx
• 查看一个engine结构 polygraphy inspect model mytrt.trt –model-type engine
• 通过onnx查看生成trt的网络结 polygraphy inspect model mymodel.onnx –display-as=trt –mode basic
• 对于trt和onnx的结果
首先生成onnx的结果信息
polygraphy run mymodel.onnx –onnxrt –save-outputs onnx_res.json
然后转一个模型进行对比
polygraphy run mytrt.trt –model-type engine –trt –load-outputs onnx_res.json –abs 1e-4
• 修改onnx结构
polygraphy surgeon sanitize modele2-nms.onnx
–override-input-shapes input_name:[1,3,224,224]
-o modele2-nms-static-shape.onnx

# Achieve Best Inference Performance on NVIDIA GPUs by Combining TensorRT with TVM Compilation Using SageMaker Neo

Amazon SageMaker Neo allows customers to compile models from any framework for optimized inference on many compilation targets, including NVIDIA Jetson devices and T4 GPU instances. We’ll dive into the details of how Neo uses the open-source deep learning compiler TVM and NVIDIA TensorRT together to provide the best inference performance across popular deep learning model types.

TVM和TensorRT的结合，想想就会有很强大。TVM和TensorRT作为业界数一数二的加速推理框架，两者结合起来又有什么样的火花呢？

TVM老潘之前提到过，极其优秀的深度学习编译器。TensorRT更不用说。这两者结合和我想象中的一样，是类似于integration或者Partitioning的方式。部分计算图运行在TVM、部分运行在TensorRT中，两者取所长。

# New Features in TRTorch, a PyTorch/TorchScript Compiler Targeting NVIDIA GPUs Using TensorRT

We’ll cover new features of TRTorch, a compiler for PyTorch and TorchScript that optimizes deep learning models for inference on NVIDIA GPUs. Programs are internally optimized using TensorRT but maintain full compatibility with standard PyTorch or TorchScript code. This allows users to continue to feel like they’re writing PyTorch code in their inference applications while fully leveraging TensorRT. We’ll discuss new capabilities enabled in recent releases of TRTorch, including direct integration into PyTorch and post-training quantization.

TRTorch，刚开始看到这个名字感觉很奇怪。后来仔细了解了下，这个库对于特定场景是比较实用的，转TRT的流程变为：

• Pytorch->torchscript->tensorrt

# Low-Latency, High-Throughput Inferencing for Transformer-Based Models

Transformer-based models provide state-of-the-art accuracy for many NLP tasks. Recent models contain a large number of parameters, which makes meeting low latency requirements challenging for online inferencing. We’ll cover highly optimized inferencing solutions for transformer-based models to tackle online and offline inferencing scenarios. We’ll demonstrate that low latency and high throughput can be achieved with the combination of NVIDIA hardware and software. We’ll briefly go over BERT inferencing with FasterTransformer, TensorRT, and MXNet, and also present performance data from the latest NVIDIA GPUs.

Transformer也不用多说，目前为止最好用的encoder和decoder集合体。基于transformer的模型也有很多，BERT便是最出名的一个，不光是NLP，在其他任务中，只要涉及编码或者解码的部分都可以无脑使用transformer提升模型精度。虽然transformer速度快精度高符合GPU的计算特性，唯一不足的就是速度相比纯卷积还不是很快。

TensorRT8针对Transformer结构进行了更深度的优化，值得试试：

TensorRT关于Transformer的开源项目如下：

# Inference with Tensorflow 2 Integrated with TensorRT Session

Learn how to inference using Tensorflow 2 with TensorRT integrated and the performance this can offer. Tensorflow is a machine learning platform and TensorRT is an SDK for high-performance deep learning inference using NVIDIA GPUs. Tensorflow models are usually written in FP32 precision to work for both training and inference. Tensorflow-TensorRT integration automatically offloads portions of the Tensorflow graph to run with TensorRT using precisions FP16 or INT8 to improve inference throughput without sacrificing much accuracy. We’ll describe: how to use Tensorflow-TensorRT integration in Tensorflow 2; the dynamic shape feature we recently added to better handle Tensorflow graph with unknown shapes; the lazy calibration mode we recently added to improve the workflow for inferencing with INT8 precision; some details on how Tensorflow-TensorRT works; and the performance benefits of using Tensorflow-TensorRT for inference.

TensorFlow2老潘不是很熟悉，这里也就不多说了。不过对于使用TensorFlow2的童鞋们来说，使用TRT加速更加方便了，更多详细的内容可以看PPT。

# Designing and Optimizing Deep Neural Networks for High-Throughput and Low-Latency Production Deployment

When integrating DNNs into applications the project teams need to consider much more than just model accuracy. Factors such as throughput affect the size and the cost of the infrastructure required to host the application. Similarly, latency of model response is important for a wide range of time-sensitive application and a hard requirement when building safety-critical applications. We’ll discuss how to select efficient models that allow us to meet the throughput and latency requirements (including multitask DNNs) as well as key approaches for their further optimization, such as quantification-aware training, post-training quantification, pruning, distillation, and other forms of model compression. We’ll explain how those techniques interact with the GPU architecture. Finally, we’ll reprise key tools that can simplify the model optimization and deployment process, such as TensorRT or Triton Inference Server.

# 后记

AI浪潮从来没有停止过，关于AI算法以及AI部署相关的前沿技术我们需要持续探索和跟进，这样才不会落后于时代。

