点击小眼睛开启蜘蛛网特效

2080 Ti TensorFlow GPU基准测试 - 2080 Ti vs V100 vs 1080 Ti vs Titan V

Oldpan 2018年10月13日 1条评论 15,101次阅读 13人点赞

最新的卡皇RTX 2080TI已经出世有一段日子了。

虽然难以买到，但是其对上一代1080TI确实是有不小的提升的。

那么究竟提升多少呢？接下来的文章将对其进行分析。

这里对以下的目前五款最新最强的显卡进行对比：

RTX 2080 Ti
RTX 2080
GTX 1080 Ti
Titan V
Tesla V100

基本结果分析

截至2018年10月8日，NVIDIA RTX 2080 Ti是运行TensorFlow的单GPU系统深度学习研究的最佳GPU。在典型的单卡系统中：

对比1080Ti，在FP32运算方面快37％，FP16运算快62％，价钱贵了25％。
对比2080，在FP32运算方面快35％，FP16运算快47％，价钱贵了25％。
在FP32运算方面有96％的Titan V性能，FP16运算快3％，成本约为前者的1/2。
在FP32运算方面有80％的Tesla V100的运算性能，FP16方面有82％的运算性能，成本约为前者的1/5。

当然，在所有实验中，对于Tensor Core能用就用，毕竟Tensor Core技术在之后的所有深度学习框架中都会普及，而环境则完全按照单个GPU系统成本来计算(不考虑多卡)。这里有详细的Lambda公司的对以上显卡的打分价格性能评价总表。

深度评测

接下来将通过测量FP32运算和FP16运算吞吐量（每秒处理的训练样本数）来评估每个GPU的性能，同时训练合成数据上的常见模型。我们将每个型号的GPU吞吐量除以同一型号的1080 Ti吞吐量(以1080Ti作为对比);

这标准化了数据并提供了GPU的每个模型对1080Ti的加速比。加速比是衡量处理同一工作的两个系统的相对性能的指标。

然后我们针对两种不同运算精度对所有模型平均一下：

相比1080 Ti的FP32平均加速比
相比1080 Ti的FP16平均加速比

最后，我们将每个GPU的平均加速率除以显卡花费的价格，得到下面的结果:

FP32运算性能每美元的加速比
FP16运算性能每美元的加速比

显然，在这种衡量标准下，2080Ti则是性价比最高的显卡。

2080 Ti vs V100 – 2080 Ti真的那么快吗？

为什么2080 Ti的性能为Tesla V100的80%，但是价格仅为后者的八分之一？答案很简单：NVIDIA希望将市场细分出来，对于那些有钱人或者公司，人家愿意也能够买得起这种设备（超级富豪），而且只购买英伟达的TESLA系列卡，这卡零售价约为9,800美元。对于普通需求来说，RTX和GTX系列卡则性价比比价高。

如果购买显卡不是为了搭建AWS，Azure或Google Cloud，那么购买2080 Ti可能要好得多。但是，有一些关键的用例，V100可以派上用场：

如果你需要FP64计算。如果正在进行计算流体动力学，n体模拟或其他需要高数值精度（FP64）的工作，那么就需要购买Titan V或V100了。如果您不确定是否需要FP64，则不需要买。
如果任务要求需要32 GB的内存，也就是所做的任务在批量(batci-size)大小为1的条件下下，11 GB的显存也不够用，或者你自己在搭建需要超高显存的模型体系结构，V100可能就派上用场了。然而，这种情况的可能性很小，不到5％。大部分人一般使用ResNet，VGG，Inception，SSD或Yolo这样的模型。因此只有很少部分人会买V100。

原始性能数据

FP32吞吐量

FP32（单精度）算法是训练CNN时最常用的精度。 FP32数据来自Lambda TensorFlow基准测试库中的代码。

Model / GPU	2080	2080 Ti	Titan V	V100	1080 Ti
ResNet-50	209.89	286.05	298.28	368.63	203.99
ResNet-152	82.78	110.24	110.13	131.69	82.83
InceptionV3	141.9	189.31	204.35	242.7	130.2
InceptionV4	61.6	81	78.64	90.6	56.98
VGG16	123.01	169.28	190.38	233	133.16
AlexNet	2567.38	3550.11	3729.64	4707.67	2720.59
SSD300	111.04	148.51	153.55	186.8	107.71

FP16 吞吐量 (Sako)

FP16 (半精度) 也适合训练很多的神经网络。这里的评测代码来自 Yusaku Sako benchmark scripts. Sako benchmark scripts 有 FP16 和 FP32 两种评测结果。

Model/GPU	2080	2080 Ti	Titan V	V100	1080 Ti
VGG16	181.2	238.45	270.27	333.33	149.39
ResNet-152	62.67	103.29	84.92	108.54	62.74

FP32 (Sako)

Model/GPU	2080	2080 Ti	Titan V	V100	1080 Ti
VGG16	120.39	163.26	168.59	222.22	130.8
ResNet-152	43.43	75.18	61.82	80.08	53.45

FP16 训练时对于 1080 ti 的提升速度比

Model/GPU	2080	2080 Ti	Titan V	V100	1080 Ti
VGG16	1.21	1.60	1.81	2.23	1.00
ResNet-152	1.00	1.65	1.35	1.73	1.00

FP32 的训练提升速度比

Model/GPU	2080	2080 Ti	Titan V	V100	1080 Ti
VGG16	0.92	1.25	1.29	1.70	1.00
ResNet-152	0.81	1.41	1.16	1.50	1.00

Price Performance Data (Speedup / $1,000 USD) FP32

Model/GPU	2080	2080 Ti	Titan V	V100	1080 Ti
Price Per GPU (k$)	0.7	1.2	3	9.8	0.7
Price Per 1 GPU System (k$)	1.99	2.49	4.29	11.09	1.99
AVG	0.51	0.55	0.33	0.16	0.50
ResNet-50	0.52	0.56	0.34	0.16	0.50
ResNet-152	0.50	0.53	0.31	0.14	0.50
InceptionV3	0.55	0.58	0.37	0.17	0.50
InceptionV4	0.54	0.57	0.32	0.14	0.50
VGG16	0.46	0.51	0.33	0.16	0.50
AlexNet	0.47	0.52	0.32	0.16	0.50
SSD300	0.52	0.55	0.33	0.16	0.50

Price Performance Data (Speedup / $1,000 USD) FP16

Model/GPU	2080	2080 Ti	Titan V	V100	1080 Ti
AVG	0.56	0.65	0.37	0.18	0.50
VGG16	0.61	0.64	0.42	0.20	0.50
ResNet-152	0.50	0.66	0.32	0.16	0.50

评测方法

All models were trained on a synthetic dataset. This isolates GPU performance from CPU pre-processing performance.
For each GPU, 10 training experiments were conducted on each model. The number of images processed per second was measured and then averaged over the 10 experiments.
The speedup benchmark is calculated by taking the images / sec score and dividing it by the minimum image / sec score for that particular model. This essentially shows you the percentage improvement over the baseline (in this case the 1080 Ti).
The 2080 Ti, 2080, Titan V, and V100 benchmarks utilized Tensor Cores.

所使用的 batch sizes

Model	Batch Size
ResNet-50	64
ResNet-152	32
InceptionV3	64
InceptionV4	16
VGG16	64
AlexNet	512
SSD	32

Hardware

All benchmarks, except for those of the V100, were conducted using a Lambda Quad Basic with swapped GPUs. The exact specifications are:

RAM: 64 GB DDR4 2400 MHz
Processor: Intel Xeon E5-1650 v4
Motherboard: ASUS X99-E WS/USB 3.1
GPUs: EVGA XC RTX 2080 Ti GPU TU102, ASUS 1080 Ti Turbo GP102, NVIDIA Titan V, and Gigabyte RTX 2080.

The V100 benchmark utilized an AWS P3 instance with an E5-2686 v4 (16 core) and 244 GB DDR4 RAM.

软件平台

除了V100之外所有的评测使用的环境

Ubuntu 18.04 (Bionic)
CUDA 10.0
TensorFlow 1.11.0-rc1
cuDNN 7.3

V100 评测使用 AWS P3 实例

Ubuntu 16.04 (Xenial)
CUDA 9.0
TensorFlow 1.12.0.dev20181004
cuDNN 7.1

How we define a “typical single GPU system”

The price we use in our calculations is based on the estimated price of the minimal system that avoids CPU, memory, and storage bottlenecking for Deep Learning training. Note that this won’t be upgradable to anything more than 1 GPU.

CPU: i7-8700K or equivalent (6 cores, 16 PCI-e lanes). ~$380.00 on Amazon.
CPU Cooler: Noctua L-Type Premium. ~$50 on Amazon.
Memory: 32 GB DDR4. ~$280.00 on Amazon.
Motherboard: ASUS Prime B360-Plus (16x pci-e lanes for GPU). ~$105.00 on Amazon.
Power supply: EVGA SuperNOVA 750 G2 (750W). ~$100.00 on Amazon.
Case:NZXT H500 ATX case ~$70.00 on Amazon
Labor: About $200 in labor if you want somebody else to build it for you.

Cost (excluding GPU): $1,291.65 after 9% sales tax.

Note that this doesn’t include any of the time that it takes to do the driver and software installation to actually get up and running. That alone can take days of full time work.

Reproduce the benchmarks yourself

All benchmarking code is available on Lambda Lab’s GitHub repo. Share your results by emailing s@lambdalabs.com or tweeting @LambdaAPI. Be sure to include the hardware specifications of the machine you used.

Step One: Clone benchmark repo

git clone https://github.com/lambdal/lambda-tensorflow-benchmark.git --recursive

Step Two: Run benchmark

Input a proper gpu_index (default 0) and num_iterations (default 10)

cd lambda-tensorflow-benchmark
./benchmark.sh gpu_index num_iterations

Step Three: Report results

Check the repo directory for folder <cpu>-<gpu>.logs (generated by benchmark.sh)
Use the same num_iterations in benchmarking and reporting.

./report.sh <cpu>-<gpu>.logs num_iterations

We are now taking orders for the Lambda Blade 2080 Ti Server and the Lambda Quad 2080 Ti workstation. Email enterprise@lambdalabs.com for more info.

You can download this blog post as a whitepaper using this link: Download Full 2080 Ti Performance Whitepaper.

本文翻译摘取自 2080 Ti TensorFlow GPU benchmarks – 2080 Ti vs V100 vs 1080 Ti vs Titan V

本篇文章采用署名-非商业性使用-禁止演绎 4.0 国际进行许可
转载请务必注明来源: https://oldpan.me/archives/2080-ti-vs-v100-vs-1080-ti-vs-titan-v

关注Oldpan博客微信公众号，你最需要的及时推送给你。