常量的Tensorflow乘法性能比tf.random慢

问题描述 投票:1回答:2

我正在使用Tensorflow进行一些非DL计算,而且我遇到了一个我不理解的行为。我正在测试一个方阵的乘法:tf.matmul(a,a)

  1. 使用tf.constant创建矩阵时
  2. 当矩阵在每次运行时随机初始化

我的期望是第一种情况应该有一些转移初始数据的开销,100 MB(使用float32的5000x5000矩阵)但是由于每次运行时的随机初始化,第二种情况的执行应该稍微慢一些。

但是,我看到的是,即使在同一个会话中连续运行,常量的乘法也要慢得多。

The code

import tensorflow as tf
import numpy as np
from timeit import timeit
import os

os.environ["TF_CPP_MIN_LOG_LEVEL"]="2"  # nospam
SIZE = 5000
NUM_RUNS = 10

a = np.random.random((SIZE, SIZE))
_const_a = tf.constant(a, dtype=tf.float32, name="Const_A")
_mul_const_a = tf.matmul(_const_a, _const_a, name="Mul_Const")

_random_a = tf.random_uniform((SIZE, SIZE), dtype=tf.float32, name="Random_A")
_mul_random_a = tf.matmul(_random_a, _random_a, name="Mul_Random")

with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as s:
    # Run once to make sure everything is initialised
    s.run((_const_a, _mul_const_a, _random_a, _mul_random_a))

    # timeit
    print("TF with const\t", timeit(lambda: s.run((_mul_const_a.op)), number=NUM_RUNS))
    print("TF with random\t", timeit(lambda: s.run((_mul_random_a.op)), number=NUM_RUNS))

The output

Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1
Random_A/sub: (Sub): /job:localhost/replica:0/task:0/device:GPU:0
Random_A/RandomUniform: (RandomUniform): /job:localhost/replica:0/task:0/device:GPU:0
Random_A/mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:0
Random_A: (Add): /job:localhost/replica:0/task:0/device:GPU:0
Mul_Random: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
Mul_Const: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
Random_A/max: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Random_A/min: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Random_A/shape: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Const_A: (Const): /job:localhost/replica:0/task:0/device:GPU:0
TF with const    2.9953213009994215
TF with random   0.513827863998813
python performance tensorflow tensorflow-gpu tensor
2个回答
0
投票

YMMV,我在我的K1100M上获得了相反的结果。

Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Quadro K1100M, pci bus id: 0000:01:00.0, compute capability: 3.0
Random_A/sub: (Sub): /job:localhost/replica:0/task:0/device:GPU:0
Random_A/RandomUniform: (RandomUniform): /job:localhost/replica:0/task:0/device:GPU:0
Random_A/mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:0
Random_A: (Add): /job:localhost/replica:0/task:0/device:GPU:0
Mul_Random: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
Mul_Const: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
Random_A/max: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Random_A/min: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Random_A/shape: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Const_A: (Const): /job:localhost/replica:0/task:0/device:GPU:0
TF with const    4.3167382130868175
TF with random   9.889055849542306

0
投票

在tensorflow中第一次调用session.run()是非常昂贵的。如果你想记录某些内容,请记住重复调用它。

但是,在您的情况下,除非您禁用常量折叠,否则您可能会看到几乎没有时间花在常量情况下,因为您的图形只会获取常量。

© www.soinside.com 2019 - 2024. All rights reserved.