slow, single threaded

Hi, I've been trying to use ngraph to accelerate my tensorflow detector/testing pipeline, but, unfortunately, without any success so far. The inference process either has the same performance, or becomes painstakingly slow.

I'm not quite sure whether I'm installing and using ngraph right.

I'm not quite sure whether this is the right place to ask these questions, since it might just be something obvious that I've missed, thus not being an actual *issue*, but I couldn't find any other support channel. If there is a different, more appropriate one, please direct me to it.

For installation, I've used pip inside my own dockerfile to install `ngraph-tensorflow-bridge`, following the instructions on this repo (and also installed `plaidml`, since I've noticed `ngraph` looks for it during initialization; I didn't build the `ngraph` library myself, since I noticed that the bridge supplies the .so, and it doesn't complain when loading it).

Also, I've tried turning on xla, but it has no effect
```python
config = tf.ConfigProto()
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
tf.keras.backend.set_session(tf.Session(config=config))
```

Also, I've tested ngraph with [intel-tensorflow](https://pypi.org/project/intel-tensorflow/). When ngraph is off, intel-tf gets about twice as fast as vanilla tf. When ngraph bridge is imported, the performance is really, really low (i.e., I've got bored of waiting for an operation to finish that takes a few seconds when ngraph is not used).

Also, I've tried both 'NCHW' and 'NHWC' under both the vanilla and the intel distributions of tensorflow.

For usage, I only added `import ngraph_bridge` after importing `tensorflow`. Is there something else I'm supposed to do?

I didn't get any stdout/stderr message to help me figure out whether ngraph is actually on or not. I've looked through the output of `tensorflow.python.client.device_lib.list_local_devices()`, but nothing seems to change when adding the import. The only indication that ngraph is used is when I don't disable my GPU (`os.environ["CUDA_VISIBLE_DEVICES"] = ""`), and I get an error message.

Here is the code I've used for testing out ngraph (it's based on the [keras example](https://github.com/NervanaSystems/ngraph-tf/blob/master/examples/keras_sample.py) in this repo). I think the longest I've waited to see some training progress was 10 minutes (without ngraph, I get a progress bar update after under half a minute).

```python
import numpy as np
import os

os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ["KMP_BLOCKTIME"] = "0"
os.environ["OMP_NUM_THREADS"] = "4"
os.environ["KMP_AFFINITY"] = "granularity=fine,verbose,compact,1,0"
os.environ['KERAS_BACKEND'] = 'tensorflow'

import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input, decode_predictions
import ngraph_bridge

# A simple script to run inference and training on resnet 50

config = tf.ConfigProto()
config.intra_op_parallelism_threads = 4
config.inter_op_parallelism_threads = 4

# config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1

tf.keras.backend.set_session(tf.Session(config=config))
# tf.keras.backend.set_image_data_format('channels_first')

model = ResNet50(weights=None)

batch_size = 128
img = np.random.rand(batch_size, 224, 224, 3)
# img = np.random.rand(batch_size, 3, 224, 224)

preds = model.predict(preprocess_input(img))
print('Predicted:', decode_predictions(preds, top=3)[0])
model.compile(tf.keras.optimizers.SGD(), loss='categorical_crossentropy')
preds = model.fit(
    preprocess_input(img), np.zeros((batch_size, 1000), dtype='float32'))
print('Ran a train round')
```

I've also tried ngraph for a different code that doesn't use keras (it uses tensorflow's object_detection API instead). Speed is at least 20% lower when using ngraph. For some models the process would have a much larger memory footprint when using ngraph vs. when not (I noticed that because my laptop started paging and crashed). 

I've also noticed that while training the keras examples, only one of the 8 logical cores of my cpu is used.  This happens when running inference on the detection model, but a fraction of the time more than one core is saturated.

Thanks


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow, single threaded #459

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

slow, single threaded #459

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions