Support TensorFlow (2.0)

**Is this a BUG REPORT or FEATURE REQUEST?**:

/kind feature

**Status**:

So far FTLib does not support TensorFlow. When adopted in ElasticDL, we take a NumPy NDArray and wrapped it into a Tensor data structure defined in PyTorch. Such approach not only suffers from overhead, but also is not elegant. It will be much better if FTLib support TensorFlow natively.

**Potential Approach(es)**:

[Distributed Strategy](https://www.tensorflow.org/guide/distributed_training) is introduced with TF 2.0. The implementation of [CollectiveAllReduceStrategy](https://github.com/tensorflow/tensorflow/blob/593d153ad88417d2ba655bdce722c30511f0522f/tensorflow/python/distribute/collective_all_reduce_strategy.py) hints we can customize a new strategy with a fault-tolerant/elastic ops defined in FTLib.

Regarding the enhanced ops,
1. the logic FTLib uses to enhance collective ops can be assembled in a new, customized (by FTLib) [cross_device_ops library](https://github.com/tensorflow/tensorflow/blob/a71c78bcf91d404de37188a8a7a73016729dd2a0/tensorflow/python/distribute/cross_device_ops.py)
2. the logic FTLib uses to reconfigure the *member list* can be customized into the new distributed strategy in FTLib

**Steps**:

1. Prepare new collective ops with elastic enhancement
2. Create customized distributed strategy

**Potential Issues**:
1. While this proposal mainly shall work for TF 2.0, it cannot be applied to earlier version.
2. While it may look transparent to the TF 2.0 users, this design is remotely close to what FTLib does with PyTorch and NumPy. 

/cc @gaocegege @QiJune @skydoorkai

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support TensorFlow (2.0) #77

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support TensorFlow (2.0) #77

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions