Skip to content

Support TensorFlow (2.0) #77

@zw0610

Description

@zw0610

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Status:

So far FTLib does not support TensorFlow. When adopted in ElasticDL, we take a NumPy NDArray and wrapped it into a Tensor data structure defined in PyTorch. Such approach not only suffers from overhead, but also is not elegant. It will be much better if FTLib support TensorFlow natively.

Potential Approach(es):

Distributed Strategy is introduced with TF 2.0. The implementation of CollectiveAllReduceStrategy hints we can customize a new strategy with a fault-tolerant/elastic ops defined in FTLib.

Regarding the enhanced ops,

  1. the logic FTLib uses to enhance collective ops can be assembled in a new, customized (by FTLib) cross_device_ops library
  2. the logic FTLib uses to reconfigure the member list can be customized into the new distributed strategy in FTLib

Steps:

  1. Prepare new collective ops with elastic enhancement
  2. Create customized distributed strategy

Potential Issues:

  1. While this proposal mainly shall work for TF 2.0, it cannot be applied to earlier version.
  2. While it may look transparent to the TF 2.0 users, this design is remotely close to what FTLib does with PyTorch and NumPy.

/cc @gaocegege @QiJune @skydoorkai

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions