generated from caicloud/python-template-project
-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Labels
kind/featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.
Description
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Status:
So far FTLib does not support TensorFlow. When adopted in ElasticDL, we take a NumPy NDArray and wrapped it into a Tensor data structure defined in PyTorch. Such approach not only suffers from overhead, but also is not elegant. It will be much better if FTLib support TensorFlow natively.
Potential Approach(es):
Distributed Strategy is introduced with TF 2.0. The implementation of CollectiveAllReduceStrategy hints we can customize a new strategy with a fault-tolerant/elastic ops defined in FTLib.
Regarding the enhanced ops,
- the logic FTLib uses to enhance collective ops can be assembled in a new, customized (by FTLib) cross_device_ops library
- the logic FTLib uses to reconfigure the member list can be customized into the new distributed strategy in FTLib
Steps:
- Prepare new collective ops with elastic enhancement
- Create customized distributed strategy
Potential Issues:
- While this proposal mainly shall work for TF 2.0, it cannot be applied to earlier version.
- While it may look transparent to the TF 2.0 users, this design is remotely close to what FTLib does with PyTorch and NumPy.
terrytangyuan and gaocegege
Metadata
Metadata
Assignees
Labels
kind/featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.