Dynamic Data Sharding For Elasticity and Fault-tolerance
Dynamic data sharding splits the dataset into many small shards and each shard only contains a few batches of training samples. With the dynaic sharding, DLRover can:
- recover the shard if the worker fails before using up samples of the shard.
- mitigate the worker straggler by assigning more shards to the fast worker.
Integration to Offline and Online Deep Learning
With the data source transparency provided by dynamic data sharding, DLRover can be integrated with offline training which consumes batch data, and also supports online learning with real-time streaming data.