About – DLRover

Fault Tolerance to Reduce the Downtime of a Large Scale Training Job

DLRover can restore the training when the process fails without stopping the training job. With fault tolerance, the goodput of GLM-65B training on thousands of GPUs increased from 69% to 95%.

Auto-Scaling to Improve Training Performance and Resource Utilization

DLRover automatically scales up/down resources (for parameter servers or workers) at the runtime of a training job. By monitoring the workload of nodes and throughput, DLRover can diagnose the bottleneck of the resource configuration. DLRover Auto-Scaling can allocate resources by the demand of model training to reduce the waste of resources.

Dynamic Data Sharding For Elasticity and Fault-tolerance

Dynamic data sharding splits the dataset into many small shards and each shard only contains a few batches of training samples. With the dynaic sharding, DLRover can:

recover the shard if the worker fails before using up samples of the shard.
mitigate the worker straggler by assigning more shards to the fast worker.

Integration to Offline and Online Deep Learning

With the data source transparency provided by dynamic data sharding, DLRover can be integrated with offline training which consumes batch data, and also supports online learning with real-time streaming data.

About DLRover

Fault Tolerance to Reduce the Downtime of a Large Scale Training Job

Auto-Scaling to Improve Training Performance and Resource Utilization

Dynamic Data Sharding For Elasticity and Fault-tolerance

Integration to Offline and Online Deep Learning