DLRover makes the distributed training of large AI models easy, stable, fast and green.

DLRover can automatically train the Deep Learning model on the distributed cluster.

It helps model developers to focus on model architecture, without taking care of any engineering stuff, say, hardware acceleration, distributed running, etc. Now, it provides automated operation and maintenance for deep learning training jobs on K8s/Ray.

LEARN MORE ABOUT DLROVER

KEY FEATURES

Fault-Tolerance

The distributed training can continue running in the event of failures.

Flash Checkpoint

The distributed training can recover failures from
the in-memory checkpoint in seconds.

Auto-Scaling

The distributed training can scale up/down resources to improve the stability, throughput and resource utilization.