Zware AICloud

Intelligent Computing Control and Scheduling Platform

Background

With the rapid development of artificial intelligence technology, large model computing power infrastructure has become a key pillar in digital transformation, greatly empowering the digital economy.

To support AI large model training and other tasks, large-scale intelligent computing centers composed of thousands, tens of thousands, or even hundreds of thousands of GPU clusters are needed to meet the computing power demands. These computing cards must work collaboratively to provide sufficient computing power to handle and update the massive parameters in models.

Key Challenges

Faced with challenges such as ultra-large scale, numerous configurations, high performance, and fine granularity, the primary innovation points revolve around how to efficiently manage and utilize computing power resources and complete large model training tasks with high quality.

Ultra-large Scale Computing

Scheduling of ultra-large scale and diverse heterogeneous computing power

Extreme Utilization

Ensuring task operations and improving computing power utilization

Automatic Fault Detection

Identifying and alerting on various software and hardware anomalies

Innovative Service Model

Providing open architecture and refined operation functions

Product Introduction

The Zware-AICloud Intelligent Computing Control and Scheduling Platform is designed and developed for AI large model pre-training and control scheduling, ensuring the efficiency of AI large model training and control scheduling. It consolidates the ultra-large-scale computing power infrastructure with full-end intelligent computing capabilities.

End-to-end Integration

Heterogeneous Adaptation

Intelligent Monitoring

Platform Capabilities

The platform adopts end-to-end integrated heterogeneous distributed computing and RDMA communication frameworks, equipped with high-performance task scheduling engines, heterogeneous GPU adaptation, intelligent monitoring, and operation and maintenance control capabilities.

RDMA

Communication

Multi-GPU

Support

Real-time

Monitoring

Auto-scale

Clusters

Product Architecture

Product Features

3.1 Core Features

Task Submission and Related Services

The user end provides a convenient visual interface, supporting custom settings for submitting distributed tasks, and includes built-in common computing frameworks such as PyTorch and MPI; it also offers services related to task management, storage management, and image management.

Visual InterfacePyTorch & MPITask Management

Large-Scale Distributed Scheduling

Equipped with a powerful distributed scheduling engine, it supports the scheduling and management of computing power resources at the scale of thousands and tens of thousands of cards. Through various scheduling methods such as priority scheduling, reclamation strategies, preemptive scheduling, and fault-tolerant task restart scheduling, it meets complex needs in different application scenarios.

Priority SchedulingFault-tolerantPreemptive

Heterogeneous and Long-Distance Control Scheduling

It enables unified cluster management of heterogeneous computing power, allowing users to easily view the real-time usage efficiency of various resources within the cluster, such as accelerators (GPUs, etc.), CPUs, and memory. It supports long-distance computing power scheduling, forming cross-data-center large model training.

Unified ManagementCross-datacenterReal-time View

Automatic Fault Detection and Alerts

It supports linkage with the control and maintenance system, automatically capturing the operational status data of network devices, and comprehensively monitoring the network's operational status, achieving automatic fault detection. It supports automatic abnormal alarms and automatic equipment inspection.

Auto-detectionSmart AlertsEquipment Inspection

User Value

Production Proven Excellence

Through the Zware-AICloud platform, users can achieve control and scheduling of ultra-large-scale intelligent computing clusters, automatic fault tolerance, and other functions. Currently deployed in multiple large-scale intelligent computing clusters.

2000P

Computing Power

Per Single Cluster

4.1 Large-Scale Distributed Scheduling

Built-in powerful distributed scheduling engine, supporting the scheduling and management of computing power resources at the scale of thousands and tens of thousands of cards. Through various scheduling methods such as priority scheduling, reclamation strategies, preemptive scheduling, and fault-tolerant task restart scheduling, it meets complex needs in different application scenarios, ensuring high-quality and stable task completion and improving computing efficiency.

Priority Scheduling

Resource Reclamation

Preemptive Scheduling

Fault Tolerance

4.2 Multidimensional Heterogeneous Computing Power Scheduling

Support for Unified Scheduling of Heterogeneous Computing Power. The platform provides a unified primitive interface for the collective communication library in AI, enabling targeted adaptation to different manufacturers' GPU computing power within the library.

GPU Compatibility

This allows AI models to be easily trained on different GPU cards, avoiding the need for specific adaptations. It also enables collaborative training with GPUs from multiple manufacturers.

Long-Distance Optimization

The platform automatically adapts to collaborative training across different distances through software-defined methods, with congestion-aware PFC for improved performance.

4.3 Automatic Fault Prediction and Recovery

The platform uses real-time monitoring and predictive maintenance, enabling the system to detect potential problems in advance and take measures to reduce system failures and downtime, thereby improving service reliability and stability. Based on historical data and real-time performance indicators, the system can predict and identify potential fault points and take measures in advance to avoid business interruptions.

Predictive AnalyticsReal-time MonitoringAuto-recoveryBusiness Continuity

4.4 High-Efficiency Congestion Control Capability

The platform can achieve automatic parameter tuning in DCQCN congestion control, adopting a distributed architecture that supports dynamic scaling. It implements load balancing encoding technology in congestion control, orchestrating synchronized data flow for large model training across the entire network level.

DCQCN

Auto-tuning

Load Balance

Network-wide

Dynamic

Scaling

Ready to Scale Your AI Infrastructure?

Experience the power of Zware AICloud and unlock the full potential of your large-scale AI workloads.

Contact Sales Back to Homepage