
Zware AICloud
Intelligent Computing Control and Scheduling Platform
Background
With the rapid development of artificial intelligence technology, large model computing power infrastructure has become a key pillar in digital transformation, greatly empowering the digital economy.
To support AI large model training and other tasks, large-scale intelligent computing centers composed of thousands, tens of thousands, or even hundreds of thousands of GPU clusters are needed to meet the computing power demands. These computing cards must work collaboratively to provide sufficient computing power to handle and update the massive parameters in models.
Key Challenges
Faced with challenges such as ultra-large scale, numerous configurations, high performance, and fine granularity, the primary innovation points revolve around how to efficiently manage and utilize computing power resources and complete large model training tasks with high quality.
Ultra-large Scale Computing
Scheduling of ultra-large scale and diverse heterogeneous computing power
Extreme Utilization
Ensuring task operations and improving computing power utilization
Automatic Fault Detection
Identifying and alerting on various software and hardware anomalies
Innovative Service Model
Providing open architecture and refined operation functions
Product Introduction
The Zware-AICloud Intelligent Computing Control and Scheduling Platform is designed and developed for AI large model pre-training and control scheduling, ensuring the efficiency of AI large model training and control scheduling. It consolidates the ultra-large-scale computing power infrastructure with full-end intelligent computing capabilities.
Platform Capabilities
The platform adopts end-to-end integrated heterogeneous distributed computing and RDMA communication frameworks, equipped with high-performance task scheduling engines, heterogeneous GPU adaptation, intelligent monitoring, and operation and maintenance control capabilities.
Product Architecture

Product Features
3.1 Core Features
Task Submission and Related Services
The user end provides a convenient visual interface, supporting custom settings for submitting distributed tasks, and includes built-in common computing frameworks such as PyTorch and MPI; it also offers services related to task management, storage management, and image management.
Large-Scale Distributed Scheduling
Equipped with a powerful distributed scheduling engine, it supports the scheduling and management of computing power resources at the scale of thousands and tens of thousands of cards. Through various scheduling methods such as priority scheduling, reclamation strategies, preemptive scheduling, and fault-tolerant task restart scheduling, it meets complex needs in different application scenarios.
Heterogeneous and Long-Distance Control Scheduling
It enables unified cluster management of heterogeneous computing power, allowing users to easily view the real-time usage efficiency of various resources within the cluster, such as accelerators (GPUs, etc.), CPUs, and memory. It supports long-distance computing power scheduling, forming cross-data-center large model training.
Automatic Fault Detection and Alerts
It supports linkage with the control and maintenance system, automatically capturing the operational status data of network devices, and comprehensively monitoring the network's operational status, achieving automatic fault detection. It supports automatic abnormal alarms and automatic equipment inspection.
User Value
Production Proven Excellence
Through the Zware-AICloud platform, users can achieve control and scheduling of ultra-large-scale intelligent computing clusters, automatic fault tolerance, and other functions. Currently deployed in multiple large-scale intelligent computing clusters.
4.1 Large-Scale Distributed Scheduling
Built-in powerful distributed scheduling engine, supporting the scheduling and management of computing power resources at the scale of thousands and tens of thousands of cards. Through various scheduling methods such as priority scheduling, reclamation strategies, preemptive scheduling, and fault-tolerant task restart scheduling, it meets complex needs in different application scenarios, ensuring high-quality and stable task completion and improving computing efficiency.
4.2 Multidimensional Heterogeneous Computing Power Scheduling
Support for Unified Scheduling of Heterogeneous Computing Power. The platform provides a unified primitive interface for the collective communication library in AI, enabling targeted adaptation to different manufacturers' GPU computing power within the library.
GPU Compatibility
This allows AI models to be easily trained on different GPU cards, avoiding the need for specific adaptations. It also enables collaborative training with GPUs from multiple manufacturers.
Long-Distance Optimization
The platform automatically adapts to collaborative training across different distances through software-defined methods, with congestion-aware PFC for improved performance.
4.3 Automatic Fault Prediction and Recovery
The platform uses real-time monitoring and predictive maintenance, enabling the system to detect potential problems in advance and take measures to reduce system failures and downtime, thereby improving service reliability and stability. Based on historical data and real-time performance indicators, the system can predict and identify potential fault points and take measures in advance to avoid business interruptions.
4.4 High-Efficiency Congestion Control Capability
The platform can achieve automatic parameter tuning in DCQCN congestion control, adopting a distributed architecture that supports dynamic scaling. It implements load balancing encoding technology in congestion control, orchestrating synchronized data flow for large model training across the entire network level.
Ready to Scale Your AI Infrastructure?
Experience the power of Zware AICloud and unlock the full potential of your large-scale AI workloads.