
Zware AINOC
Intelligent Operation and Maintenance Control Platform
Background
With the rapid development of artificial intelligence (AI) technology, especially generative AI, intelligent computing centers, which serve as centralized hubs for computing resources and data processing, are gradually becoming key infrastructures that drive technological innovation and support digital transformation.
Government Initiative
On October 8, 2023, the Ministry of Industry and Information Technology, in conjunction with five other departments, jointly released the "High-Quality Development Action Plan for Computing Power Infrastructure". This plan proposes that new computing power infrastructure will integrate information computing power, network carrying capacity, and data storage capacity.
To support AI large model training and other services, large-scale intelligent computing centers comprising thousands, tens of thousands, or even hundreds of thousands of GPU clusters are needed to meet the demand for computing power. The control and operation of such large-scale intelligent computing centers are developing towards cloud-native and intelligent directions.
Ultra-Large Scale
Managing thousands to hundreds of thousands of GPU clusters
Intelligent Operations
Cloud-native and AI-driven control systems
Product Introduction
2.1 Product Overview
Zware-AINOC is an intelligent operation and maintenance control platform based on intent, developed through research on network automation and network computing technology systems.
Comprehensive Monitoring
The system monitors the operational status, network traffic, network topology, and other relevant information of equipment such as computing nodes, storage nodes, switches, and servers in the intelligent computing center.
Web Interface
Network topology discovery, verification, anomaly alerts, automated configuration, super-visualized monitoring, and user account management.
CLI Interface
Command-line management for advanced users, scripting capabilities, and integration with existing DevOps workflows.
2.2 System Architecture
The system architecture from top to bottom includes the user interface layer, the core logic layer, and the network data access layer. The southbound interface connects to switches and servers, while the northbound interface provides unified standard services supporting both CLI and WEB interactions.

ZwareMaster
Core logic layer software deployed on management servers, providing centralized control and orchestration capabilities.
ZwareAgent
Data acquisition agent module on switches and servers, enabling comprehensive network monitoring and server linkage management.
Product Features
3.1 Core Features
The Zware-AINOC intelligent operation and maintenance control platform adopts a heterogeneous distributed computing framework, providing data-driven, comprehensive, and integrated intelligent operations, maintenance, and monitoring for intelligent computing centers.

Core Capabilities
Network Topology Discovery and Verification
Automatically discovers network devices, including switches and servers, and their topology connections, generating a network topology view. It can automatically compare the generated network topology with the planned topology according to specified policies, and provide anomaly alerts for inconsistencies.
Centralized Management of Network Integrated Facilities
The control and maintenance system covers detailed information and operational status of all network devices, providing one-stop query services to achieve centralized management of network infrastructure.
Automated Configuration
The system offers embedded device configuration templates, automatically adapting to device functions. The generated configurations can be deployed to all devices with one click, ensuring configuration accuracy and reducing manual workload.
Super-Visualized Monitoring
Automatically captures operational status data and dynamic traffic data of network devices, displaying highly visualized network operation views and dynamic traffic views. It provides comprehensive monitoring of network operational status and effectively predicts network traffic trends to support decision-making.
Error Alerts
The control and maintenance system can automatically inspect and diagnose controlled devices by setting automatic inspection policies, providing timely error messages of various levels, and promptly restoring faults.
Host-Side Integrated Network Functions
Host servers integrate high-performance communication libraries and embedded intelligent traffic control mechanisms to achieve ultra-lossless and ultra-balanced traffic. It also supports high-performance storage based on ROCE on the storage side.
Application Scenarios
Currently, intelligent computing centers are characterized primarily by GPU clusters. These GPU clusters include in-rack interconnections, data center interconnections, and cross-data center interconnections.

Large-Scale Deployment
Large-scale intelligent computing centers are characterized by having over 1,000 GPU cards. The diagram below shows a typical network layout of a large-scale intelligent computing center that supports 2,048 GPU cards, which can be used for AI large model training.

The intelligent operation and maintenance control platform is applied to intelligent computing centers, connecting to computing power servers/storage servers, network equipment, etc., to achieve automated and intelligent control and maintenance of computing nodes, storage nodes, and networks.
User Value
Proven in Production
The Zware-AINOC intelligent operation and maintenance control platform demonstrates powerful features and efficient operational capabilities in the deployment scenarios of intelligent computing centers.
5.1 Innovation Points
End-to-End Integrated Control Technology
End-to-end integrated control technology can significantly improve the operational efficiency and security of intelligent computing centers. By unifying the management of computing, storage, and network resources, the platform can achieve optimized resource allocation and efficient scheduling, reducing the complexity of maintenance and operations.
Efficient Fault Prediction and Automated Fault Recovery Technology
The platform has efficient fault prediction and automated fault recovery capabilities. Based on historical data and real-time performance indicators, the platform can predict and identify potential fault points and take proactive measures to avoid business interruptions.
Network Optimization Capability
Optimization technologies include AI-DCQCN and Deep-Routing. AI-DCQCN involves the analysis and configuration of over 30 parameters for RoCE/N CCL, achieving intelligent parameter tuning.
AI-DCQCN Technology
With principal component analysis and intelligent optimization algorithms at its core, combined with microsecond-level connection status, it utilizes human-machine collaboration to obtain the most suitable and interpretable configuration for business scenarios.

Deep-Routing Technology
The static ECMP of traditional networks is not compatible with the load of large models, resulting in link conflicts, traffic imbalances, and low utilization rates. Deep-Routing deep load balancing technology enhances the topology awareness of GPUs through end-to-end integration.

Heterogeneous Control Capability
Heterogeneous control has significant integration implications for the construction of domestic intelligent centers. The platform has already integrated with a range of domestic GPUs and commercial data center switches.
GPU Support
- NVIDIA Series GPUs
- Huawei, TianShu, Cambricon
Switch Support
- Cisco, H3C, Ruijie
- SONiC-based Systems
5.2 Application Value
In the case of the large-scale intelligent computing center, the intelligent operation and maintenance control platform demonstrates the following application values for the rapidly developing intelligent computing infrastructure:
Improving Operational Efficiency
Through automated and intelligent management tools, significantly reduce manual operations and improve maintenance work efficiency.
Optimizing Resource Utilization
Dynamically allocate and optimize resources, ensuring the most effective use of computing, storage, and network resources.
Enhancing System Stability
Real-time monitoring and predictive maintenance detect potential problems in advance, improving reliability and stability.
Strengthening Security Management
Real-time monitoring, alerts, and responses to security incidents, protecting data and system security.
Supporting Business Innovation
Efficient and reliable IT services help enterprises respond quickly to market changes and innovation.
Enhancing User Experience
Stable operation and rapid response improve end-user experience and satisfaction.
Promoting Intelligent Decision-Making
Collecting and analyzing maintenance data helps achieve data-driven decisions with improved accuracy.
Supporting Sustainable Development
Optimized energy management and resource utilization help achieve green operations.
Ready to Transform Your Infrastructure?
Experience the power of Zware AINOC and take your intelligent computing center to the next level.