Zware Logo

Zware AINOC

Intelligent Operation and Maintenance Control Platform

Background

With the rapid development of artificial intelligence (AI) technology, especially generative AI, intelligent computing centers, which serve as centralized hubs for computing resources and data processing, are gradually becoming key infrastructures that drive technological innovation and support digital transformation.

Government Initiative

On October 8, 2023, the Ministry of Industry and Information Technology, in conjunction with five other departments, jointly released the "High-Quality Development Action Plan for Computing Power Infrastructure". This plan proposes that new computing power infrastructure will integrate information computing power, network carrying capacity, and data storage capacity.

To support AI large model training and other services, large-scale intelligent computing centers comprising thousands, tens of thousands, or even hundreds of thousands of GPU clusters are needed to meet the demand for computing power. The control and operation of such large-scale intelligent computing centers are developing towards cloud-native and intelligent directions.

Ultra-Large Scale

Managing thousands to hundreds of thousands of GPU clusters

Intelligent Operations

Cloud-native and AI-driven control systems

Product Introduction

2.1 Product Overview

Zware-AINOC is an intelligent operation and maintenance control platform based on intent, developed through research on network automation and network computing technology systems.

One-click deployment
Self-testing collection
Intelligent analysis

Comprehensive Monitoring

The system monitors the operational status, network traffic, network topology, and other relevant information of equipment such as computing nodes, storage nodes, switches, and servers in the intelligent computing center.

DCQCN
Deep Connection
0.01%
Packet Error Rate
99.9%
GPU Availability
Real-time
Congestion Detection
Web Interface

Network topology discovery, verification, anomaly alerts, automated configuration, super-visualized monitoring, and user account management.

CLI Interface

Command-line management for advanced users, scripting capabilities, and integration with existing DevOps workflows.

2.2 System Architecture

The system architecture from top to bottom includes the user interface layer, the core logic layer, and the network data access layer. The southbound interface connects to switches and servers, while the northbound interface provides unified standard services supporting both CLI and WEB interactions.

System Architecture
ZwareMaster

Core logic layer software deployed on management servers, providing centralized control and orchestration capabilities.

ZwareAgent

Data acquisition agent module on switches and servers, enabling comprehensive network monitoring and server linkage management.

Product Features

3.1 Core Features

The Zware-AINOC intelligent operation and maintenance control platform adopts a heterogeneous distributed computing framework, providing data-driven, comprehensive, and integrated intelligent operations, maintenance, and monitoring for intelligent computing centers.

AINOC Core Features

Core Capabilities

Network Topology Discovery and Verification

Automatically discovers network devices, including switches and servers, and their topology connections, generating a network topology view. It can automatically compare the generated network topology with the planned topology according to specified policies, and provide anomaly alerts for inconsistencies.

Auto-discoveryTopology ValidationAnomaly Detection
Centralized Management of Network Integrated Facilities

The control and maintenance system covers detailed information and operational status of all network devices, providing one-stop query services to achieve centralized management of network infrastructure.

Unified DashboardReal-time StatusOne-stop Query
Automated Configuration

The system offers embedded device configuration templates, automatically adapting to device functions. The generated configurations can be deployed to all devices with one click, ensuring configuration accuracy and reducing manual workload.

Smart TemplatesOne-click DeployError Prevention
Super-Visualized Monitoring

Automatically captures operational status data and dynamic traffic data of network devices, displaying highly visualized network operation views and dynamic traffic views. It provides comprehensive monitoring of network operational status and effectively predicts network traffic trends to support decision-making.

Real-time VisualizationTraffic AnalysisPredictive Analytics
Error Alerts

The control and maintenance system can automatically inspect and diagnose controlled devices by setting automatic inspection policies, providing timely error messages of various levels, and promptly restoring faults.

Auto-inspectionMulti-level AlertsFault Recovery
Host-Side Integrated Network Functions

Host servers integrate high-performance communication libraries and embedded intelligent traffic control mechanisms to achieve ultra-lossless and ultra-balanced traffic. It also supports high-performance storage based on ROCE on the storage side.

High PerformanceTraffic ControlROCE Support

Application Scenarios

Currently, intelligent computing centers are characterized primarily by GPU clusters. These GPU clusters include in-rack interconnections, data center interconnections, and cross-data center interconnections.

AINOC Application Scenario

Large-Scale Deployment

Large-scale intelligent computing centers are characterized by having over 1,000 GPU cards. The diagram below shows a typical network layout of a large-scale intelligent computing center that supports 2,048 GPU cards, which can be used for AI large model training.

2,048
GPU Cards Supported
Multi-tier
Network Architecture
AI-Ready
Large Model Training
AINOC Maintenance Control Platform

The intelligent operation and maintenance control platform is applied to intelligent computing centers, connecting to computing power servers/storage servers, network equipment, etc., to achieve automated and intelligent control and maintenance of computing nodes, storage nodes, and networks.

User Value

Proven in Production

The Zware-AINOC intelligent operation and maintenance control platform demonstrates powerful features and efficient operational capabilities in the deployment scenarios of intelligent computing centers.

10+
GPU Centers
with 1000+ cards

5.1 Innovation Points

End-to-End Integrated Control Technology

End-to-end integrated control technology can significantly improve the operational efficiency and security of intelligent computing centers. By unifying the management of computing, storage, and network resources, the platform can achieve optimized resource allocation and efficient scheduling, reducing the complexity of maintenance and operations.

Improved Efficiency
Enhanced Security
Reduced Latency

Efficient Fault Prediction and Automated Fault Recovery Technology

The platform has efficient fault prediction and automated fault recovery capabilities. Based on historical data and real-time performance indicators, the platform can predict and identify potential fault points and take proactive measures to avoid business interruptions.

Predictive AnalyticsAuto-recoveryBusiness Continuity

Network Optimization Capability

Optimization technologies include AI-DCQCN and Deep-Routing. AI-DCQCN involves the analysis and configuration of over 30 parameters for RoCE/N CCL, achieving intelligent parameter tuning.

AI-DCQCN Technology

With principal component analysis and intelligent optimization algorithms at its core, combined with microsecond-level connection status, it utilizes human-machine collaboration to obtain the most suitable and interpretable configuration for business scenarios.

AINOC Optimization Diagram
Deep-Routing Technology

The static ECMP of traditional networks is not compatible with the load of large models, resulting in link conflicts, traffic imbalances, and low utilization rates. Deep-Routing deep load balancing technology enhances the topology awareness of GPUs through end-to-end integration.

Equal Cost Multi Path

Heterogeneous Control Capability

Heterogeneous control has significant integration implications for the construction of domestic intelligent centers. The platform has already integrated with a range of domestic GPUs and commercial data center switches.

GPU Support
  • NVIDIA Series GPUs
  • Huawei, TianShu, Cambricon
Switch Support
  • Cisco, H3C, Ruijie
  • SONiC-based Systems

5.2 Application Value

In the case of the large-scale intelligent computing center, the intelligent operation and maintenance control platform demonstrates the following application values for the rapidly developing intelligent computing infrastructure:

Improving Operational Efficiency

Through automated and intelligent management tools, significantly reduce manual operations and improve maintenance work efficiency.

Optimizing Resource Utilization

Dynamically allocate and optimize resources, ensuring the most effective use of computing, storage, and network resources.

Enhancing System Stability

Real-time monitoring and predictive maintenance detect potential problems in advance, improving reliability and stability.

Strengthening Security Management

Real-time monitoring, alerts, and responses to security incidents, protecting data and system security.

Supporting Business Innovation

Efficient and reliable IT services help enterprises respond quickly to market changes and innovation.

Enhancing User Experience

Stable operation and rapid response improve end-user experience and satisfaction.

Promoting Intelligent Decision-Making

Collecting and analyzing maintenance data helps achieve data-driven decisions with improved accuracy.

Supporting Sustainable Development

Optimized energy management and resource utilization help achieve green operations.

Ready to Transform Your Infrastructure?

Experience the power of Zware AINOC and take your intelligent computing center to the next level.