AIDC Network Solution Based On ROCE Switches

Design ideas for networking solutions

High performance

The AIGC bearer network needs to have high broadband characteristics to support fast data transmission and processing. Generating content may involve large-scale text, image, or video data, therefore requiring a high bandwidth network connection to quickly transmit data to computing resource nodes for processing; In addition, the AIGC hosting network needs to meet the requirement of low latency to ensure the real-time and responsiveness of generated content. After the user uploads a task or request, the network needs to respond quickly and perform task allocation or resource scheduling.

large-scale

The AIGC hosting network needs to be able to handle a large number of user requests and tasks, and support concurrent access from multiple users simultaneously. Therefore, network architecture needs to have high scalability and load balancing capabilities. For example, by adopting distributed computing and storage technologies, the network can scale horizontally and automatically adjust resource allocation to meet the growing demand of users.

High availability

The AIGC hosting network needs to have high availability to ensure service continuity and stability. Due to the fact that AIGC is based on artificial intelligence technology, its generation process may require a long time and a large amount of computing resources. Therefore, networks need to have fault-tolerant mechanisms and fault recovery strategies to cope with hardware failures, network interruptions, or other unexpected situations.

Overall scheme architecture

Computational Network Design Plan 1: 1:1 Convergence free for the entire network

Not considering the access method of the 8 interfaces of the GPU, the 8 interfaces are connected to 1 or multiple ToRs

  • Switch 10 Leaf+20 ToR=30 units, providing 640 access ports (20 * 32=640), each GPU server has 8 ports, and can access up to 80 GPU servers.
  • Both the access side and the internal interconnection of the Fabric can use 200G AOC (including 200G optical modules at both ends), with 600 on the access side and 600 on the Fabric side, totaling 1200.

Scalability of Plan One

  • Based on this architecture, up to 64 ToRs can be connected and can be expanded to 2048 200G interfaces, meeting the scalability requirements of 1280 interface access.

Computational Network Design Plan 2: 1:1 Convergence free for the entire network

Consider the access method of the 8 interfaces of the GPU, which are connected to 8 Leafs, with each 8 Leafs grouped together

  • Switch 13 Leaf+24 ToR=37 units, with 600 access ports (75 GPU servers), each group of 8 ToRs connected to 25 GPU servers, and 3 groups of ToRs connected to 75 units.
  • Each group of ToRs is connected to 25 GPU servers, with a downlink access bandwidth of 200 * 200GE. Therefore, the uplink also needs to have at least 200 * 200GE bandwidth. Each ToR to each Leaf has 2 200G connections, and the total uplink bandwidth is 2 * 13 * 8 * 200GE, meeting the 1:1 convergence requirement
  • Both the access side and the internal interconnection of the Fabric can use 200G AOC (including 200G optical modules at both ends), with 600 on the access side and 624 on the Fabric side, totaling 1224.

Scalability of Scheme 2

  • Based on this architecture, up to 8 sets of ToRs can be connected, with each set of 8 ToRs connected to 32 GPU servers and 8 sets of ToRs connected to 256 servers.
  • Can scale up to 2048 200G interfaces for access, meeting the scalability requirements of 1280 interface access.

Storage network design scheme: 3:1 convergence for the entire network

  • Switch 2 Leaf+3 ToR=5 units, providing a maximum of 144 access ports (meeting 100 access requirements).
  • If high reliability deployment of Leaf is not considered, single Leaf access can also be achieved.
  • Both the access side and the internal interconnection of the Fabric can use 200G AOC (including 200G optical modules at both ends), with 100 on the access side and 36 on the Fabric side, totaling 136.

Storage network scalability

  • Switch 2 Leaf+5 ToR=7 units, providing a maximum of 240 access ports (to meet the expansion needs of 240 access)

Jaguar-network ROCE value and advantages

Ultra -low TCO, ultra-high cost performance

Compared to the IB solution, it significantly reduces the network TCO of users while ensuring ultra-high performance.

Smooth horizontal expansion, 1:1 convergence without blocking

Non-convergent network design ensures non-blocking high-capacity network and horizontal expansion on demand.

ROCEv2 for the entire network

Based on PFC/ECN and end-to-end collaboration capabilities,
,it provides performance comparable to lB and lossless network services.

Expert service

Professional, comprehensive and reliable solution and service team, providing customers with hourly rapid response services