

#### A Preemptive Buffer Management for **On-chip Shared-memory Switches**

Danfeng Shan, Yunguang Li, Jinchao Ma, Zhenxing Zhang,

Zeyu Liang, Xinyu Wen, Hao Li, Wanchun Jiang, Nan Li, Fengyuan Ren

https://github.com/ants-xjtu/Occamy









#### Switch Buffer

- Short flows: Absorb transient bursts
- ◆ Long flows: Maintain high-throughput



Microbursts encompass most congestion events<sup>[1]</sup>

DCTCP requires ~60-70% BDP buffering for 100% throughput<sup>[2]</sup>

[1] Zhang Q, Liu V, Zeng H, et al. High-resolution measurement of data center microbursts[C]//Proceedings of ACM IMC. 2017: 78-85.
[2] Bai W, Hu S, Chen K, et al. One more config is enough: Saving (DC) TCP for high-speed extremely shallow-buffered datacenters[J]. IEEE/ACM Transactions on Networking, 2020, 29(2): 489-502.

□ Today's DCN Switch: On-ship Shared-Memory

Packet-Processing Packet-Processing Block Block Ingress Ingress **Pipeline Pipeline** 8x50Gbps  $\rightarrow$ 8x50Gbps Port Block Port Block Egress Egress **Pipeline Pipeline Ingress Traffic** Manager With Shared Buffer Packet-Processing Packet-Processing Block Block Ingress Ingress Pipeline Pipeline 8x50Gbps 8x50Gbps  $\leftrightarrow$ Port Block Port Block Egress Egress Pipeline Pipeline

Globally shared on-chip packet buffer

Broadcom Tomahawk 4 switch chip<sup>[1]</sup>

[1] https://docs.broadcom.com/docs/12398014

#### □ Trends of Switch Buffer



Doubling the switching capacity every two years<sup>[1]</sup>

#### 



SRAM scaling appears to have completely collapsed<sup>[2]</sup>



[1] <u>https://www.broadcom.com/blog/driving-the-data-center-into-the-future</u>
[2] <u>https://fuse.wikichip.org/news/7343/iedm-2022-did-we-just-witness-the-death-of-sram</u>

#### **D** Trends of Switch Buffer



The switch buffer (relative to capacity) has been decreased by 4x

### Buffer Management (BM)

Buffer Management (BM): Allocate buffer across queues



#### **G**oals of BM

- Fair: Don't starve queues when facing dynamic traffic
- Efficient: Don't waste the scarce buffer for maximizing burst absorption
- Simple: Easy to be implemented in high-speed switch chip

### Buffer Management (BM)

□ Analogy: iCloud Storage Sharing

#### Six members share iCloud storage



### Buffer Management (BM)

□ Analogy: iCloud Storage Sharing



I require 90GB

We don't require any storage, for now. But we may require 100GB in the future

#### □ Scheme 1: Sufficient Reservation

 $\blacklozenge$  Example BMs: Complete Partition, DT with a small  $\alpha$ 



#### Scheme 2: On-demand Allocation

**100GB** 

 $\blacklozenge$  Example BMs: Complete Sharing, DT with a large  $\alpha$ 



Six members share iCloud storage



1<sup>st</sup> year

#### Scheme 2: On-demand Allocation

 $\blacklozenge$  Example BMs: Complete Sharing, DT with a large  $\alpha$ 





#### Scheme 2: On-demand Allocation

 $\blacklozenge$  Example BMs: Complete Sharing, DT with a large  $\alpha$ 



10 years later

### **The Buffer Choking Problem**



Low Priority Queue

### The Buffer Choking Problem

Experiments on Huawei CE6865 switch



Buffer choking can significantly degrade the transmission performance

#### Scheme 2: On-demand Allocation

 $\blacklozenge$  Example BMs: Complete Sharing, DT with a large  $\alpha$ 

We are poor and can only afford 100GB Let's buy 100GB storage, and <u>dynamically share</u> among members



□ Why on-demand allocation is not fair?

Non-preemption: Passively wait for others to naturally free the space

We are poor and can only afford 100GB Let's buy 100GB storage, and <u>dynamically share</u> among members

Six members share iCloud storage

100GE

□ An optimal scheme for (poor) people (*i.e.*, Pushout)



1 Everyone can get space whenever there is free storage

□ An optimal scheme for (poor) people (*i.e.*, Pushout)



1 Everyone can get space whenever there is free storage

**2** If someone requires space while storage is full, reclaim the storage of the person <u>using the most storage</u>.

□ An optimal scheme for (poor) people (*i.e.*, Pushout)



1 Everyone can get space whenever there is free storage

**2** If someone requires space while storage is full, reclaim the storage of the person <u>using the most storage</u>.

□ An optimal scheme for (poor) people (*i.e.*, Pushout)



1 Everyone can get space whenever there is free storage

**2** If storage is full and someone needs space, remove the data of the person <u>using the most storage</u>.

✓ Efficient

✓ Fair

X Simple



Unacceptable for traditional off-chip shared-memory switch
 Status quo: On-chip shared-memory switch <u>significantly extends memory bandwidth</u>

### Why the Optimal Scheme is not Simple

Difficulty 2: Require complex enqueue operations



• Notify the ingress side to enqueue the packet

### Why the Optimal Scheme is not Simple

Difficulty 3: Monitoring the longest queue in real time



8-input maximum finder based on binary comparator tree



A preemptive buffer management scheme

- ✓ Efficient: (Almost) fully utilize the buffer
- ✓ Fair: Quickly adjust the buffer allocation
- ✓ Simple: Easy to be implemented in switch chip

□ Expels packets in a round-robin manner Overcomes the 3<sup>rd</sup> difficulty

□ Proactively reserves a small fraction of free buffer

> Overcomes the 2<sup>nd</sup> difficulty

Keeps admission and expulsion mutually independent













□ Head-drop selector: select a head-drop queue



□ Head-drop selector: select a head-drop queue









|                     | Cycle 1                 | Cycle 2                 | Cycle 3             | Cycle 4             |  |
|---------------------|-------------------------|-------------------------|---------------------|---------------------|--|
| PD<br>Memory        | ① Read PD               | ② Dequeue PD            | Read Next PD        | Dequeue Next PD     |  |
| Cell Pointer        | Read Prev. Cell Ptr     | ③Read Cell Ptr          | ③Read Cell Ptr      | Read Next Cell Ptr  |  |
| Memory              | Free Prev. Cell         | Free Prev. Cell         | ④ Free Cell         | ④ Free Cell         |  |
| Cell Data<br>Memory | Read Prev. Cell<br>Data | Read Prev. Cell<br>Data | ⑤ Read Cell<br>Data | ⑤ Read Cell<br>Data |  |

#### — TX pipeline

Read a PD from PD memory
 Dequeue the PD
 Read cell pointer from cell pointer memory
 Fee cell (by moving the cell pointer to the free cell ptr list)
 Read cell data

|              | Cycle 1             | Cycle 2         | Cycle 3        | Cycle 4            |  |
|--------------|---------------------|-----------------|----------------|--------------------|--|
| PD<br>Memory | ① Read PD           | ② Dequeue PD    | Read Next PD   | Dequeue Next PD    |  |
| Cell Pointer | Read Prev. Cell Ptr | ③Read Cell Ptr  | ③Read Cell Ptr | Read Next Cell Ptr |  |
| Memory       | Free Prev. Cell     | Free Prev. Cell | ④ Free Cell    | ④ Free Cell        |  |
|              |                     |                 |                |                    |  |

#### Head-drop pipeline

Read a PD from PD memory
 Dequeue the PD
 Read cell pointer from cell pointer memory
 Fee cell (by moving the cell pointer to the free cell ptr list)
 Read cell data

|                     | Cycle 1                 | Cycle 2                 | Cycle 3             | Cycle 4             |  |
|---------------------|-------------------------|-------------------------|---------------------|---------------------|--|
| PD<br>Memory        | ① Read PD               | ② Dequeue PD            | Read Next PD        | Dequeue Next PD     |  |
| Cell Pointer        | Read Prev. Cell Ptr     | ③Read Cell Ptr          | ③Read Cell Ptr      | Read Next Cell Ptr  |  |
| Memory              | Free Prev. Cell         | Free Prev. Cell         | ④ Free Cell         | ④ Free Cell         |  |
| Cell Data<br>Memory | Read Prev. Cell<br>Data | Read Prev. Cell<br>Data | ⑤ Read Cell<br>Data | ⑤ Read Cell<br>Data |  |

#### Synthesized pipeline

Read a PD from PD memory
 Dequeue the PD
 Read cell pointer from cell pointer memory
 Fee cell (by moving the cell pointer to the free cell ptr list)
 Read cell data if TX

### Implementations



- Verilog implementation of core components
- P4-based hardware prototype
- DPDK-based software prototype
- □ Ns-3-based Simulator

https://github.com/ants-xjtu/Occamy

#### **Evaluations**

|          | FPGA Cost |               | ASIC Cost      |                            |               |
|----------|-----------|---------------|----------------|----------------------------|---------------|
| Module   | LUTs      | Flip<br>Flops | Timing<br>(ns) | Area<br>(mm <sup>2</sup> ) | Power<br>(mW) |
| Selector | 1262      | 47            | 1.49           | 0.023                      | 0.895         |
| Arbiter  | 3         | 0             | 0.17           | 2.3e-5                     | 0.003         |
| Executor | 47        | 7             | 0.38           | 7.3e-4                     | 0.044         |

FPGA cost by Vivado
<1300 LUTs and 60 Flip Flops</p>

ASIC cost by Design Compiler 1.5ns timing

◆ 0.03mm<sup>2</sup> area cost and 1mW power

#### Evaluations --- P4-based HW Prototype



Occamy can absorb 57% more bursty traffic than DT ( $\alpha$ =4)

#### Evaluations --- DPDK-based HW Prototype



Occamy can reduce the average query completion time by up to ~55%



Occamy achieves similar performance to Pushout when facing buffer choking

#### Evaluations --- ns-3 simulations



Occamy significantly improves the query completion time with various traffic patterns

#### Conclusion

□ This paper answers 3 questions:

- What are the fundamental requirements of BMs with insufficient buffer and intense traffic bursts?
- Answer: BM should be <u>highly agile</u>
- What are the intrinsic limitations of current BMs in meeting the requirements in DCN?
- Answer: It is the *non-preemptive nature* that confines the agility of current BM
- Is it possible to break through these limitations with the recent advances on buffer architecture?
- Answer: Yes. We design <u>Occamy</u>, a simple yet effective preemptive BM

## Thank you!

https://github.com/ants-xjtu/Occamy