Data Center Paper Reading

Table of Contents

Some terminologies

CSE 599

Reading List

A Case for NOW (Networks of Workstations)

  • Background
    • Commercial Motivations

      Smaller computers offer better cost-performance than larger computer.

    • Lessons for Multi Parallel Processor
      • Engineering lag
      • software incompatibility
    • Motivation of Now:
      • Emergence of faster network.
      • Emergence of more powerful workstations.
      • I/O bottleneck: I/O increased in terms of capacity instead of performance, NOW proposes to use a huge pool of memory.
  • Opportunities
    • Memory
      • Network RAM, connected with high bandwidth and low latency networks.
      • Cooperative cache
      • Perform RAID at software level instead of hardware level.
    • Parallel Computing
    • Workloads of a building-wide system
  • Overview of the NOW project
    • Low overhead communication
    • GLUnix

      A global operating system of inter-connected workstations.

    • xFS(Serverless Network File Service)

      File System as a service.

DONE SiP-ML: high-bandwidth optical network interconnects for machine learning training

  • Goal

    Reduce the time-to-accuracy metric.

    Increase the throughput of data processing does not necessarily reduce time-to-accuracy (weak scaling).

    We want strong-scaling: reduce the computation time per worker, but this requires more extensive data exchange, rendering large bandwidth requirement.

    The demand of communication bandwidth grows super-linearly.

  • Optical Networks

    Silicon Photonics (SiP) offers order-of-magnitude higher bandwidth.

    Support dedicated bandwidth as long as there is a path between source/destination nodes.

    • Limitation

      Each node has limit number of circuits (limited degrees if viewed as a graph) if we do not reconfigure.

      Reconfiguration has latency.

  • Optical Solution for ML?

    optical-interconnects.png

    Figure 1: Spectrum of Optical Interconnects

    Control over the traffic patterns by choosing the parallelization strategy and device placement.

    The two ends of the spectrum are SiP-OCS and SiP-Ring.

    • SiP-OCS

      It's the topology that each OCS was connected to all GPUs.

      sip-ocs.png

      • :( Long reconfiguration latency (30 ms)
      • Single-shot: connect periodically re-configure, set up once and keep using it.
      • :) Support any permutation of inputs/outputs.
    • SiP-Ring

      GPUs located on an optical fiber ring, and uses micro-ring resonators to add/drop wavelengths to the ring.

      • :) Fast reconfiguration (20 us)
      • Wavelength reuse
      • :( Limited degree

Homa: a receiver-driven low-latency transport protocol using network priorities

PowerTCP: Pushing the Performance Limits of Datacenter Networks

paper
https://www.usenix.org/system/files/nsdi22-paper-addanki_3.pdf
slide
https://www.usenix.org/system/files/nsdi22_slides_addanki.pdf

Key idea: power = voltage x current

Viewed in networks:

  1. voltage: BDP(bandwidth-delay product) + queue length
  2. Current: transmission rate (bits/s)

TODO HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM

  • DRAM + NVM tiered memory

    The property of NVM

    • 8x capacity
    • 2x latency
    • asymmetric read/write bandwidth
    • high overhead for small accesses
  • Hardware solution

    Example: Intel Optane

    • :) No need for OS support
    • :) Low overhead (why?)
    • :( Not visible to apps
    • :( Naive memory management
  • Software tiered memory

    Examples: HeteroOS, Nimble Page Management

    • :) insights into apps
    • :) complex memory management
    • :( evaluated on emulated environment
      • do not scale to NVM capacity
      • do support for asymmetric read/write
      • limited flexibility (why?)
  • Scalable Software solution to Tiered Memory

    HeMem leverages asynchronous.

    • Memory access sampling

      Scan page table entries is not scalable:

      hemem-scan-page.png

      • PEBS (processor event-based sampling)

        Sample processor records.

        0.02% sample rate provides sufficient fidelity.

    • Async hot/cold classification

      HeMem launches a async thread to collect PEBS information, and maintains hot/cold queues.

      async-hot-classification.png

    • Async memory migration

      migration via DMA, performed in a batched manner.

      async-mem-migration.png

TODO NanoPU: A Nanosecond Network Stack for Data Centers

Project

TODO Coordinated and Efficient Huge Page Management with Ingens

Other Papers

TODO LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation

Author: expye(Zihao Ye)

Email: expye@outlook.com

Date: 2022-03-28 Mon 00:00

Last modified: 2022-09-23 Fri 00:04

Licensed under CC BY-NC 4.0