GPUdirect



GPUdirect

0 0


GPUdirect-presentation


On Github Kaikas / GPUdirect-presentation

GPUdirect

GPUdirect by NVIDIA

GPUdirect by NVIDIA

Hardware

  • Mellanox ConnectX-3 MT27500
  • NVIDIA Tesla K40c

Software

Problems

Mellanox OFED does not compile on ARM

Solution: Intel Atom

Problems

NVIDIA driver sometimes crashes during boot (dmesg => “Oops”)

Solution: Reboot until it works (NVIDIA driver is closed source)

Problems

Mellanox driver “mlx4_core” is not assigned to Mellanox card (lspci -k | grep -A 2 Mellanox)

Solution:

echo "1" > /sys/bus/pci/devices/0000\:03\:00.0/remove
echo "1" > /sys/bus/pci/rescan

Problems

GPUdirect transmition does not mark caches as dirty. So data may be stale.

Solution: Always "cudaMemcpyDeviceToHost" data for inspection.

Check status via script

#!/bin/bash
OOPS=`dmesg | grep "Oops"`
if [ -n "$OOPS" ]; then
echo "ERROR: nvidia driver crashed during boot."
fi
MLX=`lspci -k | grep "mlx4_core"`
if [ -z "$MLX" ]; then
echo "ERROR: mlx4_core is not assigned to ib card."
fi
IFCONFIG=`ifconfig | grep 192.168.11.72`
if [ -z "$IFCONFIG" ]; then
echo "ERROR: IP is not set."
fi
IBHOSTS=`ibhosts | grep gpudirect`
if [ -z "$IBHOSTS" ]; then
echo "ERROR: Infiniband hostname not set."
fi

NVIDIA interface

NVIDIA closed source driver presents an API to pin memory.
// From nv-p2p.h
/*
 * @brief
 *   Make the pages underlying a range of GPU virtual memory
 *   accessible to a third-party device.
 *
...
 */
int nvidia_p2p_get_pages(uint64_t p2p_token, uint32_t va_space,
        uint64_t virtual_address,
        uint64_t length,
        struct nvidia_p2p_page_table **page_table,
        void (*free_callback)(void *data),
        void *data);

Mellanox kernel module

Graphics card gets registered as peer memory client to infiniband driver.
// From nv_peer_mem.c
static int __init nv_mem_client_init(void)
{
  strcpy(nv_mem_client.name, DRV_NAME);
  strcpy(nv_mem_client.version, DRV_VERSION);
  reg_handle = ib_register_peer_memory_client(&nv_mem_client,
               &mem_invalidate_callback);
  if (!reg_handle)
    return -EINVAL;

  return 0;
}

libibverbs infiniband library

Register memory from userland program. If addr has been allocted with cudaMalloc(), libibverbs knows to use previously registered peer memory client.
// From verbs.h
/**
 * ibv_reg_mr - Register a memory region
 */
struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr,
        size_t length, int access);

kiro rdma extended for gpudirect

If nv_peer_mem kernel module has registered the graphics card as "peer memory client", it is sufficient to give ibv_reg_mr a virtual adress that associates to a physical adress on the graphics card.
// From kiro-rdma.h
#ifdef GPUDIRECT
void *mem_handle = mem;
int error;

if (!mem_handle) {
  error = cudaMalloc (&mem_handle, mem_size);
  ...
  *mr = ibv_reg_mr (pd, mem_handle, mem_size, access);

GPUdirect Benchmark

Default case: Receive data over infiniband and store it in main memory. Copy it to graphics card memory. Run a simple cuda math algorithm and copy data back into main memory for inspection. Average throughput 1.75 Gbit/s. GPUdirect case: Receive data over infiniband and store it directly in graphics card memory. Run a simple cuda math algorithm then copy data into main memory for inspection. Average throughput 3.70 Gbit/s.

GPUdirect Benchmark

The Mellanox Infiniband PCI card on the testsystem uses 5 GT/s and 4 Lanes. In this setup we have a speedup of: $$S_p = \frac{T_1}{T_p}=\frac{1.75}{3.70}=0.47$$ Using GPUdirect in a typical workflow makes us 111% faster.