Hardware
- Mellanox ConnectX-3 MT27500
- NVIDIA Tesla K40c
Problems
Mellanox OFED does not compile on ARM
Solution: Intel Atom
Problems
NVIDIA driver sometimes crashes during boot (dmesg => “Oops”)
Solution: Reboot until it works (NVIDIA driver is closed source)
Problems
Mellanox driver “mlx4_core” is not assigned to Mellanox card (lspci -k | grep -A 2 Mellanox)
Solution:
echo "1" > /sys/bus/pci/devices/0000\:03\:00.0/remove
echo "1" > /sys/bus/pci/rescan
Problems
GPUdirect transmition does not mark caches as dirty. So data may be stale.
Solution:
Always "cudaMemcpyDeviceToHost" data for inspection.
Check status via script
#!/bin/bash
OOPS=`dmesg | grep "Oops"`
if [ -n "$OOPS" ]; then
echo "ERROR: nvidia driver crashed during boot."
fi
MLX=`lspci -k | grep "mlx4_core"`
if [ -z "$MLX" ]; then
echo "ERROR: mlx4_core is not assigned to ib card."
fi
IFCONFIG=`ifconfig | grep 192.168.11.72`
if [ -z "$IFCONFIG" ]; then
echo "ERROR: IP is not set."
fi
IBHOSTS=`ibhosts | grep gpudirect`
if [ -z "$IBHOSTS" ]; then
echo "ERROR: Infiniband hostname not set."
fi
NVIDIA interface
NVIDIA closed source driver presents an API to pin memory.
// From nv-p2p.h
/*
* @brief
* Make the pages underlying a range of GPU virtual memory
* accessible to a third-party device.
*
...
*/
int nvidia_p2p_get_pages(uint64_t p2p_token, uint32_t va_space,
uint64_t virtual_address,
uint64_t length,
struct nvidia_p2p_page_table **page_table,
void (*free_callback)(void *data),
void *data);
Mellanox kernel module
Graphics card gets registered as peer memory client to infiniband driver.
// From nv_peer_mem.c
static int __init nv_mem_client_init(void)
{
strcpy(nv_mem_client.name, DRV_NAME);
strcpy(nv_mem_client.version, DRV_VERSION);
reg_handle = ib_register_peer_memory_client(&nv_mem_client,
&mem_invalidate_callback);
if (!reg_handle)
return -EINVAL;
return 0;
}
libibverbs infiniband library
Register memory from userland program. If addr has been allocted with cudaMalloc(), libibverbs knows to use previously registered peer memory client.
// From verbs.h
/**
* ibv_reg_mr - Register a memory region
*/
struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr,
size_t length, int access);
kiro rdma extended for gpudirect
If nv_peer_mem kernel module has registered the graphics card as "peer memory client", it is sufficient to give ibv_reg_mr a virtual adress that associates to a physical adress on the graphics card.
// From kiro-rdma.h
#ifdef GPUDIRECT
void *mem_handle = mem;
int error;
if (!mem_handle) {
error = cudaMalloc (&mem_handle, mem_size);
...
*mr = ibv_reg_mr (pd, mem_handle, mem_size, access);
GPUdirect Benchmark
Default case: Receive data over infiniband and store it in main memory. Copy it to graphics card memory. Run a simple cuda math algorithm and copy data back into main memory for inspection.
Average throughput 1.75 Gbit/s.
GPUdirect case: Receive data over infiniband and store it directly in graphics card memory. Run a simple cuda math algorithm then copy data into main memory for inspection.
Average throughput 3.70 Gbit/s.
GPUdirect Benchmark
The Mellanox Infiniband PCI card on the testsystem uses 5 GT/s and 4 Lanes. In this setup we have a speedup of:
$$S_p = \frac{T_1}{T_p}=\frac{1.75}{3.70}=0.47$$
Using GPUdirect in a typical workflow makes us 111% faster.