CuVec: Unifying Python/C++/CUDA memory

xkcd#138

Python buffered array ↔ C++11 std::vector ↔ CUDA managed memory

3 May 2023 Casper da Costa-Luis

Motivation¶

One problem, one language
Two problems, two languages
- interfaces

Prior Art¶

Lots of boilerplate (CPython API)
Pre-TensorFlow/PyTorch (no high-level frameworks)
Pre-SSDs (HDD I/O)

New Hope: Language Features¶

C++11 (ISO/IEC#14882:2011 alias templates, container/std::vector allocators)
CUDA 6 (unified memory)
Python 3 (PEP#3118 array buffer protocol), 3.6 (float16)
Python __cuda_array_interface__ (Numba, CuPy, PyTorch, PyArrow, ArrayViews, JAX, PyCUDA, DALI, RAPIDS, ...)

More Hope: Build Tooling¶

Python 3.6 build-system dependencies (PEP#517, PEP#518)
scikit-build (CMake-driven build-system generator for CPython extensions)
PIP availability: CMake generator & Ninja build-system
(optional) SWIG
TL;DR
- dev: pyproject.toml + setup.py + CMakeLists.txt
- user: C++/CUDA compiler (any OS, no IDE) + python3.6 -m pip install cuvec

Solution: Building CuVec¶

1 decade of ~~frustration~~ careful thought
2 days of prototyping + 5 days of testing

In [2]:

git_fame_plot("./cuvec", bytype=True)

Summary: Winning¶

Less boilerplate code (fewer bugs, easier debugging, and faster prototyping)
Fewer memory copies (faster execution)
Lower memory usage (do more with less hardware)

What Next¶

Integration tutorials (Numba, CuPy, PyTorch, PyArrow, ArrayViews, JAX, PyCUDA, DALI, RAPIDS, ...)
... any feature requests?

Casper da Costa-Luis