CuVec: Unifying Python/C++/CUDA memory

xkcd#138

Python buffered array ↔ C++11 std::vector ↔ CUDA managed memory

Source Docs Version
Coverage DOI

3 May 2023 Casper da Costa-Luis

  • originally thought of this close to 1 decade ago when writing CUDA/C++/Python working FT at Sony EU HQ realtime machine vision (b4 PhD)
  • sick of reimplementing boilerplate
  • amazingly nobody solved problem, so I snapped and recently implemented solution ~2ya
  • docs site has full tutorials/examples
  • but presentation covers main points

Motivation¶

  • One problem, one language
  • Two problems, two languages
    • interfaces
  • start with issue
  • programming languages tend to target one problem
    • Python quick prototyping, C++ quick runtime, CUDA same but for highly parallelisable probs
  • issue: what if you have multi probs?
    • write in 2 langs
    • pass data between
    • data interfaces are painful

Prior Art¶

  • Lots of boilerplate (CPython API)
  • Pre-TensorFlow/PyTorch (no high-level frameworks)
  • Pre-SSDs (HDD I/O)
  • ^
  • ^
  • few lines of code in 1 lang to save arr to disk, few lines in different lang to load from disk
  • "easy" to implement, but slow to run + error-prone (dtype conversion, big/small endian, C vs FORTRAN array indexing)

New Hope: Language Features¶

  • C++11 (ISO/IEC#14882:2011 alias templates, container/std::vector allocators)
  • CUDA 6 (unified memory)
  • Python 3 (PEP#3118 array buffer protocol), 3.6 (float16)
  • Python __cuda_array_interface__ (Numba, CuPy, PyTorch, PyArrow, ArrayViews, JAX, PyCUDA, DALI, RAPIDS, ...)
  • 2011-12: milestone C++ release, many features (lambdas), relevant: templates to reduce boilerplate, custom allocators for containers (eg std::vectors)
  • 2014: single address space w. implicit auto sync between CPU RAM & GPU device mem
  • 2006-16: major support (eg numpy.frombuffer)
  • 2018: inspired by NumPy's __array_interface__ dunder, Numba lead Python __cuda_array_interface__ dunder

More Hope: Build Tooling¶

  • Python 3.6 build-system dependencies (PEP#517, PEP#518)
  • scikit-build (CMake-driven build-system generator for CPython extensions)
  • PIP availability: CMake generator & Ninja build-system
  • (optional) SWIG
  • TL;DR
    • dev: pyproject.toml + setup.py + CMakeLists.txt
    • user: C++/CUDA compiler (any OS, no IDE) + python3.6 -m pip install cuvec
  • 2015-16: pyproject.toml compilation dependencies DL & run in isolated environment during pip installation
  • skbuild library helps Python call CMake to build CPython exts
  • cmake & ninja (alt to make) binaries pip installable
  • swig reduce interface boilerplate (2003), 2016: on pip
  • so all build deps available on pip
  • user "just adds compiler"; pip automagically does the rest (in theory)

Solution: Building CuVec¶

  • 1 decade of frustration careful thought
  • 2 days of prototyping + 5 days of testing
In [2]:
git_fame_plot("./cuvec", bytype=True)
  • mostly docs + examples + tests
  • meat ~1k loc divided evenly between template headers & source
  • one of smallest libs I've made yet ridiculously powerful

Docs

Summary: Winning¶

  • Less boilerplate code (fewer bugs, easier debugging, and faster prototyping)
  • Fewer memory copies (faster execution)
  • Lower memory usage (do more with less hardware)

What Next¶

  • Integration tutorials (Numba, CuPy, PyTorch, PyArrow, ArrayViews, JAX, PyCUDA, DALI, RAPIDS, ...)
  • ... any feature requests?

Casper da Costa-Luis @casperdcl sponsor