RGL
EPFL Logo

Enoki: structured vectorization and differentiation on modern processors

Abstract

En­oki is a C++ tem­plate lib­rary that en­ables auto­mat­ic trans­form­a­tions of nu­mer­ic­al code, for in­stance to cre­ate a “wide” vec­tor­ized vari­ant of an al­gorithm that runs on the CPU or GPU, or to com­pute gradi­ents via trans­par­ent for­ward/re­verse-mode auto­mat­ic dif­fer­ent­a­tion.

The core parts of the lib­rary are im­ple­men­ted as a set of head­er files with no de­pend­en­cies oth­er than a suf­fi­ciently C++17-cap­able com­piler (GCC >= 8.2, Clang >= 7.0, Visu­al Stu­dio >= 2017). En­oki code re­duces to ef­fi­cient SIMD in­struc­tions avail­able on mod­ern CPUs and GPUs—in par­tic­u­lar, En­oki sup­ports:

  • In­tel: AVX512, AVX2, AVX, and SSE4.2,
  • ARM: NEON/VFPV4 on arm­v7-a, Ad­vanced SIMD on 64-bit arm­v8-a,
  • NVIDIA: CUDA via a Par­al­lel Thread Ex­e­cu­tion (PTX) just-in-time com­piler.
  • Fall­back: a scal­ar fall­back mode en­sures that pro­grams still run even if none of the above are avail­able.

De­ploy­ing a pro­gram on top of En­oki usu­ally serves three goals:

  1. En­oki ships with a con­veni­ent lib­rary of spe­cial func­tions and data struc­tures that fa­cil­it­ate im­ple­ment­a­tion of nu­mer­ic­al code (vec­tors, matrices, com­plex num­bers, qua­ternions, etc.).

  2. Pro­grams built us­ing these can be in­stan­ti­ated as wide ver­sions that pro­cess many ar­gu­ments at once (either on the CPU or the GPU). En­oki is also struc­tured in the sense that it handles com­plex pro­grams with cus­tom data struc­tures, lambda func­tions, load­able mod­ules, vir­tu­al meth­od calls, and many oth­er mod­ern C++ fea­tures.

  3. If de­riv­at­ives are de­sired (e.g. for stochast­ic gradi­ent des­cent), En­oki per­forms trans­par­ent for­ward or re­verse-mode auto­mat­ic dif­fer­en­ti­ation of the en­tire pro­gram.

Fi­nally, En­oki can do all of the above sim­ul­tan­eously: if de­sired, it can com­pile the same source code to mul­tiple dif­fer­ent im­ple­ment­a­tions (e.g. scal­ar, AVX512, and CUDA+autodiff).