OpenMP is one of the easiest ways to make existing code run across CPU cores. In the simplest cases you simply add a single #pragma to C code and it goes N times faster. This is when you're running a function in a loop with no side effects. Some examples I've done:
1) ray tracing. Looping over all the pixels in an image using ray tracing to determine the color of each pixel. The algorithm and data structures are complex but don't change during the rendering. N cores is about N times as fast.
2) in Solvespace we had a small loop which calls a tessellation function on a bunch of NURBS surfaces. The function was appending triangles to a list, so I made a thread-local list for each call and combined them after to avoid writes to shared data structure. Again N times faster with very little effort.
The code is also fine to build single threaded without change if you don't have OpenMP. Your compiler will just ignore the #pragmas.
> OpenMP is one of the easiest ways to make existing code run across CPU cores.
True (or with Intel TBB), however as someone with a lot of experience optimising HPC algorithms for rendering, geometry processing and simulation, there are caveats, and quite often you can get situations where the existing code that is parallelised this way more naively can spend disproportionate amounts of CPU usage on spinlocks in OpenMP or TBB instead of doing useful work. (I've also noticed the same thing happening with Rayon in Rust).
Sometimes I've looked at code other colleagues have "parallelised" this way, and they've said "yes, it's using multiple threads", but when you profile it with perf or vtune, it's clearly not really doing that much *useful* parallel work, and sometimes it's even slower than single-threaded from a wall-clock standpoint, and people just didn't check if it was faster, they just looked at the CPU usage, and didn't notice the spinlocks.
OpenMP is great. I’ve done something similar to your second case (thread local objects that are filled in parallel and later combined). In the case of “OpenMP off” (pragmas ignored), is it possible to avoid the overhead of the thread local object essentially getting copied into the final object (since no OpenMP means only a single thread local object)? I avoided this by implementing a separate code path, but I’m just wondering if there are any tricks I missed that would allow still a single code path
Give one of the threads (thread ID 0, for instance) special privileges. Its list is the one everything else is appended to, then there's only concatenation or copying if you have more than one thread.
Or, pre-allocate the memory and let each thread write to its own subset of the final collection and avoid the combine step entirely. This works regardless of the number of threads you use so long as you know the maximum amount of memory you might need to allocate. If it has no calculable upper bound, you will need to use other techniques.
You can now (already in OpenMP5) use it to write GPU programs. Intels OneAPI uses OpenMP 5.5 to write programs for the Intel PonteVecchio GPUs which are on par to the Nvidia A100.
I was just googling to see if there's any Emscripten/WASM implementation of OpenMP. The emscripten github issue [1] has a link to this "simpleomp" [2][3] where
> In ncnn project, we implement a minimal openmp runtime for webassembly target
> It only works for #pragma omp parallel for num_threads(N)
OpenMP is one of the easiest ways to make existing code run across CPU cores. In the simplest cases you simply add a single #pragma to C code and it goes N times faster. This is when you're running a function in a loop with no side effects. Some examples I've done:
1) ray tracing. Looping over all the pixels in an image using ray tracing to determine the color of each pixel. The algorithm and data structures are complex but don't change during the rendering. N cores is about N times as fast.
2) in Solvespace we had a small loop which calls a tessellation function on a bunch of NURBS surfaces. The function was appending triangles to a list, so I made a thread-local list for each call and combined them after to avoid writes to shared data structure. Again N times faster with very little effort.
The code is also fine to build single threaded without change if you don't have OpenMP. Your compiler will just ignore the #pragmas.
> OpenMP is one of the easiest ways to make existing code run across CPU cores.
True (or with Intel TBB), however as someone with a lot of experience optimising HPC algorithms for rendering, geometry processing and simulation, there are caveats, and quite often you can get situations where the existing code that is parallelised this way more naively can spend disproportionate amounts of CPU usage on spinlocks in OpenMP or TBB instead of doing useful work. (I've also noticed the same thing happening with Rayon in Rust).
Sometimes I've looked at code other colleagues have "parallelised" this way, and they've said "yes, it's using multiple threads", but when you profile it with perf or vtune, it's clearly not really doing that much *useful* parallel work, and sometimes it's even slower than single-threaded from a wall-clock standpoint, and people just didn't check if it was faster, they just looked at the CPU usage, and didn't notice the spinlocks.
Here's some reading that I personally have found helpful for optimizing parallel programs:
The best I've found so far:
https://cdn.kernel.org/pub/linux/kernel/people/paulmck/perfb...
And some other good reading:
https://www.amazon.com/Systems-Performance-Brendan-Gregg/dp/...
https://fgiesen.wordpress.com/2014/08/18/atomics-and-content...
https://travisdowns.github.io/blog/2020/07/06/concurrency-co...
OpenMP is great. I’ve done something similar to your second case (thread local objects that are filled in parallel and later combined). In the case of “OpenMP off” (pragmas ignored), is it possible to avoid the overhead of the thread local object essentially getting copied into the final object (since no OpenMP means only a single thread local object)? I avoided this by implementing a separate code path, but I’m just wondering if there are any tricks I missed that would allow still a single code path
Give one of the threads (thread ID 0, for instance) special privileges. Its list is the one everything else is appended to, then there's only concatenation or copying if you have more than one thread.
Or, pre-allocate the memory and let each thread write to its own subset of the final collection and avoid the combine step entirely. This works regardless of the number of threads you use so long as you know the maximum amount of memory you might need to allocate. If it has no calculable upper bound, you will need to use other techniques.
You can now (already in OpenMP5) use it to write GPU programs. Intels OneAPI uses OpenMP 5.5 to write programs for the Intel PonteVecchio GPUs which are on par to the Nvidia A100.
https://www.intel.com/content/www/us/en/docs/oneapi/optimiza...
gcc also provides support for NVidia and AMD GPUs
https://gcc.gnu.org/wiki/Offloading
Here is an example how you can use openmp for running a kernel on a nvidia A100:
https://people.montefiore.uliege.be/geuzaine/INFO0939/notes/...
I've wanted to mess around with those intel GPUs but haven't found a great source that deals with individuals/small orders.
OpenMP was pivotal to my last workplace, but because some customers required MSVC, we barely had support for OpenMP 2.0.
I was just googling to see if there's any Emscripten/WASM implementation of OpenMP. The emscripten github issue [1] has a link to this "simpleomp" [2][3] where
> In ncnn project, we implement a minimal openmp runtime for webassembly target
> It only works for #pragma omp parallel for num_threads(N)
[1] https://github.com/emscripten-core/emscripten/issues/13892
[2] https://github.com/Tencent/ncnn/blob/master/src/simpleomp.h
[3] https://github.com/Tencent/ncnn/blob/master/src/simpleomp.cp...