I’ve been playing around with SSE recently to explore the performance benefits of using it in some performance sensitive sections of code. I’ve initially applied it to updating an AABB (Axis Aligned Bounding Box) from a rotated previous bounding box.
I ‘ve created a small repo that performs micro-benchmarking of non-SIMD and SIMD versions benchmarked using picobench in case it’s of use to others. The results when running a release build under gcc 7.4 (64 bit build) with optimisation disabled -O0:
Name (baseline is *) | Dim | Total ms | ns/op |Baseline| Ops/second
benchmark_normal * | 8 | 0.002 | 208 | - | 4799040.2
benchmark_simd | 8 | 0.001 | 62 | 0.301 | 15968063.9
benchmark_normal * | 64 | 0.013 | 205 | - | 4876933.6
benchmark_simd | 64 | 0.004 | 58 | 0.285 | 17112299.5
benchmark_normal * | 512 | 0.100 | 194 | - | 5128667.4
benchmark_simd | 512 | 0.031 | 61 | 0.315 | 16269980.0
benchmark_normal * | 4096 | 0.838 | 204 | - | 4885630.2
benchmark_simd | 4096 | 0.251 | 61 | 0.299 | 16328548.3
benchmark_normal * | 8192 | 1.669 | 203 | - | 4907605.0
benchmark_simd | 8192 | 0.474 | 57 | 0.284 | 17265216.7
Not quite 4x as we’d expect but pretty close 🙂 However, with -O3 it’s 1:1.
To be continued…
I have a side-project that I use to explore technical interests. It’s a
plugin-based Entity-Component-System 3D game engine / editor and SDK for Linux, Windows and macOS. The engine can be extended by adding new components and systems in the form of plugins (generated by the in-editor code wizard):
As SimulationStarterKit is C++ / CMake based I use CTest as the unit testing framework that’s especially helpful when you’re trying to cover a few platforms (i.e. Linux, macOS and Windows). Every time I implement a feature in the engine I try to create a corresponding test first to help determine what constitutes correct operation and also sometimes as a programming aid to help discover what a usable API might look like for a new feature. The video below demonstrates how this looks in Visual Studio.
When rendering objects I want to:
Minimise state switches (i.e. shader activation – glUseProgram()) Render opaque objects front to back to reduce overdraw Render transparent objects back to front to get transparency Continue reading Batched rendering and sort keys
You can accelerate the performance of Python code by re-implementing performance sensitive sections in C/C++ (i.e. by utilizing SSE for instance) and then making it callable from Python by using Python’s ctypes library or alternatively exposing your C/C++ code as a Python extension module.
I’ve put together a minimal github repo that demonstrates the ctypes approach here and another small project that demonstrates the python extension module approach here in case it’s of use to anyone.