C++ - Getting started with SSE

Streaming SIMD Extensions 2 Instructions

How to Write Fast Numerical Code

C++ - Getting started with SSE

Stackoverflow

Introduction

SSE stands for Streaming SIMD Extensions. It is a set of CPU instructions dedicated to applications like signal processing, scientific computation or 3D graphics.
SIMD is an acronym itself: Single Instruction, Multiple Data. A CPU instruction is said to be SIMD when the same operation is applied on multiple data at the same time.
Naming convention: _mm_<intrin_op>_<suffix>

// a is 16-byte aligned
float a[4] = {1.0, 2.0, 3.0, 4.0}; 
__m128 t = _mm_load_ps(a); // p: packed, s: single precision

Issues

Alignment is important (128 bit = 16 byte)
You need to code explicit loads and stores
Overhead through shuffles

Example

_mm_store_ps stores results in an array.
_mm_sqrt_ps computes the square root of 4 float in a single operation.
posix_memalign allocates data on the stack;
aligned (alignment) allocates aligned data on the heap. (This attribute specifies a minimum alignment for the variable or structure field, measured in bytes.) e.g., float a[] __attribute__ ((aligned (16))) = { 41982., 81.5091, 3.14, 42.666 };

#include <emmintrin.h> // IMPORTANT
#include <math.h>
#include <stdio.h>
#include <chrono>
#include <iostream>

using namespace std;

class Timer {
 public:
  Timer() { start = Clock::now(); }

  void Start() { start = Clock::now(); }

  void End() {
    auto end = Clock::now();
    std::cout << "Elapsed time: "
              << std::chrono::duration_cast<std::chrono::nanoseconds>(end -
                                                                      start)
                     .count()
              << " nanoseconds" << std::endl;
  }

 private:
  using Clock = std::chrono::high_resolution_clock;
  std::chrono::_V2::system_clock::time_point start;
};

void normal(float* a, int N) {
  for (int i = 0; i < N; ++i) {
    a[i] = sqrt(a[i]);
  }
}

// compute the square root of a very large array on float
void sse(float* a, int N) {
  // We assume N % 4 == 0.
  int nb_iters = N / 4;
  __m128* ptr = (__m128*)a;

  for (int i = 0; i < nb_iters; ++i, ++ptr, a += 4) {
    _mm_store_ps(a, _mm_sqrt_ps(*ptr));
  }
}

// add two arrays of char together
void sse(char* a, const char* b, int N) {
  int nb_iters = N / 16;

  __m128i* l = (__m128i*)a;
  __m128i* r = (__m128i*)b;

  for (int i = 0; i < nb_iters; ++i, ++l, ++r)
    _mm_store_si128(l, _mm_add_epi8(*l, *r));
}

int main(int argc, char** argv) {
  if (argc != 2) {
    cout << "Usage: ./sse_test N (N represents the number of floats)" << endl;
    return 1;
  }
  int N = atoi(argv[1]);

  float* a;
  posix_memalign((void**)&a, 16, N * sizeof(float));

  for (int i = 0; i < N; ++i) {
    a[i] = 3141592.65358;
  }

  Timer timer;

  {
    timer.Start();
    normal(a, N);
    timer.End();
  }

  for (int i = 0; i < N; ++i) {
    a[i] = 3141592.65358;
  }

  {
    timer.Start();
    sse(a, N);
    timer.End();
  }
}

g++ -o sse_test sse_test.cc -std=c++11 -O3 -msse2

./sse_test 64000000

Elapsed time: 336269639 nanoseconds
Elapsed time: 29902925 nanoseconds

C++ - Getting started with SSE

Introduction

Issues

Example

CATALOG

FEATURED TAGS