Skip to content

SIMD discovering: SSE2 and NEON

This post shows the basics to create Intel/AMD SSE2 and ARM NEON code in C/C++ using GCC/CLang and Visual Studio.

For several years I built projects in C and C++ using the Visual Studio release mode or -O3 in gcc.

The compiler make a lot of optimizations, but we might not been using the data parallel instruction set on current CPUs.

Intel and AMD has implemented a CPU instruction set called SSE2 (Streaming SIMD Extensions).

And ARM, (from the armv7) has an instruction set similar to SSE2 called NEON. I discovered that because I've bought a Raspberry Pi Model B+ very recently.

But What is SSE in practice?

The Intel or AMD CPU has 128 bits special registers that could retain several data in those registers. And it has special instructions that process the data in those registers.

For example: We could issue an instruction that make four float multiplications and place the four result in a single register. Instead of issue four float multiplication instructions (one by one).

There are several instructions in SSE that make operations on those registers.

But What is NEON in practice?

Some ARM chips has the NEON (Advanced SIMD). From the arm-V7, cortex-A8, etc...

There are some instructions that work with 64 bits and other instructions that work with 128 bits.

In the same fashion of SSE, we might make 2 operations or 4 operations on float data using one instruction.

Notice About Memory Layout

We cannot use SSE or NEON in any data type or any data structure.

We need the addresses from the input and output to be aligned: in SSE is 16 bytes and in NEON is 8 bytes and 16 bytes. To make all code compatible across the systems I put all alignment oriented to 16 bytes.

The compilers I have access are: clang (mac), gcc (linux, mac, windows) and Visual Studio (windows).

Each operation system and each compiler have their own way to configure the data alignment.

clang / gcc

We can use the __attribute__ (( __aligned__ (16))) at the end of the declaraction of a structure, class or array.

The GCC and CLang are the compilers that make the compilation easy, because it align the data inside the STL and the new operator also.

Using the data alignment attribute in the stack aligned data:

class MyClass {
public:
  float a;
  float b;
  float c;
  float d;
} __attribute__ (( __aligned__ (16)));

class MyClass {
public:
  char a;
  float b[4] __attribute__ (( __aligned__ (16)));
};

float data[4] __attribute__ (( __aligned__ (16)));

GCC automatically arrange the data in the heap using the new operator.

To allocate a raw aligned memory block we can use both:
aligned_alloc / free
and
_mm_malloc / _mm_free .

float * alignedData = aligned_alloc( 16, sizeof(float) * n );
free( alignedData );

float * alignedData = _mm_malloc( sizeof(float) * n, 16 );
_mm_free( alignedData );

Visual Studio

We can use the __declspec(align(16)) at the beginning of the data declaration.

Using the data alignment attribute in the stack aligned data:

__declspec(align(16)) class MyClass {
public:
  float a;
  float b;
  float c;
  float d;
};

class MyClass {
public:
  char a;
  __declspec(align(16)) float b[4];
};

__declspec(align(16)) float data[4];

These declarations are usable just from the stack. In the heap we need to take special care, because Visual Studio doesn't align the object data using the new operator.

The STL doesn't align the data also.

To make the new operator work properly in Visual Studio, we need to implement the new operator overload for classes and structs.

__declspec(align(16)) class MyClass {
public:
    __declspec(align(16)) float data[4];
    // operator new
    void* operator new(size_t size) {
        return _mm_malloc(size, 16);
    }
    // operator delete
    void operator delete(void* p) { 
        _mm_free(p);
    }
    // operator new[]
    void* operator new[](size_t size) {
        return _mm_malloc(size, 16);
    }
    // operator delete[]
    void operator delete[](void* p) {
        _mm_free(p);
    }
    // placement operator new
    void* operator new (std::size_t n, void* ptr){
        return ptr;
    }
    // placement operator delete
    void operator delete(void *objectAllocated, void* ptr) {
    }
};

There are the placement new and the placement delete that are not so common to overload, but we need them because the STL uses these operators in vector template for example.

To allocate a raw aligned memory block we can use: _mm_malloc / _mm_free .

float * alignedData = _mm_malloc( sizeof(float) * n, 16 );
_mm_free( alignedData );

How About The C++ STL ?

STL works fine in GCC / CLANG, but crashes on Visual Studio.

To get rid of sudden crashes on Visual Studio because of unaligned data, we need to create the allocator template for the STL structures.

I tested the result using the std::vector<> .

The allocator tells to the STL template how the data can be allocated, released, initialized and destructed.

The ssealign example:

template <typename T, size_t N = 16>
class ssealign {
public:
  typedef T value_type;
  typedef size_t size_type;
  typedef ptrdiff_t difference_type;

  typedef T * pointer;
  typedef const T * const_pointer;

  typedef T & reference;
  typedef const T & const_reference;

public:
  inline ssealign() throw () { }

  template <typename T2>
  inline ssealign(const ssealign<T2, N> &) throw () { }

  inline ~ssealign() throw () { }

  inline pointer adress(reference r) {
    return &r;
  }

  inline const_pointer adress(const_reference r) const {
    return &r;
  }

  inline pointer allocate(size_type n) {
    return (pointer)_mm_malloc(n * sizeof(value_type),N);
  }

  inline void deallocate(pointer p, size_type) {
    _mm_free(p);
  }

  inline void construct(pointer p, const value_type & wert) {
    new (p) value_type(wert);
  }

  inline void destroy(pointer p) {
    p->~value_type();
  }

  inline size_type max_size() const throw () {
    return size_type(-1) / sizeof(value_type);
  }

  template <typename T2>
  struct rebind {
    typedef ssealign<T2, N> other;
  };

  bool operator!=(const ssealign<T, N>& other) const {
    return !(*this == other);
  }

  // Returns true if and only if storage allocated from *this
  // can be deallocated from other, and vice versa.
  // Always returns true for stateless allocators.
  bool operator==(const ssealign<T, N>& other) const {
    return true;
  }
};

After define this custom allocator, we can use it to allocate the std::vector<> in the stack or local storage.

For example:

// This declaration makes the vector template
// to align data to 16 bytes
std::vector<float, ssealign<float, 16> > floats;

//other example with MyClass (using the new operator overload)
std::vector<MyClass, ssealign<MyClass, 16> > objects;

Note About the Data Structures

We can make a header to hold both types of definitions we saw here and use the

#ifdef _MSC_VER

To check if the code is compiling from a VisualStudio or GCC/CLang compiler.

For example

#ifdef _MSC_VER
  #define _ALIGN_PRE __declspec(align(16))
  #define _ALIGN_POS
#else
  #define _ALIGN_PRE
  #define _ALIGN_POS __attribute__ (( __aligned__ (16)))
#endif

And the class could be written as follow:

_ALIGN_PRE class MyClass { 
public:
  _ALIGN_PRE float data[4] _ALIGN_POS;

  // operator new and delete overload
  ...
} _ALIGN_POS;

How can we use SSE or NEON in C/C++ ?

After we know how to define data and how to align it, we can start to code our SIMD algorithm.

Intel / AMD / SSE2

I chosen the SSE2 instruction set to this post because is can be used in both (Intel and AMD) hardware. Other instruction set are very specific on Intel or AMD hardware.

To be able to use SSE2 instructions you need to configure the compiler.

On Visual Studio you need to set the arch flag to : /arch:SSE2 .

In GCC you need to add the compiler flag: -mmmx -msse -msse2 -mfpmath=sse .

In SSE we need to include:

#include <xmmintrin.h> // SSE1
#include <emmintrin.h> // SSE2

After that we can create functions or methods that use the internal types.

In SSE we use the type __m128 (it contains 4 floats or 4 ints[32bits] or 8 ints[16bits], etc...).

All functions from SSE starts with _mm_*. So if we want to sum 4 floats we can use the _mm_add_ps().

Example:

// to avoid typecasting float to __m128, 
// we can use the union
_ALIGN_PRE struct vec4 {
  union{
    struct{ float x,y,z,w; };
    __m128 sse_data;
  };
}_ALIGN_POS;

// define variables
vec4 a,b,c;

// add instruction : C = A + B
c.sse_data = _mm_add_ps( a.sse_data, b.sse_data );

ARM / NEON

I chosen the NEON instruction set to this post because is can be used in Raspberry Pi Model B+ that I bought very recently. But I saw that it is available in a lot of Android and iOS hardware also.

To be able to use NEON instructions you need to configure the compiler.

In GCC you need to add the compiler flag: -mfpu=neon .

In NEON we need to include:

#include <arm_neon.h>

After that we can create functions or methods that use the internal types.

In NEON we use the type float32x4_t (it contains 4 floats). There are other types also.

All functions from NEON starts with v*. So if we want to sum 4 floats we can use the vaddq_f32().

// to avoid typecasting float to float32x4_t, 
// we can use the union
_ALIGN_PRE struct vec4 {
  union{
    struct{ float x,y,z,w; };
    float32x4_t neon_data;
  };
}_ALIGN_POS;

// define variables
vec4 a,b,c;

// add instruction : C = A + B
c.neon_data = vaddq_f32( a.neon_data, b.neon_data );

What Next?

I thing this is the beginning.

Now you can take a look at the SSE2 or NEON documentation to see what instructions they have and try to come up with new SIMD parallel algorithms you need.

SSE2 Intrinsics Listing

Neon Intrinsics Listing

SIMD Version of Vector Math for 3D Application

I ported the vector math code from my library to use both SSE2 and NEON intrinsics. It has vector with 2, 3 and 4 components, quaternions and matrix 4x4 structures and operations.

You can check the code at the GitHUB.

Leave a Reply

Your email address will not be published. Required fields are marked *