Trying to work around an apparent GCC bug on SSE2, thanks Samuel Neves. Added an experimental NEON arch. It's fast. It's not yet GCC clean. It needs some more work on general cleanliness too.