This is largely a save-your-work checkin.
Created p521/arch_ref64 code to make sure E-521 basically works.
Fixed some of the testing code around E-521. It doesn't quite pass
everything yet.
Created p521/arch_x86_64 code with optimized multiply. In this
checkin, the multiply is fast and works, but all the other code in
that directory is the completely unoptimized ref64 build which
reduces after every add and sub. So the whole thing isn't fast yet.