tidy up

10 years ago · b5f7c2eff8
--- a/HISTORY.txt
+++ b/HISTORY.txt
@@ -1,3 +1,25 @@
 April 23, 2015:
    Removed the original Goldilocks code; Decaf now stands on its own.
    This cuts the source code approximately in half, to a still-large
    13.7k wc-lines.  (Most of these lines are in the arch-specific
    field implementations.)
    
    Note that the decaf_crypto routines are not intended to set
    standards.  They should be secure, but they're intended more as
    examples of how the core ECC library could be used.
    
    The SHAKE stuff is also mostly an experiment, particularly the
    STROBE protocol/mode stuff.  This is all fine, because the ECC
    library itself is the core, and doesn't require the SHAKE stuff.
    (Except for the C++ header, which should probably also be factored
    so that it doesn't need the SHAKE stuff.)
    
    I've started work on making a Decaf BAT, but not done yet.
    
    I haven't ripped out all old multi-field code, because I intend
    to add support for other fields eventually.  Maybe properly this
    time, instead of with a million compile flags like the original.

 March 23, 2015:
    I've been fleshing out Decaf, and hopefully the API is somewhere
    near final.  I will probably move a few things around and add a
--- a/README.txt
+++ b/README.txt
@@ -1,42 +1,24 @@
 Ed448-Goldilocks
 Ed448-Goldilocks, Decaf version.

 This software is an experimental implementation of a new 448-bit elliptic
 curve called Ed448-Goldilocks. The implementation itself is based on that of
 an earlier, unnamed 252-bit curve which should probably be referred to as
 Ed252-MontgomeryStation. See http://eprint.iacr.org/2012/309 for details of
 that implementation.
 curve called Ed448-Goldilocks, with "Decaf" cofactor removal.

 The source files here are all by Mike Hamburg. Most of them are (c) 2014
 Cryptography Research, Inc (a division of Rambus). The cRandom
 implementation is the exception: these files are from the OpenConflict video
 game protection system out of Stanford, and are (c) 2011 Stanford
 University. All of these files are usable under the MIT license contained in
 LICENSE.txt.
 The source files here are all by Mike Hamburg. Most of them are (c)
 2014-2015 Cryptography Research, Inc (a division of Rambus). All of these
 files are usable under the MIT license contained in LICENSE.txt.

 The Makefile is set for my 2013 MacBook Air. You can `make bench` to run
 a completely arbitrary set of benchmarks and tests, or `make
 build/goldilocks.so` to build a stripped-down version of the library. For
 non-Haswell platforms, you need to replace -mavx2 -mbmi2 by an appropriate
 vector declaration. For non-Mac platforms, you won't be able to build a
 library with this Makefile. This is fine, because you shouldn't be using
 this for much at this stage anyway.
 a completely arbitrary set of benchmarks and tests, or `make lib` to build
 a stripped-down version of the library. For non-Haswell platforms, you may
 need to replace -mavx2 -mbmi2 by an appropriate vector declaration.

 I've attempted to protect against timing attacks and invalid point attacks,
 but as of yet no attempt to protect against power analysis. This is an early
 revision, so I haven't done much analysis or correctness testing of
 corner-cases.

 The code in ec_point.c and ec_point.h was generated with the help of a tool
 written in SAGE. The field code in p448.h doesn't reduce after add/sub, and
 so it requires care to prevent overflow. The SAGE tool figures out where to
 put reductions and adjustments to prevent overflow. It also formally
 verifies that the formulas produce points on the curve. I'm planning to add
 more features to it eventually. That tool is even more experimental than
 this library, though, and so I won't be releasing it just yet.
 but as of yet no attempt to protect against power analysis.

 This software is incomplete, and lacks documentation. None of the APIs are
 stable. The software is probably not secure. Please consult TODO.txt for
 additional agenda items. Do not taunt happy fun ball.
 yet stable, though they may be getting there. The software is probably not
 secure. Please consult TODO.txt for additional agenda items. Do not taunt
 happy fun ball.

 Cheers,
 -- Mike Hamburg
--- a/TODO.txt
+++ b/TODO.txt
@@ -1,16 +1,10 @@
 Important work items for Ed448-Goldilocks:

 * Better architecture detection / factoring of arch-related headers.
    [PROGRESS]

 * Better factoring of high-level vs low-level library.
 Important work items for Ed448-Goldilocks / decaf:

 * Factor out hash, crandom from core library?
    [DONE, except for C++ headers]

 * Signed 32-bit NEON implementation to avoid bias/reduce after subtract



 * Documentation: write high-level API docs, and internal docs to help
  other implementors.
    * Partial progress on Doxygenating the code.
@@ -20,65 +14,27 @@ Important work items for Ed448-Goldilocks:
 * Cleanup: rename everything consistently.
    * namespace_op or op_namespace?  namespace_op_type?
    * We don't have to be super-careful with the namespacing, because
      symbols will be scrubbed by exported.sym.

 * Cleanup: hard-coded tables (probably?)
    * This reduces the work required for goldilocks_init() at the expense
      of library size.
     
    * Makes error-handling and thread safety easier.
    
    * Use the SAGE tool?
      symbols will be scrubbed by visibility

 * Cleanup: unify intrinsics code
    * Word_t, mask_t, bigregister_t, etc.
    * Generate asm intrinsics with a script?

 * [DONE] Bugfix: make sure that init() and randomization are thread-safe.

 * [DONE] Security: check on deserialization that points are < p.
    * [NEEDS TESTING] Check also that they're nonzero or otherwise non-pathological?

 * Testing:
    * Corner-case testing
    * More bulk random testing
    * Negative testing.
    * SAGE-(auto?)-generated test vectors
    * Test the Barrett fields
    * More testing.  Testing, testing and testing.
    * Test corner cases better.

 * Safety: add static analysis attributes for compilers that support them
    * Most functions now have warn on ignored return.
    * [ MOSTLY DONE ]

 * Safety:
    * [DONE] Check for init() if it's still required once we've done the above
    * Decide what to do about RNG failures
        * abort
        * return error and zeroize
        * return error but continue if RNG is kind of mostly OK
    
 * Flexibility: decide which API options are good.
    * [DONE?] Eg, should functions take nbits and table sizes?
    
    * [DONE] Remove hardcoded adjustments from comb control.
        * These adjustments make the output wrong when it's not 450 bits.
        
    * Other slow Barrett fields?  Montgomery fields?

 * Mid-level API
    * Make it easier to work with untwisted Edwards objects.
    * Probably use extended or projective, not extensible coordinates.
    * Scalarmul with other cofactor modes.

 * High-level API:
    * SHA512 Elligator Edition?  Maybe write a paper first.
    
    * Elligator.
        * Need to write Elligator inverse.  Might not be Elligator-2S.
    
    * FHMQV? Is this patented?
    
    * What low-level APIs to expose?
        * Edwards points with add, sub, scalarmul, =, ==, ser/deser?

 * High-level API: [DONE]

 * Portability: test and make clean with other compilers
    * Using a fair amount of __attribute__ code.
@@ -89,47 +45,14 @@ Important work items for Ed448-Goldilocks:
    * I can't get a simple for-loop to autovectorize :-/
    * SAGE tool?

 * Portability: make the inner layers of the code 32-bit clean.
    * Write new versions of the field code.
        * [DONE] 28-bit limbs give less headroom for carries.
        * [DONE] Now have a vectorless ARM version; need NEON.
        * Improve speed of 32-bit field code.
    
    * [DONE] Run through the SAGE tool to generate new bias & bound.

 * [DONE] Portability: make the outer layers of the code 32-bit clean.

 * [DONE] Performance/flexibility: decide which parameters should be hard-coded.
    * Perhaps useful for comb precomputation.

 * Performance: Improve SHA512.
    * [DONE?] Improve portability.
    * Improve speed.
        * Except not, because this adds too much code size.
        * Link OpenSSL if a fast SHA is desired.

 * Protocol:
    * Decide what things to stir into hashes for various functions.
    
 * Performance: improve the Barrett field code.
    * Support other primes?
    * Capture prime shape into a struct instead of passing 3 params.
    * [DONE] Make 32-bit clean.

 * Automation:
    * Improve the SAGE tool to cover more cases
        * Real SSA classes to cover branching and looping
        * Constant-time selection
        * Intrinsics code
        * Field code?
    
    * SAGE tool is impossibly slow on 32-bit
         * Currently stuck on Elligator after 19 hours.
         * [FIXED] at least for now.
        
    * Vector-mul-chains
    * Negation "bubble pushing" optimization
 * Performance: Improve SHAKE.
    * Improve speed.  (Maybe)

 * Clear other TODO/FIXME/HACK/PERF items in the code

 * [DONE?] Submit to SUPERCOP
 * Submit Decaf to SUPERCOP
--- a/include/decaf_crypto.h
+++ b/include/decaf_crypto.h
@@ -4,7 +4,10 @@
 *   Copyright (c) 2015 Cryptography Research, Inc.  \n
 *   Released under the MIT License.  See LICENSE.txt for license information.
 * @author Mike Hamburg
 * @brief Decaf cyrpto routines. 
 * @brief Example Decaf cyrpto routines.
 * @warning These are merely examples, though they ought to be secure.  But real
 * protocols will decide differently on magic numbers, formats, which items to
 * hash, etc.
 * @warning Experimental!  The names, parameter orders etc are likely to change.
 */

--- a/src/decaf_crypto.c
+++ b/src/decaf_crypto.c
@@ -5,7 +5,7 @@
 *   Copyright (c) 2015 Cryptography Research, Inc.  \n
 *   Released under the MIT License.  See LICENSE.txt for license information.
 * @author Mike Hamburg
 * @brief Decaf cyrpto routines. 
 * @brief Example Decaf cyrpto routines.
 */

 #include "decaf_crypto.h"
--- a/src/decaf_fast.c
+++ b/src/decaf_fast.c
@@ -1106,12 +1106,6 @@ static void gf_batch_invert (
    /* const */ gf *in,
    unsigned int n
 ) {
    // if (n==0) {
    //     return;
    // } else if (n==1) {
    //     field_inverse(out[0],in[0]);
    //     return;
    // }
    assert(n>1);
  
    gf_cpy(out[1], in[0]);
@@ -1254,7 +1248,7 @@ void decaf_448_precomputed_scalarmul (
         
            for (k=0; k<t; k++) {
                unsigned int bit = i + s*(k + j*t);
                if (bit < SCALAR_WORDS * WBITS) {
                if (bit < DECAF_448_SCALAR_BITS) {
                    tab |= (scalar1x->limb[bit/WBITS] >> (bit%WBITS) & 1) << k;
                }
            }
--- a/src/include/field.h
+++ b/src/include/field.h
@@ -24,45 +24,6 @@ typedef struct field_t field_a_t[1];
 #define IF32(s)
 #endif

 /** @brief Bytes in a field element */
 #define FIELD_BYTES          (1+(FIELD_BITS-1)/8)

 /** @brief Words in a field element */
 #define FIELD_WORDS          (1+(FIELD_BITS-1)/sizeof(word_t))

 /* TODO: standardize notation */
 /** @brief The number of words in the Goldilocks field. */
 #define GOLDI_FIELD_WORDS DIV_CEIL(FIELD_BITS,WORD_BITS)

 /** @brief The number of bits in the Goldilocks curve's cofactor (cofactor=4). */
 #define COFACTOR_BITS 2

 /** @brief The number of bits in a Goldilocks scalar. */
 #define SCALAR_BITS (FIELD_BITS - COFACTOR_BITS)

 /** @brief The number of bytes in a Goldilocks scalar. */
 #define SCALAR_BYTES (1+(SCALAR_BITS)/8)

 /** @brief The number of words in the Goldilocks field. */
 #define SCALAR_WORDS WORDS_FOR_BITS(SCALAR_BITS)

 /**
 * @brief For GMP tests: little-endian representation of the field modulus.
 */
 extern const uint8_t FIELD_MODULUS[FIELD_BYTES];

 /**
 * Copy one field element to another.
 */
 static inline void
 __attribute__((unused,always_inline))        
 field_copy (
    field_a_restrict_t a,
    const field_a_restrict_t b
 ) {
    memcpy(a,b,sizeof(*a));
 }

 /**
 * Returns 1/sqrt(+- x).
 * 
@@ -76,38 +37,21 @@ field_isr (
    field_a_t       a,
    const field_a_t x
 );
    
 /**
 * Batch inverts out[i] = 1/in[i]
 * 
 * If any input is zero, all the outputs will be zero.
 */     
 void
 field_simultaneous_invert (
    field_a_t *__restrict__ out,
    const field_a_t *in,
    unsigned int n
 );

 /**
 * Returns 1/x.
 * 
 * If x=0, returns 0.
 *
 * TODO: this is currently unused in Decaf, but I've left a decl
 * for it because field_inverse is different (and simpler) than
 * field_isqrt for 5-mod-8 fields.
 */
 void
 field_inverse (
    field_a_t       a,
    const field_a_t x
 );

 /**
 * Returns -1 if a==b, 0 otherwise.
 */
 mask_t
 field_eq (
    const field_a_t a,
    const field_a_t b
 );
    
 /**
 * Square x, n times.
@@ -135,53 +79,6 @@ field_sqrn (
    }
 }

 static __inline__ mask_t
 __attribute__((unused,always_inline))
 field_high_bit (const field_a_t f) {
    field_a_t red;
    field_copy(red,f);
    field_weak_reduce(red);
    field_add_RAW(red,red,red);
    field_strong_reduce(red);
    return -(1&red->limb[0]);
 }

 static __inline__ mask_t
 __attribute__((unused,always_inline))
 field_make_nonzero (field_a_t f) {
    mask_t z = field_is_zero(f);
    field_addw( f, -z );
    return z;
 }

 /* Multiply by signed curve constant */
 static __inline__ void
 field_mulw_scc (
    field_a_restrict_t out,
    const field_a_t a,
    int64_t scc
 ) {
    if (scc >= 0) {
        field_mulw(out, a, scc);
    } else {
        field_mulw(out, a, -scc);
        field_neg_RAW(out,out);
        field_bias(out,2);
    }
 }

 /* Multiply by signed curve constant and weak reduce if biased */
 static __inline__ void
 field_mulw_scc_wr (
    field_a_restrict_t out,
    const field_a_t a,
    int64_t scc
 ) {
    field_mulw_scc(out, a, scc);
    if (scc < 0)
        field_weak_reduce(out);
 }

 static __inline__ void
 field_subx_RAW (
    field_a_t d,
@@ -214,40 +111,6 @@ field_add (
    field_weak_reduce ( d );
 }

 static __inline__ void
 field_subw (
    field_a_t d,
    word_t c
 ) {
    field_subw_RAW ( d, c );
    field_bias( d, 1 );
    field_weak_reduce ( d );
 }

 static __inline__ void
 field_neg (
    field_a_t d,
    const field_a_t a
 ) {
    field_neg_RAW ( d, a );
    field_bias( d, 2 );
    field_weak_reduce ( d );
 }

 /**
 * Negate a in place if doNegate.
 */
 static inline void
 __attribute__((unused,always_inline)) 
 field_cond_neg (
    field_a_t a,
    mask_t doNegate
 ) {
 	field_a_t negated;
    field_neg(negated, a);
 	constant_time_select(a, negated, a, sizeof(negated), doNegate);
 }

 /** Require the warning annotation on raw routines */
 #define ANALYZE_THIS_ROUTINE_CAREFULLY const int ANNOTATE___ANALYZE_THIS_ROUTINE_CAREFULLY = 0;
 #define MUST_BE_CAREFUL (void) ANNOTATE___ANALYZE_THIS_ROUTINE_CAREFULLY
--- a/src/p448/arch_32/p448.h
+++ b/src/p448/arch_32/p448.h
@@ -17,12 +17,6 @@ typedef struct p448_t {
 extern "C" {
 #endif

 static __inline__ void
 p448_set_ui (
    p448_t *out,
    uint64_t x
 ) __attribute__((unused,always_inline));

 static __inline__ void
 p448_add_RAW (
    p448_t *out,
@@ -37,24 +31,6 @@ p448_sub_RAW (
    const p448_t *b
 ) __attribute__((unused,always_inline));
             
 static __inline__ void
 p448_neg_RAW (
    p448_t *out,
    const p448_t *a
 ) __attribute__((unused,always_inline));

 static __inline__ void
 p448_addw (
    p448_t *a,
    uint32_t x
 ) __attribute__((unused,always_inline));
             
 static __inline__ void
 p448_subw (
    p448_t *a,
    uint32_t x
 ) __attribute__((unused,always_inline));
             
 static __inline__ void
 p448_copy (
    p448_t *out,
@@ -70,11 +46,6 @@ void
 p448_strong_reduce (
    p448_t *inout
 );

 mask_t
 p448_is_zero (
    const p448_t *in
 );
             
 static __inline__ void
 p448_bias (
@@ -116,19 +87,6 @@ p448_deserialize (

 /* -------------- Inline functions begin here -------------- */

 void
 p448_set_ui (
    p448_t *out,
    uint64_t x
 ) {
    int i;
    out->limb[0] = x & ((1<<28)-1);
    out->limb[1] = x>>28;
    for (i=2; i<16; i++) {
      out->limb[i] = 0;
    }
 }

 void
 p448_add_RAW (
    p448_t *out,
@@ -165,39 +123,6 @@ p448_sub_RAW (
    */
 }

 void
 p448_neg_RAW (
    p448_t *out,
    const p448_t *a
 ) {
    unsigned int i;
    for (i=0; i<sizeof(*out)/sizeof(uint32xn_t); i++) {
        ((uint32xn_t*)out)[i] = -((const uint32xn_t*)a)[i];
    }
    /*
    unsigned int i;
    for (i=0; i<sizeof(*out)/sizeof(out->limb[0]); i++) {
        out->limb[i] = -a->limb[i];
    }
    */
 }

 void
 p448_addw (
    p448_t *a,
    uint32_t x
 ) {
  a->limb[0] += x;
 }
             
 void
 p448_subw (
    p448_t *a,
    uint32_t x
 ) {
  a->limb[0] -= x;
 }

 void
 p448_copy (
    p448_t *out,
--- a/src/p448/arch_arm_32/p448.h
+++ b/src/p448/arch_arm_32/p448.h
@@ -17,12 +17,6 @@ typedef struct p448_t {
 extern "C" {
 #endif

 static __inline__ void
 p448_set_ui (
    p448_t *out,
    uint64_t x
 ) __attribute__((unused,always_inline));

 static __inline__ void
 p448_add_RAW (
    p448_t *out,
@@ -37,24 +31,6 @@ p448_sub_RAW (
    const p448_t *b
 ) __attribute__((unused,always_inline));
             
 static __inline__ void
 p448_neg_RAW (
    p448_t *out,
    const p448_t *a
 ) __attribute__((unused,always_inline));

 static __inline__ void
 p448_addw (
    p448_t *a,
    uint32_t x
 ) __attribute__((unused,always_inline));
             
 static __inline__ void
 p448_subw (
    p448_t *a,
    uint32_t x
 ) __attribute__((unused,always_inline));
             
 static __inline__ void
 p448_copy (
    p448_t *out,
@@ -70,11 +46,6 @@ void
 p448_strong_reduce (
    p448_t *inout
 );

 mask_t
 p448_is_zero (
    const p448_t *in
 );
             
 static __inline__ void
 p448_bias (
@@ -116,19 +87,6 @@ p448_deserialize (

 /* -------------- Inline functions begin here -------------- */

 void
 p448_set_ui (
    p448_t *out,
    uint64_t x
 ) {
    int i;
    out->limb[0] = x & ((1<<28)-1);
    out->limb[1] = x>>28;
    for (i=2; i<16; i++) {
      out->limb[i] = 0;
    }
 }

 void
 p448_add_RAW (
    p448_t *out,
@@ -165,39 +123,6 @@ p448_sub_RAW (
    */
 }

 void
 p448_neg_RAW (
    p448_t *out,
    const p448_t *a
 ) {
    unsigned int i;
    for (i=0; i<sizeof(*out)/sizeof(uint32xn_t); i++) {
        ((uint32xn_t*)out)[i] = -((const uint32xn_t*)a)[i];
    }
    /*
    unsigned int i;
    for (i=0; i<sizeof(*out)/sizeof(out->limb[0]); i++) {
        out->limb[i] = -a->limb[i];
    }
    */
 }

 void
 p448_addw (
    p448_t *a,
    uint32_t x
 ) {
  a->limb[0] += x;
 }
             
 void
 p448_subw (
    p448_t *a,
    uint32_t x
 ) {
  a->limb[0] -= x;
 }

 void
 p448_copy (
    p448_t *out,
--- a/src/p448/arch_neon_experimental/p448.h
+++ b/src/p448/arch_neon_experimental/p448.h
@@ -27,12 +27,6 @@ typedef struct p448_t {
 extern "C" {
 #endif

 static __inline__ void
 p448_set_ui (
    p448_t *out,
    uint64_t x
 ) __attribute__((unused,always_inline));

 static __inline__ void
 p448_add_RAW (
    p448_t *out,
@@ -47,24 +41,6 @@ p448_sub_RAW (
    const p448_t *b
 ) __attribute__((unused,always_inline));
             
 static __inline__ void
 p448_neg_RAW (
    p448_t *out,
    const p448_t *a
 ) __attribute__((unused,always_inline));

 static __inline__ void
 p448_addw (
    p448_t *a,
    uint32_t x
 ) __attribute__((unused,always_inline));
             
 static __inline__ void
 p448_subw (
    p448_t *a,
    uint32_t x
 ) __attribute__((unused,always_inline));
             
 static __inline__ void
 p448_copy (
    p448_t *out,
@@ -80,11 +56,6 @@ void
 p448_strong_reduce (
    p448_t *inout
 );

 mask_t
 p448_is_zero (
    const p448_t *in
 );
             
 static __inline__ void
 p448_bias (
@@ -169,39 +140,6 @@ p448_sub_RAW (
    */
 }

 void
 p448_neg_RAW (
    p448_t *out,
    const p448_t *a
 ) {
    unsigned int i;
    for (i=0; i<sizeof(*out)/sizeof(uint32xn_t); i++) {
        ((uint32xn_t*)out)[i] = -((const uint32xn_t*)a)[i];
    }
    /*
    unsigned int i;
    for (i=0; i<sizeof(*out)/sizeof(out->limb[0]); i++) {
        out->limb[i] = -a->limb[i];
    }
    */
 }

 void
 p448_addw (
    p448_t *a,
    uint32_t x
 ) {
  a->limb[0] += x;
 }
             
 void
 p448_subw (
    p448_t *a,
    uint32_t x
 ) {
  a->limb[0] -= x;
 }

 void
 p448_copy (
    p448_t *out,
--- a/src/p448/arch_ref64/p448.h
+++ b/src/p448/arch_ref64/p448.h
@@ -18,12 +18,6 @@ typedef struct p448_t {
 extern "C" {
 #endif

 static __inline__ void
 p448_set_ui (
    p448_t *out,
    uint64_t x
 ) __attribute__((unused));

 static __inline__ void
 p448_add_RAW (
    p448_t *out,
@@ -38,24 +32,6 @@ p448_sub_RAW (
    const p448_t *b
 ) __attribute__((unused));
             
 static __inline__ void
 p448_neg_RAW (
    p448_t *out,
    const p448_t *a
 ) __attribute__((unused));

 static __inline__ void
 p448_addw (
    p448_t *a,
    uint64_t x
 ) __attribute__((unused));
             
 static __inline__ void
 p448_subw (
    p448_t *a,
    uint64_t x
 ) __attribute__((unused));
             
 static __inline__ void
 p448_copy (
    p448_t *out,
@@ -72,22 +48,11 @@ p448_strong_reduce (
    p448_t *inout
 );

 mask_t
 p448_is_zero (
    const p448_t *in
 );

 static __inline__ void
 p448_bias (
    p448_t *inout,
    int amount
 ) __attribute__((unused));

 static __inline__ void
 p448_really_bias (
    p448_t *inout,
    int amount
 ) __attribute__((unused));
         
 void
 p448_mul (
@@ -123,18 +88,6 @@ p448_deserialize (

 /* -------------- Inline functions begin here -------------- */

 void
 p448_set_ui (
    p448_t *out,
    uint64_t x
 ) {
    int i;
    out->limb[0] = x;
    for (i=1; i<8; i++) {
      out->limb[i] = 0;
    }
 }

 void
 p448_add_RAW (
    p448_t *out,
@@ -162,39 +115,6 @@ p448_sub_RAW (
    p448_weak_reduce(out);
 }

 void
 p448_neg_RAW (
    struct p448_t *out,
    const p448_t *a
 ) {
    unsigned int i;
    uint64_t co1 = ((1ull<<56)-1)*2, co2 = co1-2;
    for (i=0; i<8; i++) {
        out->limb[i] = ((i==4) ? co2 : co1) - a->limb[i];
    }
    p448_weak_reduce(out);
 }

 void
 p448_addw (
    p448_t *a,
    uint64_t x
 ) {
  a->limb[0] += x;
  a->limb[1] += a->limb[0]>>56;
  a->limb[0] &= (1ull<<56)-1;
 }
             
 void
 p448_subw (
    p448_t *a,
    uint64_t x
 ) {
  a->limb[0] -= x;
  p448_really_bias(a, 1);
  p448_weak_reduce(a);
 }

 void
 p448_copy (
    p448_t *out,
@@ -203,18 +123,6 @@ p448_copy (
    memcpy(out,a,sizeof(*a));
 }

 void
 p448_really_bias (
    p448_t *a,
    int amt
 ) {
    uint64_t co1 = ((1ull<<56)-1)*amt, co2 = co1-amt;
    int i;
    for (i=0; i<8; i++) {
        a->limb[i] += (i==4) ? co2 : co1;
    }
 }

 void
 p448_bias (
    p448_t *a,
--- a/src/p448/arch_x86_64/p448.h
+++ b/src/p448/arch_x86_64/p448.h
@@ -17,12 +17,6 @@ typedef struct p448_t {
 extern "C" {
 #endif

 static __inline__ void
 p448_set_ui (
    p448_t *out,
    uint64_t x
 ) __attribute__((unused,always_inline));

 static __inline__ void
 p448_add_RAW (
    p448_t *out,
@@ -37,24 +31,6 @@ p448_sub_RAW (
    const p448_t *b
 ) __attribute__((unused,always_inline));
             
 static __inline__ void
 p448_neg_RAW (
    p448_t *out,
    const p448_t *a
 ) __attribute__((unused,always_inline));

 static __inline__ void
 p448_addw (
    p448_t *a,
    uint64_t x
 ) __attribute__((unused,always_inline));
             
 static __inline__ void
 p448_subw (
    p448_t *a,
    uint64_t x
 ) __attribute__((unused,always_inline));
             
 static __inline__ void
 p448_copy (
    p448_t *out,
@@ -71,11 +47,6 @@ p448_strong_reduce (
    p448_t *inout
 );

 mask_t
 p448_is_zero (
    const p448_t *in
 );

 static __inline__ void
 p448_bias (
    p448_t *inout,
@@ -116,18 +87,6 @@ p448_deserialize (

 /* -------------- Inline functions begin here -------------- */

 void
 p448_set_ui (
    p448_t *out,
    uint64_t x
 ) {
    int i;
    out->limb[0] = x;
    for (i=1; i<8; i++) {
      out->limb[i] = 0;
    }
 }

 void
 p448_add_RAW (
    p448_t *out,
@@ -164,39 +123,6 @@ p448_sub_RAW (
    */
 }

 void
 p448_neg_RAW (
    struct p448_t *out,
    const p448_t *a
 ) {
    unsigned int i;
    for (i=0; i<sizeof(*out)/sizeof(uint64xn_t); i++) {
        ((uint64xn_t*)out)[i] = -((const uint64xn_t*)a)[i];
    }
    /*
    unsigned int i;
    for (i=0; i<sizeof(*out)/sizeof(out->limb[0]); i++) {
        out->limb[i] = -a->limb[i];
    }
    */
 }

 void
 p448_addw (
    p448_t *a,
    uint64_t x
 ) {
  a->limb[0] += x;
 }
             
 void
 p448_subw (
    p448_t *a,
    uint64_t x
 ) {
  a->limb[0] -= x;
 }

 void
 p448_copy (
    p448_t *out,
--- a/src/p448/f_field.h
+++ b/src/p448/f_field.h
@@ -20,19 +20,12 @@
 #define field_add_RAW        p448_add_RAW
 #define field_sub_RAW        p448_sub_RAW
 #define field_mulw           p448_mulw
 #define field_addw           p448_addw
 #define field_subw_RAW       p448_subw
 #define field_neg_RAW        p448_neg_RAW
 #define field_set_ui         p448_set_ui
 #define field_bias           p448_bias
 #define field_inverse        p448_inverse
 #define field_eq             p448_eq
 #define field_isr            p448_isr
 #define field_simultaneous_invert p448_simultaneous_invert
 #define field_inverse        p448_inverse
 #define field_weak_reduce    p448_weak_reduce
 #define field_strong_reduce  p448_strong_reduce
 #define field_serialize      p448_serialize
 #define field_deserialize    p448_deserialize
 #define field_is_zero        p448_is_zero

 #endif /* __F_FIELD_H__ */
--- a/src/p480/arch_x86_64/p480.h
+++ b/src/p480/arch_x86_64/p480.h
@@ -17,12 +17,6 @@ typedef struct p480_t {
 extern "C" {
 #endif

 static __inline__ void
 p480_set_ui (
    p480_t *out,
    uint64_t x
 ) __attribute__((unused,always_inline));

 static __inline__ void
 p480_add_RAW (
    p480_t *out,
@@ -37,24 +31,6 @@ p480_sub_RAW (
    const p480_t *b
 ) __attribute__((unused,always_inline));
             
 static __inline__ void
 p480_neg_RAW (
    p480_t *out,
    const p480_t *a
 ) __attribute__((unused,always_inline));

 static __inline__ void
 p480_addw (
    p480_t *a,
    uint64_t x
 ) __attribute__((unused,always_inline));
             
 static __inline__ void
 p480_subw (
    p480_t *a,
    uint64_t x
 ) __attribute__((unused,always_inline));
             
 static __inline__ void
 p480_copy (
    p480_t *out,
@@ -70,11 +46,6 @@ void
 p480_strong_reduce (
    p480_t *inout
 );

 mask_t
 p480_is_zero (
    const p480_t *in
 );
  
 static __inline__ void
 p480_bias (
@@ -116,18 +87,6 @@ p480_deserialize (

 /* -------------- Inline functions begin here -------------- */

 void
 p480_set_ui (
    p480_t *out,
    uint64_t x
 ) {
    int i;
    out->limb[0] = x;
    for (i=1; i<8; i++) {
      out->limb[i] = 0;
    }
 }

 void
 p480_add_RAW (
    p480_t *out,
@@ -164,39 +123,6 @@ p480_sub_RAW (
    */
 }

 void
 p480_neg_RAW (
    struct p480_t *out,
    const p480_t *a
 ) {
    unsigned int i;
    for (i=0; i<sizeof(*out)/sizeof(uint64xn_t); i++) {
        ((uint64xn_t*)out)[i] = -((const uint64xn_t*)a)[i];
    }
    /*
    unsigned int i;
    for (i=0; i<sizeof(*out)/sizeof(out->limb[0]); i++) {
        out->limb[i] = -a->limb[i];
    }
    */
 }

 void
 p480_addw (
    p480_t *a,
    uint64_t x
 ) {
  a->limb[0] += x;
 }
             
 void
 p480_subw (
    p480_t *a,
    uint64_t x
 ) {
  a->limb[0] -= x;
 }

 void
 p480_copy (
    p480_t *out,
--- a/src/p480/f_field.h
+++ b/src/p480/f_field.h
@@ -20,19 +20,12 @@
 #define field_add_RAW        p480_add_RAW
 #define field_sub_RAW        p480_sub_RAW
 #define field_mulw           p480_mulw
 #define field_addw           p480_addw
 #define field_subw_RAW       p480_subw
 #define field_neg_RAW        p480_neg_RAW
 #define field_set_ui         p480_set_ui
 #define field_bias           p480_bias
 #define field_inverse        p480_inverse
 #define field_eq             p480_eq
 #define field_isr            p480_isr
 #define field_simultaneous_invert p480_simultaneous_invert
 #define field_inverse        p480_inverse
 #define field_weak_reduce    p480_weak_reduce
 #define field_strong_reduce  p480_strong_reduce
 #define field_serialize      p480_serialize
 #define field_deserialize    p480_deserialize
 #define field_is_zero        p480_is_zero

 #endif /* __F_FIELD_H__ */
--- a/src/p521/arch_ref64/p521.h
+++ b/src/p521/arch_ref64/p521.h
@@ -18,12 +18,6 @@ typedef struct p521_t {
 extern "C" {
 #endif

 static __inline__ void
 p521_set_ui (
    p521_t *out,
    uint64_t x
 ) __attribute__((unused));

 static __inline__ void
 p521_add_RAW (
    p521_t *out,
@@ -38,24 +32,6 @@ p521_sub_RAW (
    const p521_t *b
 ) __attribute__((unused));
             
 static __inline__ void
 p521_neg_RAW (
    p521_t *out,
    const p521_t *a
 ) __attribute__((unused));

 static __inline__ void
 p521_addw (
    p521_t *a,
    uint64_t x
 ) __attribute__((unused));
             
 static __inline__ void
 p521_subw (
    p521_t *a,
    uint64_t x
 ) __attribute__((unused));
             
 static __inline__ void
 p521_copy (
    p521_t *out,
@@ -72,22 +48,11 @@ p521_strong_reduce (
    p521_t *inout
 );

 mask_t
 p521_is_zero (
    const p521_t *in
 );

 static __inline__ void
 p521_bias (
    p521_t *inout,
    int amount
 ) __attribute__((unused));

 static __inline__ void
 p521_really_bias (
    p521_t *inout,
    int amount
 ) __attribute__((unused));
         
 void
 p521_mul (
@@ -123,18 +88,6 @@ p521_deserialize (

 /* -------------- Inline functions begin here -------------- */

 void
 p521_set_ui (
    p521_t *out,
    uint64_t x
 ) {
    int i;
    out->limb[0] = x;
    for (i=1; i<9; i++) {
      out->limb[i] = 0;
    }
 }

 void
 p521_add_RAW (
    p521_t *out,
@@ -162,39 +115,6 @@ p521_sub_RAW (
    p521_weak_reduce(out);
 }

 void
 p521_neg_RAW (
    struct p521_t *out,
    const p521_t *a
 ) {
    unsigned int i;
    uint64_t co1 = ((1ull<<58)-1)*4, co2 = ((1ull<<57)-1)*4;
    for (i=0; i<9; i++) {
        out->limb[i] = ((i==8) ? co2 : co1) - a->limb[i];
    }
    p521_weak_reduce(out);
 }

 void
 p521_addw (
    p521_t *a,
    uint64_t x
 ) {
  a->limb[0] += x;
  a->limb[1] += a->limb[0]>>58;
  a->limb[0] &= (1ull<<58)-1;
 }
             
 void
 p521_subw (
    p521_t *a,
    uint64_t x
 ) {
  a->limb[0] -= x;
  p521_really_bias(a, 1);
  p521_weak_reduce(a);
 }

 void
 p521_copy (
    p521_t *out,
@@ -203,18 +123,6 @@ p521_copy (
    memcpy(out,a,sizeof(*a));
 }

 void
 p521_really_bias (
    p521_t *a,
    int amt
 ) {
    uint64_t co1 = ((1ull<<58)-1)*2*amt, co2 = ((1ull<<57)-1)*2*amt;
    int i;
    for (i=0; i<9; i++) {
        a->limb[i] += (i==8) ? co2 : co1;
    }
 }

 void
 p521_bias (
    p521_t *a,
--- a/src/p521/arch_x86_64_r12/p521.h
+++ b/src/p521/arch_x86_64_r12/p521.h
@@ -22,12 +22,6 @@ typedef struct p521_t {
 extern "C" {
 #endif

 static __inline__ void
 p521_set_ui (
    p521_t *out,
    uint64_t x
 ) __attribute__((unused));

 static __inline__ void
 p521_add_RAW (
    p521_t *out,
@@ -42,24 +36,6 @@ p521_sub_RAW (
    const p521_t *b
 ) __attribute__((unused));
             
 static __inline__ void
 p521_neg_RAW (
    p521_t *out,
    const p521_t *a
 ) __attribute__((unused));

 static __inline__ void
 p521_addw (
    p521_t *a,
    uint64_t x
 ) __attribute__((unused));
             
 static __inline__ void
 p521_subw (
    p521_t *a,
    uint64_t x
 ) __attribute__((unused));
             
 static __inline__ void
 p521_copy (
    p521_t *out,
@@ -76,11 +52,6 @@ p521_strong_reduce (
    p521_t *inout
 );

 mask_t
 p521_is_zero (
    const p521_t *in
 );

 static __inline__ void
 p521_bias (
    p521_t *inout,
@@ -134,18 +105,6 @@ timesW (
  return u.zxyw + u.zwww;
 }

 void
 p521_set_ui (
    p521_t *out,
    uint64_t x
 ) {
    int i;
    out->limb[0] = x;
    for (i=1; i<12; i++) {
      out->limb[i] = 0;
    }
 }

 void
 p521_add_RAW (
    p521_t *out,
@@ -170,33 +129,6 @@ p521_sub_RAW (
    }
 }

 void
 p521_neg_RAW (
    struct p521_t *out,
    const p521_t *a
 ) {
    unsigned int i;
    for (i=0; i<sizeof(*out)/sizeof(uint64xn_t); i++) {
        ((uint64xn_t*)out)[i] = -((const uint64xn_t*)a)[i];
    }
 }

 void
 p521_addw (
    p521_t *a,
    uint64_t x
 ) {
    a->limb[0] += x;
 }
             
 void
 p521_subw (
    p521_t *a,
    uint64_t x
 ) {
    a->limb[0] -= x;
 }

 void
 p521_copy (
    p521_t *out,
--- a/src/p521/f_field.h
+++ b/src/p521/f_field.h
@@ -20,19 +20,12 @@
 #define field_add_RAW        p521_add_RAW
 #define field_sub_RAW        p521_sub_RAW
 #define field_mulw           p521_mulw
 #define field_addw           p521_addw
 #define field_subw_RAW       p521_subw
 #define field_neg_RAW        p521_neg_RAW
 #define field_set_ui         p521_set_ui
 #define field_bias           p521_bias
 #define field_inverse        p521_inverse
 #define field_eq             p521_eq
 #define field_isr            p521_isr
 #define field_simultaneous_invert p521_simultaneous_invert
 #define field_inverse        p521_inverse
 #define field_weak_reduce    p521_weak_reduce
 #define field_strong_reduce  p521_strong_reduce
 #define field_serialize      p521_serialize
 #define field_deserialize    p521_deserialize
 #define field_is_zero        p521_is_zero

 #endif /* __F_FIELD_H__ */