Important work items for Ed448-Goldilocks:

* Better architecture detection / factoring of arch-related headers.
    [PROGRESS]

* Better factoring of high-level vs low-level library.

* Factor out hash, crandom from core library?

* Signed 32-bit NEON implementation to avoid bias/reduce after subtract


* Documentation: write high-level API docs, and internal docs to help
  other implementors.
    * Partial progress on Doxygenating the code.

* Documentation: write a spec or add to Watson's

* Cleanup: rename everything consistently.
    * namespace_op or op_namespace?  namespace_op_type?
    * We don't have to be super-careful with the namespacing, because
      symbols will be scrubbed by exported.sym.

* Cleanup: hard-coded tables (probably?)
    * This reduces the work required for goldilocks_init() at the expense
      of library size.
     
    * Makes error-handling and thread safety easier.
    
    * Use the SAGE tool?

* Cleanup: unify intrinsics code
    * Word_t, mask_t, bigregister_t, etc.
    * Generate asm intrinsics with a script?

* [DONE] Bugfix: make sure that init() and randomization are thread-safe.

* [DONE] Security: check on deserialization that points are < p.
    * [NEEDS TESTING] Check also that they're nonzero or otherwise non-pathological?

* Testing:
    * Corner-case testing
    * More bulk random testing
    * Negative testing.
    * SAGE-(auto?)-generated test vectors
    * Test the Barrett fields

* Safety: add static analysis attributes for compilers that support them
    * Most functions now have warn on ignored return.

* Safety:
    * [DONE] Check for init() if it's still required once we've done the above
    * Decide what to do about RNG failures
        * abort
        * return error and zeroize
        * return error but continue if RNG is kind of mostly OK
    
* Flexibility: decide which API options are good.
    * [DONE?] Eg, should functions take nbits and table sizes?
    
    * [DONE] Remove hardcoded adjustments from comb control.
        * These adjustments make the output wrong when it's not 450 bits.
        
    * Other slow Barrett fields?  Montgomery fields?

* Mid-level API
    * Make it easier to work with untwisted Edwards objects.
    * Probably use extended or projective, not extensible coordinates.
    * Scalarmul with other cofactor modes.

* High-level API:
    * SHA512 Elligator Edition?  Maybe write a paper first.
    
    * Elligator.
        * Need to write Elligator inverse.  Might not be Elligator-2S.
    
    * FHMQV? Is this patented?
    
    * What low-level APIs to expose?
        * Edwards points with add, sub, scalarmul, =, ==, ser/deser?

* Portability: test and make clean with other compilers
    * Using a fair amount of __attribute__ code.
    * [DONE] Should work for GCC now.

* Portability: try to make the vector code as portable as possible
    * Currently using clang ext_vector_length.
    * I can't get a simple for-loop to autovectorize :-/
    * SAGE tool?

* Portability: make the inner layers of the code 32-bit clean.
    * Write new versions of the field code.
        * [DONE] 28-bit limbs give less headroom for carries.
        * [DONE] Now have a vectorless ARM version; need NEON.
        * Improve speed of 32-bit field code.
    
    * [DONE] Run through the SAGE tool to generate new bias & bound.

* [DONE] Portability: make the outer layers of the code 32-bit clean.

* [DONE] Performance/flexibility: decide which parameters should be hard-coded.
    * Perhaps useful for comb precomputation.

* Performance: Improve SHA512.
    * [DONE?] Improve portability.
    * Improve speed.
        * Except not, because this adds too much code size.
        * Link OpenSSL if a fast SHA is desired.

* Protocol:
    * Decide what things to stir into hashes for various functions.
    
* Performance: improve the Barrett field code.
    * Support other primes?
    * Capture prime shape into a struct instead of passing 3 params.
    * [DONE] Make 32-bit clean.

* Automation:
    * Improve the SAGE tool to cover more cases
        * Real SSA classes to cover branching and looping
        * Constant-time selection
        * Intrinsics code
        * Field code?
    
    * SAGE tool is impossibly slow on 32-bit
         * Currently stuck on Elligator after 19 hours.
         * [FIXED] at least for now.
        
    * Vector-mul-chains
    * Negation "bubble pushing" optimization

* Clear other TODO/FIXME/HACK/PERF items in the code

* [DONE?] Submit to SUPERCOP