We improve the performance of the lattice-based cryptosystem Dilithium on AVX2 and NEON by deeply exploiting its algorithmic properties, such as small coefficient bounds and high sparsity, with the distinct instruction-level profiles of the underlying architectures. On AVX2, we deploy a single-modulus 16-bit NTT for and a multi-moduli 16-bit NTT coupled with a vectorized CRT reconstruction for . These instruction-level optimizations accelerate the res
Parameter-Aware and Instruction-Driven Dilithium Optimization on AVX2 and NEON
Wang Kunpeng
