We improve the performance of the lattice-based cryptosystem Dilithium on AVX2 and NEON by deeply exploiting its algorithmic properties, such as small coefficient bounds and high sparsity, with the distinct instruction-level profiles of the underlying architectures. On AVX2, we deploy a single-modulus 16-bit NTT for csic \cdot \mathbf{s}_i and a multi-moduli 16-bit NTT coupled with a vectorized CRT reconstruction for ct0c \cdot \mathbf{t}_0. These instruction-level optimizations accelerate the res