Abstract: Data-parallel workloads, such as machine learning, computer vision, and data analytics, increasingly run on mobile SoCs (System on Chip) with SIMD (Single Instruction, Multiple Data) engines ...