As I reached the halfway point of my Outreachy internship, I wanted to share what I’ve been working on for the AArch64 back-end in LLVM. My first task involved optimizing two large shifts and a vector combine, which could be narrowed down into a single combine and vector shift.
The Problem
We start with the following unoptimized code, where two right shifts by 20 bits (ushr
) are applied to two vectors, followed by a combine instruction (uzp1
) that interleaves the least significant half of the bits from each vector into a new vector.
/*
ushr v0.4s, v0.4s, #20
ushr v1.4s, v1.4s, #20
uzp1 v0.8h, v0.8h, v1.8h
*/
uint16x8_t unoptimized(uint32x4_t a, uint32x4_t b) {
a = vshrq_n_u32(a, 20);
b = vshrq_n_u32(b, 20);
return vcombine_u16(vmovn_u32(a), vmovn_u32(b));
}
In this example, because the shift is greater than the output scalar bit size (20 > 16), we can skip shifting both vectors and instead directly take the most significant 16 bits, placing them into a new vector. Since the original vectors are already at the 128-bit register limit, we first need to narrow them down to 16-bit pieces. This allows us to combine them into a 16×8 vector.
In the optimized version, we take the narrow down 32 bit scalars taking the most significant 16 bits directly (uzp2
) and then shift right by 4 on the combined vector (ushr
).
/*
uzp2 v0.8h, v0.8h, v1.8h
ushr v0.8h, v0.8h, #4
*/
uint16x8_t optimized(uint32x4_t a, uint32x4_t b) {
uint16x4_t a_u16 = vshrn_n_u32(a, 16);
uint16x4_t b_u16 = vshrn_n_u32(b, 16);
uint16x8_t r = vcombine_u16(a_u16, b_u16);
return vshrq_n_u16(r, 4);
}
In this version, we cut down on a shift instruction, resulting in a small speedup.
TableGen approach
At first, I thought I could handle this optimization with a few lines of LLVM TableGen code, a domain-specific language used in LLVM to define instructions, patterns, and optimizations. I believed it would be straightforward, but I soon found myself struggling for nearly a month, even with guidance from my mentor.
Challenges with TableGen
The main issue I struggled with was a fundamental misunderstanding how concat_vectors
worked in TableGen. I initially thought concat_vectos
was just a simple operation gluing nodes together for the purpose of the pattern, but in reality it was actually just joining multiple smaller vectors together into a single vector.
This realization happened while I was working on the C++ custom lowering version of the fix when it was easier to finish with the custom lowering than go back to TableGen.
I was stuck attempting to optimize with TableGen for around a month, so the logical conclusion was that it might not be the right tool for this particular optimization, so I shifted my focus to implementing the optimization directly in C++ with a custom lowering.
Shifting to C++
Since I switched to C++, things started to make more sense. The optimization code essentially checks if the operands of a concat_vector are both right shifts (ushr). If the shift amounts are equal and greater than the scalar size in bits of the concat_vector, it replaces the concat_vector with a uzp2 followed by a shift.
Here’s a simplified version of the code logic:
- Check for Right Shifts: Ensure both operands of the
concat_vector
are right shifts. - Compare Shift Amounts: Verify that the shift amounts are the same and greater than the size of the scalar elements.
- Replace with Optimized Sequence: Replace the
concat_vector
with auzp2
instruction followed by a single shift.
This approach was much more straightforward in C++, and I was able to implement the optimization successfully. The code is soon going to be under review, and I’m excited to see the reviews.
Closing notes
While this all takes longer than I was expecting to. I am excited to work on the AArch64 back-end. I never expected working on something so hard to wrap my head around make me feel so proud of myself and my ability to learn.
Leave a Reply