As I reached the halfway point of my Outreachy internship, I wanted to share what I’ve been working on for the AArch64 back-end in LLVM. My first task involved optimizing two large shifts and a vector combine, which could be narrowed down into a single combine and vector shift.

## The Problem

We start with the following unoptimized code, where two right shifts by 20 bits (`ushr`

) are applied to two vectors, followed by a combine instruction (`uzp1`

) that interleaves the least significant half of the bits from each vector into a new vector.

```
/*
ushr v0.4s, v0.4s, #20
ushr v1.4s, v1.4s, #20
uzp1 v0.8h, v0.8h, v1.8h
*/
uint16x8_t unoptimized(uint32x4_t a, uint32x4_t b) {
a = vshrq_n_u32(a, 20);
b = vshrq_n_u32(b, 20);
return vcombine_u16(vmovn_u32(a), vmovn_u32(b));
}
```

In this example, because the shift is greater than the output scalar bit size (20 > 16), we can skip shifting both vectors and instead directly take the most significant 16 bits, placing them into a new vector. Since the original vectors are already at the 128-bit register limit, we first need to narrow them down to 16-bit pieces. This allows us to combine them into a 16×8 vector.

In the optimized version, we take the narrow down 32 bit scalars taking the most significant 16 bits directly (`uzp2`

) and then shift right by 4 on the combined vector (`ushr`

).

```
/*
uzp2 v0.8h, v0.8h, v1.8h
ushr v0.8h, v0.8h, #4
*/
uint16x8_t optimized(uint32x4_t a, uint32x4_t b) {
uint16x4_t a_u16 = vshrn_n_u32(a, 16);
uint16x4_t b_u16 = vshrn_n_u32(b, 16);
uint16x8_t r = vcombine_u16(a_u16, b_u16);
return vshrq_n_u16(r, 4);
}
```

In this version, we cut down on a shift instruction, resulting in a small speedup.

# TableGen approach

At first, I thought I could handle this optimization with a few lines of LLVM TableGen code, a domain-specific language used in LLVM to define instructions, patterns, and optimizations. I believed it would be straightforward, but I soon found myself struggling for nearly a month, even with guidance from my mentor.

### Challenges with TableGen

The main issue I struggled with was a fundamental misunderstanding how `concat_vectors`

worked in TableGen. I initially thought `concat_vectos`

was just a simple operation gluing nodes together for the purpose of the pattern, but in reality it was actually just joining multiple smaller vectors together into a single vector.

This realization happened while I was working on the C++ custom lowering version of the fix when it was easier to finish with the custom lowering than go back to TableGen.

I was stuck attempting to optimize with TableGen for around a month, so the logical conclusion was that it might not be the right tool for this particular optimization, so I shifted my focus to implementing the optimization directly in C++ with a custom lowering.

## Shifting to C++

Since I switched to C++, things started to make more sense. The optimization code essentially checks if the operands of a concat_vector are both right shifts (ushr). If the shift amounts are equal and greater than the scalar size in bits of the concat_vector, it replaces the concat_vector with a uzp2 followed by a shift.

Here’s a simplified version of the code logic:

**Check for Right Shifts**: Ensure both operands of the`concat_vector`

are right shifts.**Compare Shift Amounts**: Verify that the shift amounts are the same and greater than the size of the scalar elements.**Replace with Optimized Sequence**: Replace the`concat_vector`

with a`uzp2`

instruction followed by a single shift.

This approach was much more straightforward in C++, and I was able to implement the optimization successfully. The code is soon going to be under review, and I’m excited to see the reviews.

## Closing notes

While this all takes longer than I was expecting to. I am excited to work on the AArch64 back-end. I never expected working on something so hard to wrap my head around make me feel so proud of myself and my ability to learn.

## Leave a Reply