← Max Bo

Off-label GF2P8AFFINEQB

Often used as a way to take the piss out of CISC ISAs, GF2P8AFFINEQB, the Galois Field Affine Transformation, part of the "Galois Field New Instructions" x86 extension, has intruiged me for a while now. Unlike GF2P8AFFINEINVQB and GF2P8MULB, which have more obvious relevance for performing the Rijndael S-box and Rijndael MixColumns steps of AES respectively, GF2P8AFFINEQB is a little more general-purpose, finding "off-label" usage outside of cryptography.

What is a Galois field?

Intel's Galois Field New Instructions (GFNI) Technology Guide gives a rather succinct definition:

A Galois Field is a field containing a finite number of elements. As with other fields, a Galois Field has well defined elements, and operations for addition, subtraction, and so on, when applied to those elements generates a result in the same field. The GF ( 2 ) field is simply a Galois Field with only two elements – 0 and 1. The addition of two values in GF ( 2 ) is equivalent to addition-modulo-2, or an exclusive-OR. The multiplication of two values in GF ( 2 ) is equivalent to multiplication-modulo-2, or the logical AND operation.
+ 0 1
0 0 1
1 1 0
× 0 1
0 0 0
1 0 1

What does GF2P8AFFINEQB do?

The GF2P8AFFINEQB instruction performs an affine transformation in GF ( 2 8 ) . For this instruction, an affine transformation is defined by A x + b where A is an 8 by 8 bit matrix, and x and b are 8-bit vectors.

Note that
[i]n typical matrix fashion, each of the rows of the matrix determines how the resulting bits of the new 8-bit integer is defined. The bytes are defined in reverse order though, as the definition of the function defines its byte-indexing in reverse(7 - i) order. Byte 0 defines how bit-7 is built, byte 1 defines how bit-6 is built, byte 2 defines how bit-5 is built, and so on.
— Wunk, gf2p8affineqb: Bit reversal

The visualizations' matrices' below are unreversed for clarity, but with the example code bit-shifting configured such that the matrix bitpattern is is in the reserved order that GF2P8AFFINEQB expects.

Identity transformation

Bit reversal

5-bit sign extension

Usage in the real world

Through GitHub searches I found that Wunkolo, author of gf2p8affineqb: Int8 shifting and gf2p8affineqb: Bit reversal , submitted optimizations to Ymir, a Sega Saturn emulator (1), and Xenia, an Xbox 360 emulator (1), (2), (3). has contributed to several projects utilizing GF2P8AFFINEQB.

Unexpected Uses for the Galois Field Affine Transformation Instruction
What are the AVX-512 Galois-field-related instructions for?

Reed-Solomon error correction

klauspost/reedsolomon https://github.com/search?q=repo%3Aklauspost%2Freedsolomon%20GF2P8AFFINEQB&type=code

PS3 Cell floating point emulation

The PS3's Cell processor's SPU does not use IEEE 754 floating point: it uses a custom floating point format, nicknamed "Xfloat" by the RPCS3 community:

Slide showing RPCS3 emulator's approach to floating-point accuracy. Title reads 'RPCS3 > CPU > SPU'. Content describes optional solutions to edge cases for accurate vs approximate xfloat modes. Accurate Xfloat emulates non-IEEE F32 behavior using ±smax-clamp and round-to-zero by: 1) Extending sign/exponent/mantissa bits from U32[4] to F64[4], 2) Doing the operation on AVX2/YMMI, 3) Bitcasting/truncating back to U32[4] with clamping and rounding. Approximate Xfloat uses <smin for math operations by applying clamps and setting denorms to 0, and for comparisons by bitcasting to integer. Right side shows God of War 3 gameplay comparison between ASMJIT (FPS: 12.02) and LLVM (FPS: 25.28) with caption thanking Nekotekina as accurate xfloat setting is no longer needed for most games with SPU LLVM Recompiler issues. Source cited as RPCS3 YouTube channel.
Alexandro Sanchez Bach, PlayStation 3 Emulation (Re)implementing the impossible

We can take a look at page 195 of the IBM Synergistic Processor Unit Instruction Set Architecture documentation for more detail:

For single-precision operations, the range of normalized numbers is extended. However, the full range defined in the standard is not implemented. The range of nonzero numbers that can be represented and operated on in the SPU is between the minimum and maximum listed in Table 9-1. Table 9-1 also demonstrates converting from a register value to a decimal value.

Table 9-1. Single-Precision (Extended-Range Mode) Minimum and Maximum Values
Number Format Minimum Positive
Magnitude (Smin)
Maximum Positive
Magnitude (Smax)
Notes
Register Value 0x00800000 0x7FFFFFFF
Bit Fields Sign 8-Bit Biased
Exponent
Fraction (implied
[1] and 23 bits)
Sign 8-Bit Biased
Exponent
Fraction (implied
[1] and 23 bits)
1
0 00000001 [1.]000...000 0 11111111 [1.]111...111
Value in Powers of 2 + 2(1 - 127) 1 + 2(255 - 127) 2 - 2-23 2
Combined Exponent and Fraction 2-126 * (+1) 2128 * (+[2 - 2-23])
Value of Register in Decimal 1.2 * 10-38 6.8 * 1038
Notes:
  1. The exponent field is biased by +127.
  2. The value 2 - 2-23 is one least significant bit (LSb) less than 2.

Zero has two representations: As inputs, both kinds of zero are supported; however, a zero result is always a positive zero.
Single-precision operations in the SPU have the following characteristics:
A different-from-IEEE exception indicates that the result produced with extended-range arithmetic could be different from the IEEE result. This occurs when one of the following conditions exists:

RPCS3 has an optimized path for shufb:

Shuffle Bytes

shufb    rt,ra,rb,rc

1 0 1 1
0 1 2 3 RT RB RA RC
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Registers RA and RB are logically concatenated with the least-significant bit of RA adjacent to the most-significant bit of RB. The bytes of the resulting value are considered to be numbered from 0 to 31.

For each byte slot in registers RC and RT:

Table 5-1. Binary Values in Register RC and Byte Results

Value in Register RC
(Expressed in Binary)
Result Byte
10xxxxxx 0x00
110xxxxx 0xFF
111xxxxx 0x80
Otherwise The byte of the concatenated register addressed by the rightmost 5 bits of register RC
Rconcat ← RA || RB
for j = 0 to 15
    b ← RCʲ
    If b₀:₁ = 0b10 then c ← 0x00
    else If b₀:₂ = 0b110 then c ← 0xFF
    else If b₀:₂ = 0b111 then c ← 0x80
    else
        b ← b & 0x1F;
        c ← Rconcatᵇ;
    RTʲ ← c
end

Here's the video that discusses it

PR

another PR

Updated version

Go GC

https://go.dev/blog/greenteagc

What can we apply it to?

<search for an application and get a speedup>

https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1b137ca0 https://bitmath.blogspot.com/2023/04/not-transposing-16x16-bitmatrix.html https://github.com/riscv/riscv-bitmanip/wiki i don't even fucking know man