Off-label GF2P8AFFINEQB

December 21, 2025

Often used as a way to take the piss out of CISC ISAs, GF2P8AFFINEQB, the Galois Field Affine Transformation, part of the "Galois Field New Instructions" x86 extension (GFNI), has intruiged me for a while now.

GFNI's 3 instructions, GF2P8AFFINEQB GF2P8AFFINEINVQB, GF2P8MULB have obvious applications for the Rijndael S-box and Rijndael MixColumns steps of AES.

But GF2P8AFFINEQB has found general-purpose, "off-label" usage outside of cryptography, particularly in the game console emulation community.

Let's investigate.

What is a Galois field?

Intel's Galois Field New Instructions (GFNI) Technology Guide gives a rather succinct definition:

A Galois Field is a field containing a finite number of elements. As with other fields, a Galois Field has well defined elements, and operations for addition, subtraction, and so on, when applied to those elements generates a result in the same field. The $GF(2)$ field is simply a Galois Field with only two elements – 0 and 1. The addition of two values in $GF(2)$ is equivalent to addition-modulo-2, or an exclusive-OR. The multiplication of two values in $GF(2)$ is equivalent to multiplication-modulo-2, or the logical AND operation.

+	0	1
0	0	1
1	1	0

×	0	1
0	0	0
1	0	1

What does GF2P8AFFINEQB do?

The GF2P8AFFINEQB instruction performs an affine transformation in $GF(2^8)$. For this instruction, an affine transformation is defined by $A \cdot x + b$ where $A$ is an 8 by 8 bit matrix, and $x$ and $b$ are 8-bit vectors.

Thus: $$ \begin{aligned} A \cdot x + b \quad & \\ &= \begin{bmatrix} a_{0,0} & a_{0,1} & a_{0,2} & a_{0,3} & a_{0,4} & a_{0,5} & a_{0,6} & a_{0,7} \\ a_{1,0} & a_{1,1} & a_{1,2} & a_{1,3} & a_{1,4} & a_{1,5} & a_{1,6} & a_{1,7} \\ a_{2,0} & a_{2,1} & a_{2,2} & a_{2,3} & a_{2,4} & a_{2,5} & a_{2,6} & a_{2,7} \\ a_{3,0} & a_{3,1} & a_{3,2} & a_{3,3} & a_{3,4} & a_{3,5} & a_{3,6} & a_{3,7} \\ a_{4,0} & a_{4,1} & a_{4,2} & a_{4,3} & a_{4,4} & a_{4,5} & a_{4,6} & a_{4,7} \\ a_{5,0} & a_{5,1} & a_{5,2} & a_{5,3} & a_{5,4} & a_{5,5} & a_{5,6} & a_{5,7} \\ a_{6,0} & a_{6,1} & a_{6,2} & a_{6,3} & a_{6,4} & a_{6,5} & a_{6,6} & a_{6,7} \\ a_{7,0} & a_{7,1} & a_{7,2} & a_{7,3} & a_{7,4} & a_{7,5} & a_{7,6} & a_{7,7} \end{bmatrix} \cdot \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \\ x_6 \\ x_7 \end{bmatrix} + \begin{bmatrix} b_0 \\ b_1 \\ b_2 \\ b_3 \\ b_4 \\ b_5 \\ b_6 \\ b_7 \end{bmatrix} \\ &= \begin{bmatrix} (a_{0,0} \times x_0) + (a_{0,1} \times x_1) + (a_{0,2} \times x_2) + (a_{0,3} \times x_3) + (a_{0,4} \times x_4) + (a_{0,5} \times x_5) + (a_{0,6} \times x_6) + (a_{0,7} \times x_7) + b_0 \\ (a_{1,0} \times x_0) + (a_{1,1} \times x_1) + (a_{1,2} \times x_2) + (a_{1,3} \times x_3) + (a_{1,4} \times x_4) + (a_{1,5} \times x_5) + (a_{1,6} \times x_6) + (a_{1,7} \times x_7) + b_1 \\ (a_{2,0} \times x_0) + (a_{2,1} \times x_1) + (a_{2,2} \times x_2) + (a_{2,3} \times x_3) + (a_{2,4} \times x_4) + (a_{2,5} \times x_5) + (a_{2,6} \times x_6) + (a_{2,7} \times x_7) + b_2 \\ (a_{3,0} \times x_0) + (a_{3,1} \times x_1) + (a_{3,2} \times x_2) + (a_{3,3} \times x_3) + (a_{3,4} \times x_4) + (a_{3,5} \times x_5) + (a_{3,6} \times x_6) + (a_{3,7} \times x_7) + b_3 \\ (a_{4,0} \times x_0) + (a_{4,1} \times x_1) + (a_{4,2} \times x_2) + (a_{4,3} \times x_3) + (a_{4,4} \times x_4) + (a_{4,5} \times x_5) + (a_{4,6} \times x_6) + (a_{4,7} \times x_7) + b_4 \\ (a_{5,0} \times x_0) + (a_{5,1} \times x_1) + (a_{5,2} \times x_2) + (a_{5,3} \times x_3) + (a_{5,4} \times x_4) + (a_{5,5} \times x_5) + (a_{5,6} \times x_6) + (a_{5,7} \times x_7) + b_5 \\ (a_{6,0} \times x_0) + (a_{6,1} \times x_1) + (a_{6,2} \times x_2) + (a_{6,3} \times x_3) + (a_{6,4} \times x_4) + (a_{6,5} \times x_5) + (a_{6,6} \times x_6) + (a_{6,7} \times x_7) + b_6 \\ (a_{7,0} \times x_0) + (a_{7,1} \times x_1) + (a_{7,2} \times x_2) + (a_{7,3} \times x_3) + (a_{7,4} \times x_4) + (a_{7,5} \times x_5) + (a_{7,6} \times x_6) + (a_{7,7} \times x_7) + b_7 \end{bmatrix} \end{aligned} $$

Note that

[i]n typical matrix fashion, each of the rows of the matrix determines how the resulting bits of the new 8-bit integer is defined. The bytes are defined in reverse order though, as the definition of the function defines its byte-indexing in reverse(7 - i) order. Byte 0 defines how bit-7 is built, byte 1 defines how bit-6 is built, byte 2 defines how bit-5 is built, and so on.
— Wunk, gf2p8affineqb: Bit reversal

The visualizations' matrices' below are unreversed for clarity, but with the example code bit-shifting configured such that the matrix bitpattern is is in the reserved order that GF2P8AFFINEQB expects.

Identity transformation

Bit reversal

5-bit sign extension

Usage in the real world

Through GitHub searches I found that Wunkolo, author of gf2p8affineqb: Int8 shifting and gf2p8affineqb: Bit reversal , submitted optimizations to Ymir, a Sega Saturn emulator (1), and Xenia, an Xbox 360 emulator (1), (2), (3). has contributed to several projects utilizing GF2P8AFFINEQB.

Unexpected Uses for the Galois Field Affine Transformation Instruction
What are the AVX-512 Galois-field-related instructions for?

Reed-Solomon error correction

klauspost/reedsolomon https://github.com/search?q=repo%3Aklauspost%2Freedsolomon%20GF2P8AFFINEQB&type=code

PS3 Cell floating point emulation

The PS3's Cell processor's SPU does not use IEEE 754 floating point: it uses a custom floating point format, nicknamed "Xfloat" by the RPCS3 community:

Slide showing RPCS3 emulator's approach to floating-point accuracy. Title reads 'RPCS3 > CPU > SPU'. Content describes optional solutions to edge cases for accurate vs approximate xfloat modes. Accurate Xfloat emulates non-IEEE F32 behavior using ±smax-clamp and round-to-zero by: 1) Extending sign/exponent/mantissa bits from U32[4] to F64[4], 2) Doing the operation on AVX2/YMMI, 3) Bitcasting/truncating back to U32[4] with clamping and rounding. Approximate Xfloat uses <smin for math operations by applying clamps and setting denorms to 0, and for comparisons by bitcasting to integer. Right side shows God of War 3 gameplay comparison between ASMJIT (FPS: 12.02) and LLVM (FPS: 25.28) with caption thanking Nekotekina as accurate xfloat setting is no longer needed for most games with SPU LLVM Recompiler issues. Source cited as RPCS3 YouTube channel. — Alexandro Sanchez Bach, PlayStation 3 Emulation (Re)implementing the impossible

We can take a look at page 195 of the IBM Synergistic Processor Unit Instruction Set Architecture documentation for more detail:

For single-precision operations, the range of normalized numbers is extended. However, the full range defined in the standard is not implemented. The range of nonzero numbers that can be represented and operated on in the SPU is between the minimum and maximum listed in Table 9-1. Table 9-1 also demonstrates converting from a register value to a decimal value.

Table 9-1. Single-Precision (Extended-Range Mode) Minimum and Maximum Values

Number Format Minimum Positive
Magnitude (Smin) Maximum Positive
Magnitude (Smax) Notes

Register Value 0x00800000 0x7FFFFFFF

Bit Fields Sign 8-Bit Biased
Exponent Fraction (implied
[1] and 23 bits) Sign 8-Bit Biased
Exponent Fraction (implied
[1] and 23 bits) 1

0 00000001 [1.]000...000 0 11111111 [1.]111...111

Value in Powers of 2 + 2^{(1 - 127)} 1 + 2^{(255 - 127)} 2 - 2^-23 2

Combined Exponent and Fraction 2^-126 * (+1) 2¹²⁸ * (+[2 - 2^-23])

Value of Register in Decimal 1.2 * 10^-38 6.8 * 10³⁸

Notes:

The exponent field is biased by +127.

The value 2 - 2^-23 is one least significant bit (LSb) less than 2.

Zero has two representations:

For a positive zero, all bits are zero; that is, the sign, exponent, and fraction are zero.

For a negative zero, the sign is one; that is, the exponent and fraction are zero.

As inputs, both kinds of zero are supported; however, a zero result is always a positive zero.
Single-precision operations in the SPU have the following characteristics:

Not a Number (NaN) is not supported as an operand and is not produced as a result.

Infinity (Inf) is not supported. An operation that produces a magnitude greater than the largest number representable in the target floating-point format instead produces a number with the appropriate sign, the largest biased exponent, and a magnitude of all (binary) ones. It is important to note that the representation of Inf, which conforms to the IEEE standard, is interpreted by the SPU as a number that is smaller than the largest number used on the SPU.

Denorms are not supported and are treated as zero. Thus, an operation that would generate a denorm under IEEE rules instead generates a positive zero. If a denorm is used as an operand, it is treated as a zero.

The only supported rounding mode is truncation (toward zero).

Number Format	Minimum Positive Magnitude (Smin)	Maximum Positive Magnitude (Smax)	Notes
Register Value	0x00800000	0x7FFFFFFF
Bit Fields	Sign	8-Bit Biased Exponent	Fraction (implied [1] and 23 bits)	Sign	8-Bit Biased Exponent	Fraction (implied [1] and 23 bits)	1
	0	00000001	[1.]000...000	0	11111111	[1.]111...111
Value in Powers of 2	+	2^{(1 - 127)}	1	+	2^{(255 - 127)}	2 - 2^-23	2
Combined Exponent and Fraction	2^-126 * (+1)	2¹²⁸ * (+[2 - 2^-23])
Value of Register in Decimal	1.2 * 10^-38	6.8 * 10³⁸
Notes: The exponent field is biased by +127. The value 2 - 2^-23 is one least significant bit (LSb) less than 2.

A different-from-IEEE exception indicates that the result produced with extended-range arithmetic could be different from the IEEE result. This occurs when one of the following conditions exists:

Any of the inputs or the result has a maximal exponent (IEEE arithmetic treats such an operand as NaN or Infinity; extended-range arithmetic treats them as normalized values.)

Any of the inputs has a zero exponent and a nonzero fraction (IEEE arithmetic treats such an oper- and as a denormal number; extended-range arithmetic treats them as a zero.)

An underflow occurs; that is, the result before rounding is different from zero and the result after rounding is zero.

RPCS3 has an optimized path for shufb:

Shuffle Bytes

shufb rt,ra,rb,rc

1 0 1 1

0 1 2 3 RT RB RA RC

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Registers RA and RB are logically concatenated with the least-significant bit of RA adjacent to the most-significant bit of RB. The bytes of the resulting value are considered to be numbered from 0 to 31.

For each byte slot in registers RC and RT:

The value in register RC is examined, and a result byte is produced as shown in Table 5-1.

The result byte is inserted into register RT.

Table 5-1. Binary Values in Register RC and Byte Results

Value in Register RC
(Expressed in Binary) Result Byte

10xxxxxx 0x00

110xxxxx 0xFF

111xxxxx 0x80

Otherwise The byte of the concatenated register addressed by the rightmost 5 bits of register RC
Rconcat ← RA || RB
for j = 0 to 15
    b ← RCʲ
    If b₀:₁ = 0b10 then c ← 0x00
    else If b₀:₂ = 0b110 then c ← 0xFF
    else If b₀:₂ = 0b111 then c ← 0x80
    else
        b ← b & 0x1F;
        c ← Rconcatᵇ;
    RTʲ ← c
end

Value in Register RC (Expressed in Binary)	Result Byte
10xxxxxx	0x00
110xxxxx	0xFF
111xxxxx	0x80
Otherwise	The byte of the concatenated register addressed by the rightmost 5 bits of register RC

Here's the video that discusses it

another PR

Updated version

Go GC

https://go.dev/blog/greenteagc

What can we apply it to?

https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1b137ca0 https://bitmath.blogspot.com/2023/04/not-transposing-16x16-bitmatrix.html https://github.com/riscv/riscv-bitmanip/wiki i don't even fucking know man

1	0	1	1
0	1	2	3	RT							RB							RA							RC
0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31

1	0	1	1
0	1	2	3	RT							RB							RA							RC
0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31

1	0	1	1
0	1	2	3	RT							RB							RA							RC
0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31