Off-label GF2P8AFFINEQB
Often used as a way to take the piss out of CISC ISAs, GF2P8AFFINEQB, the Galois Field Affine Transformation, part of the "Galois Field New Instructions" x86 extension, has intruiged me for a while now. Unlike GF2P8AFFINEINVQB and GF2P8MULB, which have more obvious relevance for performing the Rijndael S-box and Rijndael MixColumns steps of AES respectively, GF2P8AFFINEQB is a little more general-purpose, finding "off-label" usage outside of cryptography.
What is a Galois field?
Intel's Galois Field New Instructions (GFNI) Technology Guide gives a rather succinct definition:
A Galois Field is a field containing a finite number of elements. As with other fields, a Galois Field has well defined elements, and operations for addition, subtraction, and so on, when applied to those elements generates a result in the same field. The field is simply a Galois Field with only two elements – 0 and 1. The addition of two values in is equivalent to addition-modulo-2, or an exclusive-OR. The multiplication of two values in is equivalent to multiplication-modulo-2, or the logical AND operation.
| + | 0 | 1 |
|---|---|---|
| 0 | 0 | 1 |
| 1 | 1 | 0 |
| × | 0 | 1 |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 0 | 1 |
What does GF2P8AFFINEQB do?
The GF2P8AFFINEQB instruction performs an affine transformation in . For this instruction, an affine transformation is defined by where is an 8 by 8 bit matrix, and and are 8-bit vectors.
Note that[i]n typical matrix fashion, each of the rows of the matrix determines how the resulting bits of the new 8-bit integer is defined. The bytes are defined in reverse order though, as the definition of the function defines its byte-indexing in reverse(7 - i) order. Byte 0 defines how bit-7 is built, byte 1 defines how bit-6 is built, byte 2 defines how bit-5 is built, and so on.— Wunk, gf2p8affineqb: Bit reversal
The visualizations' matrices' below are unreversed for clarity, but with the example code bit-shifting configured such that the matrix bitpattern is is in the reserved order that GF2P8AFFINEQB expects.
Identity transformation
Bit reversal
5-bit sign extension
Usage in the real world
Through GitHub searches I found that Wunkolo, author of gf2p8affineqb: Int8 shifting and gf2p8affineqb: Bit reversal , submitted optimizations to Ymir, a Sega Saturn emulator (1), and Xenia, an Xbox 360 emulator (1), (2), (3). has contributed to several projects utilizing GF2P8AFFINEQB.
Unexpected Uses for the Galois Field Affine Transformation InstructionWhat are the AVX-512 Galois-field-related instructions for?
Reed-Solomon error correction
klauspost/reedsolomon https://github.com/search?q=repo%3Aklauspost%2Freedsolomon%20GF2P8AFFINEQB&type=codePS3 Cell floating point emulation
The PS3's Cell processor's SPU does not use IEEE 754 floating point: it uses a custom floating point format, nicknamed "Xfloat" by the RPCS3 community:
We can take a look at page 195 of the IBM Synergistic Processor Unit Instruction Set Architecture documentation for more detail:
For single-precision operations, the range of normalized numbers is extended. However, the full range defined in the standard is not implemented. The range of nonzero numbers that can be represented and operated on in the SPU is between the minimum and maximum listed in Table 9-1. Table 9-1 also demonstrates converting from a register value to a decimal value.
Table 9-1. Single-Precision (Extended-Range Mode) Minimum and Maximum Values
Number Format Minimum Positive
Magnitude (Smin)Maximum Positive
Magnitude (Smax)Notes Register Value 0x00800000 0x7FFFFFFF Bit Fields Sign 8-Bit Biased
ExponentFraction (implied
[1] and 23 bits)Sign 8-Bit Biased
ExponentFraction (implied
[1] and 23 bits)1 0 00000001 [1.]000...000 0 11111111 [1.]111...111 Value in Powers of 2 + 2(1 - 127) 1 + 2(255 - 127) 2 - 2-23 2 Combined Exponent and Fraction 2-126 * (+1) 2128 * (+[2 - 2-23]) Value of Register in Decimal 1.2 * 10-38 6.8 * 1038 Notes:
- The exponent field is biased by +127.
- The value 2 - 2-23 is one least significant bit (LSb) less than 2.
Zero has two representations:As inputs, both kinds of zero are supported; however, a zero result is always a positive zero.
- For a positive zero, all bits are zero; that is, the sign, exponent, and fraction are zero.
- For a negative zero, the sign is one; that is, the exponent and fraction are zero.
Single-precision operations in the SPU have the following characteristics:
- Not a Number (NaN) is not supported as an operand and is not produced as a result.
- Infinity (Inf) is not supported. An operation that produces a magnitude greater than the largest number representable in the target floating-point format instead produces a number with the appropriate sign, the largest biased exponent, and a magnitude of all (binary) ones. It is important to note that the representation of Inf, which conforms to the IEEE standard, is interpreted by the SPU as a number that is smaller than the largest number used on the SPU.
- Denorms are not supported and are treated as zero. Thus, an operation that would generate a denorm under IEEE rules instead generates a positive zero. If a denorm is used as an operand, it is treated as a zero.
- The only supported rounding mode is truncation (toward zero).
A different-from-IEEE exception indicates that the result produced with extended-range arithmetic could be different from the IEEE result. This occurs when one of the following conditions exists:
- Any of the inputs or the result has a maximal exponent (IEEE arithmetic treats such an operand as NaN or Infinity; extended-range arithmetic treats them as normalized values.)
- Any of the inputs has a zero exponent and a nonzero fraction (IEEE arithmetic treats such an oper- and as a denormal number; extended-range arithmetic treats them as a zero.)
- An underflow occurs; that is, the result before rounding is different from zero and the result after rounding is zero.
RPCS3 has an optimized path for shufb:
Shuffle Bytes
shufb rt,ra,rb,rc
1 0 1 1 0 1 2 3 RT RB RA RC 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Registers RA and RB are logically concatenated with the least-significant bit of RA adjacent to the most-significant bit of RB. The bytes of the resulting value are considered to be numbered from 0 to 31.
For each byte slot in registers RC and RT:
- The value in register RC is examined, and a result byte is produced as shown in Table 5-1.
- The result byte is inserted into register RT.
Table 5-1. Binary Values in Register RC and Byte Results
Value in Register RC
(Expressed in Binary)Result Byte 10xxxxxx 0x00 110xxxxx 0xFF 111xxxxx 0x80 Otherwise The byte of the concatenated register addressed by the rightmost 5 bits of register RC Rconcat ← RA || RB for j = 0 to 15 b ← RCʲ If b₀:₁ = 0b10 then c ← 0x00 else If b₀:₂ = 0b110 then c ← 0xFF else If b₀:₂ = 0b111 then c ← 0x80 else b ← b & 0x1F; c ← Rconcatᵇ; RTʲ ← c end
Here's the video that discusses it
Go GC
https://go.dev/blog/greenteagc
What can we apply it to?
<search for an application and get a speedup>
https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1b137ca0 https://bitmath.blogspot.com/2023/04/not-transposing-16x16-bitmatrix.html https://github.com/riscv/riscv-bitmanip/wiki i don't even fucking know man