Whirlpool, in cryptol

editor’s note: this is based on some heavily edited notes, originally published on June 8th, 2017.

Permutation functions are the underlying primitive of most modern block ciphers and hash functions, which are in turn the building blocks of most digital security. Once you have a permutation function, you can put it to work a couple of different ways; such as Merkle–Damgård, or a block cipher.

Here I’ll be discussing a sketch of one such function: the whirlpool permutation, used in both the (you guessed it) whirlpool hashing function; though my original encounter with it was from the much newer whirlbob AEAD function.whirlbob was a rejected candidate for the CAESAR competition. While it has since languished, I still see it as solid example of simple and straightforward cryptographic design. But I won’t be using C or Haskell to describe it—today we will be using cryptol to define it instead.

A first approximation of this function, which we’ll call wbob_pi, has a skeletal definition that the following C code can demonstrate:

⊕The origin of the name “π” for referring to permutation functions is unknown to me, but was used in the description of whirlbob and so I use it here.

#define ROUNDS 12

void
wbob_pi(uint8_t state[64])
{
  for (int i = 0p; i < ROUNDS; i++) {
    // perform in-place modifications to 'state'
  }
  return; // finished, 'state' is permuted
}

To implement this in Cryptol, the first step is to work out the type of the wbob_pi function and go from there. It’s quite easy to describe the type of a permutation function—as a function from a bag of bytes to another bag of bytes:

wbob_pi : { r }           // The number of rounds
        ( fin r, r >= 1   // Must have at least 1 round
        , 8 >= width r    // inferred: size constraint
        ) => [64][8]      // Input state: 64-bytes
          -> [64][8]      // Output state: 64-bytes
wbob_pi input = ...

The only notable part of the above type is that it is parametric in the number of rounds applied, unlike the C implementation.Obviously any competent C compiler could handle passing in a rounds parameter to handle this as well, but it’s not materially relevent here. Any concrete use of this function has to choose the r parameter to specify this.

Number of rounds is a significant factor in both the security margin and speed of the underlying primitive. The π function, in our case, comes with two choices of rounds that are used “in the wild”: 10 rounds as used in the original whirlpool, and 12 rounds when used as part of whirlbob:

// WhirlPool 3.0 permutation function, also known as the `W` function
wpool : [64][8] -> [64][8]
wpool = wbob_pi`{r=10}

// The "WhirlBob" permutation, using "miniboxes" to compute sbox
// This is the same as the `W` function, but with 12 rounds instead of 10.
wbob : [64][8] -> [64][8]
wbob = wbob_pi`{r=12}

Describing the recurrence relation

The original sketch of the C code clearly had two separate parts: the loop and the loop body. The body actually does a single pass of state manipulation, while the loop itself applies this given a number of rounds.

This seems obvious, but in Cryptol, loops are typically the trickiest part of an algorithm to model.While this sounds absurd at a glance, it is not difficult to model most loops in practice as we’ll see. Because Cryptol is both purely functional and symbolic, it does not allow general recursion,Symbolic execution is what allows Cryptol to generate test vectors and perform equivalence checking; yet it prohibits general recursion, as algorithms using it almost always require the ability to inspect the concrete values of loop induction variables. and in its stead only combinators exposing loop-like constructs are provided. Working in this domain rather requires us to specify loop bodies as recurrence relations.

But this all sounds a bit more complex than it actually is:

// note: ! is list indexing, but from the back, not the front
wbob_pi x0 = join (results ! 0)
  where
    results = v where
      // The number of times to apply the permutation, specified
      // as a sequence.
      rng = [ 0 .. r-1 ] : [r][8]
      // Iterate over the initial state and apply the permutation
      // a specified number of times. # is the concatenation operator
      v   = [ split x0 ] # [ perm i x | x <- v | i <- rng ]

The above code actually a typical, “boring” style of lazy functional programming. The core of this relation is described by the last line, which defines the variable v in terms of itself using a list comprehensionThis is commonly referred to as “tying the knot” in functional programming, and allows infinite recurrences or infinite sequences to behave like finite ones.. Simply expanding the definition of v is enough to reveal what’s happening here.

⊕The list comprehension here can also be defined in terms of zip as zip perm rng v

v = [ split x0 ] # [ perm i x | x <- v | i <- rng ]
v = [ split x0, perm 0 (split x0), ... ] # ...
v = [ split x0, perm 0 (split x0), perm 1 (perm 0 (split x0)), ... ] # ...

Thus v is a list, the last entry of which has had the permutation applied r number of times. Taking the last entry is all that’s needed to approximate the original loop, and decouple the loop action from the body itself.

So now we only need to define perm.

Core permutation

With the loop recurrence relation defined, we can actually now define the perm function as noted in the whirlbob paper:

$$ x_{r+1} = L(P(S(x_{r}))) \oplus C_{r} $$

… and its equal in Cryptol:

⊕Do note that the i parameter here is implicit in the definition of $ C_r $ as we’ll see later.

perm : [8] -> [8][8][8] -> [8][8][8]
perm i x = L (P (S x)) ^ C i

There’s nothing too strange about this; note that here $ x $ is also defined as a recurrent relation, and this definition is taken from the formal specification itself. But all this is saying is that “Each state of $ x $ is defined in terms of the previous state” and nothing more.

Substitute bytes

The first step of the process is defined by the $ S $ function, called “SubBytes”, and is the the easiest of them all: simply replace all the bytes in the state with some other bytes. This is done using a “substitution box” or “sbox”, also known as a lookup table: given a byte $ x \in [0..255] $ simply index into an array and return that value instead.

$$ M[i][j] \leftarrow Sbox(M[i][j]) \quad for \quad 0 \leq i, j < 8 $$

S : [8][8][8] -> [8][8][8]
S = map (map sbox)

sbox : [8] -> [8]
sbox i = box @ i where // @ is indexing, from the front
  box : [256][8]
  box = [ ... ]

⊕The sbox lookup table is actually derived from a set of equations, in terms of a series of smaller boxes—3 of them, each 16 bytes long. Such a definition may be more appropriate for e.g. hardware implementations, where larger static constant tables may increase the critical path due to higher LUT fanout. The 256-byte array box is defined ahead of time as an array of constants. I’ve omitted the definition here for brevity but there’s nothing particularly unusual about a bunch of constants in an algorithm like this.

Shifting columns

Lorem ipsum…

$$ M[(i+j)\%8][j] \leftarrow M[i][j] \quad for \quad 0 \leq i,j < 8 $$

P : [8][8][8] -> [8][8][8]
P x = transpose [ v >>> i | v <- transpose x | i <- ([ 0 .. 7 ] : [_][8]) ]

Mixing rows

The next step of the process, known as “MixRows” and called $ L $, is actually incredibly straightforward though the specification doesn’t spell it out immediately. Put bluntly: $ L $ is simply defined as using matrix multiplication to multiply the input state matrix $ M $ by a predefined matrix $ Z $.$ Z $ is defined as a “circulant, low-weight 8x8 MDS matrix” in the whirlbob paper. This multiplication takes place in the field $ GF(2^8) $ with polynomial reduction $ p(x) = x^8 + x^4 + x^3 + x^2 + 1 $.

The paper defines MixRows with two equations. The first defines $ W_i $ as a row of the matrix $ M $. And the second says to multiply every row of $ M $ by $ Z $, which is just matrix multiplication with more words.

$$ W_{i} = (M[i][0], \ldots M[i][7]) $$

$$ W_{i} \leftarrow W_{i} \cdot Z \quad for \quad 0 \leq i < 8 $$

Matrix-by-vector multiplication is of course defined as the dot product operation:

$$ dot(x,y) = \sum_{i=0}^{i<8} x_{i} y_{i} $$

And matrix-by-vector multiplication is simply the dot product of the given vector against every row:

$$ x \cdot M = dot(x,M_{i}) \quad for \quad 0 \leq i < 8 $$

As said above, the multiplication above is performed in $ GF(2^8) $. Luckily, Cryptol has this built in natively as a primitive. We start with the definition of the multiplier:

mult : [8] -> [8] -> [8]
mult a b = pmod (pmult a b) <| x^^8 + x^^4 + x^^3 + x^^2 + 1 |>

Now we can define the dot product and matrix-by-vector multiplication:

add : {n} (fin n) => [n][8] -> [8]
add ps = sums ! 0 where
  sums = [zero] # [ p ^ s | p <- ps | s <- sums ]

dot : {n} (fin n) => [n][8] -> [n][8] -> [8]
dot y z = add (zipWith mult y z)

vect : {n, m} (fin n) => [n][8] -> [m][n][8] -> [m][8]
vect v ys = [ dot v y | y <- ys ]

Finally, the definition of $ L $ is simply defined as matrix multiplication between the state and the $ Z $ matrix:

L : [8][8][8] -> [8][8][8]
L xs = [ vect x (transpose mds) | x <- xs ]

Finally: adding the round key

Lorem ipsum…

$$ M[0][j] \leftarrow M[0][j] \oplus Sbox(8r+j) \quad for \quad 0 \leq j < 8 $$

$$ M[i][j] \leftarrow M[i][j] \quad for \quad 1 \leq i < 8 \land 0 \leq j < 8 $$

C : [8] -> [8][8][8]
C i = [ Cr @ i ] # zero

Cr : [12][8][8]
Cr = [ [ ... ] // first 8 bytes of sbox
       [ ... ] // second 8 bytes of sbox
       ...
       [ ... ] // twelfth 8 bytes of sbox
     ]

Cr : [12][8][8]
Cr = split [ sbox ((8*r) + j) | r <- [ 0 .. 11 ], j <- [ 0 .. 7  ] ]

Equivalence checking

A big draw of Cryptol is that it has built in tools for equivalence checking specifications like the above against real software programs. I won’t cover it here, but the above Whirlpool function has in fact been equivalence checked against a software implementation, which showed them to be equivalent on all inputs. The intent is that Cryptol code can act as a “ground truth” specification, that is readable, executable, and can be mechanically validated against other programs.

Doing so was extremely expensive, however: it took a little over 20 hours to complete this check using a “bitsliced” software implementation of the π function, written in C. Equivalence checking against a naive C implementation of π did not complete after several days.

Alternative π functions, such as the $ F $ permutation from NORX, are vastly more amenable to techniques like equivalence checking.

Closing notes

The full code for this article is available as a GitHub Gist. This includes some extra code I didn’t cover here; for example, the mini-sbox construct for deriving the precomputed sbox and a proof of correctness.

The whirlpool π function described here is actually a variant of the AES key schedule (a permutation function); it performs a “ShiftRows” and “MixColumns”, as opposed to our own “ShiftColumns and “MixRows”. So you can certainly derive the AES π function from the above one with some work.