Introduction

Some time ago Guido Vranken (please check out his blog) stumbled upon the cryptographic library Barretenberg that is being developed by our team at Aztec Protocol. As the creator of probably the most successful differential cryptographic fuzzer there is, he quickly found several bugs that our internal audits had overlooked. One of the more serious ones that he found was a bug in the assembly code intended to perform Montgomery squaring. It reminded me of the last cryptopals challenge that I had read but never actually tried to solve. As one of the organizers of CTFZone I was in search of a CTF task idea, so I decided to make a task out of it while analyzing what could actually happen and if there are any similar vulnerabilities in our code.

If you want to try and solve the task youself, it is available here and the server is running on port 1353 of cryptraining.zone. If you just want to look at the code of the solution, you can find it here.

This writeup assumes familiarity with how finite field and elliptic curves work, so I won’t be covering those topics in depth. I’m sure that whoever is reading this will easily find a nice introduction into those topics online.

Montgomery multiplication

The core bug was in the way we performed squaring or field elements in Montgomery form. Montgomery multiplication is one of optimizations that can be used to speed up finite field multiplication. Let’s say we have some finite field modulo prime $p$ (in the task $p$ is around 254 bits, it’s the alt_bn128 curve). We find the closest convenient power of two (in this case $2^{256}$) and convert field elements into the form $a=a_{original}*R\ mod\ p$. Then, when we multiply two elements $a$ and $b$ we get the following value:

$a⋅b = a_{original}⋅ R⋅ b_{original}⋅ R= a_{original}⋅ b_{original}⋅ R^2\ mod\ p$

Now, to get the result back to Montgomery form we need to divide it by $R$. The cool thing is that since $R$ is a power of 2, it is very easy to perform this division if we manage to make the intermediate product a multiple of $R$ in integer form. Let’s look at an example:

  • $p=15541=\text{0x3cb5}$
  • $R=2^{16} = 3372\ mod\ p$
  • $a_{original}=2\ mod\ p$
  • $b_{original}=2^{-1}\ mod\ p = 7771\ mod\ p$
  • $a=a_{original}⋅R=6744\ mod\ p$
  • $b=b_{original}⋅ R= 1686\ mod\ p$

Now that we have the converted values:

$a⋅b = 6744⋅1686 = 11370384 = \text{0xad7f90}$

The value is obviously too large to fit the modulus

If we compute the integer product of two values:

  • $(-p)^{-1}\ mod\ R = 11875 =\text{0x2e63}$
  • $a⋅b= \text{0x7f90}\ mod\ R$

the result:

$l=(a⋅b\ mod\ R) ⋅ (-p^{-1}\ mod\ R) ⋅ p$

would have several cool features:

  1. $l = - a⋅b\ mod\ R$; or, to put it simply, if we add this value to the product, then the result will be 0 modulo $R$
  2. $l = 0\ mod\ p$; which means that by adding the value we are not changing the result modulo $p$

$l = \text{0x7f90} ⋅ \text{0x2e63}\ mod\ R ⋅ \text{0x3cb5}= \text{0x34b0}⋅ \text{0x3cb5} = \text{0xc7e8070}$

$a⋅b + l = \text{0xad7f90} + \text{0xc7e8070} = \text{0xd2c0000}$

We can now easily divide this value by $R$, getting

$(a⋅b + l)/\text{0x10000} = \text{0xd2c}$.

In assembly this would be even simpler, since we’d just have to forget the values of certain registers. The final value is less then $p$. If we convert it back to original form, $\text{0xd2c}*R^{-1}= 1\ mod\ p$ as we expected

N.B. The conversions themselves are also really simple. To convert $a_{original}$ to $a$ all we have to do is to do montgomery multiplication of $a_{original}$ by $R$ in Montgomery form ($R^2$):

$a_{original}⋅(R⋅R)/R = a_{original}⋅R = a\ mod\ p$.

To convert back to the original form we simply multiply by 1 and automatic division takes care of one $R$.

The curve in barretenberg obviously uses a bigger field than in the example (254-bit one), but overall the same logic applies. The only difference is that instead of computing the full product and then adding a multiple of $p$ that would nullify the lowest 256 bits ($R=2^{256}$ in this case), once we’ve computed each 64-bit chunk of $a⋅b$ we nullify just that chunk by adding an appropriate multiple of $p$. This makes the algorithm a bit less complex.

Bugs in Barretenberg’s Montgomery squaring

Barretenberg has 2 assembly versions of Montgomery squaring and multiplication algorithms. One uses BMI2 ADX extensions taking advantage of parallel additions with adcx and adox instructions and one is a simplified version that was only written to accommodate those with x86_64 CPUs without this extension. Both versions can be found in asm_macros.hpp. The non-ADX version was created from ADX version, but in doing so we created several bugs.

The assembly implementation is pretty simple:

  • We multiply single digits
  • We add them to compute the digits of the intermediate product
  • When we have computed the digit of the intermediate product completely, we use montgomery reduction on the digit to nullify it through the process described in the previous paragraph
  • We simply forget the nullified digits, so in the end we are left with 4 registers r12-r15 with the resulting value

We do squaring separately from regular multiplication, because there is an opportunity to speed up computation by performing certain digit multiplication once and then doubling the result (for example instead of multiplying a[0]⋅a[1] and a[1]⋅a[0] and then adding them together, we simply add the first product to itself).

There are several bugs in this function, but let’s look at the first one. This was the one triggered by one of the traffic captures distributed with the task.

        "movq 0(" a "), %%rdx                     \n\t" /* load a[0] into %rdx */                                       \
                                                                                                                        \
        "xorq %%r8, %%r8                          \n\t" /* clear flags                                              */  \
        /* compute a[0] *a[1], a[0]*a[2], a[0]*a[3], a[1]*a[2], a[1]*a[3], a[2]*a[3]                                */  \
        "mulxq 8(" a "), %%r9, %%r10              \n\t" /* (r[1], r[2]) <- a[0] * a[1]                              */  \
        "mulxq 16(" a "), %%r8, %%r15             \n\t" /* (t[1], t[2]) <- a[0] * a[2]                              */  \
        "mulxq 24(" a "), %%r11, %%r12            \n\t" /* (r[3], r[4]) <- a[0] * a[3]                              */  \
                                                                                                                        \
                                                                                                                        \
        /* accumulate products into result registers */                                                                 \
        "addq %%r8, %%r10                         \n\t" /* r[2] += t[1]                                             */  \
        "adcq %%r15, %%r11                        \n\t" /* r[3] += t[2]                                             */  \
        "movq 8(" a "), %%rdx                     \n\t" /* load a[1] into %r%dx                                     */  \
        "mulxq 16(" a "), %%r8, %%r15             \n\t" /* (t[5], t[6]) <- a[1] * a[2]                              */  \
        "mulxq 24(" a "), %%rdi, %%rcx            \n\t" /* (t[3], t[4]) <- a[1] * a[3]                              */  \
        "movq 24(" a "), %%rdx                    \n\t" /* load a[3] into %%rdx                                     */  \
        "mulxq 16(" a "), %%r13, %%r14            \n\t" /* (r[5], r[6]) <- a[3] * a[2]                              */  \
        "adcq %%rdi, %%r12                        \n\t" /* r[4] += t[3]                                             */  \
        "adcq %%rcx, %%r13                        \n\t" /* r[5] += t[4] + flag_c                                    */  \
        "adcq $0, %%r14                           \n\t" /* r[6] += flag_c                                           */  \
        "addq %%r8, %%r11                         \n\t" /* r[3] += t[5]                                             */  \
        "adcq %%r15, %%r12                        \n\t" /* r[4] += t[6]                                             */  \
        "adcq $0, %%r13                           \n\t" /* r[5] += flag_c                                           */  \
                                                                                                                        \
        /* double result registers  */                                                                                  \
        "addq %%r9, %%r9                          \n\t" /* r[1] = 2r[1]                                             */ 

Here we multiply a[0] by a[0-3] and start sequential additions. Then we multiply a[1] by a[2] and a[3], and a[2] by a[3]. The last in the first serie of additions you see here is

"adcq $0, %%r13"

and after that we start doubling registers, while using the instruction “addq”. The bug here is that this “adcq” can actually set the carry flag, which should be once again added to the r14 register, but this insturction is missing. Instead the flag is overwritten by the following instruction. Having lost the flag, we compute an erroneous value.

This code contains several places where carry flags are simply lost. In fact, since there are fewer and fewer people who use non-ADX version we decided that it’s easier to replace this SQR with the multiplication primitive because of the time-cost of fixing it.

The CTF task required contestants to find several faulting inputs to the SQR function as part of the solution and we gave teams an example of an buggy input. It is possible to implement smart bruteforce, where you try to generate inputs that would force r13 to be equal to 0xffffffffffffffff before the last addition (the tester of the task solved it that way). I’m lazy, so I used automation instead of thinking. Since there are very few different instructions it was easy to encode them as z3 (SMT solver) formulas, like this:

def addq(args):
    var_in_1 = get_asm_input(args[0])
    var_in_2 = get_asm_input(args[1])
    var_out = get_output_variable(args[1])
    CF_out = get_output_flag("CF")
    temp = z3.ZeroExt(1, var_in_1) + z3.ZeroExt(1, var_in_2)
    machine_state.instructions.append(
        var_out == z3.Extract(register_width - 1, 0, temp)
    )
    machine_state.instructions.append(
        CF_out == z3.Extract(register_width, register_width, temp)
    )

Then the script would go through all carry flags that were never used again and check if it was possible to find corresponding inputs. I chose several flags that were being solved by z3 fast enough. In the end I had a module, which rotated through several flag options, trying to solve them and banning previous solutions. The process is time-consuming, so once the teams found the root cause of the bug it would have been wise to set up the algorithm to find several buggy inputs before analyzing the task further. In my case on 1 thread it took around 30 minutes to find enough inputs for the other part of the task.

Now that we’ve covered the core bug, let’s look at the task itself.

Buggy Asm Task

Task structure

The contestants were given an archive with:

  • A server (source code) for the task for analysis
  • A client (source code) for the task to interact with the server
  • A capture of network traffic with the legitimate user authenticating on the server
  • A partial capture of a previous attack (they could find the asm bug this way)

The client-server workflow was the following:

  • Client connects to the server, they do an ECDH exchange
  • Clients sends encrypted password to the server
  • Server answers with the encrypted flag

The goal was to find out server’s private key. Nothing new, pretty standard for CTFs. Session logic is implemented like this:

extern "C" bool createSession(ServerState* pServerState, uint8_t* pPointCoordinates, char* pErrorMessage)
{
    std::stringstream errorStream;
    fq clientPointX(*(uint256_t*)(pPointCoordinates)),
        clientPointY(*(uint256_t*)(pPointCoordinates + sizeof(uint256_t)));
    g1::affine_element clientPoint(clientPointX, clientPointY);
    if (!clientPoint.on_curve()) {
        errorStream << "Input point " << clientPoint << " is not on curve" << std::endl;
        strcpy(pErrorMessage, errorStream.str().c_str());
        return false;
    }
    g1::affine_element sharedPoint = g1::affine_element(
        g1::element(clientPoint * fr(2)).mul_without_endomorphism(pServerState->privateKey * fr(2).invert()));
    if (!sharedPoint.on_curve()) {

        errorStream << "Shared point " << sharedPoint << " is not on curve" << std::endl;
        strcpy(pErrorMessage, errorStream.str().c_str());

        return false;
    }
    uint256_t sharedPointX(sharedPoint.x), sharedPointY(sharedPoint.y);
    std::vector<uint8_t> hasherInput;
    std::vector<uint8_t> temp;
    for (size_t i = 0; i < sizeof(uint256_t); i++) {
        hasherInput.push_back(((uint8_t*)(&sharedPointX.data[0]))[i]);
    }
    for (size_t i = 0; i < sizeof(uint256_t); i++) {
        hasherInput.push_back(((uint8_t*)(&sharedPointY.data[0]))[i]);
    }
    temp.insert(temp.begin(), hasherInput.begin(), hasherInput.end());
    temp.push_back(0);
    auto encryptionKey = Sha256Hasher::hash(temp);
    temp.pop_back();
    temp.push_back(1);
    auto macKey = Sha256Hasher::hash(temp);

    memcpy(pServerState->sessionEncryptionKey, encryptionKey.data(), 32);
    memcpy(pServerState->sessionMACKey, macKey.data(), 32);
    return true;
}

If during ECDH key exchange the client’s public point was not on the curve, the server responded with this information and ended the handshake. The weird thing was that the server also responded with shared point information, if the computed shared point was off the curve. This is a situation you should never run into, since you’re getting an invalid point by scaling a valid one. However, as we covered earlier, the core bug resulted in erroneous field element squaring. This would make the point fall off the curve onto another one, which might not be as secure. To help solvers the original public point was actually doubled initially and then all operations were performed on the doubled point. This was done to stabilize the task, as I’m not sure that a task with regular scalar multiplication would be solvable at all.

Finding failure-inducing points

The goal was to submit a valid point that would fall off the curve during the doubling operation. To understand how we could do that,let’s look at how doubling of EC points is performed in barretenberg.

template <class Fq, class Fr, class T> constexpr void element<Fq, Fr, T>::self_dbl() noexcept
{
    if constexpr (Fq::modulus.data[3] >= 0x4000000000000000ULL) {
        if (is_point_at_infinity()) {
            self_set_infinity();
            return;
        }
    } else {
        if (x.is_msb_set_word()) {
            self_set_infinity();
            return;
        }
    }

    // T0 = x*x
    Fq T0 = x.sqr();

    // T1 = y*y
    Fq T1 = y.sqr();

    // T2 = T2*T1 = y*y*y*y
    Fq T2 = T1.sqr();

    // T1 = T1 + x = x + y*y
    T1 += x;

    // T1 = T1 * T1
    T1.self_sqr();

    // T3 = T0 + T2 = xx + y*y*y*y
    Fq T3 = T0 + T2;

    // T1 = T1 - T3 = x*x + y*y*y*y + 2*x*x*y*y*y*y - x*x - y*y*y*y = 2*x*x*y*y*y*y = 2*S
    T1 -= T3;

    // T1 = 2T1 = 4*S
    T1 += T1;

    // T3 = 3T0
    T3 = T0 + T0;
    T3 += T0;
    if constexpr (T::has_a) {
        T3 += (T::a * z.sqr().sqr());
    }

    // z2 = 2*y*z
    z += z;
    z *= y;

    // T0 = 2T1
    T0 = T1 + T1;

    // x2 = T3*T3
    x = T3.sqr();

    // x2 = x2 - 2T1
    x -= T0;

    // T2 = 8T2
    T2 += T2;
    T2 += T2;
    T2 += T2;

    // y2 = T1 - x2
    y = T1 - x;

    // y2 = y2 * T3 - T2
    y *= T3;
    y -= T2;
}

We see that doubling uses projective coordinates (instead of $(x,y)$ we have $(X,Y,Z)$). If you are unfamiliar with this form, it is used to speed up computation. Affine form $(x,y)$ depends on division during computation of $\lambda$ during addition and doubling operation. Division in finite fields is extremely computationally expensive, since you need to either exponentiate the value to the power of $p-2$ or use extended euclidean algorithm to find the inverse. Projective coordinate form deals with these pains by saving division for last. In projective form you have to perform expensive divison to convert back to affine form only after you’ve computed the final point.

In the code we see several sqr and self_sqr calls that we are interested in here. For example, both $x$ and $y$ get squared, so why don’t we simply place the bug-inducing value in $x$ or $y$ and try to compute a conforming point? There are 2 issues:

  1. Such a point might simply not exist. For example, if we set $x$ to a particular value, that doesn’t automatically mean that the curve equation $y^2=x^3+b$ (the curve in barretenberg has $a=0$) will have roots (there is a 50% chance).
  2. The initial check performed by the server ensures that $y^2=x^3+b$, so if we put the buggy value directly in $x$ or $y$ the first check will fail, so we need to be smarter.

So we need to find other “squares”. Here they are:

  1. $3x^2$. We needed to check that $\frac{\mathit{buggy\_value}}{3}\ mod\ p$ is a quadratic residue, then compute two possible $x$ if it was and check if we could derive valid points from those.
    // T0 = x*x
    Fq T0 = x.sqr();

    // T3 = 3T0
    T3 = T0 + T0;
    T3 += T0;

    // x2 = T3*T3
    x = T3.sqr();
  1. $y^2$. Here we needed to check that the buggy value is a quadratic residue and find solutions to equation $x^3+3-\mathit{buggy\_value}=0$
    // T1 = y*y
    Fq T1 = y.sqr();

    // T2 = T2*T1 = y*y*y*y
    Fq T2 = T1.sqr();
  1. $x+y^2$ We needed to find solutions to the equation $x^3+x+3-\mathit{buggy\_value}=0$. Then check that there are actually points on those $x$.
    // T0 = x*x
    Fq T0 = x.sqr();

    // T1 = y*y
    Fq T1 = y.sqr();


    // T1 = T1 + x = x + y*y
    T1 += x;

    // T1 = T1 * T1
    T1.self_sqr();

To simplify checks, I added a library function that just doubled the point using buggy code.

Barretenberg assumes that the point is on curve $y^2=x^3+b$, where $b$ is embedded in the point itself, so it takes no part in computation. Knowing $(x,y)$ it is easy to compute $b$ and then it’s possible to compute the curve’s order.

I factored all the orders and collected points with subgroups small enough to be useful. Once I had enough buggy points, I could send them to the server, get erroneous shared points and solve discrete logarithm in subgroups using pollard’s rho algorithm. Using Chinese remainder theorem they could reconstruct the private key.

After reconstructing the private key contestants could simply assume the role of the server, replay the conversation with authentication and decrypt the password. Then all that was left is to submit it to the server.

Complete Solution

  1. Discover buggy flags in assembly, create a script to generate various inputs triggering those bugs and producing erroneous squared values
  2. Convert those values back from Montgomery form.
  3. Put them into the three equations and check if there are roots.
  4. If there are roots, compute the points.
  5. Check that the points actually trigger bugs and if they do, save them and resulting points on other curves.
  6. Check curves' orders. Factor them and collect subgroups with solvable dlog.
  7. Once there are enough subgroups so that the CRT of their orders is larger than the original scalar group order by a few factors, switch to interacting with the server.
  8. Send buggy points to the server and receive shared points that should now be on other curves.
  9. Solve appropriate subgroup dlogs.
  10. Reconstuct server’s private key
  11. Replay the legitimate capture that is distributed with the task, assuming the role of the server and get the decrypted password.
  12. Send it to the server and get the flag.