0:00
/
0:00

Building a CUDA GPU Big Integer Library from Scratch

Learning Practical CUDA on a Budget

I got free GPU credits from Runpod. Receive free credits at this link.

Quick intro
LeetArxiv is Leetcode for implementing Arxiv and other research papers.

Stop reading papers. Start coding them. Subscribe for weekly paper implementations, one semicolon at a time.

This is part of our series on Practical Index Calculus for Computer Programmers.

Part 1: Discrete Logarithms and the Index Calculus Solution.

Part 2: Solving Pell Equations with Index Calculus and Algebraic Numbers.

Part 3: Solving Index Calculus Equations over Integers and Finite Fields.

Part 4 : Pollard Kangaroo when Index Calculus Fails.

Part 5 : Smart Attack on Anomalous Curves.

Part 6: Hacking Dormant Bitcoin Wallets in C.

1.0 Introduction

I had 20 dollars in free GPU credits from Runpod so I spent the weekend learning CUDA by building a big integer library.

Big integers are numbers with lots of bits. They are used in cryptography to protect passwords. Libraries like LibGMP, FLINT and OpenSSL work great on CPU. On GPU however, that’s another story.

Here’s the Colab notebook and here’s the GitHub with the C and CUDA code.

This article is split into these sections:

  1. CPU Benchmark code.

    • We code CPU addition, multiplication and modular inverse for 256 bit numbers.

    • Compare our code’s running time and accuracy to libGMP.

  2. GPU Benchmark code.

    • We log into Colab and quickly convert C code to CUDA.

2.0 CPU Benchmark Code

We need a CPU-based metric. Ours shall be a custom-made C library that handles 256 bit integers versus LibGMP in C.

*Lol I used the tricks I saw in LibGMP’s source code. idk if it counts haha

2.1 Big Integer Data Structures

A common pattern observed in OpenSSL and LibGMP is the use of unmalloc’d, one dimensional arrays to typedef structs. This helps us:

  1. Avoid unsafe memory gotchas.

  2. Keep everything on the stack and avoid slow heap allocations.

  3. Code readability goes up.

    *I loathe seeing symbols in my code so much that I avoid Rust and C++ because of the syntax.

We store our integers in limbs. Our example uses unsigned 32 bit integer arrays. These are 8 limbs for 256 bit arithmetic and 16 limbs for 512 bit arithmetic.

*The most significant bit is in the leftmost limb(the array at index MAX_LIMB-1)
Unsigned 32 bit arrays are limbs and 1-dimensional arrays instead of pointers

BTW using 1 dimensional arrays further makes ops like zeroing and setting all bits to 1 pretty fast

Setting all bits to either 0 or 1.

2.2 Signed Addition and Subtraction

We implement signed addition by perfoming magnitude tests. We start from the most significant limb and add towards the least significant limb. We use 64 bit unsigned integers to handle carries.

There are two addition cases to handle:

  1. Signs are similar.

  2. Signs are different, like negative plus positive.

2.2.1 Addition/Subtraction with Similar Signs

When the signs are similar, we ignore the sign and just add like you would on pencil and paper.

*Computers use 2’s complement so addition and subtraction are the same thing if magnitudes are same.
Absolute value comparison from MSB to LSB

First we write an absolute value comparison function. It ignores the sign and compares from the largest bit to the smallest bit limb.

Then we write a pretty simple addition function that carries in an unsigned 64 bit data type.

Absolute value addition ignores sign

2.2.2 Addition with Different Signs

Different signs mean we’re working with positive and negative integers. In this case, we subtract the absolute value of the larger integer from the smaller.

Subtracting absolute values to handle pos/neg

2.2.3 Final Addition/Sub Function

We use a single addition function to wrap around our different edge cases.

Complete addition function handling different test cases

2.3 Signed Multiplication and Modulo

Our library is designed for finite fields. Division isn’t well defined in finite fields so we need not implement it. We just need a good multiplier and a good (enough) modulo function.

Implement standard multiplication. Nothing fancy

We implement usual grade school multiplication. Nothing fancy. Observe that we store our multiplication result in a 512 bit data type.

2.5 CPU Benchmark

LibGMP absolutely mogs our code. However, our implementation is correct and that’s all that matters lol*.

*yes, the copium is strong with this one
Absolute cinema! Our code is nowhere close to libGMP.

3.0 CUDA Code

Here’s the thing: converting from C to CUDA is pretty simple. I did it in this Colab notebook.

Our struct definitions remain same, we append device to GPU functions and remove memset
  1. Struct definitions and function logic remains the same.

  2. We affix __device__ keyword to functions we want to run on GPU.

  3. Replace memset with an array because CUDA doesn’t support memset.

Next we write a multiplication kernel:

Multiplication GPU kernel

Finally, we add code to handle GPU threads and allocate GPU memory for our structs. Tbh, I didn’t expect it to be this easy.

Code to handle threads and memory allocation on GPU

Our main function is split into these parts:

  1. Allocating memory for the GPU.

    • Majority of the main function does this lol.

      Allocating host memory
  2. Next, we initialize our inputs and allocate device memory and array memory.

    • Again, I’m baffled how intuitive this was.

      Allocating memory for arrays and device
  3. Finally, we run our batched matmuls and free allocated memory.

    Running multiplication kernel and freeing memory

What are the results?

Our 256 bit integer GPU code took 12 milliseconds to perform as many multiplications as our CPU code.

GPU execution time

Our CPU library took 512 milliseconds while the GPU took 12 milliseconds. Amazing!

Sign up for part 2 where we add production level research multipliers and dividers

I got free GPU credits from Runpod. Receive free credits at this link :)

Discussion about this video

User's avatar