A Random Number Generator in Only Three Lines of Assembly

I obviously need pseudorandom number generators for my projects.

I use low-spec microcontrollers, which are extremely resource-constrained. The code needs to run efficiently and compile down to a small a size as possible. I try to eke every last bit of the few kilobytes of flash space that I have to use.

The arduino random() function has far to much overhead.

What to do?

The FastLED LCG Random Generator (rand8())
The Mysterious TN13 Algorithm
Conclusion

The FastLED LCG Random Generator (`rand8()`)

I’ve always used rand8() from FastLED.

// X(n+1) = (2053 * X(n)) + 13849)
#define FASTLED_RAND16_2053 ((uint16_t)(2053))
#define FASTLED_RAND16_13849 ((uint16_t)(13849))

/// random number seed
static uint16_t rand16seed = 7;

uint8_t lcr_random8() {
  rand16seed = (rand16seed * FASTLED_RAND16_2053) + FASTLED_RAND16_13849;
  return (uint8_t)(((uint8_t)(rand16seed & 0xFF)) + ((uint8_t)(rand16seed >> 8)));
}

It uses a Linear Congruential Generator (LCG), one of the oldest and most well-understood pseudorandom number generators.

What is an LCG and How Does It Work?

An LCG generates a sequence of pseudorandom numbers using this mathematical recurrence relation:

X(n+1) = (a × X(n) + c) mod m

Where:

X(n) is the current value (rand16seed)
a is the multiplier (2053)
c is the increment (13849)
m is the modulus (implicitly 2^16 due to 16-bit integer overflow)

The algorithm works by:

Taking the current seed value
Multiplying it by a carefully chosen constant (2053)
Adding another carefully chosen constant (13849)
Letting the natural overflow of 16-bit integers handle the modulo operation

These specific constants weren’t chosen randomly — it’s been proven that certain values produce optimal statistical properties for the pseudorandom sequence, and I assume FastLED utilized published values.

Why Mix High and Low Bytes?

The line that returns the value does something interesting:

return (uint8_t)(((uint8_t)(rand16seed & 0xFF)) + ((uint8_t)(rand16seed >> 8)));

This adds together:

The low byte (rand16seed & 0xFF)
The high byte (rand16seed » 8)

Why do this? In LCGs, the lower bits tend to have shorter cycles and poorer randomness properties than the higher bits. For example, the lowest bit often alternates between odd and even in a completely predictable pattern. As you move to higher bits, the patterns become increasingly complex.

By adding the high and low bytes together, we’re mixing these different quality bits, allowing the better randomness from the high bits to influence our final 8-bit result. This simple addition creates better statistical properties than just returning either byte alone.

Optimizing Multiplication

I optimized the multiplication by 2053 in the algorithm as follows:

#define MULTIPLY_FASTLED_RAND16_2053(x) (x << 11) + (x << 2) + x

This works because 2053 = 2048 + 4 + 1

x « 11 = x × 2048
x « 2 = x × 4
x + (x × 4) + (x × 2048) = x × (1 + 4 + 2048) = x × 2053

On microcontrollers without hardware multipliers, bit shifts and additions may be much faster than multiplication operations. On the other hand, many such microcontrollers also lack a shift-by-n operator, and compile shifts by more than one into a subroutine, especially for 16 bit values. Specific architectures can be tested using resources such as godbolt.

The Mysterious TN13 Algorithm

Then, deep in the bowels of the internet, I discovered this:

uint8_t tn13_random8() {
  static uint8_t rand1 = 168, rand2 = 2;  // initialize with some seed

  asm volatile(
    "eor %0,%1"
    "\n\t"
    "swap %0"
    "\n\t"
    "add %1,%0"
    "\n\t"
    : "+r"(rand1), "+r"(rand2)
    : /* No separate inputs */
  );

  return rand1;  // or return rand2, depending on what you need
}

All references to it were long since linkrotted away. What had I stumbled upon?

Breaking Down the Assembly

This function uses inline AVR assembly for maximum efficiency and control.

Let’s translate each instruction:

eor %0,%1 - Exclusive OR (XOR) between rand1 and rand2, storing the result in rand1
swap %0 - Swap the high and low nibbles (4-bit halves) of rand1
add %1,%0 - Add rand1 to rand2, storing the result in rand2

These three simple operations form a type of Nonlinear Feedback Shift Register (NFSR), though not a conventional one. It maintains two state variables that influence each other:

XOR combines bits from both state variables in a non-linear way
Similar to the LCR, swapping nibbles creates a rotation effect that helps distribute randomness. The swap operation ensures that bits from both nibbles influence future states, creating an “avalanche” where a change to one bit affects many others over time.
Addition provides carry propagation for additional mixing

How Efficient Is It?

The TN13 algorithm is remarkably efficient:

Operates on 8-bit values only: The second function operates entirely within 8-bit registers, while the first function requires 16-bit calculations.
Minimal instruction count: Only three assembly instructions vs. multiple shifts, adds, and operations in the LCG.
Perfect for AVR architecture: The chosen instructions (eor, swap, add) are all single-cycle operations on AVR microcontrollers.
No memory access overhead: Everything happens in registers with minimal overhead.
Smaller code size: The entire function compiles to just a few bytes of machine code.
Reduced state: Only needs to maintain two 8-bit values (2 bytes total) versus a 16-bit state (2 bytes) in the LCG.

In terms of execution speed, this would likely be at least 5-10 times faster than the LCG implementation, especially on microcontrollers without hardware multiplication or 16 bit support.

The name “tn13” suggests it was originally developed for the ATtiny13, an extremely low-spec microcontroller with only 64 bytes of RAM and 1k of flash space.

Comparison

The Tradeoff: Quality vs. Speed

While maximally efficient, this algorithm has a shorter period (the sequence will eventually repeat) than the 16-bit LCG. The statistical properties are also less well-studied than LCGs; I haven’t been able to find any reference for this type of PRNG.

In some ways the tn13 algorithm is actually superior: the LCG has a degree of correlation between consecutive pairs of values, whereas the tn13 algorithm does not.

But we’re not going for cryptological security here. At a glance, no one would notice the difference. For aesthetic applications the quality is perfectly adequate, and the speed advantage is significant.

Conclusion

These two algorithms represent different ends of the spectrum:

FastLED’s LCG: Better statistical properties, more computational work
TN13 Algorithm: As fast as possible, tiny code size, adequate randomness

What amazes me about the TN13 algorithm is how much randomness can be extracted from such a simple operation sequence.