The robust universality of exponentially-weighted moving averages

At its simplest, an EWMA is just a way to calculate an average that gives more importance to recent values and less importance to older values. The “weight” of each past value decreases exponentially as it gets older.

So for example, we could compute a moving average by (.5 * previous_average) + (.5 * new reading).

The general formula looks like this:

new_average = α × current_value + (1-α) × old_average

Where α (alpha) is a constant between 0 and 1 that determines how quickly the average adapts to new values:

α close to 1: Adapts quickly, emphasizes recent data
α close to 0: Adapts slowly, provides more smoothing

This simple recursive formula creates an average that “follows” your data while smoothing out noise and spikes.

In theory, old values affect the new average forever, with exponentially decreasing weight.

Demo
The Same Algorithm in Different Domains
Truly Scalable: From Microcontrollers to Distributed Systems
Demo: 3D smoothing
Special Cases: Smoothing Circular Values
- The Sine-Cosine Trick
- Time of Day Example
Conclusion

Demo

This demo shows an example of different smoothing parameters on example time series data.

Note the weakness of choosing a larger alpha: the average is slower to respond to sudden changes.

The Same Algorithm in Different Domains

Firmware Low-Pass Filters

When I started using analog sensors, I noticed the input was “spiky” and jittery. This is called “noise”. If I tried to tie LED animations to a sensor input like an accelerometer it looked kind of erratic and jarring.

Eventually I discovered that you can use techniques to filter out this noise and get “smoothed” input number.

Floating-Point Implementation

The floating-point version is the most intuitive:

float update_filter(float new_reading) {
  // Alpha determines how quickly the filter responds to changes
  const float alpha = 0.1f;  
  static float filtered_value = 0.0f;  // Initial value

  filtered_value = alpha * new_reading + (1.0f - alpha) * filtered_value;
	
	return filtered_value;
}

Integer Implementation (basic)

Suppose we don’t want to use floating point.

A really basic integer implementation of a moving average is:

smoothed_value = (smoothed_value >> 1) + (new_value >> 1);

This will maintain a moving average by taking the sum of 50% of the new value and 50% of the previous moving average.

This simple example will work but there will be a fair amount of precision loss as the first bits will be truncated out each time.

Integer Implementation (advanced)

We can design a more precise filter using fixed point. This one uses Q16.16 without multiplication or division:

int16_t smoothInt(
  int16_t sample, uint8_t log_alpha, int32_t* filter) {
  int32_t local_sample = ((long)sample) << 16;

  *filter += (local_sample - *filter) >> log_alpha;
	
	  return (filter + 0x8000) >> 16;
}

(The 0x8000 is for rounding).

The trick is choosing your alpha as a fraction with a power-of-2 denominator which allows you to use bit shifts to accomplish fractional multiplication.

Distributed Decayed Counters

Let’s consider a completely different type of context: distributed web serving at scale.

Suppose you want to track something things “requests per minute” across a distributed system. You don’t want to maintain a sliding window of every request timestamp — that would be memory intensive. Instead, you can use a decayed counter:

class DecayedCounter:
    def __init__(self, alpha=0.1):
        self.count = 0.0
        self.alpha = alpha
        self.last_update = time.time()
    
    def add(self, value=1):
        now = time.time()
        # Calculate time elapsed since last update
        dt = now - self.last_update
        
        # Decay factor based on time elapsed
        decay = (1 - self.alpha) ** dt
        
        # Decay the old count and add the new value
        self.count = self.count * decay + value
        self.last_update = now
        
        return self.count

This counter will automatically decay over time if no new events are added, giving more weight to recent events.

What’s the difference between this and the firmware sensor smoothing example we discussed earlier? Nothing. It’s the same math.

Distributed Key Value Store

Using a distributed key value store you can compute online moving averages keyed on (say) user id to track average event counts scaled to hundreds of millions of users. I’ve personally created literally billions of low pass filters this way.

The idea is that per user id you atomically increment a “count” and the last updated time.

When you increment the counter, you assume no other increments have come in since the last update, and decrease the previous count by a half-life before incrementing.

When reading the value, you also assume no increments have happened since the last updated time, and discount the value according to the half-life as you do when incrementing.

Pseudocode:

def atomicUpdateDecayedCounter(key, increment, decayFactor, timeElapsed):
    # Attempt to update until successful
    while True:
        # Read the current value and timestamp atomically
        (currentValue, lastUpdateTime) = atomicRead(key)
        
        # Calculate the decayed value based on time elapsed since last update
        decayedValue = currentValue * (decayFactor ^ timeElapsed)
        
        # Calculate the new value by applying increment to the decayed value
        newValue = decayedValue + increment
        
        # Attempt to update the value with Compare-And-Swap (CAS)
        # This ensures atomicity in a distributed environment
        success = compareAndSwap(key, 
                                (currentValue, lastUpdateTime), 
                                (newValue, currentTime()))
        
        // If the update was successful, exit the loop
        if success:
            break
            
        # If unsuccessful, another process updated the value
        # Loop will retry with the new current value
    
    return newValue

The amount of data you need per user is tiny: just a 32 bit float and a 32 bit integer.

Machine Learning Applications

In machine learning, EWMAs appear in various forms:

Online feature engineering: When processing streaming data, you often need features that capture trends over time without storing the entire history. EWMA solves this elegantly.
Rate limiting and anomaly detection: Detecting unusual patterns in user behavior or system metrics often uses EWMA-based algorithms.
Optimization algorithms: The popular Adam optimizer for gradient descent uses EWMA to track gradients and momentum.

For example, calculating a user’s average session time as an online feature:

def update_user_features(user_id, session_duration):
    # Get current average or initialize if new user
    current_avg = user_features.get(user_id, {}).get('avg_session_time', session_duration)
    
    # Update with EWMA
    new_avg = 0.1 * session_duration + 0.9 * current_avg
    
    # Store back in user features
    if user_id not in user_features:
        user_features[user_id] = {}
    user_features[user_id]['avg_session_time'] = new_avg

This approach scales to hundreds of millions of users since you only need to store the current average for each user, not their entire history.

Why EWMA Can be Better than Windowed Approaches

When building machine learning systems at scale, you face a choice between a few different approaches to summarizing activity over time:

Lambda architectures - These combine batch processing (accurate but slow) with stream processing (fast but less accurate), keeping two separate data paths
Windowed approaches - These maintain a fixed sliding window of recent data points (last N events or last T time units)
EWMA - Our elegant little formula that keeps just the current value

Here’s why EWMA is often the most proficient solution:

Memory efficiency: A windowed approach storing the last 100 events requires 100x more storage than EWMA. For a system with millions of users and dozens of features, this becomes prohibitive quickly.

No arbitrary cutoffs: With a window, the oldest data point suddenly drops out, creating potential discontinuities. With EWMA, influence decays smoothly.

Adaptable memory horizon: By adjusting α, you can effectively change how much history matters without changing your data structures or processing logic.

Perfect for distributed systems: With distributed computing, you want minimal state transfer between nodes. EWMA requires transferring just a single value per feature, making it perfect for systems like real-time streaming frameworks.

More maintainable code: A windowed structure either completely online, batch or in a lambda system is far more complicated to build and maintain. In constrast, decayed counters are so simple there’s not many places bugs could hide.

Truly real-time: Unlike batch-based components of lambda architectures, EWMA updates are fully real-time. Each new event immediately influences the output value with no delay waiting for batch windows to complete.

This approach means you get an accurate real-time value even for sparsely or irregularly updated metrics, even with extremely high cardinality keys. It’s particularly useful for features like “user activity level” that should naturally decay if a user hasn’t been active recently.

I once migrated a recommendation system from a windowed approach to EWMA and reduced our feature store size by 95% while actually improving prediction accuracy. The simpler approach won on efficiency, effectiveness, and most importantly, maintainability. After six months, our on-call incidents related to feature generation had dropped to zero.

Hardware Low-Pass Filters (RC Networks)

We do the same thing in hardware with just a resistor and a capacitor.

An RC filter is just a resistor and a capacitor connected in series. Here’s how it works in plain terms:

Think of a capacitor like a tiny bucket that can store electrical charge. When you apply voltage to an RC circuit:

The capacitor doesn’t fill up instantly—the resistor restricts how quickly current can flow, causing the capacitor to fill gradually

If the input voltage suddenly drops, the capacitor doesn’t empty immediately—it slowly discharges through the resistor

The voltage across the capacitor at any moment represents a blend of the current input voltage and its previous state

This creates a natural smoothing effect. For example, if you feed a noisy signal into an RC filter:

Brief spikes don’t have enough time to fully charge the capacitor before they disappear
The capacitor’s voltage represents an average that gives more weight to recent voltage levels
The longer a voltage level is maintained at the input, the more influence it has on the output

The size of the resistor and capacitor determine how quickly the circuit responds to changes—exactly like our alpha parameter in software EWMA. A smaller resistor or capacitor makes the circuit respond more quickly to changes (like a higher alpha), while larger values create more smoothing (like a lower alpha).

If you were to sample the voltage across the capacitor at regular intervals, the values you’d read would follow the exact same pattern as our software EWMA formula. The precision is equivalent to the number of electrons the capacitor can store rather than the number of bits in the float or int we would use in software.

Here’s a simulation of this type of circuit in falstad.

Truly Scalable: From Microcontrollers to Distributed Systems

What’s the difference between these completely different approaches to smoothing and averaging? At its core, nothing.

The beauty of EWMA is that its memory requirements are constant and small, regardless of how much history you’re effectively incorporating:

In a microcontroller, you just need 4-8 bytes per filter (or even less with fixed-point arithmetic)
In a distributed key-value store with petabytes of storage, you can track billions of time series with just a single value each
For a machine learning feature store, you can maintain rich time-based features without an explosion in storage requirements

Demo: 3D smoothing

This demo simulates 3d accelerometer noise. You can enable and tune smoothing to see the difference. If you are reading this on mobile I’ve made it so that we can use the real unfiltered accelerometer readings from your phone as input!

Notice how “jittery” it is without smoothing: User interfaces based on it would be annoyingly unstable without it.

Special Cases: Smoothing Circular Values

What if you wanted to calculate a moving average of an angle?

The problem is angles wrap around. 359° is only one degree away from 0°.

How do you compute moving average of that?

The Sine-Cosine Trick

The solution is to separately smooth the sine and cosine components of the angle, then recombine them.

Store separate moving averages of the sine and cosine of the angle, then when you need the angle combine them using atan2().

Time of Day Example

Suppose you want to keep a moving average of when a user logs in on average.

The problem is 11:59:59 PM wraps around to 0:00:00 AM, so how do you calculate an average time of day?

The solution is to convert the time of day into an angle (e.g. (day_seconds / 86400) * 2 * pi) and use the sine and cosine trick. The same trick can be used to featurize time-of-day and similar calendar metrics for machine learning.

Conclusion

It’s remarkable how often the EWMA pattern appears across different technical domains. The same core algorithm works equally well for smoothing sensor readings on a microcontroller, filtering signals in analog circuits, or handling massive distributed data streams.

This convergence isn’t just a mathematical curiosity—it’s practical engineering. When an algorithm is simple, effective, and requires minimal state, it tends to show up repeatedly as different fields independently discover or rediscover the same solution.