TransWikia.com

How should a moving average handle missing data points?

Signal Processing Asked on January 20, 2021

I’m writing a program that averages the user’s weight across different days. I’m planning to use a 5-point moving-average (current day, two before and two after). Sometimes, a data point is missing for 1-2 days. How are these cases usually handled?

(if there’s a better low-pass filter I could use, I’d love suggestions)

5 Answers

As a general impression, regression would work better in automatically fitting the missing points rather than a moving average filter you have chosen.

If you use an AR (auto regressive filter) or ARMA filter - you can have a predicted value of a sample output based on past inputs.

$$ hat X[i] = sum { omega_{k}*x[i-1-k]} + eta $$

Where $hat X[i]$ is the predicted value.

Specifically in your case, say you know the weight of the person has a specific range $X_{max}, X_{min}$. Now if you don't have $x[i-1]$ value - apply two different substitutions - one with Min and one with Max and based on the available model you will have two extreme case results for $hat X[i]$ and you can choose something between them.

There are various other alternatives - you can keep

$$hat X[i] = X[i-1]$$ or $$hat X[i] = text {Long term sample average of X }$$

Essentially it is a game of prediction of that said value and continue using it as a signal. Of course, prediction won't be same as an original sample but that't the price you pay for not having data.

Correct answer by Dipan Mehta on January 20, 2021

I needed this as well, thanks all for your answers. I wrote a function that takes a vector (v) and a window (w). The function iteratively applies the w at each element of v. Two constraints are checked at each iteration. First, the total number of missing values. Second, the sum of the weights (elements in the moving window) that correspond to the missing values. If any of the 2 exceeds its threshold, NAN is pushed into the resulting vector, and the function continues to the next iteration. On the contrary, if enough information is present to determine the value, a simple weighted moving average is the result. Note that the code quality can surely be improved, I'm not a programmer and this is still work in progress.

pub fn mavg(v: &[f64], w: &[f64], max_missing_v: usize, max_missing_wpct: f64) -> Vec<f64> {
    let len_v: i32 = v.len() as i32;
    let len_w: i32 = w.len() as i32;
    assert!(
        len_w < len_v,
        "length of moving average window > length vector"
    );
    assert!(
        len_w % 2 == 1,
        "the moving average window has an even number of element, it should be odd"
    );
    let side: i32 = (len_w - 1) / 2;
    let sum_all_w: f64 = w.iter().sum();
    let max_missing_w: f64 = sum_all_w / 100. * (100. - max_missing_wpct);
    let mut vout: Vec<f64> = Vec::with_capacity(len_v as usize);
    for i in 0..len_v {
        let mut missing_v = 0;
        let mut missing_w = 0.;
        let mut sum_ve_we = 0.;
        let mut sum_we = 0.;
        let mut ve: f64;
        let vl = i - side;
        let vr = i + side + 1;
        for (j, we) in (vl..vr).zip(w.iter()) {
            if (j < 0) || (j >= len_v) {
                missing_v += 1;
                missing_w += we;
            } else {
                ve = v[j as usize];
                if ve.is_nan() {
                    missing_v += 1;
                    missing_w += we;
                } else {
                    sum_ve_we += ve * we;
                    sum_we += we;
                }
            }
            if missing_v > max_missing_v {
                sum_ve_we = f64::NAN;
                println!(
                    "setting to NAN: {} missing data, limit is {}",
                    missing_v, max_missing_v
                );
                break;
            } else if missing_w > max_missing_w {
                sum_ve_we = f64::NAN;
                println!(
                    "setting to NAN: {} missed window weight, limit is {}",
                    missing_w, max_missing_w
                );
                break;
            }
        }
        vout.push(sum_ve_we / sum_we);
    }
    vout
}

Answered by Peruz on January 20, 2021

i think the simplest way would be to "predict" the date for the "whole" in the time series using the data that came before. then you can use this timeseries for parameter estimation. (you could then proceed and repredict the missing values using your estimated parameters from the whole (completed) timeseries and repeat this until they converge). you should derive the confidence bounds from the number of real datapoints you have, though, and not from the length of the completed dataseries.

Answered by blabla on January 20, 2021

If you don't know some of the data, your best bet in not to average over it at all. Guessing it with linear regression and the like may help, but it also may introduce extra complexity and unintended bias to your data. I would say that if you're averaging over these five data points: $[a, b, c, ?, e]$, your answer should be

$$frac{a+b+c+e}{4}$$

Answered by Phonon on January 20, 2021

A simple and general method for filling in missing data, if you have runs of complete data, is to use
Linear regression. Say you have 1000 runs of 5 in a row with none missing.
Set up the 1000 x 1 vector y and 1000 x 4 matrix X:

y       X
wt[0]   wt[-2] wt[-1] wt[1] wt[2]
---------------------------------
68      67     70     70    68
...

Regression will give you 4 numbers a b c d that give a best match

wt[0] ~= a * wt[-2]  + b * wt[-1]  + c * wt[1]  + d * wt[2]

for your 1000 rows of data — different data, different a b c d.
Then you use these a b c d to estimate (predict, interpolate) missing wt[0].
(For human weights, I'd expect a b c d to be all around 1/4.)

In python, see numpy.linalg.lstsq .

(There are zillions of books and papers on regression, at all levels. For the connection with interpolation, though, I don't know of a good introduction; anyone ?)

Answered by denis on January 20, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP