Friday, March 15, 2019

How does filtering change the degrees of freedom in time series data?

LabKitty sanity check logo
Sanity Checks are missives on a specific math point in need of clarification. I try to do so using the fewest words possible. Usually, this is still quite a few words.

Today's Sanity Check is brought to you by Google. Or not brought to you by Google as it were.

What it didn't bring was the gorram required information. To me. When I typed the post title as a search phrase.

I was attempting to suss how the degrees of freedom should be adjusted when testing the correlation of two time series signals if the data are filtered. That ain't exactly rocket science -- as we shall see, the answer is rather intuitive -- but it is some kind of science, and in science it's simply not cricket to guess. So Google I did.

Yikes.

Here is a question that has truly fallen through the cracks of the internet. I wasn't expecting a Spanish Inquisition, which was fortunate because what I got was no inquisition at all. There was exactly one hit I would call relevant in the pile of Google offal I obtained. But, alas, my attempts to decrypt the author's fingerbarfings led nowhere, perhaps because I lacked the requisite decoder ring but more because the explanatory figure she proffered was less an explanatory figure and more a collection of lines with some text underneath. And so here we are.

Drink more Ovaltine, indeed.

Note to the impatient: You can skip to the end for a summary, but in so doing you will miss out on much explanatory background and charming (or not) wiseassery.



Time series is the Boo Radley of statistics, a neighborhood misfit longing for acceptance but forever shunned by polite society by virtue of an extra chromosome. That extra chromosome is signal processing from the statistico's point of view, or statistics from a signal processing wonk's perspective. Temporally-ordered data knocks the wheels off any number of standard statistical assumptions, and temporal order is implied right there in the name (n.b. time series).

What we require to inquire after such beasties is a marriage of statistics and signal processing, as might be described in a magical textbook titled, say, Statistical Signal Processing. In fact, there exist such a volume -- a collection of three in fact -- which comprise a very nice treatment by Steven Kay. I'm told it's something of a classic in the field, which might explain why a copy on Amazon costs more than a mortgage payment and furthermore explains why I do not own one. What can I say? My economic priorities go food, shelter, and insulin; any leftover jingle competes with red wine and bribing whichever pirate ISP is keeping the lights on at labkitty.com this month.

But I digress.

I will assume you know what a Nyquist frequency is, if not go to A LabKitty Primer on the Fourier Transform (Part II) and make learning doing. Furthermore, I will assume you know what a correlation is. Briefly, assume we've made a measurement on two collections of n objects (or two measurements on a single collection of n objects). To obtain the correlation coefficient: 1) Multiply all corresponding data points together after subtracting the respective mean, add, and divide by n (or n-1, to be pedantic). That's not correlation, that's covariance. Ergo, 2) Divide by the square root of the product of the variance of each sample. That's correlation.

In equation form:

    r = [ Σ (xk − x-bar) (yk − y-bar) ]   /
     [ √ Σ (xk − x-bar)2 Σ (yk − y-bar)2 ]

Here, x = { x1, x2, ..., xn } and y = { y1, y2, ..., yn } are the two samples, x-bar and y-bar are the mean of x and y, respectively, and the sums go from 1 to n. Note all of the dividing by n (or n-1, to be pedantic) cancels and so is omitted. It's neigh impossible for a horse to remember this equation, so here's a handy mnemonic: Convert the data to Z-scores then compute their covariance. Viola! You get the same number.

Footnote: Note covariance depends on the scaling of the data. For example, you'll get one number if your data are measured in millimeters and another if in furlongs. Dividing by the variances (i.e., the denominator in the equation) normalizes the data. Ergo, correlation doesn't depend on scaling. That's makes correlation a more meaningful description than covariance.

Footnote: If you square r, you get something called r2 (or called the coefficient of determination, if you're not into the whole brevity thing). This quantifies the predictive power of the correlation. More specifically, r2 tells you how much of the variance between your two variables is "explained" by a straight line relationship. In your mind's eye, imagine a scatterplot with a line fit through it. If the data form a tight cloud around the line, you have a big value of r2. The line explains most of the variance (i.e., spread) of the data (r2 = 1 in the limit, if the data points all lie exactly on the line). OTOH, if the data form a big cloud around the line, the line doesn't explain much of the variance. That is, you can be very wrong about the value of a data point at a given x if you use the value of the line at x as the predictor. Your coefficient of determination is small (r2 = 0 in the limit, meaning a linear relation has no predictive value).

Footnote: "The amount of variance explained" is a horrid phrase, but one that has become entrenched in the statistics lexicon. You just have to accept it. It'll hurt less if you don't struggle, the bully assures us.

Footnote: Nerds go bananas over r2 in papers, although nobody will ever define what a "good" value is. Consider yourself warned.

We now have a correlation coefficient. Like any statistic, we want to test if the value is significant. There's several approaches for doing so. The general idea is the familiar broken record of statistical testing: value of thing divided by the standard error of the thing. There's a t-test version and an F-test version and a Z-test version involving logarithms. The formulas are easy enough to look up and I hope you will not feel cheated if I don't provide them here. What is important is all of these formulas require the degrees of freedom -- it goes into the calculation of the standard error.

You may recall the degrees of freedom in correlation is the number of data points minus two. But that is for vanilla correlation wherein the data have no temporal structure. Examples: Consider the correlation of punter height (X) with punter weight (Y), or the correlation of student hours studied (X) with student course grade (Y), or the correlation of page rank of a blog (X) with the number of squirrels the blog owner mails to Google HQ (Y). It doesn't matter how we shuffle the data -- we get the same scatterplot as long as we keep the X and Y pairings the same.

That is not true with time series data. In your mind's eye imagine a sine wave. You are probably imagining a wavy continuous line, but we're dealing with sampled data so replace the line with discrete points. Here comes the boom: If we randomly shuffle the points in time, we destroy the sine wave. The temporal structure is important. That wasn't true in examples of height versus weight, or study time versus grade, or page ranking versus squirrels.

That temporal structure comes with a statistical price: It reduces the number of degrees of freedom in the data. Note carefully: reduces the degrees of freedom from the get-go. We haven't even gotten to filtering yet.

This issue opens a thoroughly Brobdingnagian statistical can of worms. The reduction in degrees of freedom due to temporal structure is typically quantified using autocorrelation and other unpleasantness. These are important and occasionally interesting topics, but they are not fish we are looking to fry. A proper treatment may appear in these pages someday, but today I will cut to the chase: Assume we know the degrees of freedom of a time series. How does that number change after we filter the data?

We can think about three types of filters: lowpass, highpass, and bandpass. For our purposes the distinction is unimportant. What is important is the bandwidth of the filter -- that is, the range of frequencies that are removed by the filter.

For example, suppose we have data containing frequencies from 0 (i.e., DC) to 1 KHz (i.e, 1000 Hz). If we lowpass filter below 500 Hz, or highpass filter above 500 Hz, or bandpass filter from 250-750 Hz, we have removed half of the original frequencies present. It doesn't matter how these were removed (low-, high-, or bandpass), only the range is important for our discussion.

So, explaining via hypothetical, suppose we begin with a time series containing 100 points. Applying the standard correlation DOF formula (n-2), this data contains 98 degrees of freedom (I am ignoring the autocorrelation adjustment). Apply any of the filters described in the previous paragraph that removes half of the frequencies present. A sensible guess is that we removed half of the degrees of freedom (98 => 46). Is that correct?

Probably not.

The Stats 101 definition of correlation DOF (i.e., n-2) is not meaningful here. The data exists in the time domain, but a filter exists in the frequency domain. As such, we need to compute how many frequency domain degrees of freedom a given time series provides. Short answer: The number of frequency DOF is the number of components appearing in the (discrete) Fourier transform of the data. That sounds terribly complicated. Let's unpack it.

Time series data has a sampling rate. And if data has a sampling rate, it has an associated Nyquist frequency (Ny). To wit: Ny = Fs/2, where Fs is the sampling rate in Hz. We can also express this as Ny = 1/(2⋅Ts), where Ts is the time interval between samples in seconds.

When we Fourier transform data, you know (or you should know) the frequencies represented span zero (DC) to Ny. Furthermore, you know (or you should know) the frequency resolution Δf (i.e., the frequency band represented by each Fourier coefficient) is the inverse of the sample duration. If you want finer frequency resolution, you must sample for a longer time. If the sample period is Ts, then the total sample duration is n⋅Ts, where n is the number of samples. Hence, the number of frequency DOF is given by :

   Ny/Δf = (1/2Ts) / [1/(n⋅Ts)] = n/2

That is, the number of "frequency DOF" represented in a time series of length n is simply n/2.

Now apply a filter. We characterize a filter by its bandwidth -- a lowpass filter below 500 Hz, a highpass filter above 500 Hz, and a bandpass filter from 250-750 Hz all have a characteristic bandwidth of 500 Hz. To compute DOF reduction we express the filter bandwidth as a fraction of the Nyquist frequency. If the Nyquist frequency is 1 KHz (i.e., the data were sampled at 2 KHz), a 500 Hz filter reduces the DOF by a factor of 500/1000 = 0.5. If the Nyquist frequency is 5 KHz (i.e., the data were sampled at 10 KHz), a 500 Hz filter reduces the DOF by a factor of 500/5000 = 0.1. And if the Nyquist frequency is 100 Hz (i.e., the data were sampled at 200 Hz), none of these filters make sense because they're trying to remove frequencies beyond Ny and you can't do that.

Finally, to get the actual number of post-filter DOF, multiply n/2 by the reduction factor. Easy peasy.

Footnote: This result is only valid for an ideal filter. Specifically, 1) the transition between the pass-band and stop-band is assumed to be instantaneous (in signal processing lingo: the filter has an "infinitely steep skirt") and 2) the filter causes no wiggles in the output that might introduce new temporal correlations (in signal processing lingo: "the pass-band is infinitely flat"). We may as well throw in 3) the filter has zero phase distortion (in signal processing lingo: "the filter has zero phase distortion") while we're at it and complete the ideal filter trifecta. It probably won't surprise you to learn that perfect filters don't exist in the real world. Think of what we did here as a quick-n-dirty back-of-the-envelope kind of result.

So, then, in summary:
Filtering reduces the number of degrees of freedom of a time series by the ratio of the filter bandwidth to the Nyquist frequency of the data. If our filter passes 50% of Ny, then we reduce the DOF by a factor of two. If our filter passes 10% of Ny, then we reduce the DOF by a factor of ten. For the purposes of this calculation, the original (i.e., pre-filter) number of degrees of freedom in a time series with n samples is (approximately) n/2, This value ignores serial dependencies that may be present in the pre-filtered data -- a more rigorous analysis would further adjust down the DOF using autocorrelation techniques. It also assumes the filter is ideal.
Hopefully, that answers the question that led you here.

1 comment: