| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

TimeSeries

Page history last edited by Stubborn Mule 14 years, 8 months ago

Some thoughts from The Stubborn Mule.

 

I have commented a couple of times on other pages about the inappropriateness of certain statistical tests in the context of time series analysis, so I thought I should expand on those comments and explain what I mean. Apologies if I am teaching anyone to suck statistical eggs here.

 

Statistics

 I will start with a broad description of what statistics is all about. Typically, the statistician is interested in exploring one or more quantitative variables which exhibit some randomness (either intrinsically or as a side-effect of the way the variable is measured). This exploration proceeds by analysing samples of the variable. So, if we are interested in a variable X, we will study a set of sample values x1, x2, x3,..., xn. Generally, we will do the best we can to ensure that these samples are independent of one another. For example, if X is the proportion of the population who believe in anthropgenic global warming, we could estimate X by taking a poll. If the poll well-designed, then each person polled should represent an independent sample from the population, just like drawing a marble from the proverbial urn of blue and red marbles. The reason this is important is that we can then consider the sample values as a realisation of a series of "independent, identically distributed" (iid) random variables X1, X2, X3,..., Xn, each of which has the same probability distribution as X. Once we are in this situation, there are a lot of powerful tools from probability to draw on, many of which rely on the Central Limit Theorem, which states that (assuming the mean and variance of X are finite) the mean of X1, X2, X3,..., Xn converges to a normal distribution as n gets large.

 

The same approach is used when dealing with multiple variables. While X and Y may not be independent, we can still work with iid samples. So the samples X1, X2, X3,..., Xn are iid and Y1, Y2, Y3,...,Yn are iid, even though Xi and Yi may be anything but independent. As an example, we might be interested in the correlation of X and Y. It's easy enough to calculate the correlation of our samples, but the question is how inaccurate may that estimate be? One approach is to use the Pearson product-moment coefficient (which can be done in R with the cor.test function), which will give a confidence interval for the correlation and can be used to test the hypothesis that the correlation is significantly different from one. As usual, this test relies on the samples being iid.

 

Correlation and Regression

Since I've mentioned correlation and Kieren has commented on it below, this is as good a place as any for a quick digression on correlation. On the Idiot page, Kieren used correlation to look for warming trends. The idea was that a positive correlation between the year and the temperature anomaly should indicate a warming trend. One way of interpreting this is in terms of a simple linear regression. Running a regression of two random variables X and Y gives a linear relationship

 

Y = a + βX + ε

 

where ε is the "residual" and is uncorrelated with X. The regression coefficient β, is the slope of a least-squares line of best fit between X and Y. There is a close relationship between β and the correlation ρ between X and Y. In particular, we have

 

β = ρ σY / σX

 

where σY is the standard deviation of Y and σX is the standard deviation of X. Since the standard deviations of X and Y are both positive, the slope of the line of best fit is positive if and only if the correlation is positive. Another important relationship is that

 

σY2 = ρ2σY2 + σε2

 

So ρ2 gives the proportion of the variance of Y explained by the the relationship with X. In the context of regression, the square of the correlation is more commonly referred to as R2.

 

On the Idiot page, Kieren calculated the correlation between the IPCC temperature anomaly and the UAH satellite temperature data. For illustration here, I have calculated the same correlation for the full UAH series back to 1980. The correlation is 0.61, so still positive and so the straight-line trend is upwards. The R2 is therefore 0.37 and so this trend over time explains just over a third of the variance of the temperature anomaly. Of course, the best way to visualise what is going on here is with a chart. The light grey line shows the least-squares line of best fit from the regression.

Figure 1. UAH Temperature Anomaly - line of best fit

 

It would now be tempting to follow Kieren and attempt to determine the statistical significance of this correlation. However, the Pearson product-moment coefficient is not applicable here, which brings me to the next topic: why statistics gets harder with time series. Notice though, that in this regression Yi is the temperature anomaly and Xi. Whatever the complexities of Yi, it should be clear that the Xi are not iid. In fact, they are not even random. Like most statistical tests, this one relies on the samples being iid, which is not the case here. All is not lost though: if we think in terms of the regression rather than correlation, there is some more that can be said. More on that later.

 

Time Series

When it comes to time series, we have again have sample values x1, x2, x3,..., xn, which are the values of some observed variable at a sequence of times t1, t2, t3,..., tn. However, in this case it is extremely unlikely that we can consider these values to arise from iid random variables X1, X2, X3,..., Xn. For a start, unlike in the usual statistical setup, we cannot arrange to have these variables independent. The fact that these are successive values through time is likely to impose relationships between the Xis. For example, it is often the case that values nearby in time are closely related and values further apart less so. The standard value to get a sense of these relationships is using auto-correlation. The auto-correlation is simply the correlation between lagged versions of the sample series. So, for example, the auto-correlation of Xi with lag d is given by

 

ρd(i) = corr(Xi, Xi-d)

 

If the Xi were all iid then the auto-correlation is 1 for d=0 and zero for d>0, so looking at the auto-correlations can give an indication of how far from independent the samples are. If we are lucky, this auto-correlation may not depend on i. Which leads to the next problem that often arises with time series. Not only are the samples likely to have dependencies, they may not even have the same distribution, so both the "i"s of iid can fail to hold. Share prices are a good example. As prices tend to rise over time, if Xi represents the share price of, say, BHP then both the mean and the variance of Xi will tend to increase over time.

 

Faced with stubbornly non-iid sample variables, the statistician may try to modify the time series to get something that is closer to being iid. In the case of share prices, the share's returns (often calculated as the difference of the log of the price) is much closer to being iid than the original share price, although even then complexities can arise. In fact, differencing is a common technique to get rid of trends from a time series. Even that does not result in independent samples, at least we the distributions may be closer to being identical. If the distributions of the Xi are the the same and, furthermore, the joint distribution of X1, X2, X3,..., Xk is the same as X1+j, X2+j, X3+j,..., Xk+j for any j, then the time series is called stationary. Being stationary is a very strong condition and often we have to make to with weakly stationary time series, which means that all of the Xi have the same mean and variance and the autocorrelation ρd(i) does not depend on i.

 

A general approach is to model the time series as

 

Xi = modeli + residuali

 

where the model may involve another time series (as in modelling temperatures in terms of CO2 or other radiative forcings) other values of the time series, as in the case of the popular auto-regressive integrated moving averages (ARIMA) approach. The hope would be that the residual would then be close enough to iid, ideally normally distributed with zero mean, to be able to get some useful statistics to come into play. Needless to say, settling on a choice of model and then estimating the parameters tends to require a lot of data. As a general rule, when working with time series you need a lot more data than in other statistical applications before you can start saying very much with confidence.

 

Getting back to auto-correlation, it is worth having a look at the auto-correlations for the temperature data Kieren has been analysing. The chart below shows the auto-correlation of the IPCC temperature anomaly for a variety of lags (Note that here I have used data from 1832 which corresponds to the period when CO2 data is available. The results are very similar if the whole anomaly series is used back to the year 900). The dashed blue lines show the 95% confidence around zero based on the hypothesis that the series is iid. It is very clear that the IPCC temperature anomaly have serial dependence. To produce a chart like this in R, use the acf function.

Figure 2. Auto-correlation of IPCC temperature anomaly (1832-2000)

 

While the answer is obvious in this case, it can sometimes be difficult to assess the significance of an auto-correlation plot, but there is a statistical tests that can help. The Ljung-Box test essentially adds up the square of the auto-correlation for a number of different lags. If the result is too large, you can conclude with reasonable confidence that the auto-correlations differ from zero. This test can be performed in R using the Box.test function and setting type="Ljung-Box". Here are the results for the IPCC temperature anomaly:

 

X-squared = 150.8535, df = 1, p-value < 2.2e-16

 

The vanishingly small p-value shows how far from iid the data is. The Ljung-Box test for the UAH temperature data gives a slightly higher p-value, but the series is still a long way from being iid.

 

X-squared = 11.7798, df = 1, p-value = 0.0005988

 

One conclusion to be drawn from these results is that applying traditional tests to this temperature data, such as using a correlation test to determine the significance of observed correlation or a t test to compare the means of the IPCC and UAH series, will not give meaningful results as these tests are based on the assumption that the random variables involved are iid.

 

Now, let's get back to the regression of the UAH temperature anomaly against time. This regression takes the form

 

Yi = a + βti + εi

 

where t is year. Although we already know that the Yi are not iid, if we are lucky, the εi might be. Better still, the εi might be normally distributed iid.  The chart in Figure 1 above shows the results of this regression and the slope β is 0.013°C per year. The good news is that the  look a lot more promising under the Ljung-Box test:

 

X-squared = 0.0952, df = 1, p-value = 0.7577

 

So, at this point we cannot rule out the possibility that the residuals εi are iid. Better still, if we apply the Shapiro-Wilk normality test (shapiro.test in R) to these residuals, we get a p-value of 0.3 so we cannot rule out the hypothesis that the residuals are in fact iid normal. If we assume they are, then we can get a 95% confidence interval on the slope of the regression. I calculate this interval to be 0.006°C to 0.020°C per annum.

 

Trends and Smoothing

In the previous section, I talked about the fact that to get to grips with time series, it can help to get the time series in this form:

 

Xi = modeli + residuali

 

In this form, the model represents underlying behaviour of the time series and the residual is random noise on top of that behaviour. The question is, how do we get a sense of what the underlying behaviour is? A common approach is to smooth the data. The straight-line regression in Figure 1 is an extreme example, but it would be nice to get a picture of underlying behaviour with a bit more structure than a straight line.

 

Moving Averages

Ken made the suggestion in comment that one could integrate under the temperature anomaly curve. This is not a bad start: integration does smooth the original function. The idea, then, would be to integrate from ti to ti+n and calculate the smoothed value for ti+n to be the integral divided by the time period ti+n - ti. If you cast your mind back to high-school calculus, you may remember that a basic approximation technique for integration is the trapezoidal rule. This would give us the following result (assuming the time periods are equally spaced one year apart so the time period is n):

 

(ti/2+ ti+1 + ti+2 +...+ ti+n-1 + ti+n/2)/n

 

So, apart from the terms on the end, this looks just like a simple moving average. Of course, moving averages are commonly used to smooth time series, so Ken's intuition was on the right track here. The degree of smoothing can be varied by varying the window of the average. In the chart below I have used a five year moving average.

Figure 3. Moving Average of UAH Temperature Anomaly

 

In this chart, the warming trend remains clear. There is a dip down at the end of the series, but this is no surprise given the sharp dip down in the data in 2008. Of course, the effect would be less pronounced if a longer time window was used for the moving average.

 

Splines and Lowess

Although moving averages do result in a smoother time series, the result is still fairly jagged. To get even smoother underlying behaviour, we have make use of smoothing techniques that involve fitting pieces of polynomial curves, which have the appeal of being very smooth. Probably the two most popular polynomial-based smoothing techniques are spline curves or LOWESS ("locally weighted scattterplot smoothing"). Both techniques have parameters than can be tweaked to vary how smooth the fitted curves are, in much the same way that the window of averaging can be varied in the case of the moving average. I will not go into the theory of how these techniques work here, but both are readily available in R. The chart below shows the results of applying the functions smooth.spline and lowess to the UAH temperature anomaly data. In the case of the lowess curve, I let the function determine the default level of smoothness, while in the case of the spline, I set the "smoothness parameter" to 0.7 as the default was a bit too smooth and looked jut like the regression straight line.

Figure 4. Spline and Lowess curves

 

Here the spline show a slight dip down at the end, much like the moving average, while the lowess curve has not responded to it.

 

So what do these smoothed curves achieve? By themselves, not very much beyond giving a clearer sense of what the underlying behaviour of the time series may look like. The do not prove trends are reversing or that they will persist. In fact, from the discussion of time series above it should be clear that simply working with a limited set of time series data does not allow you to say very much with strong statistical confidence. For this reason, my own view in relation to climate change is that it makes sense to take into account as much evidence as possible from as many difference sources and fields as possible, rather than trying to data-mine a single temperature time series. And there is a very wide range of research that supports the view that climate change is taking place. A somewhat bizarre example I heard about recently is the case of the shrinking sheep. So, given my proclivities for number-crunching, I will happily continue to look at the temperature data in a variety of ways, but I do not expect this approach to go too far in convincing anyone with an entrenched position to change their mind.

 

The main conclusion to draw is that, when analysing time series, statistical tests and methods used in other settings should be used with care. It certainly does not mean that statistical analysis of time series is intractable, just tricky. One key point is that you need as much data as possible. Throwing data away (such as only looking at temperature data from 1998 onwards) is a bad idea.

 

NOTE: the code for producing the charts here using R is now available in Keiren's code repository in a file called smoothing.R.

Comments (12)

Kieren Diment said

at 11:48 am on Aug 16, 2009

I generally dislie correlation methods because the statistical significance is just a function of the number of observations, so the conclusions are frequently trite or misleading. However for quick and dirty data/relationship visualisation, they're great.

Kieren Diment said

at 12:09 pm on Aug 16, 2009

by misleading, I mean a correlation coefficient of 0.1 with n=1000 will be statistically significant regardless of it's practical significance.

Stubborn Mule said

at 2:53 pm on Aug 16, 2009

When it comes to time series, the chances are that the correlation will not even be statistically significant, which is what I am going to write about next

Kieren Diment said

at 4:41 pm on Aug 16, 2009

t in this case is actually a non-parametric estimate of co2 level, which is a random variable (a rather odd way of looking at it, but in fact true, as co2 concentration in the atmosphere has increased every year since well before 1979). So possibly a better statistic would be the spearman correlation, but in practice the difference between pearson and spearman is actually quite small most of the time ...

Interested to see how the time series plays out though. That would suggest to me that time series have less statistical power than normal parametric stats.

Stubborn Mule said

at 7:54 pm on Aug 16, 2009

You are spot on about time series: when you're not working with iid samples, you need a lot more data!

Stubborn Mule said

at 9:11 pm on Aug 16, 2009

Kieran, you certainly could consider t to be a proxy for CO2 levels and thereby treat it as a random variable. The important point, however, is that the time series cannot be considered to be samples from iid random variables and therefore the hypothesis testing or confidence levels in the correlation test are meaningless.

kenlambert said

at 9:37 pm on Aug 16, 2009

Stubborn Mule:

Would you like to comment on the following paper:

http://wattsupwiththat.com/2009/08/12/is-global-temperature-a-random-walk/

It seems to me that the time series statistical issues covered therein might be relevant to your above analysis.

Ken Lambert

Kieren Diment said

at 10:12 pm on Aug 16, 2009

"hypothesis testing or confidence levels in the correlation test are meaningless"

Indeed, in this case time isn't really time. So we're in the strange position that in my analysis time isn't really time, it's reverse rank of atmospheric co2 concentration. That was kind of my original intention, but I was keeping quiet about it due to a lack of experience with time-series analysis, but wanting to keep things simple for the statistically challenged. [1]

[1] A trend can be negative, even when it's demonstrably zero ... I mean for goodness sake.

Stubborn Mule said

at 6:12 am on Aug 17, 2009

Kieran: note that here I am specifically referring to the Pearson product-moment coefficient used in cor.test that requires both series to be iid. It is perfectly legitimate to regress random variables (which may not even be iid) against variables that are not random, so regressing against time is actually fine, you just wouldn't use cor.test to analyze it.

Ken: I will have a look at that link and revert.

Stubborn Mule said

at 3:34 pm on Aug 17, 2009

Ken,

I have read the Random Walk post, and I am not convinced. I posted a comment there, which is currently awaiting moderation (nothing nasty there, so it should pass). In the meantime, I will reproduce the gist of the comment here.

Hurst exponent is most useful for detecting long-term predictability in a time series. While that means that there are implications for what you would calculate as the Hurst exponent for series that follows a random walk, it does not follow that it is a particularly powerful (in the statistical sense) for detecting random walks. Why use binoculars to look at bacteria when you have a microscope in the lab? On top of that, the Hurst exponent is notoriously difficult to estimate accurately, so they are not even particularly great binoculars.

To me, an obvious example of the microscope we have to hand is the Phillip-Perron (PP) unit root test. Since a random walk has a unit test, if this test allows you to reject the hypothesis of a unit root, you can dismiss the hypothesis of a random walk. According to my calculations, the seasonally differenced series has a PP test p-value of 0.01 (the p-value for the original series is also 0.01), so I would suggest that we can reject the random walk hypothesis with 99% confidence.

kenlambert said

at 10:30 pm on Aug 17, 2009

Thank you - I cannot profess to more than an undergraduate knowledge of statistics so will be interested to see any author's reply to your post.

Stubborn Mule said

at 10:53 am on Aug 18, 2009

Ken: no reply as yet.

Kieren: I have changed a few "a"s to "e"s in your name. Sorry about that. I think I covered just about everything I wanted to say at this point and even salvaged something in the way of a confidence interval for your regression.

You don't have permission to comment on this page.