• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Stage 2 (redirected from Stage 1)

Page history last edited by Kieren Diment 14 years, 9 months ago

Predicting Temperature Anomaly with IPCC figures.


I can't remember doing stepwise regression in R before (I've used SPSS and JMP though), so having read chapter 10 of Faraway's exposition on variable selection in predictive models, I tend to agree with him that the hypothetico-deductive approaches to variable selection is a fairly silly and arbitrary approach, and that the criterion based method on  page 125 is much more sensible, as it grounds the researcher to the data better. 


The results from this analysis shoud provide some evidence for the model:


  Temperature anomaly = [various forcing estimates] + CO2 + error



Regression modelling of temperature anomaly.


This is fun!  This time there's some pretty technical stuff happening, so I think I'll present the R code then explain it step by step.


First we'll read the data into the system:


> climate <- read.csv("/Users/kd/Desktop/climatekaraoke/Climate-Karaoke/all_data.csv")


calculate the log of the difference of co2 levels from the pre-industrial level

> climate$co2.log.diff <- log(climate$co2_mean-270)


then put this into a convenient data structure for us to use in subsequent analysis


> climate.summary <- climate[grep("^(VOLCANIC|SOLAR|ANOMALY|co2.log)",names(climate))]


Just to show what variables we're interested in following on from eyeballing the correlations in Stage 1 ...

> names(climate.summary)

[1] "ANOMALY"      "VOLCANIC"     "SOLAR"        "co2.log.diff"


Just a quick note, we want 30 observations per predictor.  We have three predictors, so that's 90 data points.  Fortunately we have data for 160 years.


Now we'll make a linear regression model predicting ANOMALY from all of the other variables (we'll edit out the marginally relevant bits of the output):


> my.model <- lm(ANOMALY ~ ., data=climate.summary)

> summary(my.model)


              Estimate Std. Error t value Pr(>|t|)   

(Intercept)  -0.734971   0.029781  -24.68   <2e-16 ***

VOLCANIC      0.122872   0.009777   12.57   <2e-16 ***

SOLAR         0.660034   0.052813   12.50   <2e-16 ***

co2.log.diff  0.257794   0.010935   23.57   <2e-16 ***


Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.05338 on 165 degrees of freedom

Multiple R-squared: 0.9409,    Adjusted R-squared: 0.9398

F-statistic:   876 on 3 and 165 DF,  p-value: < 2.2e-16


The F statistic at the bottom tells us that these three variables together predict ANOMALY better than chance.  The t-value columns above (along with the associated p values to the right), we can see that each parameter predicts ANOMALY better than chance.  Finally we see from the adjusted R squared, we've accounted for 93.98% of the variance of ANOMALY using these three variables.  Next note that the t statstic for co2.log.diff  is about double the t statistic for the other two variables.  This is indirect evidence that CO2 is a more important predictor of anomaly than either the VOLCANIC or SOLAR variables.  However, eyeballing the t statistic isn't good enough, as it doesn't have a strong theoretical basis. 


So instead we'll use the Akaike Information Criterion (AIC) to see if dropping predictors improve the fit of the model.  In the table below what we'll see is a bunch of AIC statistics, for the full model, and one for where each term is dropped from the model.  The one with the highest (absolute) AIC is the best model, and the one with the lowest AIC is the worst.  So using Faraway as a guide, let's go:


> step(my.model)

Start:  AIC=-986.52

ANOMALY ~ VOLCANIC + SOLAR + co2.log.diff

               Df Sum of Sq     RSS     AIC

<none>                         0.47 -986.52

- SOLAR         1      0.44    0.92 -875.95

- VOLCANIC      1      0.45    0.92 -875.04

- co2.log.diff  1      1.58    2.05 -739.35

So we see that the best model is the one with all predictors included.  If we ommit either SOLAR or VOLCANIC the AIC drops a bit, but if we omit co2.log.diff, the model a much worse fit.  This is pretty strong evidence that co2 is the main driver of warming in the present day.


But we shoud really run some regression diagnostics to make sure that we're not measuring anything spurious.  That's a job for later, and will mean a return to pretty graphs.


The outstanding things to do here, as I see it are:


  1. Deal with regression diagnostics as mentioned above, (and explain why they're important properly)
  2. Get comments from someone who deals with climate/meteorological data to comment on the methodology used in the analysis to date.
  3. Cherry pick Obtain data from different temperature anomaly sources to see how different anomaly data stand up to scrutiny ( as this is something delusional climate sceptics get really hung up about).
  4. Deal with any other issues as requested.


We'll deal with the regression diagnostics in Stage 3


Interlude:  Deciding whether to use log co2 or direct co2.


This is fairly easy.  We'll create a model that uses both co2.log.diff and co2_mean to predict ANOMALY:


climate.summary <- climate[grep("^(VOLCANIC|SOLAR|ANOMALY|co2(_mean|.log))",names(climate))]

my.model <- lm(ANOMALY ~ ., data=climate.summary)


And run the step() procedure again to get AIC with each predictor omitted.  The one that results in the smallest drop in AIC is the one that we keep:



Start:  AIC=-992.81

ANOMALY ~ VOLCANIC + SOLAR + co2_mean + co2.log.diff

               Df Sum of Sq     RSS     AIC

<none>                         0.45 -992.81

- co2_mean      1      0.02    0.47 -986.52

- co2.log.diff  1      0.16    0.60 -944.25

- SOLAR         1      0.32    0.77 -903.45

- VOLCANIC      1      0.47    0.92 -873.09


So we see here that co2_mean is slightly better than co2.log.diff.  This means we can run the original model but with co2_mean instead of the co2.log.diff:


climate.summary <- climate[grep("^(VOLCANIC|SOLAR|ANOMALY|co2_mean)",names(climate))]

my.model <- lm(ANOMALY ~ ., data=climate.summary)


Start:  AIC=-944.25


           Df Sum of Sq     RSS     AIC

<none>                     0.60 -944.25

- SOLAR     1      0.30    0.91 -877.20

- VOLCANIC  1      0.55    1.16 -836.36

- co2_mean  1      1.45    2.05 -739.35


lm(formula = ANOMALY ~ VOLCANIC + SOLAR + co2_mean, data = climate.summary)


(Intercept)     VOLCANIC        SOLAR     co2_mean 

  -1.938246     0.137729     0.586105     0.006872 

Which shows that for every 1 part per million increase in atmospheric CO2, the temperature anomaly increases by 0.007ºC


Now we have a more solid basis to proceed with Stage 3.

Comments (0)

You don't have permission to comment on this page.