Linear Regression Analysis (Simple and Multiple)

· Q. 1 What is linear regression in simple terms and do I need to understand hierarchical regression?

A. It is always best to start with simple (linear) regression, where an attempt is made to fit the best straight line to data which approximates to a straight line relationship involving one independent (predictor) variable X and one dependent (response) variable. This line has an equation of the form Y = mX + c, where m is the gradient and (0, c) is the intercept with the y-axis. This equation is used to predict the value of the response variable given a specified value of the predictor variable (provided no attempt is made to extrapolate beyond the boundaries of the predictor variable).

The regression coefficient, R-squared, measures the extent to which the variation in the response variable can be explained in terms of the predictor variable. R-squared ranges from 0 (not at all explained) to 1 (completely explained).

For some further detail, I strongly recommend that you carefully study the material available in the Statistics at Square One chapter, Correlation and Regression (see Chapter 11) and at the following site: The idea of a regression equation. (Please ignore the ‘Note to users’ at the latter site.)

When it comes to looking at the independent effects of more than one predictor variable, then the model gets more complicated and we are looking at a multiple regression model. A hierarchical regression analysis is a multiple regression analysis whereby we enter predictor variables in stages and use the change in the value of R-squared between these stages to investigate the extent to which one combination of variables explains the variation in the predictor variable relative to another.

For example, we might wish to define a regression model for predicting anxiety score in terms of age and hours of work and then subsequently to measure the improvement in the predictive value of that model (model 1) on introducing the additional variables ‘gender’ and ‘deprivation score’ so as to form the extended model, model 2. The results of this analysis might then assist us in deciding whether or not it is worthwhile extending model 1 to form model 2. Here, the critical question is, was the change in R-squared statistically significant (which must be addressed by an accompanying hypothesis test known as an F-test)?

For your purposes, whether you are performing a simple linear regression or a hierarchical (multiple) regression analysis you must decide first of all whether or not you actually have a dependent variable which ranges over measurement data (in the sense that you are not limited to few categories, such as those on a scale from 1 to 11). For the simple linear regression analysis, your predictor variable will also have to range over measurement data and in the case of the hierarchical regression analysis (as with all other forms of multiple regression analysis), you will need at least two of the predictor variables to range over measurement data.

In actual fact, there are three types of multiple regression analysis: standard, stepwise and hierarchical. Please have a look at the solutions to Q.’s 3 and 4, below to find out more.

· Q 2. I wish to perform separate simple linear regression analyses to assess the relationship between the dependent variable pulse wave velocity and each of the predictor variables systolic blood pressure, diastolic blood pressure and number of minutes spent on exercise machine. Which pre-processor tests should I perform on my data, how may I identify outliers and how may I perform the individual regression analyses in a single run using SPSS?

A. You should find the worked examples and interpretation of SPSS output provided under Simple Linear Regression in SPSS very useful. This resource will help you make sense of the regression coefficient (also known as the coefficient of determination) within the context of a regression analysis.

· Q 3. Where can I find a more comprehensive account of multiple regression analysis which covers: a) all exploratory tests which I must perform, including testing for outliers and worked examples in SPSS and b) explains stepwise regression?

A. In response to a), go through the PowerPoint presentation available at: Assumptions and Outliers, where you will also find a 6-step plan of action for data exploration. Also, note, that before embarking on a multiple regression analysis, it is worth considering whether the Pearson correlation coefficient for the strength of the linear relationship between your principal continuous explanatory variable and the dependent variable is at least 0.3. If it is not, it is hard to see why you should wish to proceed further by extending the model to a multivariable model including the same explanatory variable.

To appreciate how to interpret the magnitude of linear regression coefficients and how these coefficients relate to the Pearson correlation coefficient (also know as the Pearson product model coefficient), please refer to the section Pearson Product Moment at Correlation Coefficients, where you will find some handy cut-off values for interpretation of your findings.

In response to b), please note that Stepwise procedures have their limitations (see COMMON MISTEAKS MISTAKES IN USING STATISTICS: Spotting and Avoiding Them). Should you choose to go forward with this procedure, you should at least note the limitations implicit from the above resource when interpreting your findings. Useful resource for performing stepwise mutliple linear regression with SPSS are:

section 12.4.2 of PASW 17 Statistics Made Simple
Stepwise linear regression .

Concerns about stepwise multiple regression arise from a lack of control by the researcher as to which order a list of potential explanatory variables should be entered into the model and a lack of control over the criteria to select the final list from these variables to include in the model. By contrast, hierarchical linear regression allows the researcher to use their own bacground knowledge about relevance of explanatory variables to build up a model, while noting the effect on the model of introduction of new explanatory variables . Variables are entered in blocks. Therefore, if you wish to assess how well level of anxiety predicts blood pressure after controlling for age, you would have the capacity to enter age in block 1 followed by level of anxiety in block 2. Advise on using hierarchical multiple linear regression may be found under the solution to Q. 4, below.

· Q 4. I wish to perform a hierarchical multiple regression analysis in order to identify the optimal multiple regression model for predicting diastolic blood pressure. Which model assumption checks should I perform on my data and how may I perform the regression analysis using SPSS?

A. You may wish to refer to the chapter entitled Multiple Regression in the book SPSS survival manual: a step by step guide to data analysis using IBM SPSS by Julie Pallant. If you are registered with University of Edinburgh library, you can refer to the University’s search facility DiscoverEd to check on availability of editions of this book for borrowing. I would also strongly recommend the Kindle version of this book, available via Amazon, as it reads well using A Kindle Cloud reader.

· Q 5. Within the context of a multiple linear regression analysis, how can I test if height, weight and BMI are effect modifiers to the association between age (independent outcome) and 24-hour pain intensity (dependent outcome).

A .. Testing for confounding or effect modifiers is not very realistic for a very small sample size (see later question and solution on a sample size calculation for a multiple linear regression analysis). However, assuming that you have sufficient data and at least a modest linear association between age and pain intensity (Pearson correlation coefficient of at least 0.3), future researchers may wish to pursue a multiple linear regression analysis to adjust for potential confounders and test for effect modification. For confounding, this would involve assessing the influence on the regression coefficient of adding in the potential confounders. For effect modification, modelling the potential effect size modifiers as categorical variables would help you compare regression coefficients across strata. For helpful content comparing the concepts of confounding and effect modification and illustrating the different methods for detecting these phenomena, refer to Confounding and Effect Measure Modification.

· Q 6. Is there a rule of thumb available for deciding in advance what sample size is required for a multiple regression analysis and should I treat hierarchical multiple regression as a special case in this respect?

A. There are a number of rules of thumb available for choosing an appropriate sample size for a multiple regression analysis (see for example, the answer to the FAQ ‘How big a sample size do I need to do multiple regression?‘ in the electronic book (e-book) Multiple Regression by David Garson. The 2014 version of this book can be requested for free by accessing the Statistical Associates Publishing E-Book Catalog. With regards to hierarchical multiple regression in particular, however, you may wish to have a brief shot with the relevant sample size calculator provided at: Statistics Calculators.

This will help you to appreciate how tweaking different parameters influences the required sample size.

In addition, in the SPSS Survival Manual, in response to the question, ‘How many cases or participants do you need?’ Pallant (6th edition, ch. 13, p. 3427) highlights the following rules of thumb for multiple linear regression:

N > 50 + 8m, where ‘m’ denotes number of independent variables (source: Tabachnick and Fidell (2013), p. 123),
about 15 participants per predictor (Stevens (1996), p. 72)
and
in the case of stepwise multiple linear regression, 40 cases for every independent variable.

· Q. 7. If I would like to perform a sample size calculation for a simple linear regression or multiple linear regression analysis, where can I find a suitable sample size calculator?

A. In any one of these cases, you can consult the relevant sample size calculator provided at: Statistics Calculators. As you will see from this resource, in order to perform the calculation(s) you should know in advance the approximate size of the regression coefficient(s) which you hope to derive.

· Q. 8. I notice that when I use the sample size calculator for a regression analysis, I am invited to offer a value for the relevant regression coefficient in order that the desired effect size can be estimated for me. How is the effect size calculated?

A. The effect size is known as Cohen’s f². It is defined as:

f² = R²/ (1 – R²)

where R² is the squared multiple correlation (or the regression coefficient).

Here’s the citation for this effect size measure:

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

· Q. 9. I am interested in designing a multiple regression model for predicting patient weight where their condition is such that they are unable to get out of bed to use weighing scales. I am not sure whether it is better to use standardized or non-standardized regression coefficients to express the relative effects of the different independent variables (such as waste circumference and silhouette area) when defining the regression model. Can you provide some relevant advice?

A. Standardized regression coefficients are obtained in a multiple regression analysis by making adjustments to the existing regression coefficients so as to ensure that they are all in the same units – namely units of standard deviation. You can refer to the section Standardization at R-Square and Stanardization in Regression to see how these are calculated.

Where you have more than one variable as potential predictive factors, there is definitely a rationale for using them (see Standardized Coefficient). However, there is also a great weight of evidence in the literature that they can be misleading (and the material at the above link under ‘Disadvantages’ only gives a hint of the arguments involved). In particular, they tell us for a 1 standard deviation (s.d.) increase in an independent variable, what the corresponding increase in the dependent variable is in units of s.d. While the idea of comparing standardized coefficients may seem appealing (as they are on the same scale), it is worth bearing in mind that due to the way in which they are calculated, the real effect of independent variables can be greatly distorted, particularly in the case where they have high variances, leading to inflated effects. This can be problematic when comparing models across different populations, where variability, but not true effect, for a given variable may differ. (Fox, 2008) A similar problem arises with standardized categorical independent variables, where if group sizes for individual categories are small, effect size in switching from one category (of gender, say) to another can be inflated. Thus, expressing coefficients relative to a measure of spread (or, variability), such as the s.d., can distort effects. Correspondingly, the validity of standardized regression coefficients can be an issue. A further troublesome issue (in addition to validity) is the transparency of these coefficients for the general or statistically uninitiated reader. For example, when assessing costs of hospital transportation, it may be worth deciding in the first instance which of the following approaches to communicating findings is the more practically meaningful:

A 1 s.d. increase in vehicle weight leads to a 0.24 s.d. decline in mileage

An extra thousand pounds in vehicle weight leads on average to a 4.9 mile per gallon decline in mileage.

Standardized regression coefficients may have their use, however, when dealing with rather strange units, such as those obtained by raising variables to powers.

To keep everyone happy, I strongly encourage you to tabulate both the standardized and non-standardized coefficients with corresponding confidence intervals but to use only the non-standardized version for your regression equation. All of this is good practise!

References

Fox J: Applied Regression Analysis and Generalized Linear Models. 2nd edn. London, UK: Sage; 2008.

Statistical Modeling, Causal Inference, and Social Science

Linear Regression Analysis (Simple and Multiple) by Margaret MacDougall is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

StatsforMedics