Tests of Normality | StatsforMedics

· Q 1. I want to get a rough idea of whether the data for my continuous variable follow a normal distribution or are skewed. How can I do this using SPSS?

A. Try plotting a histogram and fitting a Normal curve to your data on the same plot. The instructions for this are available here: Creating histograms in SPSS.

However, if you are using this method as a means of helping you decide which is the right test for comparing two or more groups, then it would be best if you proceeded to Q 4., below and the accompanying solution.

· Q 2. What are parametric data?

A. Within the context of hypothesis testing, parametric data may be understood as data which follows a known distribution. Usually, the known distribution is the Normal distribution, which is why we often hear of the dichotomy between tests for Normally distributed data and tests for non-parametric data (or non-parametric tests). Examples of this dichotomy are provided in the two flowcharts on the WordPress page Some Useful Flowcharts of the current WordPress site. These flowcharts are designed to assist you in choosing the right test(s) to address your study questions based on the type of data you have and your underlying hypotheses.

· Q 3. Why are tests of Normality so essential to parametric testing?

A. Often, when we perform a hypothesis test we are trying to refute a statement of no association or difference, called the Null Hypothesis. For example, if we were comparing two means for two samples, we would start with the null hypothesis that the means for the populations from which these samples were obtained are actually the same.

We might take this approach, for example, when comparing pain levels for two different groups of patients following use of a particular analgesic, provided the data are on a measurement scale (e.g. from 1 to 60) which is not limited to only a few categories.

In order to refute the null hypothesis we would need to have obtained a test statistic with a value which would be sufficiently extreme. But what is a test statistic, you may well ask?

Well, first of all each test statistic is dependent on a choice of test and therefore if we are going to draw the correct conclusion for our data, we had better be sure that we have chosen the correct hypothesis test.

For the example I have mentioned above, you would therefore use a t-test only if it was correct to assume that the two samples were taken from populations which were approximately Normally distributed. There are many ways of testing this assumption.

It is true that sometimes we can transform the data using logs or other functions to force the data to be Normally distributed but often this does not work. If this were the case with our example above, but the two samples did nevertheless come from populations with similar distributions, then you would perform a different test called the Mann-Whitney U-test.

Now back to the test statistic. A test statistic is a statistic which is calculated using the sample data and for which the value decides whether or not to reject the null hypothesis. The formula for calculating the test statistic depends on the distributions of the data from which our samples were taken. There is one test statistic for the independent samples t-test and another for the Mann-Whitney U-test when simply comparing for a difference between our samples. It is therefore important to check the distributions from which these samples were taken before calculating the test statistic.

To see an example of the application of a test statistic in practice, have a look at Chapter 7. The t tests of the e-book
Statistics at Square One
and consider in particular how the test statistics, t, are calculated for different types of t-test.

However, I cannot stress too highly how useful the text Medical Statistics at a Glance would be in helping you get to grips with these essentials. If you are registered with the University of Edinburgh, you can consult the electronic version of this book via the University’s library discovery system, DiscoverEd.
Here are some reference details:

- Title: Medical statistics at a glance

- Author: Aviva Petrie

- Caroline Sabin

- Publisher: Hoboken : Wiley

- Publication Date: 2013

Edition: Third edition

· Q 4. I am considering comparing time to hospital discharge data for groups of patients defined according to how long they waited to see the consultant. Where can I find more comprehensive information on tests of Normality for each group (including advice on more precise tests).

A. Please refer to the resource

Testing data for Normality to assist in deciding whether to carry out a parametric or non-parametric test

NB. If you wish to test for Normality for one or more variables but these variables are in separate columns because the data are related across columns, you can still use the instructions in the tutorial. However, you will not have to enter a variable into the SPSS dialogue box for groups (the cell salvage variable in the example provided). In simplicity, this means you don’t have to enter anything in the box Columns when creating a histrogram or in the box Factor list when running the Shapiro Wilks test. You should deal with each variable separately when creating histrograms. However, on running the Shapiro Wilks test, you can enter several variables into the box Dependent List. By doing so, you can save yourself a little time. In this context, you should refer to slide 53 in the first instance.

· Q 5. I have encountered a grey area in at least one of the following senses: a) the results of the Shapiro-Wilks test are not consistent with those which I would have anticipated on the basis of examination of my histograms; b) according to my findings, it appears that one of my groups is Normally distributed whilst the other is not. Are there any further tests which I could perform to help me develop cumulative evidence to decide for or against Normality?

A. Yes, there are various possible approaches which you can take here. The first of these is to examine the box-plots which were generated when you carried out the Shapiro-Wilks test. Are there any extreme outliers? If so, are they sufficiently reliable to keep them in? If not, try removing them and re-generating your results. Please refer to the resource on box-plots to see how extreme outliers (denoted by asterisks rather than ‘o’s) can be identified. In a box-plot, the ‘o’s may be regarded as fairly harmless outliers and it is best to try to avoid removing them. The same resource also illustrates how to use box-plots as a crude preliminary test for assessing the Normality of your data.

You will also have generated some statistics for kurtosis and skewness (definitions provided here). Divide each of these statistics by their standard error (which you will also have generated) to obtain an absolute kurtosis and skewness. The journal article Statistical notes for clinical researchers: assessing normal distribution (2) using skewness and kurtosis will guide you in interpreting these absolute values as a means of assessing Normality.

Q-Q plots – You will have also generated quantile-quantile plots, abbreviated Q-Q plots. These are highly recommended in helping you make your final decision when you encounter a gray area.

Q-Q plots are plots in which quantiles for the observed values are provided along the x-axis. The quantiles are formed by ordering the observed values in increasing order. The values along the y-axis are z-scores. The y-co-ordinates for the plotted points (x,y) represent the quantlies for the z-scores derived from the ranks for the observed values according to the formula z-score = (k – 0.5)/n (where ‘k’ denotes the rank of the observed value under the above ordering and ‘n’ denotes the sample size). If the points lie approximately on the 45⁰ straight line provided in this plot, then the data approximate to Normality; otherwise, the original sample data is non-Normal. The plot may also be used to identify outliers.

Please take time out to consider Yearsley’s examples on interpreting Q-Q plots and a cherry-picked video cross-examining Q-Q plots alongside histograms to train yourself in making the correct judgements.

. Q 6. I want to create a histogram for different variables in SPSS without splitting the output into groups. How can I do this?

Please refer to the quick and easy tutorial Histograms, which also clarifies how to superimpose a Normal curve on your data to assist in judging if the data are Normal.

Also, please use the accompanying practise data for the histogram so that you can learn interactively.

If, in turn, you also wish to perform some tests of Normality on your data, such as the Shapiro Wilks test (which is thoroughly explained in the tutorial in the solution to Q. 4, above) and the additional tests of Normality discussed in the solution to Q. 5, above, but without splitting your data into groups, please have a look at the first half of the movie The Explore Command. You can use the same data as for the histogram in practising the techniques illustrated. As is implicit from the above, in order to make sense of your ouput please navigate to Q.’s 4 and 5, above. The tutorial in the solution to Q. 4 refers in this case to two groups but this doesn’t matter. It is the rationale behind the tests and the interpretation of output that you need to focus on.

The histogram plots will assist you in deciding whether your data are roughly Normal or skewed and thus whether it makes best sense to use a) the mean and standard deviation or b) the median and range together with the minimum and maximum (and possibly, the inter-quartile range), respectively, as summary measures for describing your data. As is explained in the solution to Q. 5, a boxplot can also be useful here in identifying outliers. If you are considering a clinical measurement, you should try to decide whether extreme outliers reflect unusual (but real) cases or instead, an error of measurement. In the latter case, you would clearly wish to remove the outliers before drawing further conclusions about your data. The StatsforMedics page Simple Descriptive Statistics will also assist you in obtaining further information about calculating and interpreting summary statistics such as those above.

. Q 7. I would like to test some transformations on my ESR data in an attempt to see if I can Normalize it. The reason for this is that for all tests of Normality I have considered, one of my patient groups (those with septic arthritis) has Normal data but the other (those without septic arthritis) has skewed data. I would like to try to make the data more uniform in distribution across the two groups. Can you suggest suitable functions for this purpose?

A. Finding a suitable function to transform your data in such circumstances can be challenging and you should allow plenty of time for exploration. A good first start is to use the natural logarithm function (ln) or the function for the logarithm to the base 10. Have a look at the PowerPoint presentation Computing Transformations to obtain some highly relevant information on techniques for transforming data in SPSS. Also the resource Tips for Recognizing and Transforming Non-normal Data is a useful reference for considering more of the relevant theory.

· Q. 8. Should I include the full results of my tests of Normality in my write-up?

A. No, this is not usually appropriate as these tests involve exploratory analysis of your data to help you decide on the correct hypothesis test(s). It would be a good idea, however, to provide an indication in the methods section of those tests of Normality which you used and in what contexts, exactly..

View Page

Tests of Normality by Margaret MacDougall is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.