Epidemiology for the Rusty and Comparing a Study Cohort with a General Population

. Q 1. I have been advised to present my data within the context of an epidemiological study. But what does this mean?

A. Many things to many people (with varying degrees of clarity). You should not go wrong, however, if you consult the following useful BMJ resource:

Chapter 1. What is epidemiology?

It is also important for you to be aware of how to quantify disease and with respect to reporting occurrence, to have a clear understanding of the distinction between incidence and prevalence of an outcome, such as mortality. Therefore, please take a look through the following useful BMJ resource:

Chapter 2. Quantifying disease in populations

Additionally, you should consider how best to present you data graphically. An appropriately chosen graph can lucidly summarize your data and therefore help the reader to grasp relationships in your data in an efficient way. The StatsforMedics WordPress page GRAPHICAL PRESENTATION OF DATA provides advice on how to generate many such graphs.

. Q 2. Can you point me to a resource which explains the difference between the notions ‘prevalence rate’ and ‘incidence rate’?

A. Yes; please refer to the solution to Q. 1, above.

. Q 3. I have determined crude rates for accidental death and would like to provide the corresponding confidence intervals. Can you recommend a suitable reference?

A. Yes, please consult Confidence Intervals for a Crude Rate.

. Q. 4. What are the advantages of the indirect method of standardisation as a means of adjusting risk estimates for possible confounders?

A. As is explained in Section 3 of the BMJ article Epidemiology for the Uninitiated, in the majority of cases the indirect method leads to more stable estimates of risk. This is another way of saying that if your study was repeated with a similar cohort, the new estimate is more likely to be close to the original one. This property is particularly important where you are considering cohorts (such as ethnic minority groups) which are not very large and where you are considering evaluating multiple risk estimates for comparison, such as when you may wish to compare Bangladeshi, Indian and Pakistani Cohorts with a White reference population.

One further highly important advantage of the indirect method of standardisation is that it is more amenable to tests of statistical significance for comparing study cohorts with a reference group within the context of a multivariate analysis. Such work is best carried out by a trained statistician, however.

You can read more about direct and indirect standardisation at the follow links:

Epidemiology for specialists: standardisation

Information on Public Health Observatory recommended methods

Indirect age standardisation – ScotPho

. Q. 5. How can I correctly represent the prevalence of a disease (or other adverse outcome) in an exposed group by comparison with a reference population or unexposed group, whilst taking into consideration confounding due to factors such as age or gender? (It is one thing to report the prevalence but I would like to assess whether exposure to a risk factor such as smoking is really making a difference.)

A. You would do well in the first instance to familiarize yourself with notions such as attributable risk, relative risk and standardisation as defined and illustrated within the BMJ resource Comparing disease rates. Within this resource, the indirect standardisation method is recommended and explained as a means of calculating standardised mortality ratios (SMRs). The example provided of this method illustrates how it is possible to standardise symptom prevalence for a study cohort relative to a standard cohort of population, whilst making the appropriate adjustment for the age category of the patient.

. Q. 6. While, ideally, I would like to adjust my calcuation according to age or gender, I do not have age- or gender-specific data. Is it possible to obtain a ratio similar to that discussed under Q. 5 but without standardisation according to the above factors? Also, I am considering person years in the sense that each person is being followed up for one year and I want to look at prevalence of admissions for homeless persons versus the population as a whole. What should I call the resultant ratio?

A. It is really easy to adjust the approach in the solution to Q. 5 to meet your needs. On referring to Table 3.5 under the recommended resource Section 3 (and the advice underneath this table), just use a similar table which has one row for your homeless group as a whole. (You would only wish multiple rows if you were stratifying, e.g. according to gender or age). In your case, of course, the event of interest is admission rather than death. Therere is really no need to a row entitled ‘Total’, as you are not summing over multiple rows – you only have one row.

You are now at liberty to proceed to Q. 7 to make your answer that little more sophisticated (or, statistically valid). It makes sense to call your statistic a rate ratio rather than a standardised ratio. This is quite conventional!

. Q 7. I have calculated standardised mortality ratios using the reference above but I would like to obtain confidence intervals for these ratios. Can you recommend an alternative resource for this purpose?

A. Yes; in fact it is possible to perform the relevant calculations VERY quickly by refering to Confidence intervals and signficance testing for a standardized ratio. Before using this resource, however, take care to consider whether use of confidence intervals (CIs) makes sense for your study. CIs and statistical signficance (both of which are covered in this resource) belong to the world of statistical inference. If you are seeking to compare prevalence rates for a complete sample of volunteers for a future dementia study with those in the general population and a) your study sample is complete and b) you do not wish to consider this sample as only one of several such samples from the general population, your study sample is your study population and your standardised ratio(s) are actual values, not estimates. As such they do need to be assessed in terms of statistical accuracy by means of a CI, nor do they require to be assessed for statistical significance, as the study population which they are intended to represent is the study sample itself. There should therefore be no issue concerning sampling bias. However, you may find it useful to consider evaluating the magnitude(s) of your standardised ratio(s) as markers of potential volunteer bias. To learn more about this form of bias, you may find it helpful to consult the resource Volunteer Bias in Psychology: Definition and Importance.

. Q. 8. Is it possible to carry out the calculations for standardised ratios and their corresponding 95% CIs using a software programme?

There are not specific menu commands within Minitab or SPSS

which will allow you to perform these calculations. Therefore, you might like to consider referring to the Excel template below in which two examples have been set up – one involving the calcuation of a standardised symptom ratio, adjusted for age group of the patient and the other involving the calculation of a standardised retinopathy (prevalence) ratio, adjusted for gender of the patient (where the reference group is the white population). On saving this template, you can adjust it according to your purposes.

Examples involving indirect standardisation

. Q 9. I am conducting a study on yearly incidence of coronary heart disease (CHD) over a 5-year period. As part of this study, I would like to compare the distribution of deprivation within my CHD study cohort (irrespective of year) with the general population. I have used postcodes to determine Carstairs depcat categories but how can I appropriately: a) merge these categories to represent extremes of deprivation and b) represent the differences between the distributions for the study cohort and the general population numerically. I should add that I wish to consider the data separately for the two age-groups ’18 – 59’ and ‘60+’.

A. in response to a), note that it is quite common in the literature to combine the 7 Carstairs depcat categories in the form ‘1 – 2′, ‘3 – 5’, ‘6 – 7’ so as to represent low, moderate and high levels of deprivation, respectively.

In response to b), note first that if you have frequency data for quintiles of the Scottish Index of Multiple Deprivation (SIMD) instead of for depcat categories, the methodology below can be readily adapted to suit this case.

As for comparing the distribution of deprivation for the study cohort with that of the general population for the merged depcat categories, a template is provided below to assist you in obtaining an age-specific deprivation ratio separately for each merged depcat category for each of your two age-groups. The template also offers the methodology for obtaining the corresponding 95% confidence intervals for these ratios. By clicking on the relevant cells, it is possible to see how the necessary calculations have been built up using arithmetic functions within Excel (see the top menu bar within the spreadsheet). Hopefully, you can now see how to apply this sort of approach with your own numbers.

Excel template for calculating deprivation ratios

deprivation_ratios_chdQ.4

To determine whether the observed number of CHD persons in a given depcat category is significantly different in each case from the expected number (based on the population), have a read at the material towards the end of the resource provided under Q. 6. Notice, for example that for age-group 18 – 59 years, the depcat ratio for the least deprived group is 269.01 (95% CI: (234.24,303.77)).

For the above age-group, CHD patients are more than 2.5 times as likely to fall into the least deprived category ‘1-2’ compared to the general population for the given sample studied.

Also, since both confidence limits are above 100%, we can be 95% sure that for for the age-group 18 to 59 years there is a significantly higher number of CHD persons generally speaking (not just for this study) falling into the least deprived depcat category ‘1 – 2’ than in the case of the general population. Please refer to the link under Q. 7, above for more general advice on how to interpret confidence intervals for standardised ratios.

Notice, that the confidence interval is wider for the older age-group. This is to be expected given that the sample size is smaller for this group.

You could, off course, proceed in a similar way to that above for each of the original 7 depcat categories (and hence without merging).

Adapting the Excel template for your needs

Please note that it is relatively straightforward to adapt the above template to address other clinical scenarios. For example, you may wish to compare prevalences of different BMI categories (normal, overweight and obese) in children, where a sample of children with learning disabilities is to be compared with a general population. As the BMI categories are calculated separately for each of males and females using centile charts, you may find it convenient to calculated a standardised ratio separately for males and females. In the latter case, the above template can easily be adapted to your needs by replacing age-groups by gender categories and depcat categories by BMI categories.

. Q 10. I am interested in including confidence intervals (CIs) in a chart. I have calculated separate standardised mortality ratios (SMRs) and corresponding 95% CIs for each year over a fixed time period. Can you recommend how best to plot these in order to look for a trend over time? I would also be interested in learning how to plot simple percentages in a similar way.

A. Please note the following:

a) Steps 1 – 4 below don’t apply uniquely to SMRs or to a scatter-plot. You could proceed very similary if requiring to plot CIs for a standard proportion obtained in Minitab against time (or a variable other than time) using a simple bar-chart (or, “column chart”). Bear in mind, however, that in this case, that by their very meaning, the increments required at step 5 below can be obtained by subtracting each sample proportion from the corresponding upper bound of the 95% CI. (Please refer to Q. 4 and the accompanying solution on the StatsforMedics page CONFIDENCE INTERVALS to access detailed instructions with a worked example for the corresponding calculations.)

b) If you want to plot an average proportion over time, where at each time point the average has been obtained over a considerable number of individual proportions, your data are best represented by a line graph, not a barchart. Furthermore, assuming your data are Normally distributed at each timepoint – please check* – the relevant CI is that for the sample mean. In these circumstances, you are best advised to use the line graph plotting facility available via SPSS. On selecting the sequence Legacy dialogs –> Line from the menu Graphs, be sure to select the option Summaries of separate variables. This option assumes that you have your raw percentage data for calculating each average in a separate column for each time point in SPSS, while the option Simple assumes you have only one line graph to plot. On selecting the button Define, you will also see that there is a button Options. Use this button to help ypu request a 95% CI for the average you need to calculate for each time point.

*For any one timepoint, testing your data for Normality is only strictly necessary where you have less than 30 data points. If you are unclear about how to test for Normality for the percentage data in each of the time columns, please refer to Q. 4 and the accompanying solution on the StatsforMedics page Tests of Normality. The paragraph commencing ‘NB.’ in this solution applies to you!

Steps to follow – using Excel with the SMR and simple proportion

It would be helpful in the first instance if you created a scatter plot in Excel of SMR against year. You can then add in your calculated CIs using the steps below.

1. In the solution to Q. 6, above you are provided with the correct formulae for calculating the increments you wish to add or subtract from your SMR to create your confidence interval. These can be calculated separately for each SMR in a separate Excel column using the built-in arithmetic functions within Excel. However, if you are considering the sample proportion rather than the SMR, you will find the process in Excel in a little more straightforward.

2. On the existing plot, click anywhere you like.

3. A green cross icon will appear to the right of your plot

4. Click on this icon and select the option ‘Error Bars’.

5. Hover over the option ‘Error Bars’ and click on the grey forward pointing arrow on this icon.

6. Familiarise yourself with the options for horizontal and vertical error bars under the drop-down menu ERROR BAR OPTIONS and decide which you wish to display.

7. Choose More Options and in the resultant dialogue box, choose Custom followed by the button option Specify Value.

8. To the right of the box labelled ‘Positive Error Value’, you will see a little coloured box icon, click on this box, then select the data in your spreadsheet corresponding to the increments. This allows you to select the data you require for displaying the upper limits of your confidence intervals.

9. Repeat step 8. for the ‘-‘ box to select the data your require for displaying the lower limits of your confidence intervals.

10. Click ‘OK’.

11. Admire!

12. Additional notes:

a) If you find that any of the above instructions do not work with your current version of Excel or you would like more details about the above instructions, please check under Add, change, or remove error bars in a chart.

b) If your sample size is not constant over time, it is important to make this absolutely clear by means of your chart. One way of doing this is by adding text boxes with text of the form ‘n = …’ underneath each 95% confidence interval, where “underneath” may involve including such text below the horizontal axis for your line graph. In the context of writing a report, this step could be carried out in MS Word. Positioning is up to your judgement of what looks clearest. By making the borders of your text boxes white, you can make the text appear as an integral part of your original graph!

c) If you are trying to create a chart of the above sort using Excel and your horizontal axis labels are in text rather than in numerical form, you will require to create a line graph rather than a scatter-plot to ensure that your labels appear in text form. (You should also have formatted the data as text form within the spreadsheet.) Once you have created the line graph, just delete the part joining the points together (unless you are trying to display trend) and your chart will look like a scatter-plot.

. Q. 11. I am interested in representing the prevalence of type 2 diabetes mellitus in Lothian. My intention is to obtain an overall proportion using the direct method which is corrected for age and stratified according to gender and SIMD quintile. I also wish to present my proportion as a proportion adjusted for the actual distribution of frequencies of persons within Scotland.

How may I proceed?

A. Firstly, to assess the appropriateness of choosing the direct method rather than the indirect method of standardiization, iyou may read the section 14. Issues in the use of standardisation within the resource

Epidemiology for specialists: standardisation

and the advice provided within the resource

Indirect age standardisation – ScotPho, especially in sections 3.3, 3.4, 4.2, 4.3 and 5.

In terms of explanation of methodology for the direct method, you should refer to the Excel template Prevalence Through Direct Standardisation.

This template illustrates how to lay out your data for obtaining prevalence in Lothian before and after adjusting for the Scottish population frequencies (see crude and adjusted prevalances, respectively).

You will see that in the top-right-hand corner of some of the cells, there is a red flag. Hover over this flag to read the comments, which are intended to provide further explanation.

Once you have grasped the methodology, you can use this template in a variety of settings. For example, you may be interested in obtaining prevalence of high level of HbA1c (over 7.5%) among those with type 2 diabetes mellitus in Lothian. Your intention in this case may be to obtain an overall proportion which is corrected for age and stratified according to gender and ethnicity. In turn you may wish to present your proportion as a proportion adjusted for the actual distribution of frequencies of persons within Scotland with diabetes mellitus. The analogy with the earlier example should be clear.

. Q. 12. I am conducting a study aimed at representing prevalence of angioplasties performed within Lothian Community Health Partnerships. I wish to express prevalence (using the direct method) as a proportion relative to the European Standard Population and to standardise my calculations according to age and gender. Can you point me to some relevant resources?

Epidemiology for specialists: standardisation

and the advice provided within the resource

Indirect age standardisation – ScotPho, especially in sections 3.3, 3.4, 4.2, 4.3 and 5.

Through consulting the latter resource, you should also be aware of the advice provided on the use of the 2013 European Standard Population for your calculations.

Within the resource Information on Public Health Observatory recommended methods , there are also some useful cautionary notes on the use of the European Standard Population as the choice of standard population. You should be careful to choose your standard population wisely.

In terms of explanation for the methodology for the direct method, you may find it helpful to consult the article Standardisation of rates using logistic regression: a comparison with the direct method. You focus on the details about the direct method, including the illustrative example. The logistic regression method is not relevant to what you are aiming to do. For your convenience, however, an Excel template is provided below so as to allow you to perform the necessary calculations to obtain a proportion or a rate expressed in the form per 100,000 population. The sheet Original Data with Formulae contains comments to help you understand what is contained in individual columns. Just hover over the column headers with a red flag to learn more. A chart (see the sheet Prevalence chart) is also provided with standardised rates and corresponding confidence intervals. All of the data used for this chart (see sheet Datasheetforchart) have been extracted from the sheet Original Data with Formulae. To learn more about creating charts in Excel with corresponding confidence intervals, refer to Q. 9 on this page.

Excel template illustrating prevalence calculations using the direct method

Epidemiology for the Rusty and Comparing a Study Cohort with a General Population by Margaret MacDougall is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.