Arranging Data in Spreadsheets for Statistical Analysis

· Q 1. I have inherited a CSV data file containing semi-colons to separate data. On attempting to open the file using Excel I found that the data did not separate as hoped into meaningful columns. Can you recommend a way of instructing Excel to display the data in the originally intended layout?

A. Yes; a suitable way forward is recommended in the resource

How to open CSV files with the correct delimiter/separator .

· Q 2. Is it possible to export data from MS Excel into SPSS?

A. Yes; however, it is important to do this properly. I recommend that you read the Data Preparation Tutorial carefully.

· Q 3. My data are currently stored in MS Access. Is it possible to export data from MS Access into SPSS?

A. Yes, but I recommend that you do transfer your data to MS Excel in the first instance (see Exporting Access Tables to Excel for instructions). Then use the solution to Q. 1, above to progress to SPSS.

·Q 4. Are there any restrictions on the number of variables and cases I can assume in an Excel spreasheet?

A. Yes, you can check this using your keyboard control key and arrow keys. Currently, you use 256 columns and 65536 rows in any Excel spreadsheet.

However, if you are importing your data into SPSS (see previous question), you will have the opportunity to create larger spreadhseets. The best approach here is to use the file merging facilities within SPSS once you have imported your Excel data into individual SPSS files. Have a look at the site Merging Similar Datasets for details of how to do this.

For your information, an SPSS spreadsheet can assume up to 2.15 billion cases (represented by rows) but in theory, there is no limitation on how many variables (represented by columns) which it could assume. In practice, for a short project, it wouldn’t usually be sensible to utilize even hundreds of columns in a single SPSS spreadsheet (particularly due to human memory problems in surveying the spreadsheet) and the limitations on your computer memory would problem restrict your spreadsheet capacity anyway.

·Q 5. I am preparing a target list of people with contact details for a survey. I wish to ensure that I have no duplicate cases in my list. My list is currently presented in an Excel spreadsheet. How can I conveniently identify duplicates?

A. Learn how to request Excel to highlight duplicate cases through font and cell colouring by consulting the guide Excel 2010 Duplicate & Unique Values.

· Q 6. I have patient notes for 150 subjects but have my doubts that I will be able to compile a database for all of these patients within the time-frame of my project. In turn, I am uncertain at this stage as to how far I shall get and therefore I cannot predict a reasonable sample size to work with. Is there a convenient procedure to hand which will allow me to assign random numbers to all of the patients in the population in order that those patients I do include in my study have truly been selected in a random manner?

A. What you want to do is to number your patients from 1 to 150 and then generate a random permutation of numbers between 1 and 150. The order of the numbers in the permutation will be indicative of the order in which you should select your patients and you can continue like this until the deadline for data collection has been reached.

Please be careful, however, in selecting a good quality random number generator. Excel’s function RANDBETWEEN is not ideal. Also, in your write-up, always state clearly which generator you used (including the author, if known) in order that future researchers are aware of how you have proceeded. If you have used an online calculator, include the URL and the date of use in your reference list. To access a calculator I would recommend, please select third generator at:

Random permutation calculator

When, in turn, you select ‘Help’ you will see that the calculator is validated by reliable references. This is really important!

Tips: when using this calculator, opt for ‘Single Column’ so as to allow you to generate a single permutation of numbers which you can then copy-paste into Excel for future use. Also, as indicated under ‘Help’, the calculator generates its own value for the seed. The value of the seed is a useful means of identifying the particular permutation you have obtained but you don’t need to dream one up. Just leave the space for this blank, unless you want to retrieve a previous permutations which you have recorded the random number seed for (please see bottom of page when you generate your permutation).

For your referencing, the author of this calculator is Gerard E. Dallal.

· Q 7. I wish to use a random number sequence to allocate 2 treatments (A to B) and one control (C) to 150 patients in such a way that there are 50 different patients in each group and to carry out the randomisation by allocating sealed envelopes to patients. How should I proceed?

A. In order to ensure that you are fully clued up on good research practise with regards to concealment of treatment and control allocations when using the ‘envelope procedure’, please consider carefully reading the article Allocation concealment in randomised trials: defending against deciphering.

You should be aware therefore that there is a need to generate a sequence of random numbers and allocate the random numbers to different treatments. The number allocated to each patient should be included in a concealed envelope and the treatment which that number represents should be stored confidentially within the computer system used to assign numbers to treatments. It is recommended that those estimating distress are blinded to which of the three groups the patients were allocated to and to the mapping between random numbers and treatments generated by the computer system. It is also recommended that no patient is aware of what group they have been allocated to in advance of implementation of the treatment or control procedure for that patient.

You ought to find the random assignment calculator very useful for your purposes in that it generates the random numbers you require in order to know what treatment to allocate to each patient. You simply require to provide the number of treatments, the total number of participants and the number of times you wish to carry out the experiment. If you only wish to carry out the experiment once choose ‘1’ under the drop-down menu for ‘Repeat’.

Please also note that it you are doubtful that you will reach the target of 50 patients in each group you should either consider decreasing the group sizes and decreasing the length required for your random number sequence accordingly in advance of generating that sequence.

· Q 8. I have a spreadsheet with kidney transplant dates included and wish to extract the year from these dates in order to compare transplant or patient survival times for different years. How can I avoid doing this manually?

A. Firstly, if you have your data in Excel, it would be a good idea to import it into SPSS for this process (see the first FAQ in the current section). Then follow the instructions in the presentation:

Extracting part of a date in SPSS when there are many patients in your sample. PowerPoint | PDF

· Q 9. Where can I find a really clear and useful comprehensive guide to using the Variable View window in SPSS?

A. You should find the variable view tutorial The Data Editor very appropriate for your needs. Among other things, this tutorial demonstrates how to assign labels to exisiting categories within a categorical variable and how to use the column Label in Variable View to overwrite the abbreviated titles you use for your column headers, should you wish to use more meaningful titles within tables and graphs.

Handy tip: It is helpful at times to have access to a summary of the properties you have assigned to your individual variables in SPSS, including labels you have assigned to individual categories. For this purpose select the path File –> Display Data File Information –> Working file in SPSS. You will then be able to create tables of output which are handy for you to refer to in a meeting, for example, when discussing your work with others.

· Q 10. Now that I have entered my data into SPSS, I would like to recode my ages in order that they appear as categories for different age ranges. Is there a quick way to do this using SPSS?

A. Yes, the first section in the MS Word file available at

Recoding Data, Selecting Cases, Sampling, and Sorting in SPSS

explains the necessary procedure extremely well. However, please also have a look at the following tips before using the file:

1. When reading the file, do make use of the html zoom-in facility by increasing the % on the upper horizontal bar to 200%, say.

2. Make sure that you identify this column as ‘nominal’ or (if your categories have a natural ordering) ‘ordinal’ under the column ‘Measure’ within the window Varaible View’. You should also leave the variable in Numeric form (see the column ‘Type’ in the SPSS window Variable View).

Usually, you will ultimately wish to use word labels such as ‘normal’, ‘below normal’ and ‘above normal’ for your new categories rather than simply number codes. (This looks more helpful in graphs for example.) You should do this as an extra step, however, using the window Variable View in SPSS (see Q. 3, above and work your way through the recommended tutorial until you reach the explanation of the column ‘Values’ within ‘Variable View’).

· Q 11. I have scores from the 36-Item Short Form Survey (SF-36) which I would like to transform into a new column of ranks to support further testing of model assumptions. Can this be achieved easily in SPSS?

A. Yes, indeed there is. Please see the worked example under Ranking your Cases: IBM SPSS Statistics.

· Q 12. Unfortunately, a few of my categorical variables are appearing under Variable View as string rather than numeric variables. This is creating an impediment to use of functionality in SPSS which assumes that my category names are mapped to integers. Can you suggest a quick way of resolving this issue.

A. Yes; fortunately, there is a useful SPSS procedure, known as Automatic recode, which can assist you. Note that this procedure assigns consecutive strictly positive integers to your category labels in the alphabetical order of these labels. This can be in either forward or reverse alphabetical order (depending on your choice). Please refer to the resource Recoding String Variables to Numeric Codes using Automatic Recode.

· Q. 13. I have noted the solution to Q.10, above. Is it also possible to recode data in Excel?

A. Yes, it is; but it is a little more tricky in that you need to learn to use the inbuilt logical function entitled “IF”. This is a powerful function and you should find the tutorial How to use the Excel function IF helpful in defining your syntax. As a word of caution, if you are preparing your data for analysis in SPSS, you should use the above function to create numerical codes, not labels in words; otherwise, the examples provided still apply.

· Q 14. I require to classify age at diagnosis according to a categorical variable for high and low risk but the threshold ages for these risk categories differ according to the gender of the patient (for females, low risk: < 30 and high risk 30 +; for males, low risk: < 35, high risk 35 +) . Is there a convenient way of using the above recoding facility to create a categorical column with 1’s for high risk and 0’s for low risk patients irrespective of their gender whilst ensuring that different thresholds have been used according to gender?

A. Yes. Let’s assume for the sake of argument that you originally coded females and males as ‘1’ and ‘2’, respectively and that you have named your gender column ‘gender’. Once inside the main dialogue box for recoding your data, select the ‘If…’ button and choose your variable ‘gender’ from the listed variables. Using the calculator buttons provided set gender = 1 and proceed to create a new column according to the risk categories for females. A new column will be created at the end of your spreadsheet for females only. Now repeat the process using the same column name with gender = 2 and using the risk categories for males. This will allow your categorical column to be completed. You should then use the Variable View facility within SPSS (see Data Preparation Tutorial) to ensure that your new risk variable is classified under ‘Measure’ as ‘ordinal’, not ‘scale’ and to allow you to apply the labels ‘low risk’ and ‘high risk’ to your number codes for the two risk groups.

· Q 15. I am not sure which categories to bin my raw data into. Have you any suggestions?

A. Often, the answer to this question can be explained in clinical terms. For example, if you are looking at crp levels in mg/L, you may already recognise the following categorization: 0 – 10: normal, 11-50: mildly raised, 51-100: moderately raised, >100 – markedly raised. In such cases, you should find the recoding facility in SPSS very useful as a means of efficiently creating a new column of data with categories. Please consider referring to Q. 9, above to learn how the above can be accomplished.

In other circumstances, however, there may be a lack of clinical theory to back up your choice of categories. In such circumstances, you may wish to seriously consider why you wish to obtain categories in the first place. For example, dichotomization of data (grouping your data into two categories), though a common practise, carries the disadvantage among others of limiting the statistical power for detecting true statistical associations in your data. On the other hand, where, for example, you have a mixture of continuous and categorical data (e.g. with values of eGFR greater than 60 only recorded as ‘> 60’ and your continuous data (e.g. creatinine levels) do not satisfy the model assumptions for analysis of continuous data, you may have no choice but to appeal to an approach using categorical data. In such cases, it still makes sense to seek a clinical justification for your choice of categories.

Based on the reserch findings of professional statisticians, there is a considerable amount of literature available backing up these recommendations, the content of which is far more advanced than that sensibly associated with undergraduate medical student learning in statistics. This is why the salient details have been extracted and presented above and elsewhere to help support your learning.

Nevertheless, you may find it of interest to consult the paper Consequences of dichotomization together with the articles by Senn which are included in the reference list to this paper.

· Q 16. On collecting my data, I recorded mother’s date of birth and the date at which she gave birth in two separate columns. Is there an efficient way in which I can use the dates data in Excel or SPSS to obtain a column of mother’ ages on giving birth? (I am assuming that I need to find a way of subtracting one column of dates from the other.)

A. Yes, I can recommend the instructions under the header Working out a person’s age. You should be able to easily copy your new age column into a SPSS spreadsheet for later use. Alternatively, you can do the age calculation directly in SPSS. Relevant instructions are provided in Table 5 of the document Generating new variables in SPSS: The Transform/Compute command. However, you may well also wish to read some of the material on date formatting in the same document.

· Q 17. I have encountered a problem with formatting of time in Excel. I have calculated time to delay to hospital presentation for patients with Acute Coronary Syndrome. However, the data are in the form hh:mm. I want to be able to analyse the data in SPSS and would therefore like to convert the data to time in minutes. Is there a convenient way of doing this in Excel?

A. Yes; it is helpful to understand that Excel inteprets these data as units of hours per day. You should therefore carry out the following steps to get the data in the correct format.

Step 1. Insert a new column NewDELAY_TIME,say, immediately to the right of your old one. If the first cell of time data in your old column is F2, then type ‘=F2*24*60’ in cell G2.

Step 2. Using the menu Format,opt for ‘Cells’ and then Choose the tab ‘Number’ followed by ‘Number’ from the category list. If, for example, your first delay time was expressed previously as 1:19, it should now read ’79’.

Step 3. Use the drag down facility on the bottom right-hand corner of cell G2 to populate all of the remaining cells of relevance in column G.

·Q 18. How can I access some more general advice on converting units in Excel?

A. In more recent versions of Excel, a new function, Convert, has been introduced, without the requirement that you should know what the scaling factor is that is needed to convert from one particular unit to another (cf. the solution to Q. 13, above). To use this function, you will need to know what syntax Excel uses to denote the units you wish to convert from and to. For example if you want to convert from miles to kilometres (as units of distance on land), you should note that Excel requires you to use the terms “mi” and “km”, respectively. A fairly comprehensive list of terms for other units is available at choose units. The actual conversion process involving the use of the above Excel function is easy. To see how to apply this process for a single number, refer to the steps at the guide Convert measurements. The example at the latter link is particularly handy, as the cell number rather than the actual value is used to perform the calculation. This approach is needed if you want to efficiently generate results for an entire column of numbers. For the example, here, all of these numbers (in miles) could be column A (starting at cell A2) and you may wish to present them (in km) in column B (starting at cell B2). After you have entered the correct syntax for cell B2, just use the trick recommended under Step 3 in the solution to Q. 13, above to populate all of column B with numbers in km!

.Q 19. Can you recommend resources for the conversion of a measurement
(height, weight, BMI, etc) in children to a standard deviation score (SDS) adjusted for age and sex?

A. You will need the following resources:
README.DOC
LMSDATA.XLS
LMSMACRO.XLA
LMSTEST.XLS

Essential point: please study the contents of the file README.DOC very carefully before attempting to do anything with the other files. Also, please note the licensing and copyright information below.
Please note that the above files README.DOC, LMSDATA.XLS, LMSMACRO.XLA and LMSTEST.XLS have been provided by the kind permission of Mr John Short of The Child Growth Foundation, London. The Child Growth Foundation is the owner of the copyright on the material on this disk. You may use this material on the basis that it is not distributed to a second party and that you use it for clinical or research purposes only. The Excel macro contained therein must not be utilised with any software product other than Excel and the materials on the disk must not be useful for any commercial purpose without the express prior written consent of the Child Growth Foundation. All rights not expressly granted are retained and not licensed to you as a user.

.Q 20. Can you recommend a method for categorizing BMI in adults so as to define ‘underweight’, ‘obese’, ‘very seriously obese’, etc?

A. Have a look at the section Categories in the Wikipedia article Body mass index.

.Q. 21. I currently hold multiple columns of data in my spreadsheet on an individual patient basis for type of analgesic and dosage of each analgesic administered. Across the different patients there is a considerable number of permutations of types and dosages of analgesic. Is there a way in which I can consolidate this information in order to create a single column representing potency of analgesic in order that I can use this variable as a proxy for pain score in my study?

A. This is at heart a clinical problem where the author of the query is seeking a means of comparing like with like across different patients. Based on their clinical expertise, do you know of any clinician who has successfully created a formula for listing all sets of equivalent permutations of dosages and types of analgesic? Probably not! I have discussed this problem in detail at a multidisciplinary level and
noted that this problem has surfaced on many occasions but attempts to address it have typically been unfruitful. Potential confounding factors include patient perception and tolerance of pain and the effectiveness of a given permutation across individual patients. To gain additional and more positive insight into this topic and to appreciate the scale of research and teamwork behind any relevant investigation, you may find it valuable to read the following paper, which applies specifically to the field of rheumatology:

ASAS recommendations for collecting, analysing and reporting NSAID intake in clinical trials/epidemiological studies in axial spondyloarthritis.

My thanks are expressed to:

Dr John Whittingdon, Owner of Mediscience Services, for sharing his personal observations on the difficulties associated with addressing the above problem;
and
Ms Monika Hebiesen, SCQM Foundation in Rheumatic diseases, for highlighting the above reference.

Arranging Data in Spreadsheets for Statistical Analysis by Margaret MacDougall is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

StatsforMedics

Arranging Data in Spreadsheets for Statistical Analysis

· Q 1. I have inherited a CSV data file containing semi-colons to separate data. On attempting to open the file using Excel I found that the data did not separate as hoped into meaningful columns. Can you recommend a way of instructing Excel to display the data in the originally intended layout?

· Q 2. Is it possible to export data from MS Excel into SPSS?

· Q 3. My data are currently stored in MS Access. Is it possible to export data from MS Access into SPSS?

·Q 4. Are there any restrictions on the number of variables and cases I can assume in an Excel spreasheet?

·Q 5. I am preparing a target list of people with contact details for a survey. I wish to ensure that I have no duplicate cases in my list. My list is currently presented in an Excel spreadsheet. How can I conveniently identify duplicates?

· Q 7. I wish to use a random number sequence to allocate 2 treatments (A to B) and one control (C) to 150 patients in such a way that there are 50 different patients in each group and to carry out the randomisation by allocating sealed envelopes to patients. How should I proceed?

· Q 8. I have a spreadsheet with kidney transplant dates included and wish to extract the year from these dates in order to compare transplant or patient survival times for different years. How can I avoid doing this manually?

· Q 9. Where can I find a really clear and useful comprehensive guide to using the Variable View window in SPSS?

· Q 10. Now that I have entered my data into SPSS, I would like to recode my ages in order that they appear as categories for different age ranges. Is there a quick way to do this using SPSS?

Recoding Data, Selecting Cases, Sampling, and Sorting in SPSS

· Q 11. I have scores from the 36-Item Short Form Survey (SF-36) which I would like to transform into a new column of ranks to support further testing of model assumptions. Can this be achieved easily in SPSS?

· Q. 13. I have noted the solution to Q.10, above. Is it also possible to recode data in Excel?

· Q 15. I am not sure which categories to bin my raw data into. Have you any suggestions?

·Q 18. How can I access some more general advice on converting units in Excel?

.Q 19. Can you recommend resources for the conversion of a measurement
(height, weight, BMI, etc) in children to a standard deviation score (SDS) adjusted for age and sex?

.Q 20. Can you recommend a method for categorizing BMI in adults so as to define ‘underweight’, ‘obese’, ‘very seriously obese’, etc?

The WordPress site for supporting undergraduate medical student learning in statistics for short research projects

· Q 1. I have inherited a CSV data file containing semi-colons to separate data. On attempting to open the file using Excel I found that the data did not separate as hoped into meaningful columns. Can you recommend a way of instructing Excel to display the data in the originally intended layout?

· Q 2. Is it possible to export data from MS Excel into SPSS?

· Q 3. My data are currently stored in MS Access. Is it possible to export data from MS Access into SPSS?

·Q 4. Are there any restrictions on the number of variables and cases I can assume in an Excel spreasheet?

·Q 5. I am preparing a target list of people with contact details for a survey. I wish to ensure that I have no duplicate cases in my list. My list is currently presented in an Excel spreadsheet. How can I conveniently identify duplicates?

· Q 7. I wish to use a random number sequence to allocate 2 treatments (A to B) and one control (C) to 150 patients in such a way that there are 50 different patients in each group and to carry out the randomisation by allocating sealed envelopes to patients. How should I proceed?

· Q 8. I have a spreadsheet with kidney transplant dates included and wish to extract the year from these dates in order to compare transplant or patient survival times for different years. How can I avoid doing this manually?

· Q 9. Where can I find a really clear and useful comprehensive guide to using the Variable View window in SPSS?

· Q 10. Now that I have entered my data into SPSS, I would like to recode my ages in order that they appear as categories for different age ranges. Is there a quick way to do this using SPSS?

Recoding Data, Selecting Cases, Sampling, and Sorting in SPSS

· Q 11. I have scores from the 36-Item Short Form Survey (SF-36) which I would like to transform into a new column of ranks to support further testing of model assumptions. Can this be achieved easily in SPSS?

· Q. 13. I have noted the solution to Q.10, above. Is it also possible to recode data in Excel?

· Q 15. I am not sure which categories to bin my raw data into. Have you any suggestions?

·Q 18. How can I access some more general advice on converting units in Excel?

.Q 19. Can you recommend resources for the conversion of a measurement (height, weight, BMI, etc) in children to a standard deviation score (SDS) adjusted for age and sex?

.Q 20. Can you recommend a method for categorizing BMI in adults so as to define ‘underweight’, ‘obese’, ‘very seriously obese’, etc?

The WordPress site for supporting undergraduate medical student learning in statistics for short research projects

.Q 19. Can you recommend resources for the conversion of a measurement
(height, weight, BMI, etc) in children to a standard deviation score (SDS) adjusted for age and sex?