Sunday 28 August 2011

Simulating correlated data

As a first exercise in running a script in R, we shall generate a simulated set of data for two variables, look at some basic statistics for the variables, plot them, and save the data. We will be using the data for more interesting purposes later on, but for the time being, the aim is to familiarise you with some key R commands. In addition, it is very useful to know how to simulate datasets with specific characteristics, as these can be used to check how various analyses work. Copy the script below into the R editor.
#  Simcorr
#  (based on p 11 of OpenMx manual )

# Modified version by: DVM Bishop, 3rd March 2010
# Simulates data on X and Y from 50 cases, with correlation of .5 between them
# (NB using smaller N than in original example, so user can see mismatch between
# obtained correlation and that specified by user).
require(MASS)  # MASS is R package used for generating multivariate normal data

set.seed(2)       # A seed is a value used in creating random numbers;
    # You don't need to understand how it works and you can use any number.
    # Keep the seed the same if you want the same random numbers every time you run
    # Change the seed to any other number to run the script and get different random numbers

rs=.5               # User-specified correlation between variables
# The next command has been broken up with comments to explain each part
# But normally would just be written in one line as:
# mydata=mvrnorm(50,c(0,0),matrix(c(1,rs,rs,1),2,2))

mydata=mvrnorm(50,    # Create matrix of multivariate random normal deviates, mydata,
                                   #  and specify number of XY pairs to generate
   c(0,0),      # User-specified mean values for X and Y; NB c denotes concatenation in R
  2,2))         # Covariance matrix to be simulated, 2 rows, 2 columns
                                # this will give a matrix as follows:  [1 .5
                                #                                                 .5  1]
mylabels=c('X','Y')      # Put labels (X and Y)  for the two variables in a vector
dimnames(mydata)=list(NULL,mylabels) # Allocate labels to mydata (our created dataset)
                         # A list contains a vector of row labels and a vector of column labels;
                         # We just want the second vector, and so NULL is the first entry
summary(mydata)          # Print means for mydata
cov(mydata)                   # Print covariance for mydata
cor(mydata)                    # Print correlation for mydata
print('Note that actual values may differ from specified value of .5, especially with small sample size')
sd(mydata)                     # Print SD for mydata
print('Note that the correlation = covariance/(SD_X * SD_Y)')

plot(mydata)                    # Plot a scatterplot of X vs Y
write.table(mydata,"myfile")   # Save a copy of mydata in your R directory under name "myfile"

# If you want to re-read your data another time, you can use a command such as
# newdata=read.table("myfile"); 
# This will create a matrix called newdata containing the data
Note that the first executable command (after comments) uses the 'require' statement. R has a great many functions, some of which are available from the main package; others are more specialised and so are not automatically available, but can be accessed by specifying the relevant package using the 'require' statement. When we come to use OpenMx, we will always need to start a script with require(OpenMx), in order to access the specialised functions of OpenMx.

You should now try to run the script, inspect the output, and check that you can understand it. The numerical output appears in the Console window, and a new window opens up for the Scatterplot. 

If you are new to R, I recommend you spend an hour or so working through this script, running one line at a time, until you are confident you understand it.  You can save the script from the Editor window with File|Save or File|Save As, and then try modifying it to check whether it does what you expect.  Can you work out how to change the correlation between X and Y?Or how to change the sample size of simulated points?  Note that you can run just part of a script by selecting that part and typing Ctrl+r
Experimenting with a script is one of the best ways of learning how it works.
You will see that the program allows you to save your simulated data; you should be able to find the data file in the working directory that you set up earlier.

A useful feature of R is the edit command:
This opens your data in a window and allows you to edit values from your original matrix('mydata') and save as a new matrix called 'nudata'. Or you can just use edit to update an existing matrix by a command such as:
If you want to save the updated values, you need to rerun the write.table command.
Note that the program encourages you to inspect the correlation and covariance matrices for  X and Y. Most of the analyses in OpenMx  are concerned with relationships between variables as expressed in covariances, more than with absolute values (as, e.g. reflected in the means). You may be more familiar with correlations than covariances. It is important that you understand how one is derived from the other, as explained in the script comments above. In essence, the correlation and covariance express the same information, but the correlation is scaled relative to the standard deviations of the variables, so that its values lie within a range from -1 and 1. The covariance is scaled in the original units. This is just like the variance, which is closely related: the variance is equivalent to the covariance of a variable with itself.

Exercise to consolidate your knowledge
Note that the obtained correlation between X and Y is not exactly .5.  With the seed value given in the script it is .458. Change the seed and rerun the script a few times, noting the values of the correlation between X and Y.
Now change the sample size: experiment with bigger and smaller numbers of XY pairs, say 10, 500, or 5000. Again, you can re-run the script with different seeds to get different estimates of the X-Y correlation. Do you notice anything systematic about how the correlation changes with the sample size?

No comments: