Sunday, 28 August 2011

ACE by path specification


Before we introduce an alternative way of specifying the ACE model, let us start with a clean slate. This is a good idea to avoid accidentally overwriting an existing script or using a variable from a prior session. Close all open scripts, and clear all variables with

rm(list = ls(all = TRUE))

You can also clear the console (see the 'edit' menu on the RGui).

We are going to work with a demonstration script this time. There are many of these in the Demo directory of OpenMx, which should be in the library of your R directory. If you used defaults to load OpenMx, it is probably at an address like this:
C:\Program Files\R\R-2.13.1\library\OpenMx\demo
Now open the demonstration script UnivariateTwinAnalysis_PathRaw. We will step through this program running subsections of the script to clarify how it works. The path diagram for the ACE model, including means (shown in triangle) is given below:

Figure 9: ACE path diagram
This demonstration uses real data on body mass index, which are saved as a prepackaged dataset as part of OpenMx.  Remember you can inspect the list of prepackaged datasets by typing:
          data()

Run the first few lines of the script, up to and including 'summary(twinData)'.
This shows us summary data for all the variables in the dataset. To find out how many cases there are, you can type
nrow(twinData)
which tells us there are 3808 cases.

Take a look at the dataset using edit(twinData). You will see that there are missing values for some variables. We are going to use the variables bmi1 and bmi2, which correspond to body mass index measures for twin 1 and twin 2 in a pair.  Before proceeding, it is useful to look at the histogram for these variables, to get a sense of whether they are reasonably normal. We can do this with the hist command, using the $ symbol to denote the required column, i.e.
          hist(twinData$bmi1)
            hist(twinData$bmi2)
Another command useful for checking the distribution of a variable is boxplot, which you may like to try.

Now run the next four commands. These first of all specify the two variables to be analysed, and then select just these variables for a subset of the rows, corresponding to those with zygosity codes of 1 or 3 (corresponding to female MZ and DZ twins). Use edit again to look at the data in the two new datasets, mzfData and dzfData.

Now run the next four lines of the script, which print out the means for each twin of each zygosity, and the covariances. Remember that we use na.rm=TRUE when computing means to exclude any cases with missing data, and use="complete" for a covariance matrix, to ensure that we exclude cases that don't have data on all variables.  Check the covariance matrices. Does the pattern of covariances give any clue as to whether you will see effects of shared environment or genes in these data?

We are now ready to specify the path structure of the model. The next three lines give a name to the model, specify that it is RAM path type, and list the manifest variables (i.e. our two observed variables, bmi1 and bmi2), and the latent variables, which is a set of A, C and E variables for each twin .

ACEModel <- mxModel("ACE", # Twin ACE Model -- Path Specification
          type="RAM",
    manifestVars=selVars,
    latentVars=aceVars,

 The first mxPath statement is for two-headed arrows going from aceVars, which are fixed (free= FALSE) with start value of 1. These indicate that the variance of the latent variables will be fixed at one.
  
We will model means as well as covariances. In the path diagram, means are specified by a triangle which as a fixed value of one, reflected in the 'from="one"' argument. The 'to=' argument refers to the variable whose mean is estimated. You should be able to deduce from the path specification (shown in compact form here) that we are fixing the mean value for latent variables at zero.
    mxPath(from="one",to=aceVars, arrows=1, free=FALSE, values=0 )

Take a look at the next path statement:
    mxPath(from="one", to=selVars, arrows=1, free=TRUE, values=20, labels= "mean" )

This indicates that we want to estimate means because free = TRUE. Why do you think the start values have been specified as 20?  You should get a clue from looking at the column means for bmi1 and bmi2.

Next we start specifying paths from latent to manifest variables:
    mxPath(  from=c("A1","C1","E1"),   to="bmi1",   arrows=1,   free=TRUE,   values=.6,   label=c("a","c","e")     )
    mxPath(  from=c("A2","C2","E2"),   to="bmi2",   arrows=1,   free=TRUE,   values=.6,   label=c("a","c","e")     )

Note that the same labels are used for paths from latent variables to bmi1 and to bmi2. This means that we equate the paths, fulfilling an assumption of the ACE model, which is that the size of genetic and environmental effects will be the same for twin 1 and twin 2 in a pair.

Next we fix the path between C1 and C2 to one, because, by definition , shared environmental influences are the same for twin 1 and twin 2.
    mxPath( from="C1",  to="C2",  arrows=2,  free=FALSE,  values=1  )
The last sections of path specification involve creating zygosity-specific models to handle the paths that differ for MZ and DZ twins, as well as data specification for these two groups:

mzModel <- mxModel(ACEModel, name="MZ",
          mxPath( from="A1",  to="A2",  arrows=2,  free=FALSE,  values=1),
          mxData( observed=mzfData,  type="raw"))
dzModel <- mxModel(ACEModel, name="DZ",   
 mxPath( from="A1",  to="A2",  arrows=2,  free=FALSE,  values=.5),   
mxData( observed=dzfData,  type="raw"    ))

Compare the two specifications. You should be able to relate these to the figure depicting the ACE model shown earlier.

The next section of code creates 'twinACEModel', which includes all three of the previously specified models, and then gives Algebra commands which will perform optimization on the MZ and DZ models simulataneously and compute -2LL.

The command
twinACEFit <- mxRun(twinACEModel)
then runs the model.

Highlight all of the script up to and including the twinACEFit command, and run it.
Type
          summary(twinACEFit)
to see the output.

Note that , because we have equated the paths for twin 1 and twin 2, we get a single estimate of a, c and e. We also equated the means, and so the value given as MZ.M is in fact an estimate for both MZ and DZ twins.

The next few lines of code calculate unstandardized and standardized squared paths; note that the standardized squared paths (a2, c2 and e2) represent the percentage of variance accounted for by that path.

The next sections entitled '#Mx answers hard-coded' and '#Compare OpenMx results to Mx results' can be ignored: they are concerned with comparing output from OpenMx to that obtained with the original version of Mx.

The program then moves on to specify the AE model. You should be able to anticipate how the script will be modified to fix the C path; if unclear, compare the commands for this model with the previous ACE model.

As an exercise, you should now try dropping both A and C terms, to get a pure E model, and compare its likelihood with the ACE and CE models.  Also, you should write a script to compute the likelihood of the saturated model for this dataset. You should be able to do this either using one of the models specified above, or by modifying the script  UnivariateSaturated_PathRaw, which you can find in the demo folder.

Some notes on debugging.
A common frustration for beginners is that you modify an existing script only to find it does not work. Here are a few hints of what to do if this happens:
  • Check first of all that the original script works properly: OpenMx is under development and just because a script is in a manual or the 'demos' folder does not guarantee it is error free. If you get a crash from a script provided with OpenMx, communicate this to the OpenMx community.
  • Second, check that you have the necessary libraries open. OpenMx commands need a require(OpenMx) statement at the start of the script, but this may be omitted, because the script writer assumes you will already have OpenMx active.
  • You need then to study the error messages that you get, which will mean scrolling back through the output on the console.  The error message may, for instance, tell you that a file you are trying to read does not exist. If so, check you have spelt the name correctly. It is possible that the script is looking for the file in the wrong place. With the console window on top, select File/Change Dir from the menu bar, which will allow you to identify, and if necessary reset, the default directory. This is the place where your scripts will default to when reading or writing files.
  • Error messages are often helpful in clarifying what is wrong, but can sometimes be misleading. Always identify the first error message – an early error often causes knock-on effects, and so the later errors may disappear once you fix the primary error.
  • Common reasons for scripts failing are (a) mistyping, including getting lower and upper case letters wrong; (b) missing brackets; (c) missing comma;  (d) missing or misplaced quotes; (e) using regular brackets where square brackets are required, i.e. to refer to parts of matrices. Note too that if you paste a script from a Word or .pdf document, then it may give 'smart quotes' (i.e. angled quotes) which can cause an error.
  • It can be helpful to go through the script section by section, pasting each section into the console to check it. You can type names of variables as you go to see if they are correctly computed. Because an mxModel does not run until you give a mxRun command, you can't evaluate results of the model this way, but you can still type in each section of the model (omitting the comma, after the final close bracket), and this will alert you to typographical errors.
Brackets can be particularly confusing, and you may need to resort to printing out the script and marking in pen each open/close bracket pairing. A quick check of whether you have the same number of open and close brackets can be done by pasting your script into Word, and doing a search and replace, first for '(' and then for ')'. Use the same symbol for the search and the replace term, so the text is unchanged, but you get a count of the number of replacements for each bracket type.

A related point to note is that because R preserves variables in your workspace, even after you close and re-open the program, you may have an error in your script that you are unaware of. Suppose, for instance, your script does not define 'mylabels', but a previous script that you ran did so. You won't be aware of the problem until you clear the workspace. For this reason, it is a good idea to clear the workspace before testing a new script, with
rm(list = ls(all = TRUE))

It is also possible to get confused between scripts if you have several open at the same time, and to find that you have been modifying the wrong script. Two lessons follow from this. If you are planning to modify a script, always save it under a new name before making any changes. Second, only have open those scripts you need to work on.

Persistence is an essential requirement for anyone attempting to learn OpenMx. It is highly logical and powerful, but also very complex. Take courage from the fact that it is only by making lots of mistakes, and learning to recognise how to avoid them, that you will develop expertise. My strongest advice is to play around with OpenMx - get a script, read the manuals, and try changing things to see their effects. Simulate datasets with particular features and check whether you get the results you anticipate. If you want to do something in R, try Googling it - there are enormous resources out there. This way you will gain a deep understanding of what you are doing, rather than just running scripts blindly.

No comments:

Post a Comment