|Figure 7: Standard ACE model: DZ twins|
Each latent factor has its associated path, and it is the goal of a behavior genetic analysis to estimate these paths. Note that the same letters are used for paths from latent factors to twin 1 and twin 2, indicating that in our model we assume no differences in the size of genetic and environmental influences for the two twins. Provided twins are assigned as twin 1 and twin 2 at random, this is a valid assumption.
Note that the path between the two latent C factors (common environment) is set to 1, as this is, by definition, identical for twin 1 and twin 2. The path between the latent A factors is set to .5, reflecting the fact that DZ twins share 50% of their segregating alleles on average.
Path-tracing rules can be applied to estimate the expected covariance between twin 1 and 2:
CovDZ = .5 * a2 + c2
Also, the variance for scores of either set of twins can be estimated:
VarDZ = a2 + c2+ e2
For MZ twins, the model is virtually identical, except that the correlation between the A terms is set to 1.0, because it is assumed that MZ twins have the same alleles for all genes. Therefore, the estimates of variance for MZ twins is the same as for DZ twins, but the estimates of covariance between twin 1 and twin 2 is now:
CovMZ = a2 + c2
If we used standardized variables, then the total variance for each twin will be one, and a2 will be a direct measure of the proportion of variance that can be accounted for in terms of additive genetic influences, i.e. heritability. The formulae from path-tracing indicate that we can estimate the size of a2, c2 and e2 just from a knowledge of the correlations between twin 1 and twin 2 in DZ and MZ pairs. Because correlations are standardized, we will get the estimates in standard units representing proportion of variance accounted for.
Thus simple algebra allows us to estimate:
(CorMZ-CorDZ) = .5 * a2
a2 = 2 *(CorMZ-CorDZ)
c2 = CorMZ - a2
e2 = 1 - a2 - c2
You might start to wonder why we should do complicated model-fitting when it is easy to get these estimates just from a couple of correlations! There are three reasons. First, model-fitting provides a more stringent test, because it will be sensitive to any violations of the assumptions of the model. Model-fitting usually estimates covariances rather than correlations, so that we can confirm, for instance, that the variances are comparable for MZ and DZ twin pairs. The model assumes they are the same, and so the fit of the model will be reduced if this assumption is not met. Second, with model-fitting we can not only estimate the paths, but also obtain confidence intervals around our estimates, and compare the fit of nested models, as will be illustrated more fully below. Third, with a model-fitting approach, we can easily scale up models to account for multivariate data, as well as formulating more complex models that include, for instance, other variables, measured environmental influences, or specific allelic effects.
As we shall see, there are two approaches to modeling twin data in OpenMx. The first uses a matrix algebra representation; the second uses the RAM path method that was illustrated above. Both will be illustrated here and readers can decide which approach they prefer. For beginners, it can also be useful to specify a model using both methods to check that they give the same results. Note that the omxGraphviz command for creating a diagram works only with RAM type models.
For the next example, we will use simulated data on twins. Instructions for how to read in your own data will be given after we have gone through the ACE model, so you can try it out with real life data.
The following script simulates data on 1000 pairs of MZ and DZ twins, assuming a2 of .5, c2 of .3 and e2 of .2.
Notice that although we have no missing data in our simulated file, the commands above are specified to allow for missing data. Create a copy of myDataDZ with the command:
and in the editor, type NA over one of the entries in each column. NA is the R code for a missing value. Now type:
You will see that the results just give NA values. However, the missing values are ignored if you type:
We start by fitting a saturated model, estimating means, variances and covariances separately for each group of twins (twin 1 vs twin 2) and by zygosity (MZ vs DZ pairs), so we can get a baseline likelihood estimate against which to compare a genetic model . The saturated model will have two matrices for expected means (one each for MZs and DZs), and two for the expected covariances, generated from multiplying a lower triangular matrix with its transpose (i.e. the Cholesky decomposition). The raw data to be used are specified using the mxData command. The function for optimization is mxFIMLObjective applied. We start with the submodel for MZ twins, as follows:
You may encounter scripts that use a more compact form, as illustrated below for the DZ group. Note that this script is formally identical to that for MZ twins, except for the 'name' specifications. If the arguments to the OpenMx command are given in the default order ( see help(mxMatrix) to go to the help/reference page which lists the arguments) then it is not necessary to include the name of the argument. The script has skipped a few optional arguments, so the argument "name=" has to be explicitly provided to refer to the last argument. In general, we would not recommend this compact format, as it is hard for other users to understand. The formatting used for the MZ group above, with soft tabs and each argument on a separate line is much clearer.
The MZ and DZ models are now combined in a 'super' model, which includes them as arguments. To obtain the overall likelihood of data for MZ and DZ twins, we simply add together the log likelihoods for MZ and DZ submodels. The super-model includes an mxAlgebra statement to do this. This is given the name 'twin'.
Before we fit the ACE model, we will fit the 'saturated' model, which ignores the relationships between twin pairs. In the script below, the models for MZ and DZ pairs are identical. This acts as a baseline against which the ACE model can be assessed. It has no latent variables, just observed means and covariances.
- Can you draw a path diagram of the saturated model? (NB we can't generate a diagram with omxGraphviz because we are not using a RAM model).
- What are the ten estimated parameters?
- Can you work out how you would compute the estimated DZ covariance matrix from the three DZ.CholDZ values given in rows 8-10 of the output, just using R commands?
- Do you understand why chi square is designated as NA, rather than given a numerical value?
Figure 8 shows you the path diagram for the saturated model. The four paths from 'one' to the four boxes correspond to the four estimated means, the looped two-headed arrows attached to each box correspond to variances, and the two-headed arrows linking twins of the same zygosity correspond to covariances. Note that there is one arrow for each observed covariance and variance, so the degrees of freedom are zero. There are no latent variables, just observed variables (in boxes) and a constant (in triangle).
|Figure 8: Path Diagram of Saturated Model|
Before we move on to fit the ACE model to the same data, we will test some of the assumptions of the twin model, i.e. that the means and variances are the same for twin 1 and twin 2, and that they are the same for MZ and DZ twins.
A key feature of OpenMx is that if two estimated parameters have the same label, then they will be treated as identical. To demonstrate this point, return to the previous script, and insert the following line in the sections where the matrices for expMeanMZ and expMeanDZ are defined, immediately above the 'name' statement for each matrix: