Sunday, 28 August 2011

Introduction to OpenMx for twin analysis

Preamble
Twin data can be used to estimate the relative contribution of genetic and environmental influences to quantitative traits. For many years, the key textbook on methods has been Neale & Cardon (1992): Methodology for genetic studies of twins and families. This goes hand in hand with annual workshops on Twin Methodology hosted by the Institute of Behavioral Genetics at Boulder, CO. When I first became interested in this topic, the preferred software for running twin analyses was Mx, which was freely available from the University of Virginia. Scripts for running different types of analysis were available from the Mx website, and also from a helpful site at the Free University of Amsterdam. Nevertheless, even with these tools, twin analysis was technical, complicated, and often frustrating. Mx had a tendency to crash unexpectedly without giving you much help in understanding why. Its idiosyncratic error messages were at first amusing (e.g., "Well there I was, all ready to equate all the matrices in this group to those of a previous group, and then you didn’t put *which* group on the same line") , but after a bit, you wanted to hurl your computer through the nearest window, as you were admonished for doing something completely incomprehensible, e.g. "Your observed covariance matrix is not positive-definite".
Around 2009 (I can't find a precise date on the web), the OpenMx project was started, with the goal of creating a new software package, OpenMx, written in the R programming language. As the authors state: "In some ways, Mx is barely recognizeable in OpenMx, since the interface is completely different and the software has been rewritten top to bottom using modern programming techniques and languages. But deep within OpenMx still beats the ancient heart of Mx: a general purpose matrix optimization package."
Last year when I needed to train a couple of my graduate students to do twin analyses, I was ready to embrace OpenMx and turned with alacrity to the OpenMx website. But alas, the age old problem arose: the very very clever people who had written OpenMx had no idea how to communicate with lesser mortals. I already knew a fair bit about twin analysis and Mx, but I struggled with the manuals and examples. And at several points, the general recommendation seemed to be 'Go away and learn R first'. But I knew that if you want to learn a programming language, you only really do so if you have a problem you want to tackle. And so I decided I would write an elementary introduction for my students that did not assume that you knew anything about R, Mx, structural equation modeling or behaviour genetics.  There was, of course, a problem. I didn't know anything myself about R, and my mastery of the other topics was pretty amateurish. But in some ways, an amateur has an advantage for manual-writing: if you've had to work it out for yourself, you know what needs explaining.
Anyhow, having created this introductory document, I felt it was worth sharing in case it could be useful to others. Although I have tried to simplify and explain, this material is not easy to master, especially if you have no background in programming. But for what it is worth, here is my best attempt at explanation of the basics, for those rare creatures who want to analyse twin data and haven't already mastered Mx or OpenMx.
I should add that OpenMx is appropriate for much broader applications of structural equation modeling, but the focus here is just on what you can do with twin data.
This manual assumes you will be working on a PC in a Windows-based environment.  Please note, the manual was written by someone who was teaching themselves both R and OpenMx, and so some of the example scripts are cumbersome.


A note on formatting
I apologise for the erratic and inconsistent formatting in these blogposts. Google's Blogger has many fine features, but control of formatting is not one of them. It has a mind of its own and will decide which font and spacing to use, despite instructions to the contrary. I don't want to spend hours delving in the html version to sort this out, so decided we'd have to live with it.

The scripts
You may find it easier to copy the scripts from a Word document, rather than from this blog, in which case you can download the original document on which this blog is based from this site.

Feedback
Comments are enabled on this blog, and I'd welcome feedback and suggestions for improvement. I will try to incorporate any good ideas, but I can't provide technical support, as (a) I am not very expert myself, and (b) have a busy job that is only occasionally concerned with twin analysis.
I hope somebody out there finds this useful!




Getting started in R

OpenMx is written in a programming language called R, which is free to download.  R is a powerful language for statistical computing, but much of the documentation is written for experts, and so it can be daunting for beginners.  If you go to this website you will see instructions for how to download R.  On the left panel, select CRAN from the Download section. Do not be put off by incomprehensible terms such as "CRAN mirror": just select a download site from the list provided that is geographically close to where you are.
You may then be offered further options that you may not fully understand. Just persevere by selecting the 'Windows' option from the 'Download and install R' section, and then select 'base', which at last takes you to a page with straightforward download instructions, and a link: 'Download R 2.13.1 for Windows', which you should click (the version number of R may be different).This will download an .exe file, and if you double click that, it will start an installer for R, offering the usual options about where to install, etc. I recommend accepting defaults.
Installation of R will create a Start Menu item and an icon for R on your desktop. Double clicking on the R icon starts the program. Whew!

Starting R opens a window called R Console, in which you can type commands.
You will see a > cursor.
This cursor will not be shown in the examples below, but it indicates that the console is awaiting input from you.
At the > cursor, type:
    help.start()
This starts a web-based interface to on-line help. You may want to briefly explore this before going further.  As with other programming languages, you hit Enter at the end of each command.
Just to familiarise yourself with the console, type:
    1+2
R evaluates the expression and you see output:
   [1]  3
The [1] at the beginning of the output line indicates that the answer is the first element of a vector. If you find this puzzling, don't worry about it for now.
Now type:
   x = 1+2
Nothing happens.  But the variable x has been assigned,  and if you now type x on the console, you will again see the output
   [1] 3
In R, the results of variable assignments are not shown automatically, but you can see them at any time by just typing the name of the variable.
The assigned variable x will remain assigned unless you explicitly remove it using the 'rm' command. Type:
   rm(x)
And then type x  again
You now see
  Error: object 'x' not found
You can repeat an earlier command by pressing the up arrow until it reappears.  Use this method to redo the assignment x=1+2, and then type X. Again you get the error message, because R is case-sensitive, and so X and x are different variables. 

Now type:
  y = c (1, 3, 6, 7) 
and then inspect the variable y (by typing y at the console).
You will see that it is a vector of numbers [1 3 6 7].  The 'c' in the previous command is not a variable name, but rather denotes the operation of concatenation. It just instructs R to create a variable consisting of the sequence of material that follows in brackets.
Now type:
 x =
 and hit Enter.
The cursor changs to +
This is R telling you that the command is incomplete. If you now type 1+2 followed by Enter, your regular cursor returns, because the command is completed.
It can happen that you start typing a command and think better of it. To escape from an incomplete command, and restore the > cursor, just hit Escape.
Also, if a program is running and you want to stop it, you can escape with Ctrl+c.

You have been working so far in the Console window. This is used for online computations and to hold results of program output. 
Now select New Script from the File option on the menu bar. A new window opens titled R editor. This is used to hold a script, i.e. a list of R commands that don't execute until you instruct R to run them.
If you type in the Editor
  1+2
  3+4
  5+6
and then select Run All from the Edit option on the Menu Bar, the script is executed, and you see the results of the computations in the Console window.

You can write or edit scripts directly in the Editor, but you may find it easier to use a word processor to do this and then paste the script into the Editor, as this allows for easier editing.
A note on quotes:  If you paste a script into your R console or browser,  quotes may get reformatted, causing an error.  Always check: for R, single quotes should be straight quotes, not 'smart quotes' (i.e. quotes that slant or curl in a different direction at the start and end of a quoted section). You may need to retype them if your system has reformatted them.

Important: Traditionally, R scripts use <- instead of =. The arrow is made by typing a pointed bracket and then a dash.
So, you will see instances of scripts which have commands such as
a <- 1+3
This is equivalent to
 a = 1+3
It is also possible to have the arrow going the other way , i.e., 1+3 -> a, which means the same thing.
My preference is to use '=', as it involves just one keystroke, but you will find both methods for denoting an assignment in this manual, so you become familiar with interchanging them.


Loops: A loop is a way of repeatedly executing the same code.  Suppose we wanted to print out the ten times table, we could type 1*10; 2*10, 3*10, and so on. But a simpler method is to use a loop, where we multiply 10 times a variable, myx, and specify the range of values that myx will take at the start of the loop.   
Try typing this list of commands in the editor, and then Run all:
                for (myx in c(1:10))
                          {
                                print(10*myx)
                                } 
The first line specifies the values that myx can take, i.e. c(1:10), which is the values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. The program executes all the commands between curly brackets repeatedly, incrementing the value of myx each time it does so, until it gets to the final value, whereupon it exits the loop.

Commenting: A good script will contain many lines preceded by #.
This indicates that the line is a comment – it does not contain commands to be executed, but provides explanation of how the script works.  The demonstration scripts in this manual are heavily commented so that you can understand how they work.
Before you start running the scripts in this manual, create a project directory that will contain all of your scripts, data, and workspace for a project.  Then go to the menu and select Change dir from the File menu, and navigate to your new directory.  This means that all your work will be saved in one place.  Whenever you start up R from a file in that directory, it will continue as your working directory.

As you work through this manual you will become more familiar with R, but it is likely that you will also want to extend your knowledge beyond the very basic details given here.
There is a good introductory manual that can be downloaded from the web-page that you saw when you typed help.start(), and a short introduction on:
http://openmx.psyc.virginia.edu/sites/default/files/TW-IntroToR-20100210_0.R

One last point I will put at the end of this post, so you can find it again easily, because it is a crazy but important feature of R. Other programming languages have a command you can use to clear all variables easily - such as clear(all). It is important to do that sometimes, because otherwise your variables hang around and can cause confusion, especially if you run scripts that use the same variable names. In R, you can clear all variables, but not in an intuitive way. You have to type:
   rm(list=ls(all=TRUE))
I can never remember this, and have to look it up every time. It goes to show that the guys who developed R may be very smart, but they certainly aren't psychologists.

Finally, the following books are recommended:
Braun, W. J., & Murdoch, D. J. (2007). A first course in statistical programming with R. Cambridge: Cambridge University Press.
Crawley, M. J. (2007). The R Book. Chichester, UK: Wiley.
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S, 4th edition. New York: Springer. (Do not be put off by the title: it really should be entitled 'with S and R')

PS 3rd September 2011
I've just download RStudio. It is free and it looks very promising as an interface to make R much easier to work with.