Sunday 28 August 2011

Using OpenMx with your own data


There is a good guide to importing and exporting data into R on:

If your data is in SPSS or Excel format, you should be able to save as comma-separated text with .csv extension. If possible, save the variable names in the first row.
A csv file can be read with a command such as:
          mydata=read.csv("fromspss.csv")
This command should automatically assign blank entries to NA, and will recognise column names. This is the simplest method for importing data.

Your file may be too large, so that some rows are omitted by read.csv. You should probably consider just deleting unwanted variables from the file and then resaving in .csv format. An alternative is to save in .dat format (tab-separated text), and read it using read.table. The downside of this is that blank entries will not be treated as missing, and so you need to do some extra processing to avoid a file read error. 

There are probably better ways of doing this, but this is what worked for me!
Open the .dat file in an application such as Word so you can do search and replace for missing values, and put NA instead.
How you do this depends on how missing data are represented in your file. You may have used code numbers such as 999, or the missing data may be represented by blanks.
If you are uncertain, it is a good idea to make a little demonstration file by copying a small section of your real data, saving it as a .dat file, and then opening in Word. If you then select the show/hide paragraph marks option on the menu bar, you will be able to see tabs (as arrows), spaces (as dots) and paragraph markers.
For example, we have a dataset saved in SPSS, where missing data are represented as a blank field. When this is exported as a .dat file, the blank field is translated to a space. To change to NA, we therefore need to do a search and replace, where the search term is two tabs separated by a blank (indicated by ^t ^t), which is replaced by two tabs separated by NA (^tNA^t).  Run on the whole file, and then repeat this operation, because adjacent blanks can be missed otherwise.
To allow for the possibility that the blank entry could be in the first or last column, you need to also do a search with:
          search term: ^t ^p,   replace with ^tNA^p
                            ^p ^t,    replace with ^pNA^t
(in each case there is a blank between the tab and paragraph codes)

You should then be able to open your .dat file in R, with a command such as:
mydata=read.table('fromSPSS.dat', header = TRUE)

If the column names have read in correctly, you should be able to see them if you type:
colnames(mydata)

If you have a very large file, then it may be better to pre-process it in another application. However, it can help avoid errors if you minimize the number of steps involved in copying data between applications. Furthermore, R can be used for data processing, e.g. standardization of variables.  If you do read in a large file, you can then select the rows and columns that you want in an initial step, and then remove the original file to free up memory, e.g.:

mymzData <- as.matrix(subset(mydata, zyg==1, c(bmi1,bmi2)))
mydzData <- as.matrix(subset(mydata, zyg==2, c(bmi1,bmi2)))

Note the double equals sign for specifying the zygosity group you want. This in effect tells R you want to select cases where it is true that zyg = 1. Note also that the column names at the end of each statement are not in quotes.

No comments:

Post a Comment