More Organs → More Human: May 2006

As promised, here's another entry in my ongoing quest to document all of the stupid stuff that seems like it ought to be really easy but turns out to be kind of tricky. Today, we're going to learn how to load SPSS data into R, and then how to do cross-tabulations on that data. In SPSS, this is insanely easy. It turns out not to be all that bad in R, but the whole thing depends on a particularly poorly-documented aspect of the program. We'll start with the easy part: the data import.

Before we get any further, please note that Blogger has decided to screw up my formatting. There's way more whitespace than I'd wanted, and I can't seem to get rid of it. Grrr. Anyway, let's move forward.

Consider an SPSS file called "cvd.sav". It contains stratified data representing aggregated cardiovascular disease figures for some population. The data set contains five variables:

A dichotomous gender variable ("male"/"female")

A categorical "age group" variable with five levels ("25-34","35-44", etc. up to "65-74")

A dichotomous variable representing blood factor IX levels ("low"/"high")

A dichotomous outcome variable ("CVD"/"No CVD")

A "count" variable

Note that the data are aggregated. If we want to do anything with them in SPSS, we'll have to use the "Weight Cases" function, and weight by "count". We can read this file into R quite easily, using the "read.spss" function:



> library(foreign)
> cvd <- read.spss("~/cvd.sav", to.data.frame=TRUE)

Note that if we omit the "to.data.frame" argument, our variable will contain an object, which will make the next step a bit trickier. So, now we have a variable called "cvd". Let's see what it looks like:



> cvd
   GENDER AGEGRP IXSTATUS OUTCOME COUNT
1  Male    25-34     Low   CVD        6
2  Male    25-34     Low   No CVD    75
3  Male    25-34     High  CVD        4
4  Male    25-34     High  No CVD    20
5  Male    35-44     Low   CVD       12
......................
34 Female  55-64     Low   No CVD    37
35 Female  55-64     High  CVD       39
36 Female  55-64     High  No CVD    59
37 Female  65-74     Low   CVD       27
38 Female  65-74     Low   No CVD    24
39 Female  65-74     High  CVD       57
40 Female  65-74     High  No CVD    62

>

Now that we've got the data, let's do something useful with it. We'll start with a quick crosstabulation of, say, gender vs cvd. What we'd ideally like to end up with is a 2x2 table looking like this:

	CVD	No CVD	Total
Male	a	b	n₁
Female	c	d	n₂
	m₁	m₂	sum

Like I said earlier, to get this in SPSS, we'd simply use the "Crosstabs" function. In R, we need to use the "xtabs" function. This function is somewhat tricky in that it takes a "formula" as one of its arguments. It took me quite a while to figure out what to do with this, since the documentation isn't really as straightforward as it should be. Eventually, however, I figured it out. To get our table, we enter the following command:



> xtabs(count ~ gender + outcome, cvd)
        outcome
gender   CVD No CVD
  Female 260 524   
  Male   233 477   
>

See what happened there? Our "formula" should look familiar if you've ever entered a linear model in SPSS using the raw syntax format, and I think SAS's format is pretty similar. Essentially, you just put the "count" column of your data frame on the left hand side of the "~", and then put whichever columns you want to crosstabulate by on the right-hand side, separated by "+" signs. The order that they appear definitely affects how the table comes out, so it's worth experimenting a bit to see what you get.

What we're really doing here is creating a "Formula" object, and once you know what to look for in the docs, they're actually pretty helpful (try "help(formula)"). The formula object is also found in the linear and regression modeling equations, so they're handy to learn how to use.

So, that gets us a simple 2x2 table. What if we want to stratify further, and compare the counts by age group? We can simply tack on one more variable to the end of our formula, like so:



> xtabs(count ~ gender + outcome + agegrp, cvd)
, , agegrp = 25-34

        outcome
gender   CVD No CVD
  Female  29 101   
  Male    10  95   

, , agegrp = 35-44

        outcome
gender   CVD No CVD
  Female  32 121   
  Male    27 112   

, , agegrp = 45-54

        outcome
gender   CVD No CVD
  Female  57 120   
  Male    53  92   

, , agegrp = 55-64

        outcome
gender   CVD No CVD
  Female  58  96   
  Male    71  93   

, , agegrp = 65-74

        outcome
gender   CVD No CVD
  Female  84  86   
  Male    72  85   

>

Note that if we change the order of our formula arguments, we can get something rather different:



> xtabs(count ~ agegrp + gender + outcome, cvd)
, , outcome = CVD

       gender
agegrp  Female Male
  25-34  29     10 
  35-44  32     27 
  45-54  57     53 
  55-64  58     71 
  65-74  84     72 

, , outcome = No CVD

       gender
agegrp  Female Male
  25-34 101     95 
  35-44 121    112 
  45-54 120     92 
  55-64  96     93 
  65-74  86     85 

>

Possibly useful for other reasons (some sort of Cochran test of trend, perhaps), but not what we were after here. I heartily suggest experimentation: if the xtabs function isn't giving you what you want, it might be a simple matter of re-arranging the terms in your function.

Of course, as with everything else in R, we can save the output of xtabs() to a variable, which we can then use for hypothesis testing or other analysis. The return value from xtabs() is essentially a special type of matrix, so we can do all of the usual fun stuff one can usually do with matrices. Also, many functions (prop.test, chisq.test, etc.) will take a matrix as direct input.

There's lots more you can do with xtabs in R, but this should be enough to get you started. I had to screw around with it for a couple of afternoons before I finally figured out what was going on; hopefully, someday this'll help keep somebody else— or myself, once I forget how to do all of this— from banging their heads against the wall for as long.

More Organs → More Human

About Me

Monday, May 22, 2006

Java Question

Tuesday, May 09, 2006

R Clinic: xtabs()