More Organs → More Human

Stupid things I've figured out so that you don't have to.


Site Feed

Monday, May 22, 2006

Java Question

Okay, is there anybody out there who can tell me what the heck is up with Java's Dimension object? Its accessors and mutators give and take double-precision numbers, which makes sense. What doesn't make sense is its constructor— which only takes integers! WTF?

Tuesday, May 09, 2006

R Clinic: xtabs()

As promised, here's another entry in my ongoing quest to document all of the stupid stuff that seems like it ought to be really easy but turns out to be kind of tricky. Today, we're going to learn how to load SPSS data into R, and then how to do cross-tabulations on that data. In SPSS, this is insanely easy. It turns out not to be all that bad in R, but the whole thing depends on a particularly poorly-documented aspect of the program. We'll start with the easy part: the data import.

Before we get any further, please note that Blogger has decided to screw up my formatting. There's way more whitespace than I'd wanted, and I can't seem to get rid of it. Grrr. Anyway, let's move forward.

Consider an SPSS file called "cvd.sav". It contains stratified data representing aggregated cardiovascular disease figures for some population. The data set contains five variables:

  1. A dichotomous gender variable ("male"/"female")

  2. A categorical "age group" variable with five levels ("25-34","35-44", etc. up to "65-74")

  3. A dichotomous variable representing blood factor IX levels ("low"/"high")

  4. A dichotomous outcome variable ("CVD"/"No CVD")

  5. A "count" variable



Note that the data are aggregated. If we want to do anything with them in SPSS, we'll have to use the "Weight Cases" function, and weight by "count". We can read this file into R quite easily, using the "read.spss" function:



> library(foreign)
> cvd <- read.spss("~/cvd.sav", to.data.frame=TRUE)



Note that if we omit the "to.data.frame" argument, our variable will contain an object, which will make the next step a bit trickier. So, now we have a variable called "cvd". Let's see what it looks like:



> cvd
GENDER AGEGRP IXSTATUS OUTCOME COUNT
1 Male 25-34 Low CVD 6
2 Male 25-34 Low No CVD 75
3 Male 25-34 High CVD 4
4 Male 25-34 High No CVD 20
5 Male 35-44 Low CVD 12
......................
34 Female 55-64 Low No CVD 37
35 Female 55-64 High CVD 39
36 Female 55-64 High No CVD 59
37 Female 65-74 Low CVD 27
38 Female 65-74 Low No CVD 24
39 Female 65-74 High CVD 57
40 Female 65-74 High No CVD 62

>



Now that we've got the data, let's do something useful with it. We'll start with a quick crosstabulation of, say, gender vs cvd. What we'd ideally like to end up with is a 2x2 table looking like this:


























  CVD No CVD Total
Male a b n1
Female c d n2
  m1 m2 sum


Like I said earlier, to get this in SPSS, we'd simply use the "Crosstabs" function. In R, we need to use the "xtabs" function. This function is somewhat tricky in that it takes a "formula" as one of its arguments. It took me quite a while to figure out what to do with this, since the documentation isn't really as straightforward as it should be. Eventually, however, I figured it out. To get our table, we enter the following command:



> xtabs(count ~ gender + outcome, cvd)
outcome
gender CVD No CVD
Female 260 524
Male 233 477
>



See what happened there? Our "formula" should look familiar if you've ever entered a linear model in SPSS using the raw syntax format, and I think SAS's format is pretty similar. Essentially, you just put the "count" column of your data frame on the left hand side of the "~", and then put whichever columns you want to crosstabulate by on the right-hand side, separated by "+" signs. The order that they appear definitely affects how the table comes out, so it's worth experimenting a bit to see what you get.

What we're really doing here is creating a "Formula" object, and once you know what to look for in the docs, they're actually pretty helpful (try "help(formula)"). The formula object is also found in the linear and regression modeling equations, so they're handy to learn how to use.

So, that gets us a simple 2x2 table. What if we want to stratify further, and compare the counts by age group? We can simply tack on one more variable to the end of our formula, like so:



> xtabs(count ~ gender + outcome + agegrp, cvd)
, , agegrp = 25-34

outcome
gender CVD No CVD
Female 29 101
Male 10 95

, , agegrp = 35-44

outcome
gender CVD No CVD
Female 32 121
Male 27 112

, , agegrp = 45-54

outcome
gender CVD No CVD
Female 57 120
Male 53 92

, , agegrp = 55-64

outcome
gender CVD No CVD
Female 58 96
Male 71 93

, , agegrp = 65-74

outcome
gender CVD No CVD
Female 84 86
Male 72 85

>



Note that if we change the order of our formula arguments, we can get something rather different:



> xtabs(count ~ agegrp + gender + outcome, cvd)
, , outcome = CVD

gender
agegrp Female Male
25-34 29 10
35-44 32 27
45-54 57 53
55-64 58 71
65-74 84 72

, , outcome = No CVD

gender
agegrp Female Male
25-34 101 95
35-44 121 112
45-54 120 92
55-64 96 93
65-74 86 85

>



Possibly useful for other reasons (some sort of Cochran test of trend, perhaps), but not what we were after here. I heartily suggest experimentation: if the xtabs function isn't giving you what you want, it might be a simple matter of re-arranging the terms in your function.

Of course, as with everything else in R, we can save the output of xtabs() to a variable, which we can then use for hypothesis testing or other analysis. The return value from xtabs() is essentially a special type of matrix, so we can do all of the usual fun stuff one can usually do with matrices. Also, many functions (prop.test, chisq.test, etc.) will take a matrix as direct input.

There's lots more you can do with xtabs in R, but this should be enough to get you started. I had to screw around with it for a couple of afternoons before I finally figured out what was going on; hopefully, someday this'll help keep somebody else— or myself, once I forget how to do all of this— from banging their heads against the wall for as long.