More Organs → More Human

Stupid things I've figured out so that you don't have to.


Site Feed

Tuesday, April 11, 2006

For the record...

I would like it noted that I just wrote the following sentence for my statistics homework:

For simplicity's sake, I used R to calculate...


I could be wrong here, but I think this may be the first time in recorded history that R has ever turned out to be the simplest way to do something. Usually, when people talk about R, they use phrases like "not as bad as I'd heard it would be", or "after a few hours of screwing around with R, I finally...", or "I gave up on R because...".

For the uninitiated, R is an open-source math program, in the same general family of application as Matlab. It's based on some pretty old numerical computing code, so its syntax is kind of... odd. Once you figure it out, it's not so bad, getting to that point can take an awfully long time, even for programmers or people who are familiar with programs like Matlab. The upshot is that it can basically do any kind of math you'll ever need, is free, and is easy-ish to integrate with other programs. Until today, I'd never found anything that I needed that it could do faster or easier than SPSS. It turns out, though, that getting SPSS to deal with, say, a 2x2 table when you've already got the data in aggregate form is, while entirely possible, pretty dang unintuitive. So unintuitive, in fact, that it was actually easier to get R to do what I needed than it was to figure out (from the SPSS documentation) how to get SPSS to do it. In this case, I needed to do what may well be the simplest statistical task involving a 2x2 table: a normal test of the equality of two proportions. In R, it was about two lines of input. In SPSS, it's a pretty convoluted process involving dummy variables, case weighting, etc.

Part of this has to do with the way that the two programs accept data input. R assumes that you'll be dumping data directly in from some sort of input file (or pipe, or whatever). As a result, its manual interface for entering data is ABSOLUTELY HORRID, and this is what usually trips new users up. The "tutorials" that come with R spend a lot of time talking about some pretty abstract aspects of its data model, and when they finally get around to showing you how to type your data in, it looks way more complex and tedious than it would be in practice. 2x2 tables, however, are one of the few cases where the usually annoying manual input process is exactly what is needed. All of R's proportion-comparison functions want their input in the form of two vectors: one for the "successes", and one for the total "tries" (recall that the statistical basis for proportion comparison is usually derived from the binomial distribution). This means that, if you already have that data, you can just type the darn numbers more or less straight into R, and it will politely give you your results.

SPSS, on the other hand, assumes that the user will be manually entering their data in an unaggregated way. It basically pretends to be a giant spreadsheet, and does all of the data aggregation and calculation needed for proportion comparisons "behind the scenes". Unfortunately, when your data is already aggregated, there's really no immediately obvious way to get it to do anything useful. The solution actually makes a little bit of sense, but is pretty non-obvious. For the interested reader, here's how to do it:

  1. Open a new data editor.

  2. Make three numeric integer variables. The first one will be a dummy variable for your risk factor, the second will be a dummy variable for your disease, and the third will be a count.

  3. For each square in your 2x2 table, enter a new case. For the square containing the count of subjects with the risk factor and the disease, enter "1" for the risk factor dummy variable, "1" for the disease variable, and the value of that square in your table. For the square containing the count of subjects with the risk factor and without the disease, enter "1","2", and the contents of the square. Basically, the first dummy variable is for the row number, and the second is for the column number. Do this for each cell in your 2x2 table.

  4. Once you've entered your data, go to the "Data" menu and select "Weight Cases". Tell it to weight cases by your "count" variable.

  5. Now, you can go to the "Analyze->Descriptive->Crosstabs" and proceed as you would if your data were entered normally.



This method should work for arbitrarily sized tables, but I know that many of the analyses that SPSS performs will only really work with a 2x2 table.

You can clearly see that, in this particular case, R was totally the easier way to go. I don't want to go spreading false R-hope around, though— for most people, most of the time, if you've got access to SPSS or SAS and know how to use them, R is probably not a good first choice for your statistical computing. If, however, you find yourself needing something a bit burlier than Excel, and don't have anything else lying around, it's definitely worth a shot.

0 Comments:

Post a Comment

<< Home