Tuesday, September 16, 2014

R package to convert statistical analysis objects to tidy data frames

I talked a little bit about tidy data my recent post about dplyr, but you should really go check out Hadley’s paper on the subject.
R expects inputs to data analysis procedures to be in a tidy format, but the model output objects that you get back aren’t always tidy. The reshape2, tidyr, and dplyr are meant to take data frames, munge them around, and return a data frame. David Robinson's broom package bridges this gap by taking un-tidy output from model objects, which are not data frames, and returning them in a tidy data frame format.
(From the documentation): if you performed a linear model on the built-in mtcars dataset and view the object directly, this is what you’d see:
lmfit = lm(mpg ~ wt, mtcars)
lmfit
Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344
summary(lmfit)
Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-4.543 -2.365 -0.125  1.410  6.873 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   37.285      1.878   19.86  < 2e-16 ***
wt            -5.344      0.559   -9.56  1.3e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared:  0.753,  Adjusted R-squared:  0.745 
F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10
If you’re just trying to read it this is good enough, but if you’re doing other follow-up analysis or visualization, you end up hacking around with str() and pulling out coefficients using indices, and everything gets ugly quick.
But the tidy function in the broom package run on the fit object probably gives you what you were looking for in a tidy data frame:
tidy(lmfit)
         term estimate stderror statistic   p.value
1 (Intercept)   37.285   1.8776    19.858 8.242e-19
2          wt   -5.344   0.5591    -9.559 1.294e-10
The tidy() function also works on other types of model objects, like those produced by glm() and nls(), as well as popular built-in hypothesis testing tools like t.test(), cor.test(), or wilcox.test().
View the README on the GitHub page, or install the package and run the vignette to see more examples and conventions.

Thursday, September 11, 2014

UVA / Charlottesville R Meetup

TL;DR? We started an R Users group, awesome community, huge turnout at first meeting, lots of potential.

---

I've sat through many hours of meetings where faculty lament the fact that their trainees (and the faculty themselves!) are woefully ill-prepared for our brave new world of computing- and data-intensive science. We've started to address this by running annual Software Carpentry bootcamps (March 2013, and March 2014). To make things more sustainable, we're running our own Software Carpentry instructor training here later this month, where we'll train scientists how to teach other scientists basic computing skills like using UNIX, programming in Python or R, version control, automation, and testing. I went through this training course online a few months ago, and it was an excellent introduction to pedagogy and the psychology of learning (let's face it, most research professors were never taught how to teach; it's a skill you learn, not one you inherit with the initials behind your name).

Something that constantly comes up in these conversations is how to promote continued learning and practice after the short bootcamp is over. Students get a whirlwind tour of scientific computing skills over two days, but there's generally very little follow-up that's necessary to encourage continued practice and learning.

At the same time we've got a wide variety of scientists spanning all disciplines including social sciences, humanities, medicine, physics, chemistry, math, and engineering that are doing scientific computing and data analysis on a daily basis who could really benefit from learning from one another.

These things motivated us to start a local R Users Group. So far we have 118 people registered on Meetup.com, and this week we had an excellent turnout at our first meeting, with over 70 people who RSVP'd.


At this first meetup we kicked things off with an introduction to the group, why we started it, and our goals. I then launched into a quick demo of some of the finer features in the dplyr package for effortless advanced data manipulation and analysis. The meetup group has a GitHub repository where all the code from our meetups will be stored. Finally, we concluded with a discussion of topics the group would like to see presented in the future: ggplot2, R package creation, reproducible research, dynamic documentation, and web scraping were a few of the things mentioned. We collectively decided that either talks could be either research talks highlighting how these things were used in an actual research project, or they could be demo/tutorial in nature, like the dplyr talk I gave.

The rich discussion we had at the end of this session really highlighted the potential this community has. In addition to the diversity and relative gender-balance we saw in our first meetup's participants, we had participants from all over UVA as well as representation from local industry and non-profit organizations.

I, for one, am tremendously excited about this new community, and will post the occasional update from the group here on this blog.


Wednesday, September 10, 2014

GNU datamash

GNU datamash is a command-line utility that offers simple calculations (e.g. count, sum, min, max, mean, stdev, string coalescing) as well as a rich set of statistical functions, to quickly assess information in textual input files or from a UNIX pipe. Here are a few examples:
E.g., let’s use the seq command to generate some data, and use datamash sum to add up all the values in the first column:
$ seq 5
1
2
3
4
5
$ seq 5 | datamash sum 1
15
What else? Let’s calculate the mean, 1st quartile, median, 3rd quarile, IQR, sample-standard-deviation, and p-value of Jarque-Bera test for normal distribution, using some data in file.txt:
$ cat file.txt | datamash -H mean 1 q1 1 median 1 q3 1 iqr 1 sstdev 1 jarque 1
mean(x)   q1(x)  median(x)  q3(x)   iqr(x)  sstdev(x)  jarque(x)
45.32     23     37         61.5    38.5    30.4487    8.0113e-09
Go take a look at more examples, specifically the examples using datamash to parse a GTF file. Datamash can very quickly do things like find the number of isoforms per gene by grouping (collapsing) over gene IDs and counting the number of transcripts, find the number of genes with more than 5 isoforms, find genes transcribed from multiple chromosomes, examine variability in exon counts per gene, etc.
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.