The Compiled Thesis: why computational biologists should use self-documented analysis

Most of us in the biotech space are fully onboard with the concept of reproducible research. Who thinks it’s a bad idea to be able to trace back from our results to the data and methods that produced them?

The trick, of course, is in how to do it. In the lab, there’s a strong culture of recording our experiments, tracking samples in LIMS systems and lab notebooks. However, in computational research the processes are often more personalized, or simply aren’t done.

And yet we’ve all experienced the confusion of looking back at a collection of analyses we’ve done, and wondering which was the “real” result, or why the results of one analysis differed from the other. What about that slide deck we showed to management – which of the many versions of the RNA-Seq pipeline did it come from? Why does the code result in a different result than it did the last time I ran it?

I had this problem with my dissertation. My graduate work was done entirely in R, and included nearly 100 figures and tables. Some of these figures were very similar – the same analysis was done on several different datasets. Keeping track of the origin of each figure was extremely difficult; I couldn’t convince myself I had it right.

Two things helped with this: version control and self-documenting code. I’ve written about version control before. It solved the problem of figuring out why the results might change from one day to the next. Today I’ll tell you about why self-documenting code is a critical component of a data analyst’s workflow.

My entire thesis was self-documented. It was written in Sweave, which is a documentation integration package for R and LaTeX. I embedded all of the code from my work into the text document that described it. When I made a change to the text, I would rebuild the whole thing and see a pdf emerge with all of my prose, figures, and tables, all beautifully formatted. Since the code that produced any given figure was located right next to the text that described it, I was fully confident I knew exactly which analysis produced which plot. Since the whole document was checked into version control (I used Subversion at the time), I knew I could track back any changes to the plot to the changes in the analysis that produced it. It may sound like doing this sort of tracking and documentation is a lot of work, and I won’t pretend that there isn’t overhead. But this extra work paid off profoundly one day.

A good four months into the thesis-writing process one of the public datasets I had analyzed was retracted; there were irregularities in the authors’ data curation process. I was mortified – what did this mean for my work in integrating these datasets, and how was I going to extract all of the plots and tabular results that included the disgraced dataset?

Since my thesis was self-documenting, I didn’t need to worry about how the extraction of one dataset would mangle the organization of the document. It was structured like a programmer had written it – with for loops and lists of datasets. I deleted the offending dataset and rebuilt the thesis. My figures were recreated, my tables recalculated. I did need to edit a few things by hand; for example, any mention of that disgraced dataset in the text needed to be changed. But overall, reproducible research had saved me. I could have spent a month rewriting and instead I spent a few days.

Not bad.

I can’t recommend self-documenting code highly enough. It’s helped me on many occasions since, in smaller analyses and in less-spectacular ways, but I no longer even think about doing work for myself or for a client without self-documentation.

Coda

When I wrote my thesis Sweave was pretty state-of-the art and Subversion was not a complete dinosaur. Now, there are better options available, like Markdown and Knitr, or the open lab notebooks put together by the Jupyter folks. Any one of them will make your analysis infinitely more reliable. For version control, Git is the most common choice, but any of them will give you the confidence you need that your code is what you think it is.

–Eleanor

The Sketchy Data Collector and the Problem with Repeated Measurements

I’ve been asked before to explain how repeated measurements can impact statistical models. We often bake repeated measurements – more than one measurement take on the same person, the same experimental animal, etc – into our experiments. They can give us more confidence in noisy data. The downside is that they do need to be accounted for properly when doing the analysis afterwards.

For example, an experiment with repeated measurements probably shouldn’t be analyzed with a simple t-test.

Let’s use an example to give us an intuition for why.

Imagine you wanted to know whether carb-loading resulted in faster race times for mid-distance runners. Say you hire someone to collect some data for you: they recruit runners and assign them to either carb-load or eat a balanced meal before running a timed 5k.

You see a dataset that looks something like this:

running1

Looks pretty good, right? You do a t-test (the data is about normal) and find a significant p-value: .001. Great! This looks significant! You get ready to publish a paper (or a blog post).

But then imagine that your data collector neglected to mention to you that the measurements were all taken on the same person. The data collector is an avid runner themselves, and managed to run 100 5ks. Do you still believe your t-test?

Of course you don’t. Those measurements are all correlated; they’re quite a good measurement of that one person, but who can say how well that one person generalizes to all runners?

And this is the intuition you are looking for: the measurements collected on your sketchy 5k-running data collector are all correlated, and are not independent. A t-test assumes that measurements are independent, and so if you used that test, you would see a falsely-inflated p-value.

A slightly better way to do this experiment would be to identify ten different runners, and have them run ten races each: five with each type of preparatory meal. In this  case we would still be looking at correlated data, because the measurements taken on one person will be similar to each other. Even better would be fifty independent runners, each running two races.

What if the study budget only allows for ten runners? Or what if you expect very noisy measurements and so you want to collect more than one measurement from each subject? These experimental designs are still fine, and we can test our carb-loading hypothesis with something called a mixed-effects model. And that’s a topic for another post.

–Eleanor

What does it mean to fit a model, anyway?

What on earth do people mean when they tell you they’re going to “fit a model”?

Let’s start with what a model is. A model is a description of a system, usually expressed as an equation of some kind. Let’s say we have some data – measurements of variables x and y. We think that in the future, we’ll have measurements of more x-like values, and we’d like to be able to predict those ys.
point cloud
Some “data”.

It looks like you could draw a nice straight line through this cloud of points. That means that a linear model might be a good choice for this data. We’ve just done the first step in the model-fitting process: we’ve decided to use a line – a simple linear model.

The process of picking the correct line for this model is called “fitting”. There are different ways to do this – least squares is possibly the most familiar one. You could also use the “wiggle a ruler around on paper” method, or the “draw lines in Powerpoint” method. We’ll skip the details of that step because the internet describes least-squares fairly well.

three fits
Two bad fits and one good one.

I’m going to use an equation here, but it’s simple and it’s the only one I’ll use in this post!

That fitted line can be described with the equation y=mx+b. When we fit the model what we’re really doing is choosing the values for m and b – the slope and the intercept. The point of fitting the model is to find this equation – to find the values of m and b such that y=mx+b describes a line that fits our observed data well. In the case of the best fit model above, m is close to 1, and b is just a bit larger than 0.

And why do we care about this? Well, that value of m can be really informative. If m is very large and positive, then a small change in the value of x predicts a tremendous positive change in the value of y. If m is small, then changes in x predict small changes in y. A large m implies that x may have a large effect on y, hence m is also sometimes called the effect size. It’s also sometimes called a coefficient.

The principals of this model-fitting can be applied to linear models created on data with many more parameters – many dimensions. But they are fit in similar ways. We may not be able to draw multidimensional datasets in neat graphs but we can still apply least-squares to them!

Testing for significance

Now that we’ve fit a model and found values for m and b, we’d like to know something: does m really matter?

Take these two sets of data:

two datasets
Two datasets: both could be fit with a model that is a straight line with slope of 1. Even though that line might be the “best” fit for both sets of points, it would be a far better fit for the first dataset.

You got me; the first example is just a copy of the one above. But the second is another dataset that we could imagine fits the *same* linear model – that is, the best-fit linear model could have the exact same values for m and b. But really, does the value of x predict the value of y well here? We can tell by looking that it doesn’t.

That’s why many model-fitting tools return not only a slope for each parameter, but a p-value. This p-value is an indicator of whether that predictor (x) is actually useful in informing you about the state of the response variable (y).

To assess whether a parameter is predictive, we remove the variable (x) and it’s coefficient (m) from the model. And then we see how good we are at predicting y with a model that doesn’t include them.

In this case a model with no x means that we guess: we create a model where the prediction for y is always the same: the mean value of all of the observed values of y.

compare models
Predicting y with our linear model including x, and predicting y without x. Sum the lengths of the yellow lines to get a good idea for how good a prediction we’re making.

We compare the predictions between these two models. If our model that includes x is much better at prediction, we assign a low p-value to that coefficient.

We do this kind of testing for significance in many statistical settings, including one of my favorites: testing for differential expression of genes in RNA-Seq experiments. If a linear model that includes the expression level of gene A is better at predicting which group a sample comes from than a model without A, we decide that gene A is significantly differentially expressed.

–Eleanor 

 

 

BioIT World 2017

Last week was the annual BioIT World Convention and Expo. It’s three days of talks, workshops, and row upon row of technology solutions vendors putting their best foot forward. Also, biotech folks from all over the country converge upon Boston, so I get to catch up with many colleagues I see but once a year.

As usual, the conference is a firehose of information about new technology, new methods, and new software. Also as usual, participating in the Best in Show judging was the highlight of the event. It’s a real pleasure to see the best new technologies our community has to offer. Congratulations to the winners!

–Eleanor

Probability vs Likelihood: What’s the Difference?

The terms probability and likelihood are often used interchangeably in day-to-day conversation. They have specific meanings in the world of statistics, however, and understanding the difference is helpful in understanding statistical methods.
We’ll use examples to start. Take a coin flip: if you flip a coin, and you know it’s fair; a lifetime of experience gives you a model describing the behavior of the coin: half the time a flip will result in heads. You use this probability of 0.5 to decide whether you want to take a bet on the outcome of that coin.
If, on the other hand, you wanted to test whether that coin was fair, you might flip it many times. Say you flip it 1000 times, and you observe 505 heads and 495 tails. Now you want to know: is this coin fair? Is my model of the coin’s behavior correct? Now you are talking about a likelihood; what is the likelihood that this is a fair coin?
In short, a probability quantifies how often you observe a certain outcome of a test, given a certain understanding of the underlying data. A likelihood quantifies how good one’s model is, given a set of data that’s been observed. Probabilities describe test outcomes, while likelihoods describe models.

–Eleanor

Three reasons computational scientists should use version control

Source control is not just for software engineers. Using the tools that coders have written to support their work can make a computational biologist’s life massively easier. You’ll find that having a versioned, trackable backup of your analytic scripts is a lifesaver, over and over again.
There are several version control tools out there – we find that distributed source-control tools like Git work best for teams.
  1. Backups. If you are checking code into a git repository, you have at least one other location for your code. Dropping your laptop into the river doesn’t *have* to be a disaster.
  2. Collaboration. Working with other people is hard enough without stepping on each other ‘s toes while making edits. Git’s merging capabilities are excellent, and will help you figure out who did what, when, and whose changes should remain in the final document.
  3. Reproducibility. When you look back at that analysis you did last year, do you know what code you used to run it? Git does. Just ask it! Then you can tell your team why your results are different from last time.

Try it with your writing too – you’ll find that having your manuscript draft safely stashed away in a repository adds a lot of peace of mind.

There is plenty of great documentation out there on using these tools. Here are a few to begin with.

Getting Started with Version Control

Git Best Practices

–Eleanor