We’re presenting!

Want to learn more about computational biology in a therapeutics context? I’ll be co-presenting with Huseyin Mehmet of Zafgen next Friday, December 7th at the Woburn Trade Center. We’ll be talking about making effective use of computational biology data for drug discovery and development, and also describing how the collaboration between Diamond Age and Zafgen works.

This is event is the North Shore Technology Council’s usual First Friday event. Register here to attend and come say hello. Several Diamond Age folks will be there.

Hope to see you there!

–Eleanor

 

Thanksgiving 2018

It’s Thanksgiving Day, and I’d like to take a minute to say “Thank you” to the people who have made Diamond Age possible, starting with our customers. We’ll have a proper ‘Customers’ page up here soon enough, but in the meantime, thank you to Decibel Therapeutics, Voyager Therapeutics, 1CellBio, Aquinnah Pharmaceuticals, and Zafgen, Inc. As for the rest of you who will not be named, you know who you are, and we appreciate the trust you’ve placed in us.

Also, thank you Chris and Somdutta, who took the risk of throwing in with me by joining Diamond Age full-time this year. And thank you to Dave, Max, Nick, Mike, Zarko, Bruce and Chris for making the time to work with us. Diamond Age would not exist without you.

Happy Thanksgiving!

Three reasons to use Git for bioinformatics projects

Source control is not just for software engineers. Using the tools that coders have written to support their work can make a computational biologist’s life massively easier. You’ll find that having a versioned, trackable backup of your analytic scripts is a lifesaver, over and over again.

  1. Backups. If you are checking code into a get repository, you have at least one other location for your code. Dropping your laptop into the river doesn’t *have* to be a disaster.
  2. Collaboration. Working with other people is hard enough without stepping on each other ‘s toes while making edits. Git’s merging capabilities are excellent, and will help you figure out who did what, when, and whose changes should remain in the final document.
  3. Reproducibility. When you look back at that analysis you did last year, do you know what code you used to run it? Git does. Just ask it! Then you can tell your boss why your results are different from last time.

We’re probably preaching to the choir here, but I wanted to make sure everyone had at least heard the gospel.

-Eleanor

BioIT World 2018

It’s that time of year again, folks: Bioinformatics Christmas. Or is it rather Thanksgiving, where the whole family gets together whether we like it or not?

Either way, I’ll be there next week, bells on and business cards in hand. I’d love to catch up with you, if you’ll be there. Send an email or find me on the contact page if you want to get a coffee and chat.

I’m very much looking forward to Carl Zimmer’s talk, as well as some talks by colleagues and friends of mine: John Keilty and Karina Chmielewski from Third Rock, Mike Dinsmore from Editas, and Iain McFaydden from Moderna.

Who do I really want to meet this year? Tanya Cashorali, one of the plenary keynote panelists and another woman who has started her own data science company. There aren’t many of those. We have to stick together.

After all, we’re family.

–Eleanor

Bioinformatics vs Computational Biology

The world of quantitative biology is large, diffuse and sometimes overwhelming. It’s hard sometimes to even figure out what someone means when they say “bioinformatics”. This can make it hard to figure out what part of the field someone works in.

One way to break it down is to describe bioinformatics as the building of tools and methods for the processing and management of biological data, and computational biology as the pursuit of biological sciences using computational methods. Therefore, bioinformatics is more of an engineering discipline and computational biology more a scientific discipline.

It’s helpful to think about these distinctions, subtle as they seem. It takes a certain mindset and skillset to build a robust sequencing analysis pipeline that will serve the needs of a large group of scientists for years. That mindset and skillset may be very different from the one required to do a deep investigation of the variants that impact risk of heart disease.

We can argue about the naming conventions all we want, but the label we apply to these two types of specialist doesn’t really matter. What matters is what they do; the person I would call a computational biologist writes code, yes, but does it in pursuit of a particular biological problem, and they would love to write less code and more manuscripts. The bioinformatician, on the other hand, wants to spend their time writing robust, high-quality code that does interesting and powerful computations. Papers are more of a nice side-effect.

The truth of the matter is that most programming biologists are a mix of the two disciplines.

When hiring for a small department or a startup, the distinction between these two caricatures becomes very important. Some people will be in the field for the biology specifically, and will choke when pressed to develop a tool for use by a team. Others will jump at the chance to write such a thing. Every group needs both of these. Consider the current needs; will this person be building a pipeline that will be re-used again and again? Or will they investigate particular variants, or particular compound response profiles? Fitting the right person to the job will ensure a happy employee and high productivity.

Figuring out what kind of background and preferences someone has can be as simple as asking them. Their resume or LinkedIn profile can also give clues. A software-focused person will tend to have one or more large, open-source bioinformatics software tools prominently listed. Their reference list may include a few papers describing this project and others (potentially many others) that use that tool. A manuscript-focused person will not be as likely to have a major tool-building segment of their resume. Instead, they will list a series of biology or dataset-focused projects, with manuscripts describing each.

Data Science

But where does data science fit into all this? That, at least, is simple; bioinformatics/computational biology is data science with a biology application, just as computational chemistry is data science for chemistry. Physicists have figured out that they’re all data scientists already, so there is no need for a name for them beyond “physicist”. I hope in the future we’ll do the same and just call ourselves “biologists”.

–Eleanor

The Compiled Thesis: why computational biologists should use self-documented analysis

Most of us in the biotech space are fully onboard with the concept of reproducible research. Who thinks it’s a bad idea to be able to trace back from our results to the data and methods that produced them?

The trick, of course, is in how to do it. In the lab, there’s a strong culture of recording our experiments, tracking samples in LIMS systems and lab notebooks. However, in computational research the processes are often more personalized, or simply aren’t done.

And yet we’ve all experienced the confusion of looking back at a collection of analyses we’ve done, and wondering which was the “real” result, or why the results of one analysis differed from the other. What about that slide deck we showed to management – which of the many versions of the RNA-Seq pipeline did it come from? Why does the code result in a different result than it did the last time I ran it?

I had this problem with my dissertation. My graduate work was done entirely in R, and included nearly 100 figures and tables. Some of these figures were very similar – the same analysis was done on several different datasets. Keeping track of the origin of each figure was extremely difficult; I couldn’t convince myself I had it right.

Two things helped with this: version control and self-documenting code. I’ve written about version control before. It solved the problem of figuring out why the results might change from one day to the next. Today I’ll tell you about why self-documenting code is a critical component of a data analyst’s workflow.

My entire thesis was self-documented. It was written in Sweave, which is a documentation integration package for R and LaTeX. I embedded all of the code from my work into the text document that described it. When I made a change to the text, I would rebuild the whole thing and see a pdf emerge with all of my prose, figures, and tables, all beautifully formatted. Since the code that produced any given figure was located right next to the text that described it, I was fully confident I knew exactly which analysis produced which plot. Since the whole document was checked into version control (I used Subversion at the time), I knew I could track back any changes to the plot to the changes in the analysis that produced it. It may sound like doing this sort of tracking and documentation is a lot of work, and I won’t pretend that there isn’t overhead. But this extra work paid off profoundly one day.

A good four months into the thesis-writing process one of the public datasets I had analyzed was retracted; there were irregularities in the authors’ data curation process. I was mortified – what did this mean for my work in integrating these datasets, and how was I going to extract all of the plots and tabular results that included the disgraced dataset?

Since my thesis was self-documenting, I didn’t need to worry about how the extraction of one dataset would mangle the organization of the document. It was structured like a programmer had written it – with for loops and lists of datasets. I deleted the offending dataset and rebuilt the thesis. My figures were recreated, my tables recalculated. I did need to edit a few things by hand; for example, any mention of that disgraced dataset in the text needed to be changed. But overall, reproducible research had saved me. I could have spent a month rewriting and instead I spent a few days.

Not bad.

I can’t recommend self-documenting code highly enough. It’s helped me on many occasions since, in smaller analyses and in less-spectacular ways, but I no longer even think about doing work for myself or for a client without self-documentation.

Coda

When I wrote my thesis Sweave was pretty state-of-the art and Subversion was not a complete dinosaur. Now, there are better options available, like Markdown and Knitr, or the open lab notebooks put together by the Jupyter folks. Any one of them will make your analysis infinitely more reliable. For version control, Git is the most common choice, but any of them will give you the confidence you need that your code is what you think it is.

–Eleanor

The Sketchy Data Collector and the Problem with Repeated Measurements

I’ve been asked before to explain how repeated measurements can impact statistical models. We often bake repeated measurements – more than one measurement take on the same person, the same experimental animal, etc – into our experiments. They can give us more confidence in noisy data. The downside is that they do need to be accounted for properly when doing the analysis afterwards.

For example, an experiment with repeated measurements probably shouldn’t be analyzed with a simple t-test.

Let’s use an example to give us an intuition for why.

Imagine you wanted to know whether carb-loading resulted in faster race times for mid-distance runners. Say you hire someone to collect some data for you: they recruit runners and assign them to either carb-load or eat a balanced meal before running a timed 5k.

You see a dataset that looks something like this:

running1

Looks pretty good, right? You do a t-test (the data is about normal) and find a significant p-value: .001. Great! This looks significant! You get ready to publish a paper (or a blog post).

But then imagine that your data collector neglected to mention to you that the measurements were all taken on the same person. The data collector is an avid runner themselves, and managed to run 100 5ks. Do you still believe your t-test?

Of course you don’t. Those measurements are all correlated; they’re quite a good measurement of that one person, but who can say how well that one person generalizes to all runners?

And this is the intuition you are looking for: the measurements collected on your sketchy 5k-running data collector are all correlated, and are not independent. A t-test assumes that measurements are independent, and so if you used that test, you would see a falsely-inflated p-value.

A slightly better way to do this experiment would be to identify ten different runners, and have them run ten races each: five with each type of preparatory meal. In this  case we would still be looking at correlated data, because the measurements taken on one person will be similar to each other. Even better would be fifty independent runners, each running two races.

What if the study budget only allows for ten runners? Or what if you expect very noisy measurements and so you want to collect more than one measurement from each subject? These experimental designs are still fine, and we can test our carb-loading hypothesis with something called a mixed-effects model. And that’s a topic for another post.

–Eleanor