We’re so excited to announce the result of our work with the AWS Industry Solutions Team today: a “one-stop” approach that lets GATK users take advantage of more AWS features and optimize for low cost, high speed — or both! Check out our open-source solution and an accompanying writeup at AWS for Industries.

So you want to start a Biotech

Michele Busby wrote a blog post we wish we’d written. (Then again, if we had been the ones to write it, the shout out to DA at the end might have sounded a bit shameless….) We do have one change we’d make to her biotech stack: Python is great and we agree it’s the place to start for most coding. But we find R works just fine for non-production systems.

Announcement: 10X Validated Partner

For the last four years, we’ve had the privilege of partnering with leading biotech and pharma companies in Boston and beyond to take on some of the toughest challenges in computational biology. In that time, we’ve analyzed data from hundreds of single-cell RNAseq samples, representing millions of cells, generated on 10x Genomics machines — and having seen 10x’s work up-close, we’re very proud today to be the first data analysis company named as a 10x validated partner. You can learn more about the services we’ll be offering in partnership with 10x, including acceleration of single cell gene expression data analysis, here

Using single-cell RNAseq to profile organoids for kidney therapy development

A headshot of the author, Michael DeRan.
Michael DeRan

Goldfinch Bio is an exciting startup working on precision therapies for chronic kidney disease — a condition that affects 26 million people but hasn’t had a major advance in treatment in over 20 years. I recently had the pleasure of working with Goldfinch to help characterize its high-throughput organoid platform, which it developed for preclinical validation of potential therapies for kidney disease. The Goldfinch team had already performed single-cell RNA sequencing (scRNAseq) on organoids from eight different time points in vitro and six other time points after transplantation in rats. This produced a massive dataset on >300,000 cells from 42 different organoids. I worked with the team at Goldfinch to analyze this data, with a particular focus on understanding the maturation state of the organoids and the reproducibility of the model.

An image of single-cell RNAseq data visualized using UMAP and colored by cell type. Kidney organoids at day 10, showing large populations of podocytes, tubular cells, and off-target cell types (melanocytes).
Kidney organoids grown in vitro for 10 days before transplantation in rats for 2 weeks, showing large populations of podocytes, tubular cells, and some off-target cell types (melanocytes).

scRNAseq is well-suited to studying heterogeneous populations of cells such as those from organoids. We used the expression of cell-type marker genes to identify more than a dozen cell types in the organoids without prior sorting. Having identified the cells, we were then able to compare the proportions of cell types present in the organoids to assess their maturation as well as variability between replicates. We found that the organoids differed significantly between time points but showed little variation between replicates. We also observed that organoids that had been transplanted in rats had a greater proportion of mature cell types than those cultured in vitro. Finally, we compared the transcriptional profiles of two types of podocytes (epithelial cells that help the kidneys filter blood) — cells from each organoid, and cells from adult and fetal kidney samples. We found that organoids cultured in vivo were closely correlated with the tissue samples. 

All told, our findings demonstrated that transplantation of organoids in rats consistently promoted organoid maturation and vascularization. As a computational biologist, I also gained some field-specific insights from working with the Goldfinch team. Although I have analyzed scRNAseq data from a variety of tissues, the complexity of the kidney organoids was unlike any other dataset I have worked with — this was a large dataset with numerous samples and many different cell types. Our ability to identify and characterize so many distinct cell populations within these samples speaks to the inherent strength of scRNAseq in deconvoluting complex populations. 

I would like to thank the Goldfinch Bio team for the opportunity to work on this project — in particular, Amy Westerling-Bui, Tom Soare, Eva Fast, and Srinivasan Venkatachalan, for plenty of great discussions and teamwork. This work was recently published on

Working from Home

Lots of folks in the life sciences are finding themselves suddenly thrown into an involuntary work-from-home situation. Here at Diamond Age we’ve been working remotely from our homes for some time, so we’d like to say “Welcome!” and offer a few hard-won tips to make the transition (hopefully) easier. A few of these are common to all of us:

  • Put together an ergonomic desk setup. It’s entirely too easy to damage yourself, sometimes permanently, hunching over a laptop keyboard. Get a real office chair, a monitor at the correct height, an ergonomic keyboard if that suits you, etc. This work-from-home situation isn’t temporary, and even a week with bad posture can hurt you.
  • Don’t rely on email alone to communicate with your coworkers. If you find yourself trying to write an email of more than three sentences, it probably needs to move to a phone call or a video meeting. That goes double for an email sent to more than one person.
  • Keep in contact with your professional network. You’re not meeting in the halls or at the coffee shop anymore, so you need to take action to keep up those contacts. Schedule them explicitly for video and bring your own coffee. You need to keep your social and professional network alive. 
  • Take breaks during the day. Walk around your neighborhood if you can. It’s important to get away from your desk and reset your brain. Sitting in one room all day is suffocating.

And then we each cope with the home work environment a little differently. I asked around the team for their favorite less-common tips:

  • Katie: “I don’t like to sit in one chair all day so I have four or five options to rotate through. Also when I work from home I tend to endlessly snack, but if I force myself to drink water I eat less of my kids’ Valentine’s day candy.”
  • Mike: “Check what’s in view of your camera during video calls. You may need to move your cat’s litterbox.”
  • Eleanor: “Don’t keep unhealthy snacks in the house – you will eat them all. Stock up on carrot sticks instead. And be sure to feed the cats before your 4pm meeting: they will harass you on video if you don’t.”
  • Erica: “Sometimes I plan calls with friends for my breaks ahead of time — if I know I’ll take lunch around noon, I’ll reach out to a few people in the morning before I start working to see if they want to connect around that time.”

Hang in there, everyone. We’re all adjusting to this new normal.

Reach out if you’d like to talk to us about either the joys of home office work or bioinformatics. We’re always happy to chat.


Tutorial: Scientific reporting from Jupyterlab with R Markdown

Chris Friedline

At Diamond Age Data Science, we make extensive use of RMarkdown and RStudio. These tools let us merge our analyses and reporting into a single framework and give our customers a way to inspect our methodology and reproduce our analyses on their own (if they wish). They also enable our customers to customize their views of our work (e.g., hiding or showing all code along with a narrative) and choose how they interact with the data (e.g., downloading tables directly to Excel, viewing dynamic plots).

But what about Python-oriented developers and data scientists who don’t want to use R? There’s a way they can also get the benefits of the RMarkdown reporting system — but it requires a bit of trickery, which I’ll show you in this tutorial.


Alright, I admit it. I’m one of those people who dislike R. I work in Python 99.9% of the time, I’ve been programming in Python for a long time now, and I spend a lot of development time working interactively with JupyterLab. I’m delighted and productive. Could I get this productive with R? Yes. Am I motivated to do so? Not at the moment.

I am aware of the many improvements that have been made to the Python tooling in RStudio, but to me it still feels unnatural — and, though it’s much improved, it still has a way to go (in my opinion) before it is appropriate for a full-time, Python-centric, computational biology workload.

What you need to know already

To proceed through this document with ease, you should be familiar with:

This is a lot, though, and hopefully those without the full suite of knowledge above can still gain some appreciation of the system I’m going to describe.

R and Python

For many years now, the Python community has had really nice integration with R through rpy2 and the cell magic of %%R in Jupyter/IPython. I used it in both my PhD and postdoc work, as well as professionally, and it’s great. However, it has always been a pain point to create the (admittedly) beautiful reports that come almost stock with RMarkdown. Pair the narrative with the analysis possibilities of the R ecosystem, and one may wonder why more complete Python-to-R switches are not made by professionals doing this kind of work.

Recent TIOBE metrics may tell part of the story. R dropped 0.52% between 2019 and 2020 and was overtaken by (gasp) Visual Basic. (I have some cred here — the first application I wrote professionally was in Visual Basic, over 20 years ago.)

For those of you how may have not seen this R-from-Python scenario in action before, here’s a taste using IPython %%R magic in a JupyterLab notebook backed by an IPython kernel.


Briefly, you load the extension from rpy2 into the notebook, annotate a code cell with %%R, and use R as you normally would. This example uses R from Python. Functionality also exists to use a native R kernel in JupyterLab (i.e., no Python interface), but that is outside the scope of this post. More info on this type of setup can be found here.

From .ipynb to .Rmd (pain)

So, let’s set the stage: Here I am, happily developing code in JupyterLab (sometimes in %%R cells, but mostly not), and I finish my analysis. Things look great – like ROC AUC = 0.999 great – and now I need to write up the report of how all of this magic happened and why it matters to our clients. So, I begin…

  1. Open RStudio
  2. Load up our Diamond Age-branded RMarkdown (Rmd) template
  3. Sigh
  4. Begin the report

I have a choice to make. (It is one I have to make a lot.) For the sake of example, let’s pretend I did a large machine learning project on some RNA-Seq data, and analyzed the data in a JupyterLab notebook on an EC2 instance with a lot of memory and many CPUs (more than a laptop or desktop). The choice: do I put the code in the Rmd, or do I just report intermediate results? More often than not, the latter makes the most sense. Our clients often don’t have the infrastructure, expertise, or desire to duplicate the environment I’ve used to analyze their data, nor do they have the desire to take delivery of and deploy a docker image with everything installed. (There are advancements happening in this space that are exciting, however. Gigantum and CodeOcean, I’m looking at you!)

Back to the intermediate data: From the JupyterLab notebook, I write out the necessary text files generated by my analysis, and then code them up in my Rmd. I write R to pull them in, massage them, and display them nicely with our downloadableDT function. All of the text is written in RStudio as well, as the analysis is recapitulated, cell-by-Jupyter-cell, to weave and tell the analysis story. Back and forth I go, JupyterLab -> RStudio -> JupyterLab -> RStudio… until my head (and whatever is left of my soul, if such a thing even exists) hurts. There must be a better way, I tell myself.

Until about a year or so ago, I was without one. Until Jupytext.

The magic of Jupytext

I’ve been a fan of Jupytext since the moment I first saw it, using it mostly in conjunction with the percent output format to commit my otherwise unwieldy .ipynb files to git repositories as .py files with %% cell markup. Jupytext would then faithfully recreate the notebook files when cloned to a new system, and things would be back up and running quickly. Additionally, cell-level changes would be sanely represented in diffs. Sharing was also a breeze, and as long as my collaborator had JupyterLab and Jupytext set up, things would just work. Someone should give Marc Wouts 1000 stroopwafels.

Here’s something even more amazing: with a change to just one lone setting, Jupytext can write Rmd files from Jupyter notebooks running Python kernels! One day, this just clicked for me. “Hey, maybe I can get rid of all of this Rmd double-work.”

Indeed. Here’s how.

Environment setup

In order to facilitate writing this post, I chose to create a fresh (and even usable) docker image that contained what I needed to get started with this system. I leveraged the Jupyter Docker Stacks to build a base image and added some of the R packages I need to complete the demo. Feel free to take and use.

The Dockerfile

FROM jupyter/scipy-notebook:7a0c7325e470
USER root
RUN apt-get update && \
    apt-get install -y libreadline-dev libicu-dev libicu60
USER jovyan
RUN conda install -c conda-forge jupytext \
    R \
    r-tidyverse \
    r-dt \
    r-reticulate \
    rpy2 && \
    jupyter lab build

The Makefile

Building and running the environment is controlled using a Makefile, as below.


    docker build -t $(DOCKER_TAG) .

    docker run -p 10000:8888 \
    -v $PWD:/home/jovyan/work $(DOCKER_TAG) \ jupyter lab

    Rscript -e "rmarkdown::render('$(TEMPLATE_NAME)')"

I like a Makefile for this kind of thing. It’s a useful way to get things to work forever without having to memorize commands. I do this a lot, and along with my beard and colorful shoelaces, it increases my hipster cred.

Setting up your JupyterLab environment

After building your environment with the Dockerfile and the Makefile using make build and make run, you’ll be presented with a JupyterLab instance that is running on port 10000. Go there: http://localhost:10000. You’ll be presented with a screen that looks like the one below, asking you for a token or a password. In this demo, you’ll be using a token.

From your command line where you ran make run, you’ll find the token looking something like this:

From here, you would copy everything after token= into the JupyterLab text box. Don’t get confused about port 8888 here. This is the port inside the docker container. To get to it, you’ll use the mapped port of 10000.

Go ahead and log in with the token: https://localhost:10000

Creating your notebook, template.ipynb.

After logging into the JupyterLab Interface, create a new notebook by clicking on the Python 3 icon under the Notebook section. Rename Untitled.ipynb to template.ipynb.

Using the command palette, associate template.ipynb with “R Markdown” as shown below.

Now, after saving the file, you should see a template.Rmd file in the File Browser.

Adding the R Markdown YAML header

Now that you have the pair of ipynb-Rmd files, we’ll be making changes manually, only to the ipynb file. This is after all, what we want to do in the first place: work in our beloved JupyterLab environment, but come out the other side with a nicely formatted html document processed from R Markdown.

Our standard Diamond Age YAML header looks like this.

title: "Favorite Therapeutics<br>Demo Report"
author: "Chris Friedline ("
date: "`r format(Sys.time(), '%d %B, %Y')`"
    code_folding: hide
    collapsed: true
    fig_width: 8
    highlight: haddock
    keep_md: false
    theme: cerulean
    toc: yes
      collapsed: no
    css: DAstyle.css

There are some nice things in here. We’ve designed our own CSS to make things look snappy and we make sure our date is always correct with a bit of R code. However, if we put that into a regular cell in the notebook, as below:

What we see are two bad things:

  1. I can’t actually run the cell since it’s definitely not valid Python
  2. The paired R Markdown looks like this:

This is not what we want. What we want is for the R Markdown header YAML to be merged with the Jupytext header YAML. To do this we use a Raw Cell.

Back in the notebook, change the cell to Raw (using either the command mode keyboard shortcut, r, or using the menu above).

Now, after saving, you can see that the header has been merged. Is this magic? You bet it is.

Fixing up the R Markdown indenting

We discovered, as scientists who sometimes (always) write highly-indented documents, that this doesn’t play nicely with R Markdown after maybe three levels. To get around this, we use some code like the below:

<!-- script for proper TOC indenting -->
$(document).ready(function() {
  $items = $('div#TOC li');
  $items.each(function(idx) {
    num_ul = $(this).parentsUntil('#TOC').length;
    $(this).css({'text-indent': num_ul * 10, 'padding-left': 0});


So, how the heck do we get this into the notebook (and subsequently into our Rmd)? We know we can’t use a Code cell because of the same issue we had with the header. Can we use Raw? It turns out that we can’t, because the code gets wrapped in a Python block.

```{python active="", eval=FALSE}
<!-- script for proper TOC indenting -->
$(document).ready(function() {
  $items = $('div#TOC li');
  $items.each(function(idx) {
    num_ul = $(this).parentsUntil('#TOC').length;
    $(this).css({'text-indent': num_ul * 10, 'padding-left': 0});


If, however, we use a Markdown cell type, things work properly. Presumably, since Markdown is shorthand for html, this makes sense. Change the cell type to Markdown and save. Now you see this properly:

This translates to:

title: "Favorite Therapeutics
Demo Report"
author: "Chris Friedline ("
date: "`r format(Sys.time(), '%d %B, %Y')`"
    code_folding: hide
    collapsed: true
    fig_width: 8
    highlight: haddock
    keep_md: false
    theme: cerulean
    toc: yes
      collapsed: no
    css: DAstyle.css
    formats: ipynb,Rmd
      extension: .Rmd
      format_name: rmarkdown
      format_version: '1.2'
      jupytext_version: 1.3.1
    display_name: Python 3
    language: python
    name: python3

So far, so good. But we also need to set global knitr options in our document as per our convention. How do we do this in the notebook?

Adding R cells

As Markdown

Setting values for R cells can be done in two ways. First, we can use Markdown cells with code fencing (```), as below:

Which pairs with a proper r cell in R Markdown:

<!-- #region -->
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = F, fig.height=8, fig.width=8, message=F, warning=F, cache=F)
<!-- #endregion -->

As Code

If we don’t need other header information, we can also use a Code cell with %%R, which does the same thing.

This pairs with:


Adding a header image

We also include a header image on our reports, as would any self-respecting report writer, and we have self-respect in spades. We do this with a markdown cell, as the code is technically html.

And we get that line in our R Markdown, as expected.

Let’s inspect

At this point, we have the basics in place to render the Jupytext R Markdown sibling document to html. We can do this with the make file target knit. Since we are working in the dockerized JupyterLab environment, we don’t need to leave to drop to a command line, but rather, we can use the built-in terminal. Click the + in the upper left to launch.

Change to the work directory (e.g., cd work), and run the target

make knit

If you happened to not use the naming convention above, feel free to use the Rscript command directly, with whatever name you choose.

Rscript -e "rmarkdown::render('YourSillyName.Rmd')"

Assuming everything works correctly, and it should, you should be left with a template.html (or a YourSillyName.html) file. Open that up in a browser and marvel at your accomplishment.

You’ve just written and rendered an R Markdown document in JupyterLab using Jupytext and a Python kernel. Hooray!

Finishing your report

As it turns out, our clients want slightly more information than what’s in that report, so we can go ahead and create a basic structure and write it. I’d be a terrible consultant if I gave away our secrets for free, so, please forgive this trite example.

Let’s knit it again. make knit

Better, right?

What about cell metadata?

In your R markdown, what if you wanted to not include a code cell. If you look at the previous report, you’ll see that pesky Code button in there.

The markdown you want looks like this:

```{python echo=FALSE, include=FALSE}
import pandas as pd

But how do you get your code cell to have that information? The answer: Cell Metadata.

Each cell has editable metadata. By default, it’s an empty JSON dictionary. Reminder, this is JSON, not Python.

To get those echo and include options in there, select the cell on the right and edit it’s Cell Metadata dictionary, as shown.

Make sure to click the check mark to embed the metadata into the cell. Once you do that, it will format the JSON. Now, when you save, you should see the translated options in the Rmd.

is now:

```{python echo=FALSE, include=FALSE}
import pandas as pd

This also works for thos %%R magic cells.


```{r echo=FALSE}

After adding those options to the code cell, make knit again and verify that your code cell button is now gone.

What about running code?

So, the whole goal of this exercise is reproducibility, right? Which means we have to run code at some point, right?

Here’s an example of how this might be done.

Consider these cells:

We created four new cells. The first is a Markdown cell with the header # Extra fun. The next two are standard python Code cells. Finally, there’s a code cell with R magic. Note that this last cell doesn’t need to be executed. It’s there only for the rendering in the report. If you don’t want to see the code cells in the report, don’t forget to set the Cell Metadata. Now, save and make knit.

Awesome, right?


Hopefully, after slogging through this post (which was totally worth it, no?), you have a better understanding of the problem I wanted to try to solve: living in JupyterLab but wanting R Markdown reports.

However, this solution is not perfect. What I’ve shown in this tutorial involves a lot of nuance (and some amount of trial and error). Some cells can be executed (like the Python analysis code cells) and some can’t, and the difference between the two is not always clear. Then there’s the fact that hooking all of this up involves a lot of moving parts. It’s complex, with a high bar to entry. I’m looking forward to the day when the Jupyter publishing ecosystem catches up natively with the R publishing ecosystem — maybe then I’ll be writing shorter tutorials.


Here are the files I mention in this document that I can share:

Come visit us December 10th

We’ll be hosting the venerable Boston Computational Biology and Bioinformatics Meetup in Cambridge, at the Asgard, December 10th. Please come join us for food and drinks and much discussion of all things bioinformatics.

This meetup is a fantastic place to meet folks who are in the field, whether they’re just starting out or long-time veterans. I’ve made tremendously valuable connections there, and I hope to keep doing so. I’ll be there on the 10th, as will several of my Diamond Age colleagues. Please come find us, and let us talk your ear off about what consulting is like.


The challenge of single-cell RNA-seq and differential expression

One of the common analysis tasks we have at Diamond Age is to analyze single cell RNA-seq data. Our customers are largely therapeutics-development biotechs who use this new technology to assess the impact of their development candidates on gene expression in selected cell types. scRNA-seq is a very different beast than its apparent predecessor, bulk RNA-seq. There are gotchas in both the experimental design and analysis of this data that simply didn’t apply to the older technology. One of them relates to appropriate experimental design for differential expression studies.

The problem

Folks often think that when designing a scRNA-seq experiment, they need only collect data from one sample per treatment group to reliably find differences in gene expression. They are then surprised when we tell them that they need multiple biological replicates, even though each replicate provides them with measurements from 1000+ cells of their cell type of interest. A recent Twitter thread started by John Hogenesch (@jbhclock) makes it clear that this misconception is widespread.

Vito Zanotelli (@ZanotelliVRT) summed up the problem rather succinctly:

Vito RT Zanotelli (@ZanotelliVRT) tweets: People tend to forget that the statistically independent entity of single cell experiments is mostly still the biological sample and not the cells. Distributions/features of cells can be used to calculate properties of that sample that need to be confirmed by replication.

He’s right; the gene expression profile of the individual cells in a sample aren’t independent measurements. They are more accurately described as repeated measurements on the sample.

Consider a single patient; we expect that any B cells collected from one patient would have a more-similar expression profile to each other than to B cells collected from another patient. If we dose one patient with drug and the other with vehicle, how do we know that differences in expression between those two patients’ B cells aren’t driven by biological difference between the patients? Short answer: we don’t. We *must* collect data from more than one patient, so we must have more than one patient (or animal, or dish of cells) in each treatment group, no matter how many cells we collect from each.

Experimental design

To drive home the point: imagine if we measured blood glucose from a mouse 1000 times. Exsanguination aside, all 1000 of those replicated measurements give us a very good idea of what is happening in that one animal, but doesn’t tell us much about the rest of all mouse-kind. In single-cell RNA-seq, each gene expression profile collected from 1000 different B cells from that mouse are analogous to those glucose measurements.

If we want to figure out how a drug affects B-cells generally across all mice, we must treat multiple mice with the drug, and compare the gene expression of, say, 1000 B-cells from each animal in one treatment group against the profiles of the other group. We treat those 1000 cells as repeated measurements of one animal, or one biological replicate. That means that our N is still counted in animals: three animals means we have three replicates, not 3000.

The upshot of this is that a properly-powered single-cell RNA-seq experiment can get quite expensive. As of this writing, the total cost of a scRNA-seq experiment is in the thousands of dollars *per sample*. If we need a minimum of three samples per group (and we do), that’s a hefty price tag. But it’s worth it to get real data.

Analyzing the data

Once we have a well-designed experiment with biological replicates, how do we handle the analysis? Most of the differential expression methods for single-cell analysis are only suitable for within-sample analysis: they treat each cell as an independent measurement and can only reliably tell you about how one group of cells from the same sample compares to another group in that sample. Differential expression tests using these methods result in improbably low p-values.

One tool that does handle differential expression across multiple samples properly is an R package called MAST. It does this by essentially grouping expression profiles from each sample together, and comparing those groups rather than comparing individual cells. It uses what’s called a mixed-effects model to accomplish this. It’s quite computational intensive to use, but the results are solid. I’d love to hear from folks who have found other tools that do good work on these experimental designs.

Getting the most from the data

One of the hardest things to do in this business is to tell clients that the experiment they ran – the one that cost so much – isn’t going to give them the answers they need. Single-cell RNA-seq experiments are repeat offenders in this space because they are expensive and very new; despite the somewhat familiar name, they are very different beasts than good old bulk-RNAseq.

We hate seeing good mice (or even cell lines) go to waste. Reach out if you’d like to chat about experimental design and making sure your investment pays off.