Making Better Metaplots With ggplot, Part 2

Posted on Fri 28 June 2019 in how-to

Last time we prepared our data using Deeptools.

Now we're going to do something kind of scandalous. R and python, living together in peace. What is this madness? I like R's ecosystem for manipulating data and plotting with the tidyverse. It still requires some tweaking, but with a bit of a time investment you can have publication-ready vector images in only a few lines of code. It's great for genomics data as well. Some out there may prefer matplotlib in python, and it is powerful, but I find it kind of tedious to use without adding another package on-top like Seaborn.

Genomics and data science belong together just like python and R!

1. Investigate Deeptools' Metaplot Output

Your first task with any new data is just to see what it looks like. In a terminal my initial instinct is always to call:

head ${filename}

But you can also just open deeptools metaplot table in your text editor of choice. What you'll find is:

bin labels    -1.0Kb    ...    start    ...    end    ...    1.0Kb
bins        1.0 2.0 3.0 ...
sample_name genes score score score ...
sample_name genes score score score ...

A tab delimited table of bin labels, bin numbers, and scores (data to plot) for each of those bins. This is a rather odd format because it's horizontal, rather than the long format that would be more convenient. We also have a label "genes" in position 2 of the same row as the score data. The bin labels only have 4 values in the whole row. The --upstream, --startLabel, --endLabel, and --downstream values from the computeMatrix step. We can work with this but it's a bit unwieldly.

2. Load Data Into R

Before getting started here make sure you have the tidyverse packages installed:

install.packages('tidyverse')

There are no built-in functions to read a "transposed tsv" file like this, but with a little googling this turned out to not be so bad. My original thought was to read it as a standard .tsv file with read_tsv() or base read.csv() and transpose with t() but these didn't like the data. This is because of the need to keep that first row exactly as it appears, despite most of it technically being empty, we'll need the blank labels later. So, from that stackoverflow post I was able to edit a few things:

read_deeptools_table <- function(file) {

  n <- max(count.fields(file, sep = '\t'), na.rm = TRUE)
  x <- readLines(file)

  .splitvar <- function(x, sep, n) {
    var <- unlist(strsplit(x, split = sep))
    length(var) <- n
    return(var)
  }

  x <- do.call(cbind, lapply(x, .splitvar, sep = '\t', n = n))
  x <- apply(x, 1, paste, collapse = '\t')
  plot_table <- na.omit(read.csv(text = x, sep = '\t')[-1,])  # Remove first row with "gene" label

  return(plot_table)
}

Essentially, this function is finding the length of the lines, reading the lines as character vectors, splitting the vectors by the tab character, and creating a new table from the vectors. Reading the data with this gives us a nice dataframe. From here on I will be using the tidyverse packages so feel free to load them with library('tidyverse').

> table_test <- read_deeptools_table('metaplot.tab')
> as_tibble(table_test)
# A tibble: 600 x 4
   bin.labels  bins sample_1            sample_2
   <fct>      <dbl> <fct>               <fct>
 1 -1.0Kb         1 0.7382198952879583  0.008900523560209424
 2 ""             2 0.9565445026178011  0.007329842931937172
 3 ""             3 0.9879581151832458  0.008376963350785341
 4 ""             4 0.8026178010471204  0.005235602094240838
 5 ""             5 0.7968586387434556  0.0031413612565445023
 6 ""             6 0.593717277486911   0.005235602094240838
 7 ""             7 0.36230366492146604 0.004712041884816754
 8 ""             8 0.5392670157068064  0.0
 9 ""             9 0.9617801047120418  0.0020942408376963353
10 ""            10 1.403664921465969   0.01099476439790576

3. Convert the Data to Long Format

A quirk of ggplot is that it really likes long format data. Where, instead of separate columns for the different samples you end up with a column of "scores" and another of "sample_id." This means that your sample ID is actually a variable and can be plotted. This results in a new data frame which concatenates the current sample columns into one, replicates bin.labels and bins as needed, and creates a new column with the sample ID for each row. The easiest way to do this is with the gather() function in tidyr:

long_table <- gather(plot_table, 'sample', 'score', -bin.labels, -bins)

You can check out what this looks like as follows:

> head(long_table)
  bin.labels bins   sample              score
1     -1.0Kb    1 sample_1 0.7382198952879583
2               2 sample_1 0.9565445026178011
3               3 sample_1 0.9879581151832458
4               4 sample_1 0.8026178010471204
5               5 sample_1 0.7968586387434556
6               6 sample_1  0.593717277486911
> tail(long_table)
     bin.labels bins     sample                score
1195             595 sample_2 0.005759162303664922
1196             596 sample_2                  0.0
1197             597 sample_2 0.006282722513089005
1198             598 sample_2 0.017277486910994764
1199             599 sample_2 0.020942408376963352
1200      1.0Kb  600 sample_2 0.012565445026178013

This would be annoying to work with by hand, but ggplot2 understands it just fine.

4. Build the ggplot Command

Now that our data is in the right format it's time to get plotting! We'll start simple, and make it more complex from there:

plot <- ggplot(long_table, aes(x = bins, y = as.numeric(score), color = sample)) +
  geom_line() +
  scale_x_continuous(breaks = long_table$bins,
                     labels = long_table$bin.labels)

This will get us started with a simple line plot. The key here is that the x axis breaks are the bin numbers, but are labeled as the bounds, start, and end of the features. However, this also creates a major gridline at each break. Not ideal. I have some sensible plot defaults that I use often, which I have saved to a code snippet on my github. I'll use these as a starting point for the theming. Another feature we may want to add is the ability to smooth the line. This can be accomplished by using:

geom_smooth(method = 'loess', se = FALSE)

This will smooth the data with loess regression. The amount of smoothing can be configured with the span = ... parameter in geom_smooth(). You'll also want to control the size of the plot when it's saved, and perhaps stretch or shrink its aspect ratio. This can also be controlled by ggplot2 using ggsave() at the end of our plotting command. We also will want to add the ability to specify the colors rather than just using the defaults. It's best to use a colorblind-friendly palette when possible. Putting this all together, our plot command becomes:

metaplot <- function(long_table, start_label, end_label, y_axis_label, span,
                     out_prefix, format, smooth, line, aspect, width, height,
                     colors) {

  start_bin <- subset(long_table, bin.labels == start_label)$bins
  end_bin <- subset(long_table, bin.labels == end_label)$bins

  plot <- ggplot(long_table, aes(x = bins, y = as.numeric(score), color = sample))
  if (smooth == TRUE) plot <- plot + geom_smooth(method = 'loess',
                                                 span = span,
                                                 se = FALSE)
  if (line == TRUE) plot <- plot + geom_line()
  plot <- plot + scale_color_manual(values = unlist(strsplit(colors, ','))) +
    scale_x_continuous(breaks = long_table$bins,
                       labels = long_table$bin.labels) +
    geom_vline(xintercept = c(start_bin, end_bin), linetype = 'dotted') +
    ylab(y_axis_label) +
    xlab('Position') +
    theme_bw(base_size = 22) +
    theme(legend.title = element_blank(),
          legend.position = 'bottom',
          legend.direction = 'horizontal',
          legend.margin = margin(0,0,0,0),
          legend.box.margin = margin(-10,-10,-10,-10),
          axis.text = element_text(color = 'black'),
          axis.ticks.x = element_blank(),
          panel.grid.major.x = element_blank(),
          panel.grid.minor.x = element_blank(),
          aspect.ratio = aspect) +
    ggsave(paste0(out_prefix, '.', format),
           width = width,
           height = height)
}

With a few if statements we actually can plot both the smoothed and line plots on the same coordinate system if we want. Let's test it with some data small RNA expression data from my 2018 paper over a set of genomic features:

I think we can agree that this is an improvement. This could still be improved by showing error when replicates are plotted, but it's pretty good for now.

Wrapping up

While this required a little patience, I think the results are worth it. Creating clean visualizations is necessary to get your point across. I've tidied this up a bit and pushed the full code to github. Bonus: it runs as a standalone script and works with any number of input samples! Check it out!