Moving to the Dark Side

Leaving the Pipette for a Keyboard.

My visual CV

TLDR; I made a visual timeline of my career (CV) completely in R. Here I will show how it was done, and in the process how to hack ggplot objects, add images to plots, and use pretty much any font in your ggplots.

This is the final result, and if you are interested on how it was generated, read on.

Motivation

Recently I gave a presentation with an overview of my career, focusing on data analysis / bioinformatics. As my path to data analysis has not been straightforward, I wanted walk the audience through my career as we went along the several stages. Because I am someone who generally takes pride on giving good presentations, and this was about me after all, and I wanted it to be:

  • Clean and with an elegant design
  • Rich in features
  • Playful
  • With a wink and a nod to data analysis and visualization

Basically me in a plot.

On top of that, I wanted to showcase my coding skills, so whatever I presented had to be done 100% with code. No post-production in Inkscape. In keeping with this mantra, the slide-deck itself was done with LaTeX.

Because the plot is so intrinsically linked with the presentation itself, as I explain the how the plot was done, I will include my though process on certain choices and how that is linked with the presentation [1].

Framework

Important for me, due to the way my career as paned out so far, and the goal of the presentation was to [2]:

  • depict my journey from wet lab to data analysis, with increasing amounts of time spent analyzing data.
  • include a rough timeline
  • highlight milestones such as academic publications and degrees
  • demonstrate how international that path has been

Initially I thought of using a Gantt chart as those are the go to visual representations for timelines and there are several packages available [3]. However, after prototyping with pen and paper, I decided it would be too hard to incorporate all the features I wanted without cluttering the plot. Basically, it didn’t feel right.

Luckily inspiration was at hand in one of my favorite blogs, by an excellent data scientist and communicator, which just happened to have a similar path to data science as my own. Her visual CV is hand drawn, and looks great, but I made it a challenge to re-create that playfulness, along with the richness of information, with code (also I am terrible at drawing/painting).

Set-up

First step is to load the libraries I’ll be using:

  • tidyverse, for general data wrangling
  • lubridate, fancy functions to work with dates
  • data.table, rolling joins and general awesomeness
  • ggplot2, plots et al.
  • ggthemes, I don’t think this was actually used.
  • scales, functions to add some spice to ggplot axis
  • cowplot, add images to plots
  • extrafont, use of system fonts
library("tidyverse")
library("lubridate")
library("data.table")
library("ggplot2")
library("ggthemes")
library("cowplot")
library("extrafont")
extrafont::loadfonts(quiet = TRUE)
library("bib2df")
library("rcrossref")

My CV in a table

The data itself is structured in a straightforward way with variables for my roles, dates, location, and a mock data analysis ratio (how much data analysis was part of my role). Because this is supposed to a cartoonish overview of my CV, not a rigorous representation, the data analysis ratio is close to reality but I won’t swear it is strictly correct.

antonio <- fread("/home/adomingu/Documents/personal/CV/visual_cv/timeline_cv.tsv") %>%
  melt(measure.vars = c("Start", "End"), variable.name = "Timpepoint", value.name = "Date") %>%
  .[, Date := as.IDate(Date)]

antonio %>%
  head() %>%
  knitr::kable()
Role Place Type Country da_ratio Year Timpepoint Date
BSc UA Student Portugal 0.2 2001 Start 1997-09-01
MSc UA Student Portugal 0.3 2005 Start 2002-09-01
Technician CNC Service Portugal 0.1 2006 Start 2005-05-01
PhD University of Leicester Student UK 0.4 2009 Start 2006-02-01
Postdoc MPI-CBG Researcher Germany 0.6 2013 Start 2009-10-01
Outreach Internet Package World 1.0 2020 Start 2020-07-15

The Canvas

With the events of my CV in tabular format I could get started with plotting. Key features of the plot:

  1. A theme with as few visual elements as possible, and attention to detail
  2. The font carefully selected (luckily I have dozens of fonts installed in my system)
  3. Addition of small pictograms

The first and second items were achieved with a custom theme and the package extrafont. extrafont allows us to use pretty much any system font in R plots. There a couple of steps that need to be done before system font can be used:

  • The first is extrafont::font_import() to detect and import system fonts. This needs to be done only once after installing the package, or any time a new font installed. This can take several minutes depending on how many system fonts your system has. Mine has a LOT.
  • The second step is to load the system fonts in the current R session to make them available, extrafont::loadfonts(quiet = TRUE)
extrafont::font_import()
extrafont::loadfonts(quiet = TRUE)

fonttable() %>%
  filter(stringr::str_detect(FullName, "Annie"))  

You can also use the third snippet of code make sure the desired font is indeed available in R.

The theme is a variation of theme_light to which I removed as many elements as I thought necessary give it that clean aspect. Mind you this is not for data visualization but rather to be aesthetically pleasing (ideally one should be able to combine both). The is also at this stage that the custom font was defined. By creating a theme in this manner we don’t need to add these lines to every plot.

theme_set(theme_light())
theme_update(
    text = element_text(family = "Annie Use Your Telescope", size = 12),
    legend.title=element_text(size=6), 
    legend.text=element_text(size=6),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.line.x = element_line(size = 1, linetype = "solid", arrow = arrow(length = unit(0.1, "inches"), type = "closed")),
    panel.grid = element_line(colour = "black", linetype = "dashed", size = 0.5),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.major.y = element_line(colour = "gray", linetype = "dotted", size = 0.5),
    panel.border = element_blank()
  )

With the theme set I could produce the first, and very basic, plot:

p <- ggplot(antonio, aes(x = Date, y = da_ratio)) +
  geom_smooth(method = "loess", color = "black", size = 1.5, span = 1, se = FALSE) +
  xlim(min(antonio$Date), as.IDate("2020-12-01")) +
  ylim(0, 1) +
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "#c72e29", size = 1) +
  annotate(
    geom = "text", x = min(antonio$Date) %m+% months(6), y = 0.35, label = "Wet lab",
    color = "red", size = 5, family = "Annie Use Your Telescope"
  ) +
  annotate(
    geom = "text", x = as.IDate("2012-12-01"), y = 0.9, label = "Data analysis",
    color = "red", size = 5, family = "Annie Use Your Telescope"
  ) 

p

The data science trajectory line is a smooth line, result of a linear model fitted to the time points in the CV. The model was chosen based on the very accurate statistical measure of “being easy on the eye” - smooth and in an upward trajectory after a particular time point.

To add the small icons, I ended up using magick (via cowplot) which I could only install with conda (well, mamba), due to missing dependencies. I did try the R way with install.packages("magick") but gave up after the second error. Life is too short.

The icons themselves were downloaded from the Noun Project which has tons of free to use icons (with attribution). I ended up paying for these because I liked them so much, but nevertheless I will credit the author, Tippawan Sookruay.

As these images will be included in several versions of the plot, I wrapped the code in a function to save typing:

add_img <- function(gg){
  ggdraw(gg) +
  draw_image("/home/adomingu/Documents/personal/CV/visual_cv/figure/noun_pipette_3384602.svg.png", scale = 0.1, x = 0.09, y = 0.001, hjust = 0.52, vjust = 0.25) +
  draw_image("/home/adomingu/Documents/personal/CV/visual_cv/figure/noun_chart_3384933.svg.png", scale = 0.1, x = 0.4, y = 0.25)
}

p_save <- add_img(p)

ggsave("../assets/img/timeline_icons.png", p_save, dpi = 600, width = 10, height = 6, units = "cm")

Timeline with icons

So now the plots has the basics:

  • A timeline on the x-axis
  • rough % of data analysis as part of my job
  • and a couple of cool looking icons to show how it all started and it is going

In technical terms, the most difficult part was to add the icons, and in particular placing them in the right location. Not only did it take a lot of trial an error, it was only just right once saved as a file (pdf). I wanted to have a wide plot to fit with the slides, and the plotting window is square, so some eye-balling was required. I tried a few other packages (patchwork for example), but in the end magick + cowplot was the best option for my purposes.

But we are not done yet: career milestones and cities I lived in are still missing.

Publications

Until now I have worked in fairly academic milieu where one’s achievements are measured mostly by publications. whilst I don’t particularly agree with this[4], it is a decent proxy for output which is what I was looking to represent. Other fields will have their own metrics of productivity (number of sales, features shipped, etc).

For full reproducibility, I tried to get my publications list from pubmed or ORCID but ran into problems. The issue with pubmed queries is that my name follows Portuguese naming conventions, and each publisher seems to ignore what I write in the provided forms and makes up a different version of my name. Not helping matters is that my first and last names are fairly common, so there were a lot of false positives in the pubmed queries. I could retrieve some of the variables I needed from ORCID but publications dates were missing for some paper.

Luckily I keep all my publications in a bibtex file exported from zotero and ended up resorting to that. The publication data was extracted with the help of bib2df to read in the bibtex file, and rcrossref to retrieve publications dates which were incomplete or missing in the bibtex file.

# ==========================================================================
# Get publication list
# ==========================================================================
path <- "/home/adomingu/Documents/personal/CV/Awesome-CV/examples/resume/AD_papers.bib"

df <- bib2df(path)
setDT(df)
my_dois <- df[!is.na(DOI)]$DOI

# my_dois <- rorcid::identifiers(rorcid::works("0000-0002-1803-1863"))
pubs <- rcrossref::cr_works(dois = my_dois)$data %>%
  setDT() %>%
  .[, c("title", "type", "container.title", "published.print", "published.online", "deposited")] %>%
  .[, Date := ifelse(!is.na(published.print), published.print, published.online)] %>%
  .[, Date := ifelse(!is.na(Date), Date, deposited)] %>%
  .[, Date := ifelse(nchar(Date) == 4, paste(Date, "03-01", sep = "-"), Date)] %>%
  .[, Date := ifelse(nchar(Date) == 7, paste(Date, "01", sep = "-"), Date)] %>%
  .[, Date := as.IDate(Date)]

pubs %>%
  head() %>%
  knitr::kable()
title type container.title published.print published.online deposited Date
Condensation of Ded1p Promotes a Translational Switch from Housekeeping to Stress Protein Production journal-article Cell 2020-05 NA 2020-05-14 2020-05-01
The IDR-containing protein PID-2 affects Z granules and is required for piRNA-induced silencing in the embryo posted-content NA NA NA 2020-04-18 2020-04-18
Extensive nuclear gyration and pervasive non-genic transcription during primordial germ cell development in zebrafish posted-content NA NA NA 2020-01-13 2020-01-13
Glia as transmitter sources and sensors in health and disease journal-article Neurochemistry International 2010-11 NA 2019-05-28 2010-11-01
FK506 prevents mitochondrial-dependent apoptotic cell death induced by 3-nitropropionic acid in rat primary cortical cultures journal-article Neurobiology of Disease 2004-12 NA 2019-02-03 2004-12-01
White matter synapses: Form, function, and dysfunction journal-article Neurology 2011-01-25 2011-01-24 2018-02-09 2011-01-25

Once that was done, I could proceed to add the publications has milestones in the plot - one point per paper.

Not so fast.

Turns out that when that smooth line was created, a linear model fit on the original time points, I lost the reference points, ratio of data analysis, on the y-axis. So now I also need to estimate the Y positions in the plot.

ggplot hacking

I have done enough custom plots, and spent enough hours googling how to do it, to know that ggplot objects store a ton of information which can be accessed and modified (at your own risk). So what I did was to extract the x and y coordinates of the fitted line from the ggplot object.

This was done in several steps:

  • re-run the linear model (spline)
  • plot the fitted line
  • extract the coordinates from the plot (ggplot_build(p_smooth)$data[[1]])
#
# Retrieve the smooth line (dummy timeline) to add points of interest
# --------------------------------------------------------------------------
spline_int <- as.data.frame(spline(antonio$Date, antonio$da_ratio))

p_smooth <- ggplot(antonio, aes(x = Date, y = da_ratio)) +
  geom_smooth(method = "loess", color = "black", size = 1.5, span = 1, se = FALSE) +
  xlim(min(antonio$Date), as.IDate("2020-12-01")) +
  ylim(0, 1)

smooth_df <- ggplot_build(p_smooth)$data[[1]] %>% 
  setDT() %>%
  .[, Date := as.IDate(x)] %>%
  .[]

smooth_df %>%
  head() %>%
  knitr::kable()
x y flipped_aes PANEL group colour fill size linetype weight alpha Date
10105.00 0.2179275 FALSE 1 -1 black grey60 1.5 1 1 0.4 1997-09-01
10211.34 0.2120627 FALSE 1 -1 black grey60 1.5 1 1 0.4 1997-12-16
10317.68 0.2067649 FALSE 1 -1 black grey60 1.5 1 1 0.4 1998-04-01
10424.03 0.2020354 FALSE 1 -1 black grey60 1.5 1 1 0.4 1998-07-17
10530.37 0.1978752 FALSE 1 -1 black grey60 1.5 1 1 0.4 1998-10-31
10636.71 0.1942856 FALSE 1 -1 black grey60 1.5 1 1 0.4 1999-02-14

Now that I have the coordinates, these need to be matched with the publication dates. However, the x-axis values (dates) of the fitted plot are intervals which do not match the CV dates. To solve that problem, as nearly any data manipulation problem, data.table and rolling joins came to the rescue. The goal was to find the closest time point of my CV in the fitted plot.

## find the closest date of publications in the dummy timeline
pubs_smooth <- smooth_df[pubs, roll = "nearest", on = "Date"] %>%
  .[, new_date := as.IDate(x)] %>%
  .[, Date:= NULL] %>%
  setnames("new_date", "Date") %>%
  .[, c("Date", "y", "title")] %>%
  .[Date <= as.IDate("2012-01-01"), Publications := "Experimental"] %>%
  .[Date >= as.IDate("2012-01-01"), Publications := "Analyst"] %>%
  .[, Publications := ifelse(str_detect(title, "NXF1"), "Both", Publications)] %>%
  .[]

pubs_smooth %>%
  head() %>%
  knitr::kable()
Date y title Publications
2020-05-17 0.9417646 Condensation of Ded1p Promotes a Translational Switch from Housekeeping to Stress Protein Production Analyst
2020-05-17 0.9417646 The IDR-containing protein PID-2 affects Z granules and is required for piRNA-induced silencing in the embryo Analyst
2020-02-01 0.9358999 Extensive nuclear gyration and pervasive non-genic transcription during primordial germ cell development in zebrafish Analyst
2010-10-08 0.5528694 Glia as transmitter sources and sensors in health and disease Experimental
2004-12-11 0.2448564 FK506 prevents mitochondrial-dependent apoptotic cell death induced by 3-nitropropionic acid in rat primary cortical cultures Experimental
2011-01-22 0.5719478 White matter synapses: Form, function, and dysfunction Experimental

Some of that information is not necessary but I kept it anyway to show how it looks.

A similar join was done for some key outreach dates that will also be highlighted in the final plot.

## same for outreach activities
outreach <- antonio[Role == "Outreach" & Timpepoint == "Start"]

outreach_smooth <- smooth_df[outreach, roll = "nearest", on = "Date"] %>%
  .[, new_date := as.IDate(x)] %>%
  .[, Date:= NULL] %>%
  setnames("new_date", "Date") %>% 
  .[, c("Date", "y", "Type")] %>%
  .[]

Annotate with academic degrees

I could have used points (geom_point) with different color/shape to highlight my academic degrees, but these belong to a different category of events than publications or other more technical milestones / achievements, and would add too much information to the lines anyway. As an alternative I decided to use arrows and text to indicate when those degrees where obtained.

Again there is nothing magic or novel about this, in fact I took inspiration from Cedric Scherer, except the need for trial and error to get the locations correct. The positions of the arrows were put on a tibble to make it easier to plot later on:

#
# plot annotations - academic points of interest
# --------------------------------------------------------------------------
arrows <- tibble(
    x1 = c(as.IDate("2000-09-01"), as.IDate("2004-12-01"), as.IDate("2010-09-01")),
    x2 = c(as.IDate("2001-09-01"), as.IDate("2005-05-01"), as.IDate("2009-08-01")),
    y1 = c(0.08, 0.15, 0.35),
    y2 = c(0.13, 0.20, 0.40)
  )
arrows
## # A tibble: 3 x 4
##   x1         x2            y1    y2
##   <date>     <date>     <dbl> <dbl>
## 1 2000-09-01 2001-09-01  0.08  0.13
## 2 2004-12-01 2005-05-01  0.15  0.2 
## 3 2010-09-01 2009-08-01  0.35  0.4

Remember that in the plots show in the X-window, the arrows, text annotations, and icons where not in the correct locations. I kept saving the plots with the final dimensions to see if everything is in it’s right place. Saving as pdf gave me the best results vs saving as png. This is the same reason why I am saving the plots and linking to them in the Rmd rather than letting rmarkdown do all the work for me.

This is the final version:

p_complete <- p +
  geom_jitter(data = pubs_smooth %>% arrange(Publications), aes(x = Date, y = y, fill = Publications), size = 2, alpha = 0.8, stroke = 1, shape=21) +
  geom_point(data = outreach_smooth, aes(x = Date, y = y), fill = "#e37e00", size = 2, alpha = 0.7, stroke = 1, shape=23) +
  scale_colour_stata() +
  scale_fill_stata() +
  guides(
    fill = guide_legend(reverse=TRUE)
  ) +
  labs(
    fill = "Publications:"
  ) +
  annotate(
    "text", x = as.IDate("1999-09-01"), y = 0.08, color = "gray20", lineheight = .9, 
    label = "BSc", size = 4, family = "Annie Use Your Telescope") + 
  annotate(
    "text", x = as.IDate("2004-11-01"), y = 0.12, color = "gray20", lineheight = .9, 
    label = "MSc", size = 4, family = "Annie Use Your Telescope") + 
  annotate(
    "text", x = as.IDate("2010-10-01"), y = 0.3, color = "gray20", lineheight = .9, 
    label = "PhD", size = 4, family = "Annie Use Your Telescope") + 
  geom_curve(
    data = arrows, aes(x = x1, y = y1, xend = x2, yend = y2),
    arrow = arrow(length = unit(0.07, "inch")), size = 0.4,
    color = "gray20", curvature = 0.3
  ) +
  theme(
    legend.direction = "horizontal",
    legend.position = c(1, 0.05),
    legend.justification = c(1, 0))

p_save <- add_img(p_complete)
ggsave("../assets/img/timeline_complete.png", p_save, dpi = 300, width = 10, height = 6, units = "cm")

timeline complete

What is the information contained in the figure:

  • icons depicting a start as a bench scientist and the current status as a data analyst
  • a line showing a smooth upward trajectory towards data science, but
  • data science only took over bench work (red line) at some point after my PhD
  • milestones (papers) are also depicted as round points, and in keeping with my trajectory, there are some to which I contributed only lab work, others with data analysis (one with both!).
  • the milestones, the diamond shaped points, are to be described in the presentation (blog, R package maintainer).

I hope you can appreciate that this plot contains a lot of information, but doesn’t feel cluttered. Importantly, this is not meant to be a self-contained, self-explanatory piece of data visualization as I will be guiding the audience through the pieces of information and revealing them only when needed.

Getting there

Well, nearly done. There are still a couple of modifications to be done for the presentation:

  1. No legend because I wanted to talk about the milestones during the presentation without giving way what the points meant too early;
  2. Indication of the places of work is still missing.

The latter is crucial to the presentation because I will be highlighting each place of work, and follow that up with slides describing my role, tasks, and accomplishments. So in effect highlighting the places of work in a stepwise manner is critical to create breaks and setting the tone for each part of the presentation.

Removing the legend is fairly simple (said he after googling how to do for the 1000th time).

## no legend
p_no_leg <- p_complete +
  guides(
    fill = "none"
  )
p_save <- add_img(p_no_leg)
ggsave("../assets/img/timeline_no_legend.png", p_save, dpi = 300, width = 10, height = 6, units = "cm")

timeline complete

To show how I moved around I made used of geom_annotation and shaded boxes to highlight each job, one at the time. Each one of these was then shown as bridge between parts of the talk.

For simplicity I am plotting only one here:

## add countries

p_no_leg <- p_complete +
  guides(
    fill = "none"
  )
add_img(p_no_leg)
# ggsave("timeline_no_legend.pdf", dpi = 600, width = 10, height = 6, units = "cm")

p_pt <- p_no_leg +
  annotate("rect",
    xmin = min(antonio$Date), xmax = as.Date("2006-02-01"), 
    ymin = -Inf, ymax = Inf,  fill = "#d9e6eb", alpha=0.35)
p_save <- add_img(p_pt)

ggsave("../assets/img/timeline_no_legend.pt.png", p_save, dpi = 600, width = 10, height = 6, units = "cm")

timeline complete

Final word

And that was it. It took me quite some time to get it done, including testing packages which were not used, but in the end I am quite happy with the result. It was a lot of work, but I wanted to impress and I did.

Footnotes

[1] Before any presentation I always think about who my audience will be and what is the goal. Only then do I start sketching the slides with pen and paper to organize the ideas into slides. One key message per slide.

[2] For example https://github.com/laresbernardo/lares and https://github.com/daattali/timevis.

[3] It’s fine to use the publication record, but someone career should not be just about that. There’s also teaching, mentoring, outreach, software, etc, which I am not sure it’s always taken into account when evaluating academics. Not that it affects me because I am not looking to climb the academic ladder.

[4] The slide-deck is not available, but if a few people show interest I could put it online.