Sometimes I am working on some data and notice certain biases, say differentially expressed genes appearing to originate more often from a chromosome. Or a factor binding more often to a class of transcripts. In these situations I tend to turn to Fisher’s exact test. Here I will put an example of what I do.
Get some data
For the sake of simplicity I will use data from the Pasilla1 dataset and run differential gene expression analysis with DESeq22 following the vignette’s instructions. The data.table package is used because I like it’s speed and syntax, specially for sub-setting. Since I am still learning all the ins and outs of it, I mix data.frame code with it. Whatever works.
While loading the pasillaGenes R was throwing an error at me:
Warning: namespace ‘DESeq’ is not available and has been replaced
It turns out that DESeq needed to be installed to load the data. No idea why. So let’s get us some results:
Add chromosome information
Now the fun starts. I created a data.table with the results plus gene IDs, And will now added some extra information with biomaRt. I have only added chromosome and biotype, but the amount of information one can add is large. I usually also add gene symbol and description. Very useful to have a quick idea of the function of a particular gene. Adding the full gene location might be useful to later on subset data and easily create a bed file.
Now let’s have a look at down-regulated genes with padj < 0.1.
Most of these appear to be coding genes. Not a great surprise. But… Oh! There are a lot of genes located in chromosome 2L. Is this a coincidence? Let us test it.
Test for over-representation
Chromosome 3L
Expected number of hits in chromosome 3L would be 71.8440912 and we have 61. Is this significant? I will use Fisher’s exact test, as applied here and here for very similar problems.
Firstly I construct a table/matrix with the events and from that calculated Fisher’s Exact test, or the probability of having more genes in chromosome 3L than expected by chance. There are more details here.
So there is a p-value of 0.9349979 and we can reject the hypothesis that there is enrichment for or genes in 3L. My eyes were seeing patterns where this is none. Also, since I was testing only for over-representation, or enrichment, the option alternative="greater" was used in the test. Other options are available.
For all chromosomes
Is this this the case for any of the other chromosome? I will construct a function, and then loop over the chromosomes.
The function chrEnrichment is minimal and it only tests for down-regulated gens, but could be easily extended to add other arguments. It could also be used to test for biases in gene biotype.
Since I am testing for all chromosomes, I have also calculated the adjusted p-value (bonferroni) to be on the safe side.
And what does Fisher say?
That we have a pretty much random distribution of down-regulated genes in the fly chromosomes.