Markers Annotation¶

Each cluster is associated with a list of marker genes that we now need to annotate in order to biologically identify the cell clusters. We will start by annotating the gene identifiers with Biomart and then use functional enrichment methods to find the functions shared by the marker genes.

Via Biomart¶

We will use the dataframe we generated at the beginning of the analysis to allow us to add the gene name and a description for each gene ID set.

## Merge markers results with biomart annotation
pbmc_markers_annotated <- merge(x = pbmc_markers,         #First df to merge
                                y = annotated_hg19,       #Second df to merge
                                by.x = "gene",            #Column name of first df used for matching lines
                                by.y = "ensembl_gene_id", #Column name of second df used for matching lines
                                all.x = TRUE)             #Keep all lines from first df even if there is no match with second df

We will now remove all markers where the adjusted p-value is greater than a 5% threshold and with the absolute value of the mean log(Fold Change) less than 0.25 in order to obtain the list of markers with a significant expression differential.

pbmc_markers_signif <- subset(pbmc_markers_annotated,
                              p_val_adj < 0.05 &
                                abs(avg_log2FC) >= 0.25)       #Filter dataframe based on p_val_adj column

## Number of significative DEG per cluster
table(pbmc_markers_signif$cluster)

## 
##    0    1    2    3    4    5    6    7    8 
##  688 1000  403  554  280  975  453  430  296

## Sorting results by cluster and by average log2(Fold Change)
pbmc_markers_signif <- pbmc_markers_signif %>%                 #Rearrange df with dplyr package
  group_by(cluster) %>%                                        #Group df based on cluster column
  arrange(desc(pct.1), .by_group = TRUE)                  #Sort lines by descending the column avg_log2FC and by group

## Filter to retrieve meaningful genes
markers_display <- pbmc_markers_signif %>% 
  filter((pct.1 - pct.2) >= (0.25 * pct.1) & pct.1 >= 0.5) %>%
  slice_max(n = 3, order_by = avg_log2FC, with_ties = FALSE)

## Most DE gene marker for each cluster
paged_table(markers_display)

First annotated markers for each cluster

gene	avg_log2FC	pct.1	pct.2	cluster	external_gene_name	description	gene_biotype	chromosome_name
ENSG00000142546	1.261953	0.663	0.354	0	NOSIP	nitric oxide synthase interacting protein [Source:HGNC Symbol;Acc:17946]	protein_coding	19
ENSG00000111716	1.179308	0.929	0.592	0	LDHB	lactate dehydrogenase B [Source:HGNC Symbol;Acc:6541]	protein_coding	12
ENSG00000167286	1.061714	0.853	0.415	0	CD3D	CD3d molecule, delta (CD3-TCR complex) [Source:HGNC Symbol;Acc:1673]	protein_coding	11
ENSG00000143546	6.683076	0.971	0.119	1	S100A8	S100 calcium binding protein A8 [Source:HGNC Symbol;Acc:10498]	protein_coding	1
ENSG00000163220	6.215798	0.998	0.212	1	S100A9	S100 calcium binding protein A9 [Source:HGNC Symbol;Acc:10499]	protein_coding	1
ENSG00000170458	5.984897	0.666	0.027	1	CD14	CD14 molecule [Source:HGNC Symbol;Acc:1628]	protein_coding	5
ENSG00000116824	1.563939	0.635	0.254	2	CD2	CD2 molecule [Source:HGNC Symbol;Acc:1639]	protein_coding	1
ENSG00000168685	1.408361	0.733	0.336	2	IL7R	interleukin 7 receptor [Source:HGNC Symbol;Acc:6024]	protein_coding	5
ENSG00000008517	1.384736	0.949	0.473	2	IL32	interleukin 32 [Source:HGNC Symbol;Acc:16830]	protein_coding	16
ENSG00000247982	7.168087	0.561	0.010	3	LINC00926	long intergenic non-protein coding RNA 926 [Source:HGNC Symbol;Acc:27514]	lincRNA	15
ENSG00000105369	6.865916	0.939	0.041	3	CD79A	CD79a molecule, immunoglobulin-associated alpha [Source:HGNC Symbol;Acc:1698]	protein_coding	19
ENSG00000100721	6.590071	0.625	0.022	3	TCL1A	T-cell leukemia/lymphoma 1A [Source:HGNC Symbol;Acc:11648]	protein_coding	14
ENSG00000113088	4.668454	0.582	0.050	4	GZMK	granzyme K (granzyme 3; tryptase II) [Source:HGNC Symbol;Acc:4711]	protein_coding	5
ENSG00000161570	3.478015	0.968	0.223	4	CCL5	chemokine (C-C motif) ligand 5 [Source:HGNC Symbol;Acc:10632]	protein_coding	17
ENSG00000145220	2.608064	0.576	0.127	4	LYAR	Ly1 antibody reactive [Source:HGNC Symbol;Acc:26021]	protein_coding	4
ENSG00000129757	5.643867	0.519	0.009	5	CDKN1C	cyclin-dependent kinase inhibitor 1C (p57, Kip2) [Source:HGNC Symbol;Acc:1786]	protein_coding	11
ENSG00000188290	5.060087	0.589	0.020	5	HES4	hes family bHLH transcription factor 4 [Source:HGNC Symbol;Acc:24149]	protein_coding	1
ENSG00000166927	4.377772	0.810	0.074	5	MS4A7	membrane-spanning 4-domains, subfamily A, member 7 [Source:HGNC Symbol;Acc:13378]	protein_coding	11
ENSG00000115523	6.159853	0.968	0.131	6	GNLY	granulysin [Source:HGNC Symbol;Acc:4414]	protein_coding	2
ENSG00000100453	5.924214	0.955	0.068	6	GZMB	granzyme B (granzyme 2, cytotoxic T-lymphocyte-associated serine esterase 1) [Source:HGNC Symbol;Acc:4709]	protein_coding	14
ENSG00000143185	5.605621	0.551	0.021	6	XCL2	chemokine (C motif) ligand 2 [Source:HGNC Symbol;Acc:10646]	protein_coding	1
ENSG00000132386	7.901006	0.500	0.002	7	SERPINF1	serpin peptidase inhibitor, clade F (alpha-2 antiplasmin, pigment epithelium derived factor), member 1 [Source:HGNC Symbol;Acc:8824]	protein_coding	17
ENSG00000179639	7.630987	0.812	0.011	7	FCER1A	Fc fragment of IgE, high affinity I, receptor for; alpha polypeptide [Source:HGNC Symbol;Acc:3609]	protein_coding	1
ENSG00000132514	6.084949	0.688	0.015	7	CLEC10A	C-type lectin domain family 10, member A [Source:HGNC Symbol;Acc:16916]	protein_coding	17
ENSG00000204424	12.573630	0.714	0.000	8	LY6G6F	lymphocyte antigen 6 complex, locus G6F [Source:HGNC Symbol;Acc:13933]	protein_coding	6
ENSG00000005961	11.632991	0.857	0.002	8	ITGA2B	integrin, alpha 2b (platelet glycoprotein IIb of IIb/IIIa complex, antigen CD41) [Source:HGNC Symbol;Acc:6138]	protein_coding	17
ENSG00000127920	11.064293	1.000	0.010	8	GNG11	guanine nucleotide binding protein (G protein), gamma 11 [Source:HGNC Symbol;Acc:4403]	protein_coding	7

## Generate feature plots and stock them into variable
plots <- FeaturePlot(pbmc_small,                                                                #SeuratObject
                     features = top_n(x= pbmc_markers_signif, n = 1, wt = avg_log2FC)$gene,     #Vector of genes to plot
                     cols = c("yellow", "red"),                                                 #Change color
                     label = TRUE,                                                              #Plot ident position
                     combine = FALSE,                                                           #Return list of plot objets instead of a combined plot (easier to process)
                     repel = TRUE)                                                              #Avoid label overlap

## Add gene name as title
plots <- lapply(plots,                                                                          #List of plots
                add_title_gene_name,                                                            #Function to apply to the list
                gene_format = annotated_hg19)                                                   #Fill in the dataframe parameter

## Plot list of plots
grid.arrange(grobs = plots)

## Generate violin plots and stock them into variable
vln_plots <- VlnPlot(pbmc_small,                                                                #SeuratObject
                     features = top_n(x= pbmc_markers_signif, n = 1, wt = avg_log2FC)$gene,     #Vector of genes to plot
                     combine = FALSE)                                                           #Return list of plot objets instead of a combined plot (easier to process)

## Add gene name as title
vln_plots <- lapply(vln_plots,                                                                  #List of plots
                    add_title_gene_name,                                                        #Function to apply to the list
                    gene_format = annotated_hg19)                                               #Fill in the dataframe parameter

## Remove unecessary legend
vln_plots <- lapply(vln_plots, function(plot){
  plot <- plot + theme(legend.position = "none")
  return(plot)
})

## Plot list of plots
grid.arrange(grobs = vln_plots)

We now have each Ensembl gene ID set associated with a gene name and a description to help us identify the gene lists. This is easy if you know the theory of the biology of your system but if you don't know enough about the genes identified as markers, the enrichment methods will help you.

Via ClusterProfiler¶

To understand the relationship between genes specific to our clusters we can use functional enrichment methods. There are two types of functional enrichment methods:

Over-Representation Analysis methods which are based on a ratio between the number of marker genes present in a functional gene set and the total number of genes present in this gene set.
Gene Set Enrichment Analysis (GSEA) methods which calculate an enrichment rate from a ranking of genes.

An R package will allow us to perform these different analyses using several databases. It is quite complete and I advise you to take the time to look at the documentation because here we will only see a small overview.