Skip to content

Comparing two populations

One last thing we might ask is what makes two populations distinct. This could be two conditions or even two clusters. This is possible with the FindMarkers function which works almost like FindAllMarkers since it calls the former. Many of the parameters are equivalent, however FindMarkers allows you to perform a differential expression analysis between two populations which are defined with the ident.1 and ident.2 parameters. By default, these are cell identities present in the active.ident but we can select another variable contained in the metadata using the group.by parameter. So if we want to study the impact of gender in the cells of the platelet then we would run:

FindMarkers(objectName, ident.1 = "female", ident.2 = "male", group.by = "sex", subset.ident = "Platelet")

(if we had a "sex" column in the metadata slot). If we want to study the impact of sex in all cells then we leave the subset.ident parameter as default (i.e. NULL).

Here we will test the difference between our NK and CD8+ T clusters which were very difficult to differentiate. The results are similar to FindAllMarkers with the difference that there is no gene column because as a differential analysis it will not be possible to have a gene overexpressed in both NK and CD8+ T cells.

NK_CD8_diff_markers <- FindMarkers(pbmc_small,
                           ident.1 = "NK",
                           ident.2 = "CD8+ T")

## Merge markers results with biomart annotation
NK_CD8_diff_markers_annotated <- merge(x = NK_CD8_diff_markers,  #First df to merge
                                       y = annotated_hg19,       #Second df to merge
                                       by.x = 0,                 #Column name of first df used for matching lines, 0 for rownames
                                       by.y = "ensembl_gene_id", #Column name of second df used for matching lines
                                       all.x = TRUE)             #Keep all lines from first df even if there is no match with second df

## Filter dataset based on Fold change and p-value adjusted
NK_CD8_diff_markers_annotated_signif <- subset(NK_CD8_diff_markers_annotated,
                                               p_val_adj < 0.05 &
                                                 abs(avg_log2FC) >= 0.25)       #Filter dataframe based on p_val_adj column

## Sorting results by average log2(Fold Change)
NK_CD8_diff_markers_annotated_signif <- NK_CD8_diff_markers_annotated_signif %>%                 #Rearrange df with dplyr package
  arrange(desc(avg_log2FC))                  #Sort lines by descending the column avg_log2FC and by group

## Most DE gene marker for each cluster
kable(NK_CD8_diff_markers_annotated_signif[(c(1:3, (nrow(NK_CD8_diff_markers_annotated_signif)-2):nrow(NK_CD8_diff_markers_annotated_signif))),])
Row.names p_val avg_log2FC pct.1 pct.2 p_val_adj external_gene_name description gene_biotype chromosome_name
1 ENSG00000163453 0 7.673621 0.551 0.006 0.00e+00 IGFBP7 insulin-like growth factor binding protein 7 [Source:HGNC Symbol;Acc:5476] protein_coding 4
2 ENSG00000135077 0 5.656534 0.314 0.009 0.00e+00 HAVCR2 hepatitis A virus cellular receptor 2 [Source:HGNC Symbol;Acc:18437] protein_coding 5
3 ENSG00000175294 0 5.513915 0.109 0.000 2.03e-05 CATSPER1 cation channel, sperm associated 1 [Source:HGNC Symbol;Acc:17116] protein_coding 11
224 ENSG00000167286 0 -3.971611 0.083 0.885 0.00e+00 CD3D CD3d molecule, delta (CD3-TCR complex) [Source:HGNC Symbol;Acc:1673] protein_coding 11
225 ENSG00000137078 0 -4.379914 0.006 0.232 1.01e-05 SIT1 signaling threshold regulating transmembrane adaptor 1 [Source:HGNC Symbol;Acc:17710] protein_coding 9
226 ENSG00000172116 0 -5.098008 0.019 0.368 0.00e+00 CD8B CD8b molecule [Source:HGNC Symbol;Acc:1707] protein_coding 2

Here are the results for the three most over-expressed genes in NK cells (avg_log2FC positive) and the 3 most over-expressed genes in CD8+ T cells (avg_log2FC negative).

You can totally use the different methods of gene cluster analysis on this one.