Skip to content

Over-representation analysis

Gene Ontology

We will start by analysing our gene lists with the Gene Ontology annotation which classifies genes into gene sets according to three main types of information:

  • Molecular Function (MF): protein activity of the gene product
  • Biological Process (BP): set of protein activities leading to a common task
  • Cellular Component (CC): location of the gene product

Each set of genes is called a gene set and is grouped according to Gene Ontology terms (referred to here as GO terms). The GO terms are classified in a tree structure with general gene sets that become more specific as we go along. The relationships between GO terms can reflect different cases:

  • is a : A GO term is a subtype of the GO term B. For example mitochondrion is a organelle.
  • part of : the term GO A is part of the term GO B, so if the term GO A is present then so is the term GO B. For example mitochondrion is a part of cytoplasm.
  • has part : the term GO A necessarily contains the term GO B, but if there is the term GO B there is not necessarily the term GO A. For example, the receptor tyrosine kinase activity has part ATP hydrolysis activity.
  • regulates : the term GO A necessarily impacts the term GO B, but the latter is not necessarily impacted by A.

In the ClusterProfiler package we will use the enrichGO function which calculates for each GO term the over-representation of the genes in the analysed cluster among those in the term.

To use the Gene Ontology database, we use an R package containing all the annotation of the desired organism. These packages are called OrgDb which can be found in the form org.[Genome Initials].eg.db (for the human annotation, the OrgDb is org.Hs.eg.db). They contain gene identifiers from different resources (NCBI, Ensembl, RefSeq, etc...) with annotation from different databases (GO, OMIM, PMID, Uniprot, etc...). The default gene identifiers are in the format entrez (which is a sequence of numbers). Since we only have the Ensembl identifiers or the gene name available to us, we will use the keyType parameter of the enrichGO function to use the gene names instead of the entrez identifiers and thus use the different information contained in the Org.Db.

## What's inside an organism database ?
ls("package:org.Hs.eg.db")

## You must see the help section of category you want to know about and don't hesite to test the examples to understand the architecture
?org.Hs.egENSEMBL    #Link between ensembl ID and entrez ID
?org.Hs.egSYMBOL2EG  #Link between entrez ID and gene name
?org.Hs.egSYMBOL     #Link between gene name and entrez ID
?org.Hs.egGENENAME   #Beware ! It concern gene description and not gene name as we know

## Different available mapping variable name
columns(org.Hs.eg.db)

We will therefore apply the enrichGO function for each clusters through a lapply. For each cluster :

  • We filter the marker result dataframe to retrieve only the rows concerning the genes overexpressed by the cluster cells.
  • We run the enrichGO function to obtain the GO terms (BP, CC and MF) enriched in the marker gene set.
  • Add the cluster name in a new column to the resulting dataframe
  • We visualize the results with the dotplot function of the enrichR package which allows to visualize the first 3 GO terms for each ontology type

The result of the lapply is a list where each element of the list is a resultant dataframe for each cluster. We then assemble the results with the do.call function which applies a function to the whole list. The rbind will concatenate the rows of all the elements in the list so that there is only one dataframe with the results for each cluster.

## GO enrichment for all clusters
enrich_go_list <- lapply(levels(pbmc_markers_signif$cluster), function(cluster_name){

  res_markers <- subset(pbmc_markers_signif, cluster == cluster_name & avg_log2FC > 0) #Filter markers to retrieve only positive markers for the specific cluster

  ego <- enrichGO(gene = res_markers$external_gene_name,                 #Vector of target genes
                  universe = unique(annotated_hg19$external_gene_name),  #Vector of reference genes (all genes from the differential analysis)
                  OrgDb = "org.Hs.eg.db",                                #Organisme database
                  keyType = "SYMBOL",                                    #Column name of the OrgDB that convert gene format in `gene`parameter to entrez ID
                  ont = "ALL")                                           #What category of GO you want to analyse (BP, CC, MF or ALL)

  ego@result$cluster <- cluster_name                                     #Add cluster name in a new column named "cluster"

  ## visualisation
  ### don't forget to add print when inside a function/loop/lapply
  print(dotplot(ego,                                                     #enrichResult object
                split = "ONTOLOGY",                                      #Do separated plot for each ontology type (only valable fo GO results)
                showCategory = 3,                                        #Only show first three categories
                label_format = 20,
                title = paste("Cluster", cluster_name)) +                #Add title
    facet_grid(ONTOLOGY~., scales = "free_y") +                          #Create subplot according to type used with `split = "ONTOLOGY"`
    theme(axis.text.y = element_text(size = 10),
                legend.key.size = unit(0.2, 'cm')))                      #Reduce ontology labels names
  return(ego@result)
})

enrich_go <- do.call("rbind", enrich_go_list)                            #Bind all results together
enrich_go <- enrich_go %>%
  group_by(cluster, ONTOLOGY)                                            #Rearrange df according to cluster and ontology type


## show first results
kable(slice_max(
  enrich_go, 
  n = 3, 
  order_by = Count, 
  with_ties = FALSE)[,-9]
)  #Only remove list of genes for the visualisation
First Enriched GO terms for each cluster
ONTOLOGY ID Description GeneRatio BgRatio pvalue p.adjust qvalue Count cluster
BP cytoplasmic translation 70/185 150/16049 0.0000000 0.0000000 0.0000000 70 0
BP ribonucleoprotein complex biogenesis 40/185 438/16049 0.0000000 0.0000000 0.0000000 40 0
BP ribosome biogenesis 36/185 297/16049 0.0000000 0.0000000 0.0000000 36 0
CC ribosome 72/189 224/16891 0.0000000 0.0000000 0.0000000 72 0
CC cytosolic ribosome 70/189 110/16891 0.0000000 0.0000000 0.0000000 70 0
CC ribosomal subunit 70/189 178/16891 0.0000000 0.0000000 0.0000000 70 0
MF structural constituent of ribosome 68/188 159/16507 0.0000000 0.0000000 0.0000000 68 0
MF mRNA binding 17/188 318/16507 0.0000001 0.0000111 0.0000099 17 0
MF rRNA binding 16/188 59/16507 0.0000000 0.0000000 0.0000000 16 0
BP positive regulation of defense response 52/539 437/16049 0.0000000 0.0000000 0.0000000 52 1
BP generation of precursor metabolites and energy 52/539 447/16049 0.0000000 0.0000000 0.0000000 52 1
BP immune response-regulating signaling pathway 49/539 440/16049 0.0000000 0.0000000 0.0000000 49 1
CC secretory granule lumen 52/550 314/16891 0.0000000 0.0000000 0.0000000 52 1
CC cytoplasmic vesicle lumen 52/550 317/16891 0.0000000 0.0000000 0.0000000 52 1
CC vesicle lumen 52/550 318/16891 0.0000000 0.0000000 0.0000000 52 1
MF actin binding 32/540 424/16507 0.0000105 0.0006036 0.0005298 32 1
MF molecular function inhibitor activity 32/540 474/16507 0.0000921 0.0031247 0.0027424 32 1
MF amide binding 28/540 368/16507 0.0000323 0.0014190 0.0012454 28 1
BP regulation of lymphocyte activation 36/213 484/16049 0.0000000 0.0000000 0.0000000 36 2
BP mononuclear cell differentiation 32/213 471/16049 0.0000000 0.0000000 0.0000000 32 2
BP lymphocyte differentiation 31/213 421/16049 0.0000000 0.0000000 0.0000000 31 2
CC ribosome 27/224 224/16891 0.0000000 0.0000000 0.0000000 27 2
CC focal adhesion 26/224 411/16891 0.0000000 0.0000000 0.0000000 26 2
CC cell-substrate junction 26/224 420/16891 0.0000000 0.0000000 0.0000000 26 2
MF structural constituent of ribosome 24/220 159/16507 0.0000000 0.0000000 0.0000000 24 2
MF molecular function inhibitor activity 16/220 474/16507 0.0006378 0.0293565 0.0253015 16 2
MF cadherin binding 13/220 309/16507 0.0002675 0.0237934 0.0205069 13 2
BP cytoplasmic translation 31/171 150/16049 0.0000000 0.0000000 0.0000000 31 3
BP adaptive immune response 29/171 449/16049 0.0000000 0.0000000 0.0000000 29 3
BP B cell activation 27/171 265/16049 0.0000000 0.0000000 0.0000000 27 3
CC ribosome 29/178 224/16891 0.0000000 0.0000000 0.0000000 29 3
CC cytosolic ribosome 28/178 110/16891 0.0000000 0.0000000 0.0000000 28 3
CC ribosomal subunit 28/178 178/16891 0.0000000 0.0000000 0.0000000 28 3
MF structural constituent of ribosome 29/170 159/16507 0.0000000 0.0000000 0.0000000 29 3
MF peptide binding 16/170 295/16507 0.0000001 0.0000029 0.0000027 16 3
MF amide binding 16/170 368/16507 0.0000013 0.0000496 0.0000464 16 3
BP adaptive immune response 27/103 449/16049 0.0000000 0.0000000 0.0000000 27 4
BP leukocyte mediated immunity 23/103 377/16049 0.0000000 0.0000000 0.0000000 23 4
BP regulation of lymphocyte activation 23/103 484/16049 0.0000000 0.0000000 0.0000000 23 4
CC external side of plasma membrane 16/106 372/16891 0.0000000 0.0000001 0.0000001 16 4
CC early endosome 13/106 399/16891 0.0000013 0.0000229 0.0000179 13 4
CC endocytic vesicle membrane 12/106 194/16891 0.0000000 0.0000002 0.0000001 12 4
MF antigen binding 9/106 62/16507 0.0000000 0.0000001 0.0000000 9 4
MF peptide binding 8/106 295/16507 0.0006277 0.0146499 0.0128414 8 4
MF amide binding 8/106 368/16507 0.0025574 0.0374167 0.0327978 8 4
BP immune response-regulating signaling pathway 68/757 440/16049 0.0000000 0.0000000 0.0000000 68 5
BP activation of immune response 68/757 482/16049 0.0000000 0.0000000 0.0000000 68 5
BP immune response-activating signaling pathway 63/757 415/16049 0.0000000 0.0000000 0.0000000 63 5
CC vacuolar membrane 65/781 448/16891 0.0000000 0.0000000 0.0000000 65 5
CC secretory granule membrane 63/781 299/16891 0.0000000 0.0000000 0.0000000 63 5
CC cytoplasmic vesicle lumen 61/781 317/16891 0.0000000 0.0000000 0.0000000 61 5
MF actin binding 54/770 424/16507 0.0000000 0.0000000 0.0000000 54 5
MF actin filament binding 32/770 202/16507 0.0000000 0.0000005 0.0000004 32 5
MF cadherin binding 32/770 309/16507 0.0000211 0.0028534 0.0025270 32 5
BP adaptive immune response 43/262 449/16049 0.0000000 0.0000000 0.0000000 43 6
BP leukocyte mediated immunity 41/262 377/16049 0.0000000 0.0000000 0.0000000 41 6
BP regulation of lymphocyte activation 37/262 484/16049 0.0000000 0.0000000 0.0000000 37 6
CC focal adhesion 29/269 411/16891 0.0000000 0.0000000 0.0000000 29 6
CC cell-substrate junction 29/269 420/16891 0.0000000 0.0000000 0.0000000 29 6
CC external side of plasma membrane 25/269 372/16891 0.0000000 0.0000001 0.0000001 25 6
MF actin binding 20/268 424/16507 0.0000210 0.0021638 0.0019775 20 6
MF actin filament binding 15/268 202/16507 0.0000011 0.0001900 0.0001736 15 6
MF immune receptor activity 10/268 141/16507 0.0001028 0.0067171 0.0061389 10 6
BP regulation of lymphocyte activation 43/373 484/16049 0.0000000 0.0000000 0.0000000 43 7
BP adaptive immune response 40/373 449/16049 0.0000000 0.0000000 0.0000000 40 7
BP positive regulation of leukocyte activation 37/373 359/16049 0.0000000 0.0000000 0.0000000 37 7
CC vacuolar membrane 30/388 448/16891 0.0000002 0.0000041 0.0000035 30 7
CC lysosomal membrane 29/388 407/16891 0.0000001 0.0000024 0.0000020 29 7
CC lytic vacuole membrane 29/388 407/16891 0.0000001 0.0000024 0.0000020 29 7
MF amide binding 23/382 368/16507 0.0000168 0.0016856 0.0015624 23 7
MF immune receptor activity 20/382 141/16507 0.0000000 0.0000000 0.0000000 20 7
MF peptide binding 20/382 295/16507 0.0000188 0.0016856 0.0015624 20 7
BP wound healing 29/218 403/16049 0.0000000 0.0000000 0.0000000 29 8
BP blood coagulation 21/218 221/16049 0.0000000 0.0000000 0.0000000 21 8
BP coagulation 21/218 225/16049 0.0000000 0.0000000 0.0000000 21 8
CC actin cytoskeleton 26/228 475/16891 0.0000000 0.0000001 0.0000001 26 8
CC cell-substrate junction 22/228 420/16891 0.0000001 0.0000022 0.0000019 22 8
CC focal adhesion 21/228 411/16891 0.0000002 0.0000062 0.0000053 21 8
MF actin binding 21/227 424/16507 0.0000004 0.0002070 0.0001890 21 8
MF integrin binding 10/227 144/16507 0.0000305 0.0073768 0.0067362 10 8
MF cytoskeletal anchor activity 4/227 24/16507 0.0002981 0.0479966 0.0438282 4 8

The result is a dataframe where the GO terms have been considered as enriched:

  • ONTOLOGY : ontology type (BP, CC, or MF)
  • ID : Unique identifier of the GO term
  • Description : Description of the GO term
  • GeneRatio : Fraction representing the number of marker genes present in the GO term, \(GeneRatio = \frac{nbrMarkerGeneInKEGGcat}{nbrMarkerGene}\).
  • BgRatio : Fraction representing the number of reference genes present in the GO term, \(BgRatio = \frac{nbrTotalGeneInKEGGcat}{nbrTotalGene}\).
  • pvalue : p-value of the enrichment test
  • p.adjust : adjusted p-value of the Benjamini Hochberg test
  • qvalue : q-value after FDR (False Discovery Rate) check
  • geneID : String containing the list of marker genes present in the GO term (separated by /)
  • count : Number of marker genes present in the GO term
  • cluster : Column added before the do.call("rbind", list) to identify in which cluster the GO term has been considered as enriched.

A GO term has been considered as enriched if :

  • the p-value \(\lt\) 0.05
  • the adjusted p-value \(\lt\) 0.05
  • the q-value \(\lt\) 0.2

Kyoto Encyclopedia of Genes and Genomes (KEGG)

KEGG is a database that focuses on the molecular annotation of the different metabolic pathways. It allows the description of the biochemical reactions that compose the pathway. KEGG also allows the visualization of these pathways through hand-drawn maps representing the different reactions and the relationship between the genes. There are several main categories of KEGG pathways:

  • Metabolism
  • Genetic information processing
  • Environmental information processing
  • Cellular processes
  • Organ systems
  • Human diseases
  • Drug development

We will use the enrichKEGG function from the ClusterProfiler package. This function doesn't work exactly like enrichGO because it calls directly on the database and doesn't use an Orgdb package but it does require our genes to be annotated with an entrez id. We will have to find another way to convert our genes into the correct format.

To do this, we will use the Orgdb or org.Hs.egSYMBOL object to list each enter id as its "gene symbol" (gene name). When we convert this object to a dataframe we get a table with two columns (gene_id and symbol). We now have the possibility to switch from a gene name to an entrez id.

We will therefore apply the enrichKEGG function for each clusters through a lapply. For each cluster :

  • We filter the marker result dataframe to retrieve only the rows concerning the genes overexpressed by the cluster cells.
  • Run the enrichGO function to get the enriched KEGG terms in the marker gene set, filter our enter/symbol mapping table for the gene and universe parameters with the gene names contained in the marker and biomart annotation tables.
  • Add the cluster name in a new column to the resulting dataframe
  • We visualize the results with the dotplot function of the enrichR package which allows to visualize the first 5 KEGG terms
## Retrieve a corresponding table between entrez id and gene name (called gene symbol in org.db)
corresp_entrez <- as.data.frame(org.Hs.egSYMBOL)  #Change format to df

## Apply enrichKEGG for each cluster
enrich_kegg_list <- lapply(levels(pbmc_markers_signif$cluster), function(cluster_name){

  res_markers <- subset(pbmc_markers_signif, cluster == cluster_name & avg_log2FC > 0) #Filter markers dataframe based on cluster

  ## Perform enrichKEGG analysis
  ekegg <- enrichKEGG(gene = subset(corresp_entrez, symbol %in% res_markers$external_gene_name)$gene_id,                #Genes to analyse
                      universe = subset(corresp_entrez, symbol %in% unique(annotated_hg19$external_gene_name))$gene_id, #Background genes, here we take all genes from our expression matrix
                      organism = "hsa")

  ekegg@result$cluster <- cluster_name            #Add cluster name as column

  ## Add plot
  print(dotplot(ekegg,
                label_format = 30,
                font.size = 10,
                showCategory = 5,
                title = paste("Cluster", cluster_name)) +
    theme(axis.text.y = element_text(size = 10),
                legend.key.size = unit(0.2, 'cm')))

  return(ekegg@result)                            #Return dataframe result
})

## Concatenate all results in one dataframe
enrich_kegg <- do.call("rbind", enrich_kegg_list)

## Group result by cluster (easier to manipulate with dplyr)
enrich_kegg <- enrich_kegg %>%
  group_by(cluster)

## Visualise first 3 KEGG categories for each cluster (removing the vector of genes just for the visualisation)
kable(slice_max(
  enrich_kegg, 
  n = 3, 
  order_by = Count, 
  with_ties = FALSE)[,-8]
)
First Enriched KEGG categories for each cluster
category subcategory ID Description GeneRatio BgRatio pvalue qvalue geneID Count cluster
Human Diseases Infectious disease: viral hsa05171 Coronavirus disease - COVID-19 69/131 227/7983 0.0000000 0.0000000 3725/3921/4736/6122/6124/6125/6128/6129/6130/6133/6134/6135/6136/6137/6138/6139/6141/6142/6143/6144/6146/6147/6152/6155/6156/6157/6160/6161/6164/6165/6167/6168/6169/6173/6175/6176/6181/6187/6188/6189/6191/6192/6193/6194/6202/6203/6204/6206/6207/6208/6209/6210/6217/6222/6223/6224/6227/6228/6230/6231/6232/6233/6234/6235/9045/9349/11224/23521/25873 69 0
Genetic Information Processing Translation hsa03010 Ribosome 68/131 134/7983 0.0000000 0.0000000 3921/4736/6122/6124/6125/6128/6129/6130/6133/6134/6135/6136/6137/6138/6139/6141/6142/6143/6144/6146/6147/6152/6155/6156/6157/6160/6161/6164/6165/6167/6168/6169/6173/6175/6176/6181/6187/6188/6189/6191/6192/6193/6194/6202/6203/6204/6206/6207/6208/6209/6210/6217/6222/6223/6224/6227/6228/6230/6231/6232/6233/6234/6235/9045/9349/11224/23521/25873 68 0
Organismal Systems Immune system hsa04640 Hematopoietic cell lineage 9/131 94/7983 0.0000215 0.0009710 914/915/916/917/921/924/926/2323/3575 9 0
Human Diseases Cancer: overview hsa05208 Chemical carcinogenesis - reactive oxygen species 36/341 196/7983 0.0000000 0.0000000 292/293/873/1327/1329/1340/1351/1535/1537/2885/3162/4257/4688/4695/4707/4716/4726/4731/4792/5580/5879/6390/6648/7384/7386/9377/9446/10105/10327/10975/29796/51079/54539/126328/374291/653361 36 1
Human Diseases Neurodegenerative disease hsa05022 Pathways of neurodegeneration - multiple diseases 36/341 443/7983 0.0001337 0.0008218 292/293/637/1327/1329/1340/1351/1460/1536/1537/2876/3553/4695/4707/4716/4726/4731/5688/5691/5879/6390/7132/7133/7384/7386/8878/9246/9377/10105/10975/23401/29796/51079/54539/126328/374291 36 1
Human Diseases Cardiovascular disease hsa05415 Diabetic cardiomyopathy 34/341 175/7983 0.0000000 0.0000000 292/293/948/1327/1329/1340/1351/1509/1535/1536/1537/2597/4688/4689/4695/4707/4716/4726/4731/5580/5879/6390/7040/7384/7386/9377/10105/10975/29796/51079/54539/126328/374291/653361 34 1
Human Diseases Infectious disease: viral hsa05171 Coronavirus disease - COVID-19 26/131 227/7983 0.0000000 0.0000000 2353/3725/3921/6122/6139/6147/6156/6157/6160/6166/6173/6175/6188/6201/6204/6206/6210/6222/6224/6227/6230/6233/6235/6772/9045/25873 26 2
Genetic Information Processing Translation hsa03010 Ribosome 23/131 134/7983 0.0000000 0.0000000 3921/6122/6139/6147/6156/6157/6160/6166/6173/6175/6188/6201/6204/6206/6210/6222/6224/6227/6230/6233/6235/9045/25873 23 2
Human Diseases Infectious disease: bacterial hsa05132 Salmonella infection 17/131 244/7983 0.0000004 0.0000154 71/330/399/2353/3320/3725/6188/6281/6500/6932/6990/7277/7295/8717/10627/51176/103910 17 2
Human Diseases Infectious disease: viral hsa05171 Coronavirus disease - COVID-19 30/112 227/7983 0.0000000 0.0000000 2197/3921/4736/5579/6125/6132/6133/6136/6138/6141/6142/6144/6147/6155/6159/6167/6168/6189/6193/6202/6205/6207/6222/6223/6227/6228/6232/11224/23521/200916 30 3
Genetic Information Processing Translation hsa03010 Ribosome 29/112 134/7983 0.0000000 0.0000000 2197/3921/4736/6125/6132/6133/6136/6138/6141/6142/6144/6147/6155/6159/6167/6168/6189/6193/6202/6205/6207/6222/6223/6227/6228/6232/11224/23521/200916 29 3
Organismal Systems Immune system hsa04640 Hematopoietic cell lineage 17/112 94/7983 0.0000000 0.0000000 930/931/933/951/2208/3108/3109/3112/3113/3115/3117/3118/3119/3122/3123/3127/3566 17 3
Organismal Systems Immune system hsa04650 Natural killer cell mediated cytotoxicity 15/73 125/7983 0.0000000 0.0000000 919/2534/3002/3105/3106/3107/3133/3824/3932/4068/5551/7535/10870/27040/259197 15 4
Human Diseases Infectious disease: viral hsa05170 Human immunodeficiency virus 1 infection 12/73 204/7983 0.0000003 0.0000032 567/801/915/916/917/919/3105/3106/3107/3133/3134/60489 12 4
Organismal Systems Immune system hsa04660 T cell receptor signaling pathway 11/73 117/7983 0.0000000 0.0000001 915/916/917/919/925/926/2534/3932/5527/7535/27040 11 4
Human Diseases Neurodegenerative disease hsa05022 Pathways of neurodegeneration - multiple diseases 49/481 443/7983 0.0000208 0.0002333 292/293/581/637/805/808/847/1020/1329/1337/1340/1347/1351/1460/1536/4218/4701/4709/4711/4717/5579/5594/5606/5663/5685/5688/5691/5692/5701/5702/5710/5715/5879/7132/7133/7384/7386/7416/7846/9246/9377/10010/10121/10376/29796/51079/51465/55062/55255 49 5
Human Diseases Infectious disease: bacterial hsa05132 Salmonella infection 42/481 244/7983 0.0000000 0.0000000 60/302/387/388/391/581/834/837/2316/3071/4074/4615/5058/5216/5585/5594/5606/5878/5879/6237/6281/6934/6993/7132/7846/8655/8677/8743/10092/10094/10095/10097/10109/10121/10376/10552/25828/27128/29108/51143/55845/112574 42 5
Human Diseases Neurodegenerative disease hsa05014 Amyotrophic lateral sclerosis 42/481 332/7983 0.0000031 0.0000601 60/581/637/834/847/1329/1337/1340/1347/1351/4218/4701/4709/4711/4717/5216/5606/5685/5688/5691/5692/5701/5702/5710/5715/5879/7132/7133/7384/7386/7416/7846/8021/9377/10010/10121/10376/29796/51079/55062/55255/400916 42 5
Organismal Systems Immune system hsa04650 Natural killer cell mediated cytotoxicity 26/156 125/7983 0.0000000 0.0000000 356/919/2207/2214/3002/3105/3106/3107/3133/3458/3683/3689/3804/3812/3821/3824/4068/5551/5880/7305/7462/7535/10870/51744/117157/259197 26 6
Human Diseases Infectious disease: viral hsa05163 Human cytomegalovirus infection 17/156 222/7983 0.0000014 0.0000328 356/567/801/811/2923/3105/3106/3107/3133/3716/5732/5880/6348/6351/6352/6890/54331 17 6
Cellular Processes Cell motility hsa04810 Regulation of actin cytoskeleton 17/156 226/7983 0.0000018 0.0000329 60/71/1072/2934/3683/3689/3695/5216/5499/5880/10093/10096/10109/10627/10788/81873/103910 17 6
Human Diseases Infectious disease: bacterial hsa05152 Tuberculosis 22/217 176/7983 0.0000000 0.0000000 637/972/2207/3108/3109/3111/3113/3115/3117/3118/3119/3122/3123/3127/3553/3606/4046/4261/5534/6850/7879/8844 22 7
Organismal Systems Immune system hsa04640 Hematopoietic cell lineage 20/217 94/7983 0.0000000 0.0000000 911/912/945/1436/1438/2322/3108/3109/3111/3113/3115/3117/3118/3119/3122/3123/3127/3553/3563/3570 20 7
Cellular Processes Transport and catabolism hsa04145 Phagosome 20/217 146/7983 0.0000000 0.0000001 60/71/3108/3109/3111/3113/3115/3117/3118/3119/3122/3123/3127/4688/5879/7879/10376/53407/83547/84617 20 7
NA NA hsa04820 Cytoskeleton in muscle cells 12/125 230/7983 0.0002438 0.0100247 2026/2273/3674/3688/3690/3693/7125/7168/7171/9124/10398/23002 12 8
Organismal Systems Immune system hsa04611 Platelet activation 11/125 122/7983 0.0000029 0.0005543 2811/2815/3674/3688/3690/4638/5742/6915/7450/51206/83706 11 8
Cellular Processes Cellular community - eukaryotes hsa04510 Focal adhesion 11/125 198/7983 0.0002602 0.0100247 87/3611/3674/3688/3690/3693/4638/7450/10398/29780/56034 11 8

The result is a dataframe where KEGG categories have been considered as enriched with the following columns :

  • ID : Unique identifier of the KEGG category
  • Description : Description of the KEGG category
  • GeneRatio : Fraction representing the number of marker genes present in the KEGG category, \(GeneRatio = \frac{nbrMarkerGeneInKEGGcat}{nbrMarkerGene}\)
  • BgRatio : Fraction representing the number of genes of the reference present in the KEGG category, \(BgRatio = \frac{nbrTotalGeneInKEGGcat}{nbrTotalGene}\)
  • pvalue : p-value of the enrichment test
  • p.adjust : adjusted p-value of the Benjamini Hochberg test
  • qvalue : q-value after FDR (False Discovery Rate) check
  • geneID : String containing the list of marker genes present in the KEGG category (separated by /)
  • Count : Number of marker genes present in the KEGG category
  • cluster : Column added before the do.call("rbind", list) in order to identify in which cluster the KEGG category has been considered as enriched.

A KEGG category has been considered enriched if :

  • the p-value \(\lt\) 0.05
  • the adjusted p-value \(\lt\) 0.05
  • q-value \(\lt\) 0.2

These over-representation methods are highly dependent on the database containing fairly generalized groups of genes. However, we can see that the results of enrichGO, enrichKEGG and Biomart intersect for some clusters where :

  • Cluster 6 would represent the Natural Killer cells as well as cluster 4: knowing that they are very close on the UMAP it reflects their proximity of the transcriptomes of the cells that compose these two clusters. The GO analysis would however lean more towards T cells for cluster 4.
  • Cluster 8 would be composed of platelet cells

Knowing when taking only the first 3 or 5 results (so very restricted), we are extremely stringent in identifying clusters.