Skip to content

GO

Prepare the datasets for GOSeq


  1. Select Compute an expression on every row tool with
    • Add expression: bool(c7<0.05)
    • as a new column to: the DESeq2 result file

  1. Cut tool with
    • Cut columns: c1,c8
    • Delimited by: Tab
    • From: the output of the Compute tool

  1. Change Case tool with
    • From: the output of the previous Cut tool
    • Change case of columns: c1
    • Delimited by: Tab
    • To: Upper case

This generates the first input for goseq. We need as second input for goseq, the gene lengths. We can use there the gene length generated by featureCounts tool and reformat it a bit.


  1. Copy one output of type ...: Feature lengths of the 7 featureCounts runs in the history STAR/HISAT2
  2. Rename it Lengths
  3. Change Case tool with
    • From: the feature lengths (output of featureCounts tool)
    • Change case of columns: c1
    • Delimited by: Tab
    • To: Upper case

We have now the two required input files for goseq.

Perform GO analysis


  1. Select goseq tool with
    • Differentially expressed genes file : first file generated by Change Case tool on previous step
    • Gene lengths file : second file generated by Change Case tool on previous step
    • Gene categories : Get categories
    • Select a genome to use : Fruit fly (dm6)
    • Select Gene ID format : Ensembl Gene ID
    • Select one or more categories : GO: Cellular Component, GO: Biological Process, GO: Molecular Function

goseq generates a big table with the following columns for each GO term:

Column Description
category GO category
over_rep_pval p-value for over representation of the term in the differentially expressed genes
under_rep_pval p-value for under representation of the term in the differentially expressed genes
numDEInCat number of differentially expressed genes in this category
numInCat number of genes in this category
term detail of the term
ontology MF (Molecular Function - molecular activities of gene products), CC (Cellular Component - where gene products are active), BP (Biological Process - pathways and larger processes made up of the activities of multiple gene products)
p.adjust.over_represented p-value for over representation of the term in the differentially expressed genes, adjusted for multiple testing with the Benjamini-Hochberg procedure
p.adjust.under_represented p-value for over representation of the term in the differentially expressed genes, adjusted for multiple testing with the Benjamini-Hochberg procedure

To identify categories significantly enriched/unenriched below some p-value cutoff, it is necessary to use the adjusted p-value.

How many GO terms are over-represented at adjusted P value < 0.05?

Under-represented?

How are the over-represented GO terms divided between MF, CC and BP?

And for under-represented GO terms?