Manipulation of DESeq2 data for visualisation and comparisons¶

Now we would like to extract the most differentially expressed genes in the various conditions, and then visualize them using an heatmap of the normalized counts for each sample.

We will proceed in several steps:

For each package, extract the normalized counts of genes for each sample (all three packages, DESeq2, edgeR and limma, provide this functionality.
For each package, extract the most differentially expressed genes at a given log2FC threshold (let's say 2, corresponding to a 4x or ¼x fold time expression), and at a given p-adjusted value (let's say p-adj < 0.01). We will keep these gene lists apart to build latter a venn diagram for comparison of the three tools.
Plot heatmaps of normalized counts

Extract the most differentially expressed genes (PRJNA630433 / DESeq2)¶

Basically, we navigate in the DESeq history of the PRJNA630433 use-case and we repeat a DESeq2 run, asking in addition for a rLog-Normalized counts output.

DESeq2 settings

Basically, the same as before, except that we ask for a Normalized counts file

how

→ Select datasets per levels
1: Factor

→ Tissue
1: Factor level

Note that there will be three factor levels in this analysis: Dc, Mo and Oc.

→ Oc
Counts file(s)

→ select the data collection icon, then 15: Oc FeatureCounts counts
2: Factor level

→ Mo
Counts file(s)

→ select the data collection icon, then 10: Mo FeatureCounts counts
3: Factor level (you must click on Insert Factor level)

→ Dc
Counts file(s)

→ select the data collection icon, then 5: Mo FeatureCounts counts
(Optional) provide a tabular file with additional batch factors to include in the model.

→ Leave to Nothing selected
Files have header?

→ Yes
Choice of Input data

→ Count data
Advanced options

→ No, leave folded
Output options

→ This time, check the Output rLog normalized table box !

→ Unfold and check Output all levels vs all levels of primary factor (use when you have >2 levels for primary factor) in addition to the already checked Generate plots for visualizing the analysis results

→ Leave Alpha value for MA-plot to 0,1: note that this option is used for plots and does not impact DESeq2 results
Run Tool

This time you can trash the DESeq2 plots and result files which we have already generated.

Keep this output for latter, will use it for a clustered heatmap

Generate top lists of DE genes¶

We will do that with the help of the tool Filter data on any column using simple expressions. We will also use 3 other tools Compute on rows, Column Regex Find And Replace and Filter data on any column using simple expressions

Select genes with |log2FC > 2| and p-adj < 0.01 with `Filter data on any column using simple expressions`¶

Filter data on any column... settings

Filter

→ DESeq2 Results Tables
With following condition

→ abs(c3) > 2 and c7 < 0.01
Number of header lines to skip

→ 1 (these tables have an added header !)
Click the Run Tool button

Rename the "filter on..." collection to Top gene lists

Compute a boolean value by row¶

This is to determine whether genes in the lists are up or down-regulated

Compute on rows settings

Input file

→ Top gene lists ( collection !)
Input has a header line with column names?

→ Yes
1: Expressions
Add expression

→ c3 > 0
Mode of the operation

→ Append
The new column name

→ Regulation
Avoid scientific notation in any newly computed columns

→ No
Click the Run Tool button

Look at the effect of evaluating the expression c3 > 0 in the new column expression in the output datasets.

Transform `True` and `False` values to `up` and `down`, respectively¶

Column Regex Find And Replace settings

Select cells from

→ Compute on collection 36 (or so)
using column

→ 8
Check

→ click the button Insert Check
Find Regex

→ False
Replacement

→ down
Check

→ click another time the button Insert Check
Find Regex

→ True
Replacement

→ up
Click the Run Tool button

rename the collection Column Regex Find And Replace on collection 40 with top gene lists - oriented

Split the lists in `up` and `down` regulated lists¶

This will be performed through 2 successive runs of the tool Select lines that match an expression

Select lines that match an expression settings

Select lines from

→ top gene lists - oriented
that

→ matching
the pattern

→ \tup (a tabulation immediately followed by the string up)
Keep header line

→ Yes
Click the Run Tool button

Immediately rename the collection Select on collection... to top up-regulated gene lists

Redo exactly the same operation with a single change in the setting of the tool Select lines that match an expression

Select lines that match an expression settings

Select lines from

→ top gene lists - oriented
that

→ matching
the pattern

→ \tdown (a tabulation immediately followed by the string down)
Keep header line

→ Yes
Click the Run Tool button

Rename the collection Select on collection... to top down-regulated gene lists

keep the last three generated collections for later comparison with edgeR and limma tools

Plotting an heatmap of the most significantly de-regulated genes¶

For this, we are going to collect and gather all significantly de-regulated genes in any of the 3 conditions, and to intersect (join operation) this list with the rLog normalized count table precedently generated.

Use `Advanced cut` to select the list of deregulated genes in all three comparisons¶

advanced cut settings

File to cut

→ Top gene lists (this is a collection)
Operation

→ Keep
Delimited by

→ Tab
Cut by

→ fields
List of Fields

→ Column 1
First line is a header line
Click the Run Tool button

Rename this collection of single column datasets top genes names

Next we concatenate the three datasets of the previous collection in a single dataset¶

We do that using the Concatenate multiple datasets tail-to-head while specifying how tool

Concatenate multiple datasets tail-to-head while specifying how settings

What type of data do you wish to concatenate?

→ Single datasets
Concatenate Datasets

→ Click on the collection icon and select top genes names
Include dataset names?

→ No
Number of lines to skip at the beginning of each concatenation:

→ 1
Click the Run Tool button

Rename the return single dataset as Pooled top genes

Next we extract Uniques gene names from the `Pooled top genes` dataset¶

You probably agree that the same gene may be deregulated in the three pair-wise comparisons which we have performed with DESeq2.

Thus we need to eliminate the redundancy, using the tool Unique occurrences of each record.

Unique occurrences of each record settings

File to scan for unique values

→ Pooled top genes
Ignore differences in case when comparing

→ No
Column only contains numeric values

→ No
Advanced Options

→ Leave as Hide Advanced Options
Click the Run Tool button

Add a header the list of unique gene names associated we significant DE in any of the comparisons¶

We do this with the tools Add Header

Add Header settings

List of Column headers (comma delimited, e.g. C1,C2,...)

→ DESeq_All_DE_genes
Data File (tab-delimted)

→ Unique on data 1xx...
Click the Run Tool button

Rename the generated dataset DESeq_All_DE_genes

Intersection (join operation) between the list of unique gene name associated with DE and the rLog-Normalized counts file.¶

This is the moment when we are going to use the rLog-Normalized counts file on data... and intersect it (join operation) with the list of DE genes in all three condition.

To do this, we are going to use the tool Join two files

Join two files settings

1^st file

→ rLog-Normalized counts file on data...
Column to use from 1^st file

→ 1
2^nd File

→ DESeq_All_DE_genes
Column to use from 2^nd file

→ 1
Output lines appearing in

→ Both 1st and 2nd files
First line is a header line

→ Yes
Ignore case

→ No
Value to put in unpaired (empty) fields

→ NA
Click the Run Tool button

Rename the output dataset rLog-Normalized counts of DE genes

Plot a heatmap of the rLog-Normalized counts of DE genes in all three conditions¶

We do this using the Plot heatmap with high number of rows tool

Plot heatmap with high number of rows settings

Input should have column headers - these will be the columns that are plotted

→ rLog-Normalized counts of DE genes
Data transformation

→ Plot the data as it is
Enable data clustering

→ Yes
Clustering columns and rows

→ Cluster rows and not columns
Distance method

→ Euclidean
Clustering method

→ Complete
Labeling columns and rows

→ Label columns and not rows
Coloring groups

→ Blue to white to red
Data scaling

→ Scale my data by row
tweak plot height

→ 35
tweak row label size

→ 1
tweak line height

→ 24
Run Tool