Estimation of the strandness
In practice, with Illumina RNA-seq protocols you will most likely deal with either:
-
Unstranded RNAseq data
-
Stranded RNA-seq data produced with - kits and dUTP tagging (ISR)
This information should be provided with your FASTQ files, ask your sequencing facility!
If not, try to find it on the site where you downloaded the data or in the corresponding publication.
Another option is to estimate these parameters with a tool called Infer Experiment
from
the RSeQC
tool suite. This tool takes the output of your mappings (BAM files), selects
a subsample of your reads and compares their genome coordinates and strands with those of
the reference gene model (from an annotation file).
Based on the strand of the genes, it can gauge whether sequencing is strand-specific, and if so, how reads are stranded.
Use of Infer Experiment
tool
Convert GTF to BED12
tool to convert the GTF file to BED
- Go to your history
STAR
orHISAT2
- Select the tool
Convert GTF to BED12
GTF File to convert
: Drosophila_melanogaster.BDGP6.95.gtf
Execute
Infer Experiment
tool to determine the library strandness
- In the same history
STAR
orHISAT2
- Select the tool
Infer Experiment
Input .bam file
: mapped.bam files (outputs of RNA STAR or HISAT2 tools)Reference gene model
: BED12 file (output of Convert GTF to BED12 tool)Number of reads sampled from SAM/BAM file (default = 200000)
: 200000
Execute
Summarize results with MultiQC
tool
- Select the tool
MultiQC
- Which tool was used generate logs?
RSeQC
- RSeQC output (Type of RSeQC output?)
- infer_experiment
- Select the 7 datasets of type
Infer Experiment on ...
Execute
Infer Experiment tool generates one file with information on:
- Paired-end or single-end library
- Fraction of reads failed to determine
- 2 lines:
- For single-end
- Fraction of reads explained by “++,–” (SF in previous figure)
- Fraction of reads explained by “+-,-+” (SR in previous figure)
- For paired-end
- Fraction of reads explained by “1++,1–,2+-,2-+” (SF in previous figure)
- Fraction of reads explained by “1+-,1-+,2++,2–” (SR in previous figure)
- For single-end
If the two “Fraction of reads explained by” numbers are close to each other (i.e. a mix of SF and SR), we conclude that the library is not a strand-specific dataset (U in previous figure).
As it is sometimes quite difficult to find out which settings correspond to those of other programs, the following table might be helpful to identify the library type:
Library type | Infer Experiment | TopHat | HISAT | htseq-count | featureCounts |
---|---|---|---|---|---|
Paired-End (PE) - SF | 1++,1–,2+-,2-+ | FR Second Strand | Second Strand F/FR | yes | Forward (1) |
PE - SR | 1+-,1-+,2++,2– | FR First Strand | First Strand R/RF | reverse | Reverse (2) |
Single-End (SE) - SF | +,– | FR Second Strand | Second Strand F/FR | yes | Forward (1) |
SE - SR | +-,-+ | FR First Strand | First Strand R/RF | reverse | Reverse (2) |
PE, SE - U | undecided | FR Unstranded | default | no | Unstranded (0) |