Reference genomes
Reference Genomes¶
-
Fasta format
-
Assembly version, generally, associated to a number and a date of assembly
-
A same assembly may be provided by various organisation (Genome Resource Consortium, Ensembl, NCBI, UCSC, etc)
This will be the same DNA sequence but formats may differ:
- by the name of the chromosomes (chr1, 1, NC_000001.11, ...)
- by the presence (or the absence) of unmapped contigs and haplotypes
exemple 1: human genome¶
- GRCh37/hg19 - juil 2007
- GRCh38/hg38 - déc 2011
- GRCh39/hg39 - juin 2020 (repeat ++)
This various versions (or "releases") may in addition contain
- chromosomal regions "Aplotypes" (HLA, HBV inserts, etc…)
- unmapped contigs (regions which are significant assembly of reads, but are not assigned to a specific chromosome)
exemple 2: mouse genome¶
Release name | Date of release | Equivalent UCSC version |
GRCm39 | June 2020 | mm39 |
GRCm38 | Dec 2011 | mm10 |
NCBI Build 37 | Jul 2007 | mm9 |
NCBI Build 36 | Feb 2006 | mm8 |
NCBI Build 35 | Aug 2005 | mm7 |
NCBI Build 34 | Mar 2005 | mm6 |
Annotations¶
It is important to note that annotations of genomes (GTF, GFF, etc.) although generally equivalent, are strictly linked to their genome version because they refere to the DNA sequences using the format of the release. This is why a GTF annotation file downloaded from Ensembl is not interchangeable with a GTF annotation file from the UCSC or from another organisation.
Moreover, since genome annotations may be considered as genome metadata (data on data), it is normal and expected that genome annotation versions are different from genome versions and that they are released at a faster pace.