SARS-CoV-2 (COVID-19) bioinformatics resources


Here we provide resources useful for SARS-CoV-2 (COVID-19) bioinformatics & genomics. Over time, and depending on the response of the community, we will publish additional resources. We ask you to cross-check the material and give feedback if you miss something or have any suggestions for improvement.

For feedback, questions or requests regarding specific bioinformatics services: michael.schmid@genexa.ch

If you want to cite this resource, a short preprint is up in bioRxiv in a short time. Check back later.

Centrifuge indexes including SARS-CoV-2 (Updated March 27th)

Here we provide Centrifuge indexes. Building of indexes was performed as described here: Database download and index building (see also methods below).

πŸ”— Centrifuge "human and virus" (h+v) index based on NCBI RefSeq and NCBI Virus genomes (VERSION 2, including 138 SARS-CoV-2 complete genomes from NCBI Virus in addition to RefSeq): Download (Data fetched March 27th, 2020 from NCBI RefSeq + Virus)
Since we added many SARS-CoV-2 genomes to the index you might want to add --host-taxids 2697049 to the Centrifuge run. Otherwise it will report the species (taxid: 694009) and not the leaf taxid for SARS-CoV-2 (2697049). In case you don't mind if there is just one SARS-CoV-2 genome in the index (the one from RefSeq) you can use VERSION 1 linked below and you do not have to care about this. If you want to use this index but you are not interested in SARS-CoV-2 in particular you can also omit the option.
πŸ”— md5 Centrifuge h+v index | Methods describing build of Centrifuge h+v index
πŸ”— OLD VERSION 1: h+v index | h+v md5

πŸ”— Centrifuge "human, prokaryotes and virus" (h+p+v) index based on RefSeq genomes: Download (Data fetched March 18th, 2020 from NCBI RefSeq)
πŸ”— md5 Centrifuge h+p+v index | Methods describing build of Centrifuge h+p+v index


Kraken2 index including SARS-CoV-2

Here we provide a Kraken index. It was compiled on March 19th 2020 and hence contains SARS-CoV-2. Building of the index was performed as described here: Custom Databases (see also methods below).

πŸ”— Kraken2 "human and virus" (h+v) database based on RefSeq genomes: Download (Data fetched March 19th, 2020 from NCBI RefSeq)
πŸ”— md5 Kraken2 h+v index | Methods describing build of Kraken2 h+v index


MetaMaps index including SARS-CoV-2

Here we provide a Metamaps database. It was compiled on March 26th 2020 and hence contains SARS-CoV-2 (1 from RefSeq and 87 from GenBank). Building of the database was performed as described here: Custom Databases (see also methods below).

πŸ”— MetaMaps "human and virus" (h+v) database based on RefSeq genomes: Download (Data fetched March 26th, 2020 from NCBI RefSeq & NCBI Virus)
πŸ”— md5 MetaMaps h+v index | Methods describing build of Metamaps database | List of genomes included in MetaMaps database


Kmers unique to SARS-CoV-2 (Version 3, Updated March 29th)

Kmers unique to SARS-CoV-2 might be helpful to design novel diagnostic tests, for screening of high throughput sequencing data and similar applications. We provide the unique kmers for lengths 19-25 bp.

Kmers unique to public SARS-CoV-2 genomes (March 29th 2020, based on NCBI's Virus & Assembly databases and CNGBdb)
Occurring in all examined SARS-CoV-2 genomes BUT not contained in the off-target genomes we tested. Off-target hits are defined as either exact matches or matches with up to an edit distance of 1-4 (depending on the kmer length). For more informations and methods please check README.txt.
Be aware that we just tested against non-SARS2 viral genomes as well against the human genome (GRCh38) for off-target hits . So there might be (very rare) off-target hits against prokaryotes/eukaryotes (not so much a concern for the longer kmers).
πŸ”— Kmers unique to SARS-CoV-2: As *.tar.gz (version 3) or as *.zip (version 3)
πŸ”— List of examined SARS-CoV-2 genome assemblies for the analysis (n=153, 146 from NCBI, 7 from CNGBdb): Download (version 3)
πŸ”— List of contigs for ~8.8k riboviria genomes for detection of exact off-target hits: Download (version 3)
πŸ”— List of contigs for 116 cornidovirineae genomes for detection of off-target hits with edit distance up to 4: Download (version 3)
πŸ”— README.txt including methods: Download (version 3)

Additional - Non-unique kmers occurring in all public SARS-CoV-2 genomes (March 29th 2020, based on NCBI's Virus database & CNGBdb)
Those kmers are not necessarily unique to SARS-CoV-2 genomes but are contained in all examined SARS2 genomes (no check for off-targets).
πŸ”— Kmers common to all SARS-CoV-2 genomes: As *.tar.gz (version 3) or as *.zip (version 3)
πŸ”— List of examined SARS-CoV-2 genome assemblies for the analysis: Download (version 3)


Complete and public SARS-CoV-2 genome assemblies as a batch download (currently n=153)

Yes, it is a simple task to download complete SARS-CoV-2 genomes from NCBI's or CNGBdb's databases. However, we provide the following FASTA files for people who are not that familiar with NCBI's and CNGBdb's interfaces. Also, it is painful to (1) get access to GISAID and then (2) batch download data from there. So the 153 genomes as one package might be a good starting point for many to conduct an analysis.

πŸ”— Complete genome assemblies from NCBI Virus database as one FASTA file (n=146, genomes with more than 5 "N" (gaps) in the assembly are excluded ): Download (downloaded March 29th 2020)
You should be able to query this yourself via this link.

πŸ”— Seven more (quality checked and sanitized) genome assemblies from CNGBdb database as one FASTA file (n=7): Download (downloaded March 21th 2020).
You can get them via CNGB's COV-19 database

πŸ”— Complete genomes from above combined as one FASTA file (n=153): Download (compiled March 29th 2020 from the files above).


Multiple sequence alignment (MSA) of complete SARS-CoV-2 genome assemblies (currently n=153)

Multiple sequence alignment on nucleotide level of 153 complete SARS-CoV-2 genome assemblies. Generated using MUSCLE. Source of the genomes see above.
πŸ”— Multiple sequence alignment in Clustal format: Download (generated with genomes available March 29th 2020)
πŸ”— Multiple sequence alignment in FASTA format: Download (generated with genomes available March 29th 2020)
πŸ”— Consensus sequence based on the MSA alignment. Every position with at least one ambiguous base in the alignment (Indel/SNP) is shown as "N": Download (generated with genomes available March 29th 2020)
πŸ”— Consensus regions on the MSA split up by ambiguous regions. Regions shorter than 16bp were discarded. Download (generated with genomes available March 29th 2020)
πŸ”— Methods for MSA and consensus regions