🔗 Centrifuge "human and virus" (h+v) index based on NCBI RefSeq and NCBI Virus genomes (VERSION 2, including 138 SARS-CoV-2 complete genomes from NCBI Virus in addition to RefSeq): Download (Data fetched March 27th, 2020 from NCBI RefSeq + Virus)
Since we added many SARS-CoV-2 genomes to the index you might want to add --host-taxids 2697049 to the Centrifuge run. Otherwise it will report the species (taxid: 694009) and not the leaf taxid for SARS-CoV-2 (2697049). In case you don't mind if there is just one SARS-CoV-2 genome in the index (the one from RefSeq) you can use VERSION 1 linked below and you do not have to care about this. If you want to use this index but you are not interested in SARS-CoV-2 in particular you can also omit the option.
🔗 md5 Centrifuge h+v index | Methods describing build of Centrifuge h+v index
🔗 OLD VERSION 1: h+v index | h+v md5
🔗 Centrifuge "human, prokaryotes and virus" (h+p+v) index based on RefSeq genomes: Download (Data fetched March 18th, 2020 from NCBI RefSeq)
🔗 md5 Centrifuge h+p+v index | Methods describing build of Centrifuge h+p+v index
🔗 Kraken2 "human and virus" (h+v) database based on RefSeq genomes: Download (Data fetched March 19th, 2020 from NCBI RefSeq)
🔗 md5 Kraken2 h+v index | Methods describing build of Kraken2 h+v index
Here we provide a Metamaps database. It was compiled on March 26th 2020 and hence contains SARS-CoV-2 (1 from RefSeq and 87 from GenBank). Building of the database was performed as described here: Custom Databases (see also methods below).
🔗 MetaMaps "human and virus" (h+v) database based on RefSeq genomes: Download (Data fetched March 26th, 2020 from NCBI RefSeq & NCBI Virus)
🔗 md5 MetaMaps h+v index | Methods describing build of Metamaps database | List of genomes included in MetaMaps database
Kmers unique to SARS-CoV-2 might be helpful to design novel diagnostic tests, for screening of high throughput sequencing data and similar applications. We provide the unique kmers for lengths 19-25 bp.
Kmers unique to public SARS-CoV-2 genomes (March 29th 2020, based on NCBI's Virus & Assembly databases and CNGBdb)
Occurring in all examined SARS-CoV-2 genomes BUT not contained in the off-target genomes we tested. Off-target hits are defined as either exact matches or matches with up to an edit distance of 1-4 (depending on the kmer length). For more informations and methods please check README.txt.
Be aware that we just tested against non-SARS2 viral genomes as well against the human genome (GRCh38) for off-target hits . So there might be (very rare) off-target hits against prokaryotes/eukaryotes (not so much a concern for the longer kmers).
🔗 Kmers unique to SARS-CoV-2: As *.tar.gz (version 3) or as *.zip (version 3)
🔗 List of examined SARS-CoV-2 genome assemblies for the analysis (n=153, 146 from NCBI, 7 from CNGBdb): Download (version 3)
🔗 List of contigs for ~8.8k riboviria genomes for detection of exact off-target hits: Download (version 3)
🔗 List of contigs for 116 cornidovirineae genomes for detection of off-target hits with edit distance up to 4: Download (version 3)
🔗 README.txt including methods: Download (version 3)
Additional - Non-unique kmers occurring in all public SARS-CoV-2 genomes (March 29th 2020, based on NCBI's Virus database & CNGBdb)
Those kmers are not necessarily unique to SARS-CoV-2 genomes but are contained in all examined SARS2 genomes (no check for off-targets).
🔗 Kmers common to all SARS-CoV-2 genomes: As *.tar.gz (version 3) or as *.zip (version 3)
🔗 List of examined SARS-CoV-2 genome assemblies for the analysis: Download (version 3)
Yes, it is a simple task to download complete SARS-CoV-2 genomes from NCBI's or CNGBdb's databases. However, we provide the following FASTA files for people who are not that familiar with NCBI's and CNGBdb's interfaces. Also, it is painful to (1) get access to GISAID and then (2) batch download data from there. So the 153 genomes as one package might be a good starting point for many to conduct an analysis.
🔗 Complete genome assemblies from NCBI Virus database as one FASTA file (n=146, genomes with more than 5 "N" (gaps) in the assembly are excluded ): Download (downloaded March 29th 2020)
You should be able to query this yourself via this link.
🔗 Complete genomes from above combined as one FASTA file (n=153): Download (compiled March 29th 2020 from the files above).
Multiple sequence alignment on nucleotide level of 153 complete SARS-CoV-2 genome assemblies. Generated using MUSCLE. Source of the genomes see above.
🔗 Multiple sequence alignment in Clustal format: Download (generated with genomes available March 29th 2020)
🔗 Multiple sequence alignment in FASTA format: Download (generated with genomes available March 29th 2020)
🔗 Consensus sequence based on the MSA alignment. Every position with at least one ambiguous base in the alignment (Indel/SNP) is shown as "N": Download (generated with genomes available March 29th 2020)
🔗 Consensus regions on the MSA split up by ambiguous regions. Regions shorter than 16bp were discarded. Download (generated with genomes available March 29th 2020)
🔗 Methods for MSA and consensus regions