NCBI News, March 2011

NCBI Responds to a Report of Contamination in the Sequence Databases

A recent publication in PLoS One by Longo et. al entitled 'Abundant Human DNA Contamination Identified in Non-Primate Genome Databases ' highlights the issue of cross-species contamination in DNA sequence data submitted to the archival databases of the International Nucleotide Sequence Database Collaboration (INSDC) as well as presented in various genome browsers. The authors point out that this is worrisome for downstream users of the data who make assumptions based on the sequences they retrieve. While we understand that there is some contamination in sequences submitted to the public databases, we do not believe that the problem is as widespread as cited in the paper. Most of the examples provided by the authors were of preliminary data submitted to the archives under rapid data release policies. Database users must understand that preliminary data is just that, preliminary because it has not been screened and vetted the way finished data has.

Data in the Trace Archive is the primary sequence read data that was deposited, in many cases, directly off of the sequencing machines without analysis or trimming for contamination. This data is the building blocks for the assemblies that are subsequently submitted to the database. As the author points out in figure 2, the trace may be contaminated but the WGS genome assembly built from that trace read was trimmed so that only the Pseudomonas sequence was included in the assembly. Trace data is known to be preliminary and should always be screened prior to use in assemblies or other downstream analyses. This result was expected and it is not surprising that the authors found it.

Indeed other data types that are submitted to the INSDC archival databases may also be considered preliminary data that should be used with greater caution. This includes ESTs, high throughput genomic or transcript data (HTG and HTC, respectively), and some whole genome shotgun (WGS) genome assemblies, most notably those that are of known overall low coverage, and small contigs that could not be assembled into larger scaffolds or placed on a chromosome. EST records are single pass mRNA sequences that are used to identify and map transcripts within a genome. They are also preliminary data and are widely known not to be as clean as assembled cDNA sequences. HTG sequence records generally progress from phase 1 (preliminary) to phase 3 (finished) as a project progresses and more sequencing is done.

As the authors point out, the issue of sequence contamination has been raised before. To address this issue, NCBI developed a contamination screening pipeline and every eukaryotic assembly that has been processed in GenBank since March 2008 has been screened for contamination. We initiated screening of the prokaryotic assemblies in November 2009. Our foreign contamination screen (FCS) blasts the submitted sequences against the chromosomes of unrelated organisms, against primate- or rodent-specific repeats (as appropriate) and against an enhanced common contaminants database that contains vector sequences, bacterial insertion sequences and E.coli. The FCS results are filtered for hits over a cut-off of 100 nt and 98% identity. While we routinely screen assemblies when they are submitted to GenBank the submitter may choose not to remove the offending sequence. Additionally, if the submitter distributed the assembly to the other browser portals prior to submission to GenBank, the contaminated sequences would be in those distributions even if they had been removed from the GenBank submissions. Assemblies submitted to our INSDC partner databases (ENA and DDBJ), are not screened for contaminants. When we generate RefSeq versions of these GenBank assemblies we endeavor to remove all contamination in a way that preserves the chromosome coordinate system.

We have reviewed the cited contamination for the NCBI assemblies reported in the supplementary table S1 where the authors cited a mixture of HTG, EST, RefSeq and GenBank assemblies.

The Bos taurus RefSeq accessions NW_001499666.1, NW_001499717.1, NW_001499173.1, NW_001500638.1, NW_001500984.1, NW_001499575.1, NW_001500541.1: These accessions are all small (1.5-3kb) unplaced singleton scaffolds from the Btau_4.2 assembly dated June 2010. The reported contamination falls below the thresholds established in our FCS process and would not be filtered out. We note that the reported contamination represents a tiny fraction of the assembly with only 7/131728 contigs reported by the authors to include contamination.
The Caenorhabditis elegans (C. elegans) INSDC accessions AC006760.2, AC006895.2, AC006911.1: These accessions are HTG Phase 1 sequence records from the early days of genome sequencing and were submitted in 1999. These C. elegans accessions were never updated nor carried forth into the final completed genome assembly.
The Dictyostelium discoideum RefSeq accessions NZ_AAFI02000195.1 and NW_001263718.1: The WGS contig, AAFI02000195.1, used as the component in the suppressed RefSeq accessions cited in the table, currently appears as a component in the RefSeq accession NC_007092.3. Our review of this WGS contig finds apparent contamination from human at the end of the contig. The best matching human sequence has 93% identity to this part of the WGS contig, so is below our contamination screen cut-off.
The Danio rerio RefSeq accessions NW_001883888.1, NW_001878841.1, NW_001877121.1: The cited accessions all represent an older assembly of D. rerio (Zv7) . NW_001883888.1 was an unplaced and unlocalized contig which was suppressed in July 2010, and the other two were updated in March 2010 to a newer version. This genome assembly was submitted to the EBI which does not have a contamination screen service. The RefSeq assembly is a copy of the INSDC data to maintain the coordinate space with identified contamination masked to minimize the impact on gene annotation.
The fourteen Felis catus INSDC accessions including ACBE01410795.1 and AANG01328690.1: The cited accessions are from two different submitted assemblies (WGS project prefixes AANG and ACBE). The article indicates that 9/604,000 contigs are contaminated for the ACBE project and 5/818,000 are cited for the AANG project. Both assemblies are low-coverage draft assemblies. As such it is not really surprising that they have not completely removed all contaminants.
The Manihot esculenta INSDC accessions DB950689.1, DB949974.1, DB942673.1, DB942376.1, FG806179.1, DB942079.1, FG805263.1: The accessions cited are all EST sequences which are considered to be draft primary data and are not screened for contamination.
The Magnaporthe grisea RefSeq accessions NW_001798830.1, NW_001798760.1, NW_001798706.1, NW_001798801.1, NW_001798834.1, NW_001798772.1, NW_001798835.1: These records were made public in 2007 and thus pre-date the addition of our FCS process. Note that the organism name for these records was updated to Magnaporthe oryzae by May 2010.
The Ornithorhynchus anatinus RefSeq accessions NW_001720372.1, NW_001722118.1, NW_001607597.1, NW_001654751.1, NW_001718259.1, and NW_001794301.1: Five of these accessions represent scaffolds of less than 1 Kb. The O. anatinus assembly was submitted in 2006 which pre-dates the availability of our FCS process. The contaminated sequences reported in the table represent a small fraction of the 1.8 gigabases of sequence in the genome assembly.
The Paramecium tetraurelia RefSeq accessions NZ_CAAL01001273.1 and NZ_CAAL01001272.1: These small contigs were suppressed in October 2009 after an assembly update became available from our INSDC partner ENA database. Note that suppressed records are clearly indicated as such in the NCBI web display and can still be retrieved by accession or gi number since they are part of the historical data archive; however, they are not included in BLAST databases or the GenBank or RefSeq bi-monthly releases for ftp.
The thirteen Populus trichocarpa RefSeq accessions including NW_001492486.1 and NW_001492402.1: This assembly was made public in October 2006 which pre-dates the addition of our FCS process. These RefSeq accessions were suppressed due to identified contamination in December 2009 however the equivalent INSDC records remain public.

The largest consequence of having contamination in an assembly is generating gene models from these contaminated bits. While it is not ideal to have contaminated sequences in an assembly, if they are sufficiently small then they are unlikely to contribute toward a gene model and the adverse annotation affects are minimal. Contamination of any size may confound evolutionary and comparative studies and thus the possibility of contamination should be considered in interpreting results. Additionally, groups who work with genome assemblies should understand that in most cases (human and mouse aside) the unplaced and unlocalized contigs are the most unreliable and most of the contaminants found are expected to fall into this category.

Longo et. al article

W.B.Langdon 10 Sep 2014