NCBI News, March 2011
NCBI Responds to a Report of Contamination in the Sequence Databases
A recent publication in PLoS One by Longo et. al entitled 'Abundant
Human DNA Contamination Identified in Non-Primate Genome Databases '
highlights the issue of cross-species contamination in DNA sequence
data submitted to the archival databases of the International
Nucleotide Sequence Database Collaboration (INSDC) as well as
presented in various genome browsers. The authors point out that this
is worrisome for downstream users of the data who make assumptions
based on the sequences they retrieve. While we understand that there
is some contamination in sequences submitted to the public databases,
we do not believe that the problem is as widespread as cited in the
paper. Most of the examples provided by the authors were of
preliminary data submitted to the archives under rapid data release
policies. Database users must understand that preliminary data is
just that, preliminary because it has not been screened and vetted the
way finished data has.
Data in the Trace Archive is the primary sequence read data that was
deposited, in many cases, directly off of the sequencing machines
without analysis or trimming for contamination. This data is the
building blocks for the assemblies that are subsequently submitted to
the database. As the author points out in figure 2, the trace may be
contaminated but the WGS genome assembly built from that trace read
was trimmed so that only the Pseudomonas sequence was included in the
assembly. Trace data is known to be preliminary and should always be
screened prior to use in assemblies or other downstream analyses.
This result was expected and it is not surprising that the authors
found it.
Indeed other data types that are submitted to the INSDC archival
databases may also be considered preliminary data that should be used
with greater caution. This includes ESTs, high throughput genomic or
transcript data (HTG and HTC, respectively), and some whole genome
shotgun (WGS) genome assemblies, most notably those that are of known
overall low coverage, and small contigs that could not be assembled
into larger scaffolds or placed on a chromosome. EST records are
single pass mRNA sequences that are used to identify and map
transcripts within a genome. They are also preliminary data and are
widely known not to be as clean as assembled cDNA sequences. HTG
sequence records generally progress from phase 1 (preliminary) to
phase 3 (finished) as a project progresses and more sequencing is
done.
As the authors point out, the issue of sequence contamination has been
raised before. To address this issue, NCBI developed a contamination
screening pipeline and every eukaryotic assembly that has been
processed in GenBank since March 2008 has been screened for
contamination. We initiated screening of the prokaryotic assemblies
in November 2009. Our foreign contamination screen (FCS) blasts the
submitted sequences against the chromosomes of unrelated organisms,
against primate- or rodent-specific repeats (as appropriate) and
against an enhanced common contaminants database that contains vector
sequences, bacterial insertion sequences and E.coli. The FCS results
are filtered for hits over a cut-off of 100 nt and 98% identity.
While we routinely screen assemblies when they are submitted to
GenBank the submitter may choose not to remove the offending sequence.
Additionally, if the submitter distributed the assembly to the other
browser portals prior to submission to GenBank, the contaminated
sequences would be in those distributions even if they had been
removed from the GenBank submissions. Assemblies submitted to our
INSDC partner databases (ENA and DDBJ), are not screened for
contaminants. When we generate RefSeq versions of these GenBank
assemblies we endeavor to remove all contamination in a way that
preserves the chromosome coordinate system.
We have reviewed the cited contamination for the NCBI assemblies
reported in the supplementary table S1 where the authors cited a
mixture of HTG, EST, RefSeq and GenBank assemblies.
-
The Bos taurus RefSeq accessions NW_001499666.1, NW_001499717.1,
NW_001499173.1, NW_001500638.1, NW_001500984.1, NW_001499575.1,
NW_001500541.1: These accessions are all small (1.5-3kb) unplaced
singleton scaffolds from the Btau_4.2 assembly dated June 2010. The
reported contamination falls below the thresholds established in our
FCS process and would not be filtered out. We note that the reported
contamination represents a tiny fraction of the assembly with only
7/131728 contigs reported by the authors to include contamination.
-
The Caenorhabditis elegans (C. elegans) INSDC accessions AC006760.2,
AC006895.2, AC006911.1: These accessions are HTG Phase 1 sequence
records from the early days of genome sequencing and were submitted in
1999. These C. elegans accessions were never updated nor carried
forth into the final completed genome assembly.
-
The Dictyostelium discoideum RefSeq accessions NZ_AAFI02000195.1 and
NW_001263718.1: The WGS contig, AAFI02000195.1, used as the component
in the suppressed RefSeq accessions cited in the table, currently
appears as a component in the RefSeq accession NC_007092.3. Our
review of this WGS contig finds apparent contamination from human at
the end of the contig. The best matching human sequence has 93%
identity to this part of the WGS contig, so is below our contamination
screen cut-off.
-
The Danio rerio RefSeq accessions NW_001883888.1, NW_001878841.1,
NW_001877121.1: The cited accessions all represent an older assembly
of D. rerio (Zv7) . NW_001883888.1 was an unplaced and unlocalized
contig which was suppressed in July 2010, and the other two were
updated in March 2010 to a newer version. This genome assembly was
submitted to the EBI which does not have a contamination screen
service. The RefSeq assembly is a copy of the INSDC data to maintain
the coordinate space with identified contamination masked to minimize
the impact on gene annotation.
-
The fourteen Felis catus INSDC accessions including ACBE01410795.1 and
AANG01328690.1: The cited accessions are from two different submitted
assemblies (WGS project prefixes AANG and ACBE). The article
indicates that 9/604,000 contigs are contaminated for the ACBE project
and 5/818,000 are cited for the AANG project. Both assemblies are
low-coverage draft assemblies. As such it is not really surprising
that they have not completely removed all contaminants.
-
The Manihot esculenta INSDC accessions DB950689.1, DB949974.1,
DB942673.1, DB942376.1, FG806179.1, DB942079.1, FG805263.1: The
accessions cited are all EST sequences which are considered to be
draft primary data and are not screened for contamination.
-
The Magnaporthe grisea RefSeq accessions NW_001798830.1,
NW_001798760.1, NW_001798706.1, NW_001798801.1, NW_001798834.1,
NW_001798772.1, NW_001798835.1: These records were made public in
2007 and thus pre-date the addition of our FCS process. Note that the
organism name for these records was updated to Magnaporthe oryzae by
May 2010.
-
The Ornithorhynchus anatinus RefSeq accessions NW_001720372.1,
NW_001722118.1, NW_001607597.1, NW_001654751.1, NW_001718259.1, and
NW_001794301.1: Five of these accessions represent scaffolds of less
than 1 Kb. The O. anatinus assembly was submitted in 2006 which
pre-dates the availability of our FCS process. The contaminated
sequences reported in the table represent a small fraction of the 1.8
gigabases of sequence in the genome assembly.
-
The Paramecium tetraurelia RefSeq accessions NZ_CAAL01001273.1 and
NZ_CAAL01001272.1: These small contigs were suppressed in October
2009 after an assembly update became available from our INSDC partner
ENA database. Note that suppressed records are clearly indicated as
such in the NCBI web display and can still be retrieved by accession
or gi number since they are part of the historical data archive;
however, they are not included in BLAST databases or the GenBank or
RefSeq bi-monthly releases for ftp.
-
The thirteen Populus trichocarpa RefSeq accessions including
NW_001492486.1 and NW_001492402.1: This assembly was made public in
October 2006 which pre-dates the addition of our FCS process. These
RefSeq accessions were suppressed due to identified contamination in
December 2009 however the equivalent INSDC records remain public.
The largest consequence of having contamination in an assembly is
generating gene models from these contaminated bits. While it is not
ideal to have contaminated sequences in an assembly, if they are
sufficiently small then they are unlikely to contribute toward a gene
model and the adverse annotation affects are minimal. Contamination
of any size may confound evolutionary and comparative studies and thus
the possibility of contamination should be considered in interpreting
results. Additionally, groups who work with genome assemblies should
understand that in most cases (human and mouse aside) the unplaced and
unlocalized contigs are the most unreliable and most of the
contaminants found are expected to fall into this category.
Longo et. al
article
W.B.Langdon
10 Sep 2014