Read Ancient DNA: Methods and Protocols Online
Authors: Beth Shapiro
cause them to miss alignments when the number of allowed
edits is set too low. Green
et al.
( 8 )
presented an aligner with sensitivity comparable to MegaBLAST ( 79, 80 ) that incorporates base misincorporation patterns typical of aDNA extracts.
12. If nonidentical sequences originating from different DNA molecules are clustered together, a consensus approach will average these. This may result in incorrect haplotype calls and low quality scores for sites where variation is present. Thus, a consensus approach should only be applied if it is very unlikely that two different template molecules may be clustered. For aDNA samples with a few million endogenous molecules, large megabase-sized genomes, and random fragment ends, the
assumption of PCR duplicates as the only source is probably valid. Large amounts of endogenous DNA, small genomes, or
protocols that generate nonrandom fragment ends (such as the use of restriction enzymes or multiplex PCR) may, however, confl ict with this assumption.
Acknowledgments
I thank all current and previous members of the Department of Evolutionary Genetics at the Max Planck Institute for Evolutionary Anthropology, and particularly members of the aDNA and 226
M. Kircher
sequencing group, for interesting discussions and useful insights as well as for providing their sequencing data for analysis (especially Knut Finstermeier for providing the example data set). I also thank Knut Finstermeier and Beth Shapiro for critical reading and revisions. This work was supported by a grant from the Max Planck Society.
References
1. Margulies M et al (2005) Genome sequencing
14. Poinar HN et al (2006) Metagenomics to
in microfabricated high-density picolitre reac—
paleogenomics: large-scale sequencing of
tors. Nature 437(7057):376–380
mammoth DNA. Science 311(5759):392–394
2. Bentley DR et al (2008) Accurate whole human
15. Green RE et al (2008) A complete Neandertal
genome sequencing using reversible termina—
mitochondrial genome sequence determined
tor chemistry. Nature 456(7218):53–59
by high-throughput sequencing. Cell 134(3):
3. Shendure J et al (2005) Accurate multiplex
416–426
polony sequencing of an evolved bacterial 16. Gilbert MT et al (2008) Intraspecifi c phyloge-genome. Science 309(5741):1728–1732
netic analysis of Siberian woolly mammoths
4. Harris TD et al (2008) Single-molecule DNA
using complete mitochondrial genomes. Proc
sequencing of a viral genome. Science
Natl Acad Sci U S A 105(24):8327–8332
320(5872):106–109
17. Briggs AW et al (2007) Patterns of damage in
5. Drmanac R et al (2010) Human genome
genomic DNA sequences from a Neandertal.
sequencing using unchained base reads on self—
Proc Natl Acad Sci USA 104(37):
assembling DNA nanoarrays. Science
14616–14621
327(5961):78–81
18. Heyn P et al (2010) Road blocks on paleoge—
6. Korlach J et al (2008) Selective aluminum pas—
nomes—polymerase extension profi ling reveals
sivation for targeted immobilization of single
the frequency of blocking lesions in ancient
DNA polymerase molecules in zero-mode
DNA. Nucleic Acids Res 38(16):e161
waveguide nanostructures. Proc Natl Acad Sci
19. Hofreiter M et al (2001) DNA sequences from
U S A 105(4):1176–1181
multiple amplifi cations reveal artifacts induced
7. Miller W et al (2008) Sequencing the nuclear
by cytosine deamination in ancient DNA.
genome of the extinct woolly mammoth.
Nucleic Acids Res 29(23):4793–4799
Nature 456(7220):387–390
20. Kircher M, Kelso J (2010) High-throughput
8. Green RE et al (2010) A draft sequence of the
DNA sequencing—concepts and limitations.
Neandertal genome. Science 328(5979):
Bioessays 32(6):524–536
710–722
21. Shendure J, Ji H (2008) Next-generation
9. Rasmussen M et al (2010) Ancient human
DNA sequencing. Nat Biotechnol 26(10):
genome sequence of an extinct Palaeo-Eskimo.
1135–1145
Nature 463(7282):757–762
22. Reich D et al (2010) Genetic history of an
10. Krause J et al (2006) Multiplex amplifi cation of
archaic hominin group from Denisova Cave in
the mammoth mitochondrial genome and the
Siberia. Nature 468(7327):1053–1060
evolution of Elephantidae. Nature 439(7077):
23. Prüfer K et al (2010) Computational challenges
724–727
in the analysis of ancient DNA. Genome Biol
11. Krause J et al (2010) The complete mitochon—
11(5):R47
drial DNA genome of an unknown hominin 24. Dohm JC et al (2008) Substantial biases in from southern Siberia. Nature 464(7290):
ultra-short read data sets from high-through—
894–897
put DNA sequencing. Nucleic Acids Res
12. Briggs AW et al (2009) Targeted retrieval and
36(16):e105
analysis of fi ve Neandertal mtDNA genomes. 25. Lassmann T, Hayashizaki Y, Daub CO (2009) Science 325(5938):318–321
TagDust—a program to eliminate artifacts
13. Burbano HA et al (2010) Targeted investiga—
from next generation sequencing data.
tion of the Neandertal genome by array-based
Bioinformatics 25(21):2839–2840
sequence capture. Science 328(5979):
26. Briggs AW, Stenzel U, Meyer M, Krause J,
723–725
Kircher M, Paabo S (2009) Removal of
23 Analysis of High-Throughput Ancient DNA Sequencing Data 227
deaminated cytosines and detection of in vivo
42. Stiller M et al (2009) Direct multiplex
methylation in ancient DNA. Nucleic Acids
sequencing (DMPS)—a novel method for
Res 38(6):e87
targeted high-throughput sequencing of
27. Krause J et al (2010) A complete mtDNA
ancient and highly degraded DNA. Genome
genome of an early modern human from
Res 19(10):1843–1848
Kostenki, Russia. Curr Biol 20(3):231–236
43. Paabo S, Irwin DM, Wilson AC (1990) DNA
28. Quinlan AR et al (2008) Pyrobayes: an
damage promotes jumping between templates
improved base caller for SNP discovery in
during enzymatic amplifi cation. J Biol Chem
pyrosequences. Nat Methods 5(2):179–181
265(8):4718–4721
29. Erlich Y et al (2008) AltaCyclic: a self-opti—
44. Lahr DJ, Katz LA (2009) Reducing the impact
mizing base caller for next-generation sequenc—
of PCR-mediated recombination in molecular
ing. Nat Methods 5(8):679–682
evolution and environmental studies using a
30. Kao WC, Stevens K, Song YS (2009) BayesCall:
new-generation high-fi delity DNA polymerase.
a model-based basecalling algorithm for high—
Biotechniques 47(4):857–866
throughput short-read sequencing. Genome 45. Meyerhans A, Vartanian JP, Wain-Hobson S
Res 19(10):1884–1895
(1990) DNA recombination during PCR.
31. Kircher M, Stenzel U, Kelso J (2009) Improved
Nucleic Acids Res 18(7):1687–1691
base calling for the Illumina Genome Analyzer
46. Odelberg SJ et al (1995) Template-switching
using machine learning strategies. Genome
during DNA synthesis by
Thermus aquaticus
Biol 10(8):R83
DNA polymerase I. Nucleic Acids Res
32. Whiteford N et al (2009) Swift: primary data
23(11):2049–2057
analysis for the Illumina Solexa sequencing 47. Mamanova L et al (2010) Target-enrichment platform. Bioinformatics 25(17):2194–2199
strategies for next-generation sequencing. Nat
33. Noer GJ (1998) Cygwin: A free win32 porting
Methods 7(2):111–118
layer for UNIX Applications. In: 2nd USENIX
48. R Development Core Team (2010) R: a lan—
NT Symposium, Seattle, WA
guage and environment for statistical comput—
34. Stajich JE et al (2002) The Bioperl toolkit: Perl
ing. R Foundation for Statistical Computing,
modules for the life sciences. Genome Res
Vienna, Austria
12(10):1611
49. Ewing B, Green P (1998) Basecalling of auto—
35. Cock PJA et al (2009) Biopython: freely avail—
mated sequencer traces using phred. II. Error
able Python tools for computational molecular
probabilities. Genome Res 8(3):186–194
biology and bioinformatics. Bioinformatics 50. Dolan PC, Denver DR (2008) TileQC: a 25(11):1422
system for tile-based quality control of Solexa
36. Mason CE et al (2010) Standardizing the next
data. BMC Bioinformatics 9:250
generation of bioinformatics software develop—
51. Andrews S (2010) FastQC: a quality control
ment with BioHDF (HDF5). Adv Exp Med
tool for high throughput sequence data
Biol 680:693–700
52. McKenna A et al (2010) The Genome Analysis
37. Chang F et al (2008) Bigtable: a distributed
Toolkit: a MapReduce framework for analyzing
storage system for structured data. ACM Trans
next-generation DNA sequencing data.
Comput Syst (TOCS) 26(2):1–26
Genome Res 20(9):1297–1303
38. Venner J (2009) Pro Hadoop. In: Moodie M
53. Li H, Durbin R (2009) Fast and accurate short
(ed) Apress. Springer, New York
read alignment with Burrows-Wheeler trans—
39. Meyer M, Kircher M (2010) Illumina sequenc—
form. Bioinformatics 25(14):1754–1760
ing library preparation for highly multiplexed 54. Palmer LE et al (2010) Improving de novo target capture and sequencing. Cold Spring
sequence assembly using machine learning and
Harb Protoc 2010(6):pdb.prot5448. comparative genomics for overlap correction.
doi: 10.1101/pdb.prot5448
BMC Bioinformatics 11:33
40. Meyer M, Stenzel U, Hofreiter M (2008) 55. Zerbino DR, Birney E (2008) Velvet: algo-Parallel tagged sequencing on the 454 plat—
rithms for de novo short read assembly using
form. Nat Protoc 3(2):267–278
de Bruijn graphs. Genome Res 18(5):
41. Illumina Inc. (2008) Multiplexed sequencing
821–829
with the Illumina Genome Analyzer System 56. Birol I et al (2009) De novo transcriptome [PDF] [cited; 770-2008-011]. Available from:
assembly with ABySS. Bioinformatics
http://www.illumina.com/Documents/prod-
25(21):2872–2877
ucts/datasheets/datasheet_sequencing_multi-
57. Chaisson MJ, Brinza D, Pevzner PA (2009) De
novo fragment assembly with short mate-paired
228
M. Kircher
reads: does the read length matter? Genome 68. Altschul SF et al (1990) Basic local alignment Res 19(2):336–346
search tool. J Mol Biol 215(3):403–410
58. Jeck WR et al (2007) Extending assembly of 69. Kent WJ (2002) BLAT—the BLAST-like align-short DNA sequences to handle error.
ment tool. Genome Res 12(4):656–664
Bioinformatics 23(21):2942–2944
70. Thompson JD, Higgins DG, Gibson TJ (1994)
59. Li H et al (2009) The Sequence Alignment/
CLUSTAL W: improving the sensitivity of pro—
Map format and SAMtools. Bioinformatics
gressive multiple sequence alignment through
25(16):2078–2079
sequence weighting, position-specifi c gap pen—
60. Creighton CJ, Reid JG, Gunaratne PH
alties and weight matrix choice. Nucleic Acids
(2009) Expression profi ling of microRNAs
Res 22(22):4673–4680
by deep sequencing. Brief Bioinform 10(5):
71. Notredame C, Higgins DG, Heringa J (2000)
490–497
T-Coffee: a novel method for fast and accurate
61. Green RE et al (2009) The Neandertal genome
multiple sequence alignment. J Mol Biol
and ancient DNA authenticity. EMBO J
302(1):205–217
28(17):2494–2502
72. Edgar RC (2004) MUSCLE: multiple sequence
62. Edgar RC (2010) Search and clustering orders
alignment with high accuracy and high through—
of magnitude faster than BLAST. Bioinformatics
put. Nucleic Acids Res 32(5):1792–1797
26(19):2460–2461
73. Trapnell C, Salzberg SL (2009) How to map
63. Li W, Godzik A (2006) Cd-hit: a fast program
billions of short reads onto genomes. Nat
for clustering and comparing large sets of pro—
Biotechnol 27(5):455–457
tein or nucleotide sequences. Bioinformatics 74. Li R et al (2008) SOAP: short oligonucleotide 22(13):1658–1659
alignment program. Bioinformatics
64. Niu B et al (2010) Artifi cial and natural dupli—
24(5):713–714
cates in pyrosequencing reads of metagenomic
75. Smith AD, Xuan Z, Zhang MQ (2008) Using
data. BMC Bioinformatics 11:187
quality scores and longer reads improves accu—
65. Blanca J, Chevreux B (2010) sff_extract.
racy of Solexa read mapping. BMC
http://bioinf.comav.upv.es/sff_extract/index
Bioinformatics 9:128
66. Langmead B et al (2009) Ultrafast and mem—
76. Li R et al (2009) SOAP2: an improved ultrafast
ory-effi
cient alignment of short DNA
tool for short read alignment. Bioinformatics
sequences to the human genome. Genome
25(15):1966–1967
Biol 10(3):R25
77. Zhang Z et al (2000) A greedy algorithm for
67. Applied Biosystems (2008) A theoretical