Assembly of Long Error-prone Reads Using De Bruijn Graphs
. 2016 Dec 27;113(52):E8396-E8405.
doi: 10.1073/pnas.1604560113. Epub 2016 December 12.
Assembly of long error-prone reads using de Bruijn graphs
Affiliations
- PMID: 27956617
- PMCID: PMC5206522
- DOI: 10.1073/pnas.1604560113
Costless PMC article
Assembly of long error-prone reads using de Bruijn graphs
Proc Natl Acad Sci U Due south A. .
Costless PMC article
Abstruse
The recent breakthroughs in assembling long error-decumbent reads were based on the overlap-layout-consensus (OLC) approach and did not apply the strengths of the culling de Bruijn graph approach to genome assembly. Moreover, these studies often assume that applications of the de Bruijn graph approach are limited to brusque and accurate reads and that the OLC approach is the only applied prototype for assembling long mistake-prone reads. We testify how to generalize de Bruijn graphs for assembling long mistake-decumbent reads and describe the ABruijn assembler, which combines the de Bruijn graph and the OLC approaches and results in accurate genome reconstructions.
Keywords: de Bruijn graph; genome assembly; single-molecule sequencing.
Conflict of involvement statement
The authors declare no conflict of interest.
Figures
Amalgam the de Bruijn graph (Left) and the A-Bruijn graph (Right) for a circular
=CATCAGATAGGA. (Left) From
to
. (Right) From
to
for
CA, AT, TC, AGA, TA, AC
. The figure illustrates the process of bringing the vertices with the aforementioned label closer to each other (middle row) to eventually mucilage them into a unmarried vertex (lesser row). Note that some symbols of
are non covered by strings in
. We assign integer
to the edge
in this path to denote the difference between the positions of
and
in
(i.e., the number of symbols between the commencement of
and the get-go of
in
).
The histograms of the number of 15-mers with given frequencies for the ECOLI dataset from Escherichia coli. The bars for unique/repeated/nongenomic 15-mers for the E. coli genome are stacked and shown in green/red/blue according to their fractions. ABruijn automatically selects the parameter
and defines solid strings as all 15-mers with frequencies at least
for the ECOLI dataset. Nosotros found that increasing the automatically selected values of
past one results in equally accurate assemblies. In that location be four.i , 0.1 , and 0.5 million (3.9, 0.ane, and 0.3 1000000) unique, repeated, and nongenomic 15-mers, respectively, for ECOLI at
(
). Although larger values of
(e.thou.,
) as well produce high-quality SMS assemblies, we institute that selecting smaller rather than larger
results in slightly better performance.
Two overlapping reads from the ECOLI dataset and their common
-subpath with maximum span that contains fifty vertices and has span 6,714 with respect to the lesser read (for
=one,000). The left and right overhangs for these reads are 425 and 434. The weights of the edges in the A-Bruijn graph are shown just if they exceed 400 bp.
(Top) A growing path (shown in green) and a prepare of v paths
above it (extending this path). The gray path with
is the near-consistent path in the prepare
. (Eye) A growing path (shown in green) ending in a repeat (represented past the internal edge in the graph), and eight read-paths that extend this growing path (v right extensions shown in blue and three wrong extensions shown in cherry. (Bottom) A support graph for the to a higher place eight read-paths. Note that the blue read-path i is connected by edges with all red read-paths because it is supported past all blood-red paths even though these paths practise not contain any brusque suffix of read-path 1 (the ABruijn graph framework is less sensitive than the de Bruijn graph framework with respect to overlap detection).
(Left) Support graph
for a read in the BLS dataset (Results, Datasets) that does not end in a long repeat. Reads in the BLS dataset are numbered in order of their appearance along the genome. The green vertex represents a chimeric read. The blue vertex has maximum caste in
and reveals a single cluster consisting of all vertices but the green one. A vertex 281 with big indegree (5) and large outdegree (iii) in
is a about-consistent read-path, and information technology is selected for path extension (unless it ends in a repeat). (Right) Support graph
for a read in the BLS dataset that ends in a long repeat. The green vertex represents a chimeric read. The blue vertex has maximum caste in
and reveals a cluster consisting of 9 blue vertices. The vertex 4901 with large indegree (4) and big outdegree (4) in
is a most-consistent read-path, and it is selected for path extension if information technology does not showtime in a repeat. The red vertex reveals another cluster consisting of five cherry vertices. By and large, we look that a read ending in a long repeat of multiplicity
volition result in
clusters because reads originating unlike instances of this repeat are non expected to back up each other and, thus, are not connected by edges in
.
(Pinnacle Left) The pairwise alignments between a reference region
in the draft genome and 5 reads
. All inserted symbols in these reads with respect to the region
are colored in blue. (Lesser Left) The multiple alignment
synthetic from the in a higher place pairwise alignments along with the values of
,
,
,
and
. The final row shows the set
of
-solid 4-mers. The nonreference columns in the alignment are not numbered. (Right) Constructing
, that is, combining all paths
into
. Note that the iv-mer ATGA corresponds to two different nodes with labels i and thirteen. The three boundaries of the mini-alignments are between positions two and 3, 7 and 8, and 14 and 15. The two resulting necklaces are formed by segments
and
.
Similar articles
-
Benchmarking of de novo associates algorithms for Nanopore information reveals optimal performance of OLC approaches.
BMC Genomics. 2016 Aug 22;17 Suppl seven(Suppl 7):507. doi: 10.1186/s12864-016-2895-8. BMC Genomics. 2016. PMID: 27556636 Free PMC article.
-
Integrating long-range connectivity data into de Bruijn graphs.
Bioinformatics. 2018 Aug 1;34(xv):2556-2565. doi: 10.1093/bioinformatics/bty157. Bioinformatics. 2018. PMID: 29554215 Gratuitous PMC article.
-
Read mapping on de Bruijn graphs.
BMC Bioinformatics. 2016 Jun 16;17(i):237. doi: ten.1186/s12859-016-1103-9. BMC Bioinformatics. 2016. PMID: 27306641 Free PMC commodity.
-
The nowadays and future of de novo whole-genome assembly.
Cursory Bioinform. 2018 Jan 1;xix(i):23-forty. doi: ten.1093/bib/bbw096. Brief Bioinform. 2018. PMID: 27742661 Review.
-
Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph.
Brief Funct Genomics. 2012 Jan;11(1):25-37. doi: 10.1093/bfgp/elr035. Epub 2011 Dec nineteen. Brief Funct Genomics. 2012. PMID: 22184334 Review.
Cited by 66 articles
-
Short- and long-read metagenomics of urban and rural S African gut microbiomes reveal a transitional limerick and undescribed taxa.
Nat Commun. 2022 Feb 22;13(1):926. doi: ten.1038/s41467-021-27917-x. Nat Commun. 2022. PMID: 35194028
-
The genome of the Paleogene relic tree Bretschneidera sinensis: insights into trade-offs in gene family evolution, demographic history, and adaptive SNPs.
DNA Res. 2022 Jan 28;29(1):dsac003. doi: 10.1093/dnares/dsac003. DNA Res. 2022. PMID: 35137004 Complimentary PMC article.
-
A scaffold-level genome associates of a minute pirate bug, Orius laevigatus (Hemiptera: Anthocoridae), and a comparative assay of insecticide resistance-related gene families with hemipteran ingather pests.
BMC Genomics. 2022 Jan xi;23(1):45. doi: 10.1186/s12864-021-08249-y. BMC Genomics. 2022. PMID: 35012450 Costless PMC article.
-
Combined assembly of long and short sequencing reads ameliorate the efficiency of exploring the soil metagenome.
BMC Genomics. 2022 Jan 7;23(1):37. doi: 10.1186/s12864-021-08260-3. BMC Genomics. 2022. PMID: 34996356 Free PMC article.
-
Applications and potentials of nanopore sequencing in the (epi)genome and (epi)transcriptome era.
Innovation (Northward Y). 2021 Aug xi;2(iv):100153. doi: 10.1016/j.xinn.2021.100153. eCollection 2021 Nov 28. Innovation (N Y). 2021. PMID: 34901902 Free PMC article. Review.
MeSH terms
LinkOut - more resources
-
Full Text Sources
- Atypon
- Europe PubMed Central
- PubMed Central
-
Other Literature Sources
- scite Smart Citations
Source: https://pubmed.ncbi.nlm.nih.gov/27956617/
0 Response to "Assembly of Long Error-prone Reads Using De Bruijn Graphs"
Post a Comment