Assembly of Long Error-prone Reads Using De Bruijn Graphs

. 2016 Dec 27;113(52):E8396-E8405.

doi: 10.1073/pnas.1604560113. Epub 2016 December 12.

Assembly of long error-prone reads using de Bruijn graphs

Affiliations

  • PMID: 27956617
  • PMCID: PMC5206522
  • DOI: 10.1073/pnas.1604560113

Costless PMC article

Assembly of long error-prone reads using de Bruijn graphs

Yu Lin  et al. Proc Natl Acad Sci U Due south A. .

Costless PMC article

Abstruse

The recent breakthroughs in assembling long error-decumbent reads were based on the overlap-layout-consensus (OLC) approach and did not apply the strengths of the culling de Bruijn graph approach to genome assembly. Moreover, these studies often assume that applications of the de Bruijn graph approach are limited to brusque and accurate reads and that the OLC approach is the only applied prototype for assembling long mistake-prone reads. We testify how to generalize de Bruijn graphs for assembling long mistake-decumbent reads and describe the ABruijn assembler, which combines the de Bruijn graph and the OLC approaches and results in accurate genome reconstructions.

Keywords: de Bruijn graph; genome assembly; single-molecule sequencing.

Conflict of involvement statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.

Amalgam the de Bruijn graph (Left) and the A-Bruijn graph (Right) for a circular

S t r i n g

=CATCAGATAGGA. (Left) From

P a t h ( Southward t r i due north k , 3 )

to

D B ( Due south t r i north g , 3 )

. (Right) From

P a t h ( S t r i n yard , V )

to

A B ( S t r i n g , 5 )

for

V = {

CA, AT, TC, AGA, TA, AC

}

. The figure illustrates the process of bringing the vertices with the aforementioned label closer to each other (middle row) to eventually mucilage them into a unmarried vertex (lesser row). Note that some symbols of

Southward t r i n grand

are non covered by strings in

5

. We assign integer

s h i f t ( v , w )

to the edge

( five , west )

in this path to denote the difference between the positions of

v

and

w

in

South t r i northward g

(i.e., the number of symbols between the commencement of

v

and the get-go of

w

in

South t r i due north g

).

Fig. 2.
Fig. ii.

The histograms of the number of 15-mers with given frequencies for the ECOLI dataset from Escherichia coli. The bars for unique/repeated/nongenomic 15-mers for the E. coli genome are stacked and shown in green/red/blue according to their fractions. ABruijn automatically selects the parameter

t

and defines solid strings as all 15-mers with frequencies at least

t = 7

for the ECOLI dataset. Nosotros found that increasing the automatically selected values of

t

past one results in equally accurate assemblies. In that location be four.i , 0.1 , and 0.5 million (3.9, 0.ane, and 0.3 1000000) unique, repeated, and nongenomic 15-mers, respectively, for ECOLI at

t = 7

(

t = 8

). Although larger values of

thou

(e.thou.,

k = 25

) as well produce high-quality SMS assemblies, we institute that selecting smaller rather than larger

grand

results in slightly better performance.

Fig. 3.
Fig. three.

Two overlapping reads from the ECOLI dataset and their common

j u one thousand p

-subpath with maximum span that contains fifty vertices and has span 6,714 with respect to the lesser read (for

j u m p

=one,000). The left and right overhangs for these reads are 425 and 434. The weights of the edges in the A-Bruijn graph are shown just if they exceed 400 bp.

Fig. 4.
Fig. 4.

(Top) A growing path (shown in green) and a prepare of v paths

P a t h south

above it (extending this path). The gray path with

S u p p o r t P a t h south ( P ) = 2

is the near-consistent path in the prepare

P a t h due south

. (Eye) A growing path (shown in green) ending in a repeat (represented past the internal edge in the graph), and eight read-paths that extend this growing path (v right extensions shown in blue and three wrong extensions shown in cherry. (Bottom) A support graph for the to a higher place eight read-paths. Note that the blue read-path i is connected by edges with all red read-paths because it is supported past all blood-red paths even though these paths practise not contain any brusque suffix of read-path 1 (the ABruijn graph framework is less sensitive than the de Bruijn graph framework with respect to overlap detection).

Fig. 5.
Fig. 5.

(Left) Support graph

G ( R e a d s )

for a read in the BLS dataset (Results, Datasets) that does not end in a long repeat. Reads in the BLS dataset are numbered in order of their appearance along the genome. The green vertex represents a chimeric read. The blue vertex has maximum caste in

G ( R due east a d s )

and reveals a single cluster consisting of all vertices but the green one. A vertex 281 with big indegree (5) and large outdegree (iii) in

G ( R east a d due south )

is a about-consistent read-path, and information technology is selected for path extension (unless it ends in a repeat). (Right) Support graph

G ( R e a d south )

for a read in the BLS dataset that ends in a long repeat. The green vertex represents a chimeric read. The blue vertex has maximum caste in

Grand ( R e a d due south )

and reveals a cluster consisting of 9 blue vertices. The vertex 4901 with large indegree (4) and big outdegree (4) in

G ( R eastward a d southward )

is a most-consistent read-path, and it is selected for path extension if information technology does not showtime in a repeat. The red vertex reveals another cluster consisting of five cherry vertices. By and large, we look that a read ending in a long repeat of multiplicity

chiliad

volition result in

m

clusters because reads originating unlike instances of this repeat are non expected to back up each other and, thus, are not connected by edges in

G ( R eastward a d southward )

.

Fig. 6.
Fig. 6.

(Pinnacle Left) The pairwise alignments between a reference region

r due east f

in the draft genome and 5 reads

R eastward a d southward = { r due east a d ane , r eastward a d 2 , r e a d 3 , r east a d 4 , r e a d 5 }

. All inserted symbols in these reads with respect to the region

r east f

are colored in blue. (Lesser Left) The multiple alignment

A fifty i g n m e n t

synthetic from the in a higher place pairwise alignments along with the values of

C o v ( i )

,

M a t c h ( i )

,

D e l ( i )

,

Due south u b ( i )

and

I northward s ( i )

. The final row shows the set

V

of

( 0.8 , 0.two )

-solid 4-mers. The nonreference columns in the alignment are not numbered. (Right) Constructing

A B ¯ ( A l i grand n m e n t )

, that is, combining all paths

P a t h ( r east a d j , V )

into

A B ¯ ( A 50 i g due north thousand e n t )

. Note that the iv-mer ATGA corresponds to two different nodes with labels i and thirteen. The three boundaries of the mini-alignments are between positions two and 3, 7 and 8, and 14 and 15. The two resulting necklaces are formed by segments

{ Thou A A T C A , One thousand A T T C A , Thousand A A A C A , 1000 A A A C A , G A 1000 Yard T A }

and

{ G T C A T , G T T C A , T C C T C G A T , Yard T A T T A C A T , G T C T T A A T }

.

Similar articles

  • Benchmarking of de novo associates algorithms for Nanopore information reveals optimal performance of OLC approaches.

    Cherukuri Y, Janga SC. Cherukuri Y, et al. BMC Genomics. 2016 Aug 22;17 Suppl seven(Suppl 7):507. doi: 10.1186/s12864-016-2895-8. BMC Genomics. 2016. PMID: 27556636 Free PMC article.

  • Integrating long-range connectivity data into de Bruijn graphs.

    Turner I, Garimella KV, Iqbal Z, McVean G. Turner I, et al. Bioinformatics. 2018 Aug 1;34(xv):2556-2565. doi: 10.1093/bioinformatics/bty157. Bioinformatics. 2018. PMID: 29554215 Gratuitous PMC article.

  • Read mapping on de Bruijn graphs.

    Limasset A, Cazaux B, Rivals East, Peterlongo P. Limasset A, et al. BMC Bioinformatics. 2016 Jun 16;17(i):237. doi: ten.1186/s12859-016-1103-9. BMC Bioinformatics. 2016. PMID: 27306641 Free PMC commodity.

  • The nowadays and future of de novo whole-genome assembly.

    Sohn JI, Nam JW. Sohn JI, et al. Cursory Bioinform. 2018 Jan 1;xix(i):23-forty. doi: ten.1093/bib/bbw096. Brief Bioinform. 2018. PMID: 27742661 Review.

  • Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph.

    Li Z, Chen Y, Mu D, Yuan J, Shi Y, Zhang H, Gan J, Li N, Hu Ten, Liu B, Yang B, Fan W. Li Z, et al. Brief Funct Genomics. 2012 Jan;11(1):25-37. doi: 10.1093/bfgp/elr035. Epub 2011 Dec nineteen. Brief Funct Genomics. 2012. PMID: 22184334 Review.

Cited by 66 articles

  • Short- and long-read metagenomics of urban and rural S African gut microbiomes reveal a transitional limerick and undescribed taxa.

    Tamburini FB, Maghini D, Oduaran OH, Brewster R, Hulley MR, Sahibdeen Five, Norris SA, Tollman S, Kahn K, Wagner RG, Wade AN, Wafawanaka F, Gómez-Olivé FX, Twine R, Lombard Z; H3Africa AWI-Gen Collaborative Centre, Hazelhurst S, Bhatt As. Tamburini FB, et al. Nat Commun. 2022 Feb 22;13(1):926. doi: ten.1038/s41467-021-27917-x. Nat Commun. 2022. PMID: 35194028

  • The genome of the Paleogene relic tree Bretschneidera sinensis: insights into trade-offs in gene family evolution, demographic history, and adaptive SNPs.

    Liu HL, Harris AJ, Wang ZF, Chen HF, Li ZA, Wei X. Liu HL, et al. DNA Res. 2022 Jan 28;29(1):dsac003. doi: 10.1093/dnares/dsac003. DNA Res. 2022. PMID: 35137004 Complimentary PMC article.

  • A scaffold-level genome associates of a minute pirate bug, Orius laevigatus (Hemiptera: Anthocoridae), and a comparative assay of insecticide resistance-related gene families with hemipteran ingather pests.

    Bailey E, Field L, Rawlings C, King R, Mohareb F, Pak KH, Hughes D, Williamson M, Ganko E, Buer B, Nauen R. Bailey E, et al. BMC Genomics. 2022 Jan xi;23(1):45. doi: 10.1186/s12864-021-08249-y. BMC Genomics. 2022. PMID: 35012450 Costless PMC article.

  • Combined assembly of long and short sequencing reads ameliorate the efficiency of exploring the soil metagenome.

    Xu Yard, Zhang 50, Liu X, Guan F, Xu Y, Yue H, Huang JQ, Chen J, Wu N, Tian J. Xu Yard, et al. BMC Genomics. 2022 Jan 7;23(1):37. doi: 10.1186/s12864-021-08260-3. BMC Genomics. 2022. PMID: 34996356 Free PMC article.

  • Applications and potentials of nanopore sequencing in the (epi)genome and (epi)transcriptome era.

    Xie South, Leung AW, Zheng Z, Zhang D, Xiao C, Luo R, Luo M, Zhang Due south. Xie Due south, et al. Innovation (Northward Y). 2021 Aug xi;2(iv):100153. doi: 10.1016/j.xinn.2021.100153. eCollection 2021 Nov 28. Innovation (N Y). 2021. PMID: 34901902 Free PMC article. Review.

MeSH terms

LinkOut - more resources

  • Full Text Sources

  • Other Literature Sources

seymourficks1984.blogspot.com

Source: https://pubmed.ncbi.nlm.nih.gov/27956617/

0 Response to "Assembly of Long Error-prone Reads Using De Bruijn Graphs"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel