De Novo Sequencing Data Analysis

De Novo Sequencing Data Analysis Introduction
Turn-around Time


De Novo Sequencing refers to the process of sequencing and construction of an unknown genome or transcriptome. De Novo Sequencing is also utilized in characterize genomes with known sequence but undergoing major changes. Next-sequencing provides cost-effective read depth and pair-end/mate-pair configurations to effectively de novo assemble sequences from BAC to mammalian genomes.

The general strategy of De Novo Sequencing analysis is to align and merge short fragments derived from a much longer DNA sequence in order to reconstruct the original sequence. De Novo Sequencing projects usually take multiple libraries and multiple rounds of finishing to get a complete genome sequence.

There are critical differences between De Novo Genome Sequencing and De Novo Transcriptome Sequencing. The composition and sequence of genomic DNA is quite stable developmentally and spatially in an organism. The same thing cannot be said about transcriptome, due to temporal and spatial gene expression regulation and wide range of transcript copy number. Therefore, De Novo Genome Sequencing and Transcriptome Sequencing utilize distinct experimental and informatics strategies.


Following is a list of common analysis items for De Novo Sequencing. One of our expert bioinformaticians will work closely with you to identify a custom analysis workflow most appropriate for your project.

1) Experiment design consultation
2) Data QC and clean up
3) Contig assembly
4) Scaffolding and gap closure
5) Gene/ORF prediction
6) Gene annotation and classification via database search and comparison
7) Gene Ontology and pathway analysis
8) SNP discovery and analysis
9) Comparative genome analysis
10) Written project report with analysis methods, publication-ready graphics, and references

Turn-around Time

Upon data receipt, we usually finish a typical De Novo Sequencing analysis project (e.g., microbial genome) in 3-5 days. The actual turn-around time, however, is highly dependent on genome size, sequencing strategy, and project complexity.


Publications below are representative research or review papers that will help you understand how De Novo Sequencing is employed in biomedical research.

  • Li, D. et al. (2012) De novo assembly and characterization of bark transcriptome using Illumina sequencing and development of EST-SSR markers in rubber tree (Hevea brasiliensis Muell. Arg.). BMC Genomics. 13(1):192.
  • Martin, JA. and Wang, Z. (2011) Next-generation transcriptome assembly. Nat Rev Genet. 12(10):671-82.
  • Li, R. et al. (2010) The sequence and de novo assembly of the giant panda genome. Nature. 463(7279):311-7.


I am new to De Novo Sequencing. What do you recommend as the first step?
De Novo sequencing and assembly projects are among the most complex next-generation sequencing projects. The most challenging parts are assembly and finishing, which heavily rely on trial and error. We highly recommend you contact one of our expert bioinformaticians before initiating any sequencing operation. Our expert bioinformatician will work closely with you to identify a working solution most appropriate for your project.
Why are there still gaps in the assembled sequences when I have 100x (or 1000x, or even higher) coverage?
The sequencing process, including next-generation sequencing, has inherent bias as to generating completely representative sequencing fragments to cover the entire genome. The fragmentation process, cloning process (if applicable), PCR steps, purification/cleanup steps, and sequencing technology itself, will lose certain sequence information. In addition, repetitive sequences, sequencing errors, and other factors will complicate the computational assembly process and cause misassembled sequences and unresolved gaps.
Sequencing coverage alone, therefore, cannot possibly close all the gaps. In general, any additional coverage above 100x will not contribute to additional gap closing.
Why are sequencing libraries of different insert size needed for De Novo assembly?
Computational strategies of a single library cannot resolve the gaps larger than the sequence fragment (single-end) or the library insert size (pair-end and mate-pair). Sequencing libraries of different insert size, ranging from <1kb to >10kb, are needed to deduced the relative position of the assembled sequence fragments for further gap closing efforts. For short sequences like microbial genomes, it is possible to obtain ~99% of the sequence using a single library.