Yi Yang

Date of Award

January 2014

Document Type


Degree Name

Master of Science (MS)


Computer Science

First Advisor

Ronald Marsh


The advent of next-generation sequencing (NGS) technology has shown unprecedented promise for accurately identifying and quantifying genomic variants for living organisms. For species whose genome sequences are unknown, the first step of RNA sequencing data analysis is to assemble all short reads. The de Bruijn graph-based algorithms, such as Oases, are usually used for short reads assembly to resolve the issue of computational complexity. However, de Bruijn graph-based assemblers normally generate high error rates when assembling RNA-Seq data. We have developed a novel assembly algorithm that can be used jointly with any other assembly methods for RNA-Seq short reads. The proposed method, clustering-based assembly (CBA), aims not only to maintain computational and memory efficiency but also improve the assembly accuracy in our simulation study. We tested CBA using ERCC RNA-Seq data, simulated data from Chromosome 22, and real human RNA-Seq data. The results showed that our algorithm was more accurate in comparison with other de novo methods in terms of short reads mapping rate, recover rate, and contigs mapping rate.