All events are in Central time unless specified.
Activity

Ph.D. Defense: Sairam Behera

Date:
Time:
3:00 pm – 4:00 pm
Zoom
Ph.D. Defense: Sairam Behera

Wednesday, November 18, 2020
3 p.m. via Zoom

Join Zoom Meeting
https://unl.zoom.us/j/95105255518?pwd=ZDYxMVVBWG5zVk5WNVV6L29idHFUZz09
Meeting ID: 951 0525 5518
Passcode: 111820

“Application of Suffix Tree, Minwise Hashing and Streaming Algorithm for Bioinformatics Problems”

In this dissertation, we worked on certain algorithmic problems in bioinformatics using mainly three approaches: (a) streaming model for the large genomics dataset (b) suffix tree-based indexing, and (c) minwise-hashing (minhash) and locality sensitive hashing (LSH). The streaming models are useful for big data problems where a good approximation can be achieved with very limited space usage. In these models, the input data arrive one by one and only the current input needs to be stored and processed. For the first problem, we developed an approximation algorithm using a streaming approach to estimate the frequency counts of k-mers i.e. string or sequence of length k for genomics sequences. For the second problem, we used a suffix tree, a trie data structure, for developing an alignment-free and non-pairwise algorithm for conserved non-coding sequence (CNS) identification. We give two different algorithms to identify exact matched CNSs as well as CNSs with certain mismatches. The algorithms were useful for various researches in the areas of comparative genomics and were used to identify the CNSs in various grass species. We used minhash and LSH based techniques when CNSs are larger in size i.e. ? 100 bp. The minhash approach is used to estimate the Jaccard similarity. Our algorithm used minhash techniques to create signatures for the sequences and LSH-based approach to create a cluster of sequences without using pair-wise operations. For the third problem, we further used the minhash and LSH techniques to address the challenges in isoform clustering. Isoforms are generated from different combinations of exons of the same gene by alternative splicing. As the isoform sequences share same the exon regions, our algorithm clustered these sequences based on their shared minhash signatures. Finally, we discuss an ensemble approach for the de novo transcriptome assembly problem. We first performed a comprehensive performance analysis on different transcriptome assemblers using a simulated dataset. Our new ensemble approach also uses minhash technique to identify potential transcripts from the combined list of contigs that are coming from different de novo transcriptome assemblers.

Download this event to my calendar