[EBOOK] Efficient Algorithms For Identification And Analysis Of Repetitive Patterns In Biological Sequences PDF Download

Computational biology

Efficient Algorithms for Identification and Analysis of Repetitive Patterns in Biological Sequences

Book Details:

Author : Jie Zheng
Publisher :
Release : 2006
ISBN :
Pages : 224 pages

Download or read book Efficient Algorithms for Identification and Analysis of Repetitive Patterns in Biological Sequences written by Jie Zheng and published by . This book was released on 2006 with total page 224 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Identification and Application of Repetitive Biological Sequences

Book Details:

Author : Xuehui Li
Publisher :
Release : 2007
ISBN :
Pages : pages

Download or read book Identification and Application of Repetitive Biological Sequences written by Xuehui Li and published by . This book was released on 2007 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: ABSTRACT: Biological sequences are rich in repeats. For example, more than 50% of the human genome consists of repeats and approximately one-quarter of the amino acids are in repeats. Repeats are subsequences of biased composition. They vary in size from less than a hundred bases to tens of kilobases. They are found as either tandem arrays or dispersed throughout the genome. Repeats can generate insertions, deletions, and unequal crossing-over within genomes and affect protein functions. Hence, repeats play important roles in genome evolution. Repeat identification is normally the first step of studying repeats and a critical part of sequence analysis. For protein sequences, some repeats are popularly referred as low complexity regions (LCRs). Although some computational tools have been developed to identify genomic repeats or LCRs, they all are geared toward specific situations and suffer from different problems. We develop novel methods to identify genomic repeats and LCRs, respectively. Genomic repeats and LCRs present difficulties in genome annotation and analyses. Local alignments between repeats cause many false positives to sequence similarity search. These false positives can cause misassembly of genome sequences or misidentification of repeats as gene/protein sequences. Existing sequence similarity search algorithms either ignore the existence of these repeats or completely remove them. The first strategy produces false positives. The second strategy is not desirable, since no LCR-identification tool is 100% accurate. We develop new algorithms that use LCR information wisely to improve the accuracy and efficiency of sequence search.

Repetitive Structures in Biological Sequences Algorithms and Applications

Book Details:

Author : Marco Pellegrini
Publisher : Frontiers Media SA
Release : 2016-10-27
ISBN : 288945018X
Pages : 95 pages

Download or read book Repetitive Structures in Biological Sequences Algorithms and Applications written by Marco Pellegrini and published by Frontiers Media SA. This book was released on 2016-10-27 with total page 95 pages. Available in PDF, EPUB and Kindle. Book excerpt: Repetitive structures in biological sequences are emerging as an active focus of research and the unifying concept of "repeatome" (the ensemble of knowledge associated with repeating structures in genomic/proteomic sequences) has been recently proposed in order to highlight several converging trends. One main trend is the ongoing discovery that genomic repetitions are linked to many biological significant events and functions. Diseases (e.g. Huntington's disease) have been causally linked with abnormal expansion of certain repeating sequences in the human genome. Deletions or multiple copy duplications of genes (Copy Number Variations) are important in the aetiology of cancer, Alzheimer, and Parkinson diseases. A second converging trend has been the emergence of many different models and algorithms for detecting non-obvious repeating patterns in strings with applications to in genomic data. Borrowing methodologies from combinatorial pattern, matching, string algorithms, data structures, data mining and machine learning these new approaches break the limitations of the current approaches and offer a new way to design better trans-disciplinary research. The articles collected in this book provides a glance into the rich emerging area of repeatome research, addressing some of its pressing challenges. We believe that these contributions are valuable resources for repeatome research and will stimulate further research from bioinformatic, statistical, and biological points of view.

Bioinformatics

Pattern Discovery in Bioinformatics

Book Details:

Author : Laxmi Parida
Publisher : CRC Press
Release : 2019-12-20
ISBN : 9780367388898
Pages : 512 pages

Download or read book Pattern Discovery in Bioinformatics written by Laxmi Parida and published by CRC Press. This book was released on 2019-12-20 with total page 512 pages. Available in PDF, EPUB and Kindle. Book excerpt: The computational methods of bioinformatics are being used more and more to process the large volume of current biological data. Promoting an understanding of the underlying biology that produces this data, Pattern Discovery in Bioinformatics: Theory and Algorithms provides the tools to study regularities in biological data. Taking a systematic approach to pattern discovery, the book supplies sound mathematical definitions and efficient algorithms to explain vital information about biological data. It explores various data patterns, including strings, clusters, permutations, topology, partial orders, and boolean expressions. Each of these classes captures a different form of regularity in the data, providing possible answers to a wide range of questions. The book also reviews basic statistics, including probability, information theory, and the central limit theorem. This self-contained book provides a solid foundation in computational methods, enabling the solution of difficult biological questions.

Algorithms for Analysis of Multiple Biological Sequences

Book Details:

Author : Eugene V. Davydov
Publisher :
Release : 2009
ISBN :
Pages : pages

Download or read book Algorithms for Analysis of Multiple Biological Sequences written by Eugene V. Davydov and published by . This book was released on 2009 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: Availability of massive amounts of genomic data from hundreds of species has introduced many challenging computational problems as well as the need for efficient algorithmic tools that leverage multiple species information to facilitate biological analysis. This dissertation discusses two such problems: noncoding RNA multiple structural alignment and constrained element detection. Noncoding RNA genes (ncRNAs) are regions of the genome that are transcribed but not translated into protein, and fold directly into secondary and tertiary structures which can have a variety of important biological functions. Because their function depends closely on the secondary structure, ncRNAs often do not exhibit enough primary sequence conservation to be properly aligned using standard sequence-based methods. I therefore consider the problem of RNA multiple structural alignment, i.e., performing sequence alignment and secondary structure prediction simultaneously. In the first part of this dissertation I introduce a novel graph theoretic framework for analyzing this problem and prove that when the number of sequences is not fixed it is NP-complete. I also provide a polynomial time algorithm that approximates the optimal solution to within a factor of O(log^2 n). Constrained elements are regions of the human genome exhibiting evidence of purifying selection and therefore biological function. Computational identification of such elements is one of the major goals of comparative genomics. In the second part of this dissertation I present GERP++, a new tool for efficient constrained element detection that significantly improves on one of the current leading methods, GERP. While retaining GERP's biological transparency and metric for quantifying position-specific constraint, GERP++ uses a more rigorous method for computing evolutionary rates and a novel algorithm for element identification that uses statistical significance directly to evaluate and rank candidate elements. These algorithmic improvements decrease the running time by several orders of magnitude in practice, enabling high-throughput analysis of large data sets. Furthermore, I present analysis and biological interpretation of constrained elements identified by GERP++ in the human genome from recently available multiple species alignments.

Science

Biological Sequence Analysis

Book Details:

Author : Richard Durbin
Publisher : Cambridge University Press
Release : 1998-04-23
ISBN : 113945739X
Pages : 372 pages

Download or read book Biological Sequence Analysis written by Richard Durbin and published by Cambridge University Press. This book was released on 1998-04-23 with total page 372 pages. Available in PDF, EPUB and Kindle. Book excerpt: Probabilistic models are becoming increasingly important in analysing the huge amount of data being produced by large-scale DNA-sequencing efforts such as the Human Genome Project. For example, hidden Markov models are used for analysing biological sequences, linguistic-grammar-based probabilistic models for identifying RNA secondary structure, and probabilistic evolutionary models for inferring phylogenies of sequences from different organisms. This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis. Written by an interdisciplinary team of authors, it aims to be accessible to molecular biologists, computer scientists, and mathematicians with no formal knowledge of the other fields, and at the same time present the state-of-the-art in this new and highly important field.

Science

Biological Sequence Analysis Using the SeqAn C Library

Book Details:

Author : Andreas Gogol-Döring
Publisher : CRC Press
Release : 2009-11-11
ISBN : 9781420076233
Pages : 0 pages

Download or read book Biological Sequence Analysis Using the SeqAn C Library written by Andreas Gogol-Döring and published by CRC Press. This book was released on 2009-11-11 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: An Easy-to-Use Research Tool for Algorithm Testing and Development Before the SeqAn project, there was clearly a lack of available implementations in sequence analysis, even for standard tasks. Implementations of needed algorithmic components were either unavailable or hard to access in third-party monolithic software products. Addressing these concerns, the developers of SeqAn created a comprehensive, easy-to-use, open source C++ library of efficient algorithms and data structures for the analysis of biological sequences. Written by the founders of this project, Biological Sequence Analysis Using the SeqAn C++ Library covers the SeqAn library, its documentation, and the supporting infrastructure. The first part of the book describes the general library design. It introduces biological sequence analysis problems, discusses the benefit of using software libraries, summarizes the design principles and goals of SeqAn, details the main programming techniques used in SeqAn, and demonstrates the application of these techniques in various examples. Focusing on the components provided by SeqAn, the second part explores basic functionality, sequence data structures, alignments, pattern and motif searching, string indices, and graphs. The last part illustrates applications of SeqAn to genome alignment, consensus sequence in assembly projects, suffix array construction, and more. This handy book describes a user-friendly library of efficient data types and algorithms for sequence analysis in computational biology. SeqAn enables not only the implementation of new algorithms, but also the sound analysis and comparison of existing algorithms. Visit SeqAn for more information.

Algorithms

Function based Algorithms for Biological Sequences

Book Details:

Author : Pragyan Sheela P. Mohanty
Publisher :
Release : 2015
ISBN :
Pages : 202 pages

Download or read book Function based Algorithms for Biological Sequences written by Pragyan Sheela P. Mohanty and published by . This book was released on 2015 with total page 202 pages. Available in PDF, EPUB and Kindle. Book excerpt: Two problems at two different abstraction levels of computational biology are studied. At the molecular level, efficient pattern matching algorithms in DNA sequences are presented. For gene order data, an efficient data structure is presented capable of storing all gene re-orderings in a systematic manner. A common characteristic of presented methods is the use of binary decision diagrams that store and manipulate binary functions. Searching for a particular pattern in a very large DNA database, is a fundamental and essential component in computational biology. In the biological world, pattern matching is required for finding repeats in a particular DNA sequence, finding motif and aligning sequences etc. Due to immense amount and continuous increase of biological data, the searching process requires very fast algorithms. This also requires encoding schemes for efficient storage of these search processes to operate on. Due to continuous progress in genome sequencing, genome rearrangements and construction of evolutionary genome graphs, which represent the relationships between genomes, become challenging tasks. Previous approaches are largely based on distance measure so that relationship between more phylogenetic species can be established with some specifically required rearrangement operations and hence within certain computational time. However because of the large volume of the available data, storage space and construction time for this evolutionary graph is still a problem. In addition, it is important to keep track of all possible rearrangement operations for a particular genome as biological processes are uncertain. This study presents a binary function-based tool set for efficient DNA sequence storage. A novel scalable method is also developed for fast offline pattern searches in large DNA sequences. This study also presents a method which efficiently stores all the gene sequences associated with all possible genome rearrangements such as transpositions and construct the evolutionary genome structure much faster for multiple species. The developed methods benefit from the use of Boolean functions; their compact storage using canonical data structure and the existence of built-in operators for these data structures. The time complexities depend on the size of the data structures used for storing the functions that represent the DNA sequences and/or gene sequences. It is shown that the presented approaches exhibit sub linear time complexity to the sequence size. The number of nodes present in the DNA data structure, string search time on these data structures, depths of the genome graph structure, and the time of the rearrangement operations are reported. Experiments on DNA sequences from the NCBI database are conducted for DNA sequence storage and search process. Experiments on large gene order data sets such as: human mitochondrial data and plant chloroplast data are conducted and depth of this structure was studied for evolutionary processes on gene sequences. The results show that the developed approaches are scalable.

Finding Conserved Patterns in Biological Sequences Networks and Genomes

Book Details:

Author : Qingwu Yang
Publisher :
Release : 2010
ISBN :
Pages : pages

Download or read book Finding Conserved Patterns in Biological Sequences Networks and Genomes written by Qingwu Yang and published by . This book was released on 2010 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: Biological patterns are widely used for identifying biologically interesting regions within macromolecules, classifying biological objects, predicting functions and studying evolution. Good pattern finding algorithms will help biologists to formulate and validate hypotheses in an attempt to obtain important insights into the complex mechanisms of living things. In this dissertation, we aim to improve and develop algorithms for five biological pattern finding problems. For the multiple sequence alignment problem, we propose an alternative formulation in which a final alignment is obtained by preserving pairwise alignments specified by edges of a given tree. In contrast with traditional NPhard formulations, our preserving alignment formulation can be solved in polynomial time without using a heuristic, while having very good accuracy. For the path matching problem, we take advantage of the linearity of the query path to reduce the problem to finding a longest weighted path in a directed acyclic graph. We can find k paths with top scores in a network from the query path in polynomial time. As many biological pathways are not linear, our graph matching approach allows a non-linear graph query to be given. Our graph matching formulation overcomes the common weakness of previous approaches that there is no guarantee on the quality of the results. For the gene cluster finding problem, we investigate a formulation based on constraining the overall size of a cluster and develop statistical significance estimates that allow direct comparisons of clusters of different sizes. We explore both a restricted version which requires that orthologous genes are strictly ordered within each cluster, and the unrestricted problem that allows paralogous genes within a genome and clusters that may not appear in every genome. We solve the first problem in polynomial time and develop practical exact algorithms for the second one. In the gene cluster querying problem, based on a querying strategy, we propose an efficient approach for investigating clustering of related genes across multiple genomes for a given gene cluster. By analyzing gene clustering in 400 bacterial genomes, we show that our algorithm is efficient enough to study gene clusters across hundreds of genomes.

Mathematics

Introduction to Computational Biology

Book Details:

Author : Michael S. Waterman
Publisher : CRC Press
Release : 2018-05-02
ISBN : 1351437089
Pages : 248 pages

Download or read book Introduction to Computational Biology written by Michael S. Waterman and published by CRC Press. This book was released on 2018-05-02 with total page 248 pages. Available in PDF, EPUB and Kindle. Book excerpt: Biology is in the midst of a era yielding many significant discoveries and promising many more. Unique to this era is the exponential growth in the size of information-packed databases. Inspired by a pressing need to analyze that data, Introduction to Computational Biology explores a new area of expertise that emerged from this fertile field- the combination of biological and information sciences. This introduction describes the mathematical structure of biological data, especially from sequences and chromosomes. After a brief survey of molecular biology, it studies restriction maps of DNA, rough landmark maps of the underlying sequences, and clones and clone maps. It examines problems associated with reading DNA sequences and comparing sequences to finding common patterns. The author then considers that statistics of pattern counts in sequences, RNA secondary structure, and the inference of evolutionary history of related sequences. Introduction to Computational Biology exposes the reader to the fascinating structure of biological data and explains how to treat related combinatorial and statistical problems. Written to describe mathematical formulation and development, this book helps set the stage for even more, truly interdisciplinary work in biology.

Efficient Large Scale Machine Learning Algorithms for Genomic Sequences

Book Details:

Author : Daniel Quang
Publisher :
Release : 2017
ISBN : 9780355309577
Pages : 114 pages

Download or read book Efficient Large Scale Machine Learning Algorithms for Genomic Sequences written by Daniel Quang and published by . This book was released on 2017 with total page 114 pages. Available in PDF, EPUB and Kindle. Book excerpt: High-throughput sequencing (HTS) has led to many breakthroughs in basic and translational biology research. With this technology, researchers can interrogate whole genomes at single-nucleotide resolution. The large volume of data generated by HTS experiments necessitates the development of novel algorithms that can efficiently process these data. At the advent of HTS, several rudimentary methods were proposed. Often, these methods applied compromising strategies such as discarding a majority of the data or reducing the complexity of the models. This thesis focuses on the development of machine learning methods for efficiently capturing complex patterns from high volumes of HTS data.First, we focus on on de novo motif discovery, a popular sequence analysis method that predates HTS. Given multiple input sequences, the goal of motif discovery is to identify one or more candidate motifs, which are biopolymer sequence patterns that are conjectured to have biological significance. In the context of transcription factor (TF) binding, motifs may represent the sequence binding preference of proteins. Traditional motif discovery algorithms do not scale well with the number of input sequences, which can make motif discovery intractable for the volume of data generated by HTS experiments. One common solution is to only perform motif discovery on a small fraction of the sequences. Scalable algorithms that simplify the motif models are popular alternatives. Our approach is a stochastic method that is scalable and retains the modeling power of past methods.Second, we leverage deep learning methods to annotate the pathogenicity of genetic variants. Deep learning is a class of machine learning algorithms concerned with deep neural networks (DNNs). DNNs use a cascade of layers of nonlinear processing units for feature extraction and transformation. Each layer uses the output from the previous layer as its input. Similar to our novel motif discovery algorithm, artificial neural networks can be efficiently trained in a stochastic manner. Using a large labeled dataset comprised of tens of millions of pathogenic and benign genetic variants, we trained a deep neural network to discriminate between the two categories. Previous methods either focused only on variants lying in protein coding regions, which cover less than 2% of the human genome, or applied simpler models such as linear support vector machines, which can not usually capture non-linear patterns like deep neural networks can.Finally, we discuss convolutional (CNN) and recurrent (RNN) neural networks, variations of DNNs that are especially well-suited for studying sequential data. Specifically, we stacked a bidirectional recurrent layer on top of a convolutional layer to form a hybrid model. The model accepts raw DNA sequences as inputs and predicts chromatin markers, including histone modifications, open chromatin, and transcription factor binding. In this specific application, the convolutional kernels are analogous to motifs, hence the model learning is essentially also performing motif discovery. Compared to a pure convolutional model, the hybrid model requires fewer free parameters to achieve superior performance. We conjecture that the recurrent layer allows our model spatial and orientation dependencies among motifs better than a pure convolutional model can. With some modifications to this framework, the model can accept cell type-specific features, such as gene expression and open chromatin DNase I cleavage, to accurately predict transcription factor binding across cell types. We submitted our model to the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge, where it was among the top performing models. We implemented several novel heuristics, which significantly reduced the training time and the computational overhead. These heuristics were instrumental to meet the Challenge deadlines and to make the method more accessible for the research community.HTS has already transformed the landscape of basic and translational research, proving itself as a mainstay of modern biological research. As more data are generated and new assays are developed, there will be an increasing need for computational methods to integrate the data to yield new biological insights. We have only begun to scratch the surface of discovering what is possible from both an experimental and a computational perspective. Thus, further development of versatile and efficient statistical models is crucial to maintaining the momentum for new biological discoveries.

Computers

Handbook of Exact String Matching Algorithms

Book Details:

Author : Christian Charras
Publisher : College PressPub Company
Release : 2004
ISBN : 9780954300647
Pages : 238 pages

Download or read book Handbook of Exact String Matching Algorithms written by Christian Charras and published by College PressPub Company. This book was released on 2004 with total page 238 pages. Available in PDF, EPUB and Kindle. Book excerpt: String matching is a very important subject in the wider domain of text processing. It consists of finding one, or more generally, all the occurrences of a string (more generally called a pattern) in a text. The Handbook of Exact String Matching Algorithms presents 38 methods for solving this problem. For each, it gives the main features, a description, its C code, an example and references.

Computers

The Burrows Wheeler Transform

Book Details:

Author : Donald Adjeroh
Publisher : Springer Science & Business Media
Release : 2008-06-17
ISBN : 038778909X
Pages : 353 pages

Download or read book The Burrows Wheeler Transform written by Donald Adjeroh and published by Springer Science & Business Media. This book was released on 2008-06-17 with total page 353 pages. Available in PDF, EPUB and Kindle. Book excerpt: The Burrows-Wheeler Transform is one of the best lossless compression me- ods available. It is an intriguing — even puzzling — approach to squeezing redundancy out of data, it has an interesting history, and it has applications well beyond its original purpose as a compression method. It is a relatively late addition to the compression canon, and hence our motivation to write this book, looking at the method in detail, bringing together the threads that led to its discovery and development, and speculating on what future ideas might grow out of it. The book is aimed at a wide audience, ranging from those interested in learning a little more than the short descriptions of the BWT given in st- dard texts, through to those whose research is building on what we know about compression and pattern matching. The ?rst few chapters are a careful description suitable for readers with an elementary computer science ba- ground (and these chapters have been used in undergraduate courses), but later chapters collect a wide range of detailed developments, some of which are built on advanced concepts from a range of computer science topics (for example, some of the advanced material has been used in a graduate c- puter science course in string algorithms). Some of the later explanations require some mathematical sophistication, but most should be accessible to those with a broad background in computer science.

Computer algorithms

Scalable Kernel Methods and Algorithms for General Sequence Analysis

Book Details:

Author : Pavel Kuksa
Publisher :
Release : 2011
ISBN :
Pages : 114 pages

Download or read book Scalable Kernel Methods and Algorithms for General Sequence Analysis written by Pavel Kuksa and published by . This book was released on 2011 with total page 114 pages. Available in PDF, EPUB and Kindle. Book excerpt: Analysis of large-scale sequential data has become an important task in machine learning and pattern recognition, inspired in part by numerous scientific and technological applications such as the document and text classification or the analysis of biological sequences. However, current computational methods for sequence comparison still lack accuracy and scalability necessary for reliable analysis of large datasets. To this end, we develop a new framework (efficient algorithms and methods) that solve sequence matching, comparison, classification, and pattern extraction problems in linear time, with increased accuracy, improving over the prior art. In particular, we propose novel ways of modeling sequences under complex transformations (such as multiple insertions, deletions, mutations) and present a new family of similarity measures (kernels), the spatial string kernels (SSK). SSKs can be computed very efficiently and perform better than the best available methods on a variety of distinct classification tasks. We also present new algorithms for approximate (e.g., with mismatches) string comparison that improve currently known time complexity bounds for such tasks and show order-of-magnitude running time improvements. We then propose novel linear time algorithms for representative pattern extraction in sequence data sets that exploit developed computational framework. In an extensive set of experiments on many challenging classification problems, such as detecting homology (evolutionary similarity) of remotely related proteins, categorizing texts, and performing classification of music samples, our algorithms and similarity measures display state-of-the-art classification performance and run significantly faster than existing methods.

Computers

Computational Molecular Biology

Book Details:

Author : S. Istrail
Publisher : Gulf Professional Publishing
Release : 2003-04-02
ISBN : 9780444513847
Pages : 196 pages

Download or read book Computational Molecular Biology written by S. Istrail and published by Gulf Professional Publishing. This book was released on 2003-04-02 with total page 196 pages. Available in PDF, EPUB and Kindle. Book excerpt: This volume contains papers demonstrating the variety and richness of computational problems motivated by molecular biology. The application areas within biology that give rise to the problems studied in these papers include solid molecular modeling, sequence comparison, phylogeny, evolution, mapping, DNA chips, protein folding and 2D gel technology. The mathematical techniques used are algorithmics, combinatorics, optimization, probability, graph theory, complexity and applied mathematics. This is the fourth volume in the Discrete Applied Mathematics series on computational molecular biology, which is devoted to combinatorial and algorithmic techniques in computational molecular biology. This series publishes novel research results on the mathematical and algorithmic foundations of the inherently discrete aspects of computational biology. Key features: . protein folding . phylogenetic inference . 2-dimensional gel analysis . graphical models for sequencing by hybridisation . dynamic visualization of molecular surfaces . problems and algorithms in sequence alignment This book is a reprint of Discrete Applied Mathematics Volume 127, Number 1.

Dissertations, Academic

Dissertation Abstracts International

Book Details:

Author :
Publisher :
Release : 2008
ISBN :
Pages : 1006 pages

Download or read book Dissertation Abstracts International written by and published by . This book was released on 2008 with total page 1006 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Genomes

Human Genome

Book Details:

Author :
Publisher :
Release : 1992
ISBN :
Pages : 264 pages

Download or read book Human Genome written by and published by . This book was released on 1992 with total page 264 pages. Available in PDF, EPUB and Kindle. Book excerpt: