- KonferenzbeitragA method for fast approximate searching of polypeptide structures in the PDB(German Conference on Bioinformatics 2004, GCB 2004, 2004) Täubig, Hanjo; Buchner, Arno; Griebsch, Jan; Giegerich, Robert; Stoye, JensThe main contribution of this paper is a novel approach for fast searching in huge structural databases like the PDB. The data structure is based on an adaption of the generalized suffix tree and relies on an translationand rotation-invariant representation of the protein backbone. The method was evaluated by applying structural queries to the PDB and comparing the results to the established tool SPASM. Our experiments show that the new method reduces the query time by orders of magnitude while producing comparable results.
- KonferenzbeitragPoSSuMsearch: Fast and sensitive matching of position specific scoring matrices using enhanced suffix arrays(German Conference on Bioinformatics 2004, GCB 2004, 2004) Beckstette, Michael; Strothmann, Dirk; Homann, Romann; Giegerich, Robert; Kurtz, Stefan; Giegerich, Robert; Stoye, JensIn biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs. In this paper, we present a new nonheuristic algorithm, called ESAsearch, to efficiently find matches of such matrices in large databases. Our approach preprocesses the search space, e.g. a complete genome or a set of protein sequences, and builds an enhanced suffix array which is stored on file. The enhanced suffix array only requires 9 bytes per input symbol, and allows to search a database with a PSSM in sublinear expected time. We also address the problem of non-comparable PSSM-scores by developing a method which allows to efficiently compute a matrix similarity threshold for a PSSM, given an E-value or a p-value. Our method is based on dynamic programming. In contrast to other methods it employs lazy evaluation of the dynamic programming matrix: it only evaluates those matrix entries that are necessary to derive the sought similarity threshold. We tested algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch show speedups of a factor between 4 and 50 for nucleotide PSSMs, and speedups up to a factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 10. The lazy evaluation method is also much faster than previous methods, with speedups by a factor of at least 10.
- KonferenzbeitragKleene's theorem and the solution of metabolic carbon labeling systems(German Conference on Bioinformatics 2004, GCB 2004, 2004) Isermann, Nicole; Weitzel, Michael; Wiechert, Wolfgang; Giegerich, Robert; Stoye, JensCarbon Labeling Systems (CLS) are large equation systems that describe the dynamics of labeled carbon atoms in a metabolic network. The rapid solution of these systems is the algorithmic backbone of 13C Metabolic Flux Analysis (MFA) which has become one of the most widely used tools in Metabolic Engineering. A new algorithm is presented for the solution of CLS which is not based on iteration schemes or numerical linear algebra methods but on path tracing of labeled particles. It is shown that the set of all paths from the system input to the internal network nodes directly gives the clue to an explicit solution of CLS. The promising potential of this new solution algorithm are outlined.
- KonferenzbeitragMultiple sequence alignment with user-defined constraints(German Conference on Bioinformatics 2004, GCB 2004, 2004) Morgenstern, Burkhard; Prohaska, Sonja J.; Werner, Nadine; Weyer-Menkhoff, Jan; Schneider, Isabelle; Subramanian, Amarendran R.; Stadler, Peter F.; Giegerich, Robert; Stoye, JensIn many situations, automated multi-alignment programs are not able to correctly align families of nucleic acid or protein sequences. Distantly related sequences are generally hard to align, and sequence duplications may present additional challenges to standard alignment algorithms. In the present paper, we describe a semiautomatic approach to multiple sequence alignment. The user can specify parts of the sequences that are thought to be related to each other; our software program will use these sites as anchor points and create a multiple alignment respecting these userdefined constraints. By using functionally, structurally or evolutionarily related positions of the input sequences as anchor points, the proposed method can produce alignments that are biologically more meaningful than alignments produced by fully automated procedures. We apply our approach to genomic sequences around the Hox gene cluster. As a by-product, we obtain useful insights for the further development of alignment algorithms. The described alignment approach has been integrated into the tracker software system.
- KonferenzbeitragSequenceJuxtaposer: Fluid navigation for large-scale sequence comparison in context(German Conference on Bioinformatics 2004, GCB 2004, 2004) Slack, James; Hildebrand, Kristian; Munzner, Tamara; John, Katherine St.; Giegerich, Robert; Stoye, JensSequenceJuxtaposer is a sequence visualization tool for the exploration and comparison of biomolecular sequences. We use an information visualization technique called “accordion drawing” that guarantees three key properties: context, visibility, and frame rate. We provide context through the navigation metaphor of a rubber sheet that can be smoothly stretched to show more details in the areas of focus, while the surrounding regions of context are correspondingly shrunk. Landmarks, such as motifs or differences between aligned base pairs, are guaranteed to be visible even if located in the shrunken areas of context. Our graphics infrastructure for progressive rendering provides immediate responsiveness to user interaction by guaranteeing that we redraw the scene at a target frame rate. SequenceJuxtaposer supports interaction at 20 frames per second when browsing collections of several hundred sequences that comprise over 1.7 million total base pairs.
- KonferenzbeitragIdentification and measurement of neigbor dependent nucleotide substitution processes(German Conference on Bioinformatics 2004, GCB 2004, 2004) Arndt, Peter F.; Giegerich, Robert; Stoye, JensThe presence of different neighbor dependent substitution processes generates specific patterns of dinucleotide frequencies in all organisms. Based on a general framework of how to include such processes into more realistic models of nucleotide substitutions we develop a method that is able to identify such processes, measure their strength, and judge their importance to be included into the modeling. Starting from a model for neighbor independent nucleotide substitution we successively add neighbor dependent substitution processes in the order of their ability to increase the likelihood of the model describing the data. The analysis of neighbor dependent nucleotide substitutions in human, zebrafish and fruit fly is presented.
- KonferenzbeitragWeighted sequencing from compomers: DNA de-novo sequencing from mass spectrometry data in the presence of false negative peaks(German Conference on Bioinformatics 2004, GCB 2004, 2004) Böcker, Sebastian; Giegerich, Robert; Stoye, JensOne of the main endeavors in today's Life Science remains the efficient sequencing of long DNA molecules. Today, most de-novo sequencing of DNA is still performed using electrophoresis-based Sanger Sequencing introduced in 1977, in spite of certain restrictions of this method. Recently, we proposed a new method for DNA sequencing using base-specific cleavage and mass spectrometry, that appears to be a promising alternative to classical DNA sequencing approaches: Among its benefits is the extremely fast data acquisition of mass spectrometry. This leads to the combinatorial problem of Sequencing From Compomers (SFC), and to the definition of sequencing graphs. Simulations indicate that this method may allow for de-novo sequencing of DNA molecules with 200+ nt. An open problem in the context of SFC is that it does not take into account false negative peaks (missing peaks) that are common for real-world mass spectra. Here, we present a natural generalization of SFC, the Weighted Sequencing from Compomers (WSC) Problem, that allows us to cope with false negative peaks. We also show that the family of graphs introduced to solve SFC, can be generalized to capture the new aspects of WSC. Finally, we present a branch-and-bound algorithm to find all sequences that agree with the sample mass spectra with the exception of some missing peaks.
- Editiertes BuchGerman Conference on Bioinformatics 2004, GCB 2004(2004) Giegerich, Robert; Stoye, Jens
- KonferenzbeitragFeature based representation and detection of transcription factor binding sites(German Conference on Bioinformatics 2004, GCB 2004, 2004) Pudimat, Rainer; Schukat-Talamazzini, Ernst-Günter; Backofen, Rolf; Giegerich, Robert; Stoye, JensThe prediction of transcription factor binding sites is an important problem, since it reveals information about the transcriptional regulation of genes. A commonly used representation of these sites are position specific weight matrices which show weak predictive power. We introduce a feature-based modelling approach, which is able to deal with various kind of biological properties of binding sites and models them via Bayesian belief networks. The presented results imply higher model accuracy in contrast to the PSSM approach.
- KonferenzbeitragFast track to disease-specific drugs? The impact of in-situ proteomics imaging (Toponomics)(German Conference on Bioinformatics 2004, GCB 2004, 2004) Schubert, Walter; Giegerich, Robert; Stoye, Jens