- KonferenzbeitragQuantitative comparison of genomic-wide protein domain distributions(German Conference on Bioinformatics 2010, 2010) Parikesit, Arli A.; Stadler, Peter F.; Prohaska, Sonja J.; Schomburg, Dietmar; Grote, AndreasInvestigations into the origins and evolution of regulatory mechanisms require quantitative estimates of the abundance and co-occurrence of functional protein domains among distantly related genomes. Currently available databases, such as the SUPERFAMILY, are not designed for quantitative comparisons since they are built upon transcript and protein annotations provided by the various different genome annotation projects. Large biases are introduced by the differences in genome annotation protocols, which strongly depend on the availability of transcript information and well-annotated closely related organisms. Here we show that the combination of de novo gene predictors and subsequent HMM-based annotation of SCOP domains in the predicted peptides leads to consistent estimates with acceptable accuracy that in particular can be utilized for systematic studies of the evolution of protein domain occurrences and co-occurrences. As an application, we considered four major classes of DNA binding domains: zink-finger, leucine-zipper, winged-helix, and HMG-box. We found that different types of DNA binding domains systematically avoid each other throughout the evolution of Eukarya. In contrast, DNA binding domains belonging to the same superfamily readily co-occur in the same protein.
- KonferenzbeitragPredicting miRNA targets utilizing an extended profile HMM(German Conference on Bioinformatics 2010, 2010) Grau, Jan; Arend, Daniel; Grosse, Ivo; Hatzigeorgiou, Artemis G.; Keilwagen, Jens; Maragkakis, Manolis; Weinholdt, Claus; Posch, Stefan; Schomburg, Dietmar; Grote, AndreasThe regulation of many cellular processes is influenced by miRNAs, and bioinformatics approaches for predicting miRNA targets evolve rapidly. Here, we propose conditional profile HMMs that learn rules of miRNA-target site interaction automatically from data. We demonstrate that conditional profile HMMs detect the rules implemented into existing approaches from their predictions. And we show that a simple UTR model utilizing conditional profile HMMs predicts target genes of miR- NAs with a precision that is competitive compared to leading approaches, although it does not exploit cross-species conservation.
- Editiertes BuchGerman Conference on Bioinformatics 2010(2010) Schomburg, Dietmar; Grote, Andreas
- KonferenzbeitragCASOP GS: computing intervention strategies targeted at production improvement in genome-scale metabolic networks(German Conference on Bioinformatics 2010, 2010) Bohl, Katrin; Figueiredo, Luís F. de; Hädicke, Oliver; Klamt, Steffen; Kost, Christian; Schuster, Stefan; Kaleta, Christoph; Schomburg, Dietmar; Grote, AndreasMetabolic engineering aims to improve the production of desired biochemicals and proteins in organisms and therefore, plays a central role in Biotechnology. However, the design of overproducing strains is not straightforward due to the complexity of metabolic and regulatory networks. Thus, theoretical tools supporting the design of such strains have been developed. One particular method, CASOP, uses the set of elementary flux modes (EFMs) of a reaction network to propose strategies for the overproduction of a target compound. The advantage of CASOP over other approaches is that it does not consider a single specific flux distribution within the network but the whole set of possible flux distributions represented by the EFMs of the network. Moreover, its application results not only in the identification of candidate loci that can be knocked out, but additionally proposes overexpression candidates. However, the utilization of CASOP was restricted to small and medium scale metabolic networks so far, since the entire set of EFMs cannot be enumerated in such networks. This work presents an approach that allows to use CASOP even in genome-scale networks. This approach is based on an estimation of the score utilized in CASOP through a sample of EFMs within a genome-scale network. Using EFMs from the genome-scale metabolic network gives a more reliable picture of the metabolic capabilities of an organism required for the design of overproducing strains. We applied our new method to identify strategies for the overproduction of succinate and histidine in Escherichia coli. The succinate case study, in particular, proposes engineering targets which resemble known strategies already applied in E. coli. Availability: Source code and an executable are available upon request.
- KonferenzbeitragUncovering the structure of heterogenous biological data: fuzzy graph partitioning in the k-partite setting(German Conference on Bioinformatics 2010, 2010) Blöchl, Florian; Hartsperger, Maria L.; Stümpflen, Volker; Theis, Fabian J.; Schomburg, Dietmar; Grote, AndreasWith the increasing availability of large-scale interaction networks derived either from experimental data or from text mining, we face the challenge of interpreting and analyzing these data sets in a comprehensive fashion. A particularity of these networks, which sets it apart from other examples in various scientific fields lies in their k-partiteness. Whereas graph partitioning has received considerable attention, only few researchers have focused on this generalized situation. Recently, Long et al. have proposed a method for jointly clustering such a network and at the same time estimating a weighted graph connecting the clusters thereby allowing simple interpretation of the resulting clustering structure. In this contribution, we extend this work by allowing fuzzy clusters for each node type. We propose an extended cost function for partitioning that allows for overlapping clusters. Our main contribution lies in the novel efficient minimization procedure, mimicking the multiplicative update rules employed in algorithms for non-negative matrix factorization. Results on clustering a manually annotated bipartite gene-complex graph show significantly higher homogeneity between gene and corresponding complex clusters than expected by chance. The algorithm is freely available at http://cmb.helmholtz-muenchen.de/ fuzzyclustering.
- KonferenzbeitragRepeat-aware comparative genome assembly(German Conference on Bioinformatics 2010, 2010) Husemann, Peter; Stoye, Jens; Schomburg, Dietmar; Grote, AndreasThe current high-throughput sequencing technologies produce gigabytes of data even when prokaryotic genomes are processed. In a subsequent assembly phase, the generated overlapping reads are merged, ideally into one contiguous sequence. Often, however, the assembly results in a set of contigs which need to be stitched together with additional lab work. One of the reasons why the assembly produces several distinct contigs are repetitive elements in the newly sequenced genome. While knowing order and orientation of a set of non-repetitive contigs helps to close the gaps between them, special care has to be taken for repetitive contigs. Here we propose an algorithm that orders a set of contigs with respect to a related reference genome while treating the repetitive contigs in an appropriate way.
- KonferenzbeitragShape-based barrier estimation for RNAs(German Conference on Bioinformatics 2010, 2010) Bogomolov, Sergiy; Mann, Martin; Voß, Björn; Podelski, Andreas; Backofen, Rolf; Schomburg, Dietmar; Grote, AndreasThe ability of some RNA molecules to switch between different metastable conformations plays an important role in cellular processes. In order to identify such molecules and to predict their conformational changes one has to investigate the refolding pathways. As a qualitative measure of these transitions, the barrier height marks the energy peak along such refolding paths. We introduce a meta-heuristic to estimate such barriers, which is an NP-complete problem. To guide an arbitrary path heuristic, the method uses RNA shape representative structures as intermediate checkpoints for detours. This enables a broad but efficient search for refolding pathways. The resulting Shape Triples meta-heuristic enables a close to optimal estimation of the barrier height that outperforms the precision of the employed path heuristic.
- KonferenzbeitragEfficient similarity retrieval of protein binding sites based on histogram comparison(German Conference on Bioinformatics 2010, 2010) Fober, Thomas; Mernberger, Marco; Klebe, Gerhard; Hüllermeier, Eyke; Schomburg, Dietmar; Grote, AndreasWe propose a method for comparing protein structures or, more specifically, protein binding sites using a histogram-based representation that captures important geometrical and physico-chemical properties. In comparison to hitherto existing approaches in structural bioinformatics, especially methods from graph theory and computational geometry, our approach is computationally much more efficient. Moreover, despite its simplicity, it appears to capture and recover functional similarities surprisingly well.
- KonferenzbeitragLearning pathway-based decision rules to classify microarray cancer samples(German Conference on Bioinformatics 2010, 2010) Glaab, Enrico; Garibaldi, Jonathan M.; Krasnogor, Natalio; Schomburg, Dietmar; Grote, AndreasDespite recent advances in DNA chip technology current microarray gene expression studies are still affected by high noise levels, small sample sizes and large numbers of uninformative genes. Combining microarray data with cellular pathway data by using new integrative analysis methods could help to alleviate some of these problems and provide new biological insights. We present a method for learning simple decision rules for class prediction from pairwise comparisons of cellular pathways in terms of gene set expression levels representing the upand downregulation of pathway members. The procedure generates compact and comprehensible sets of rules, describing changes in the relative ranks of gene expression levels in pairs of pathways across different biological conditions. Re- sults for two large-scale microarray studies, containing samples from prostate cancer and B-cell lymphoma patients, show that the method provides robust and accurate rule sets and new insights on differentially regulated pathway pairs. However, the main benefit of these predictive models in comparison to other classification methods like support vector machines lies not in the attained accuracy levels but in the ease of interpretation and the insights they provide on the relative regulation of cellular pathways in the biological conditions under consideration.
- KonferenzbeitragEfficient sequence clustering for RNA-seq data without a reference genome(German Conference on Bioinformatics 2010, 2010) Battke, Florian; Körner, Stephan; Hüttner, Steffen; Nieselt, Kay; Schomburg, Dietmar; Grote, AndreasNew deep-sequencing technologies are applied to transcript sequencing (RNA-seq) for transcriptomic studies. However, current approaches are based on the availability of a reference genome sequence for read mapping. We present Passage, a method for efficient read clustering in the absence of a reference genome that allows sequencing-based comparative transcriptomic studies for currently unsequenced organisms. If the reference genome is available, our method can be used to reduce the computational effort involved in read mapping. Comparisons to microarray data show a correlation of 0.69, proving the validity of our approach.