- KonferenzbeitragCharacterizing metagenomic novelty with unexplained protein domain hits(German conference on bioinformatics 2014, 2014) Lingner, Thomas; Meinicke, Peter; Giegerich, Robert; Hofestädt, Ralf; Nattkemper, Tim W.In metagenomics, the discovery of functional novelty has always been pursued in a gene-centered manner. In that way, sequence-based analysis has been restricted to particular features and to a sufficient length of the sequences. We propose a statistical approach that is independent from the identification of single sequences but rather yields an overall characterization of a metagenome. Our method is based on the analysis of significant differences between the functional profile of a metagenome and its reconstruction from a combination of genomic profiles using the Taxy-Pro mixture model. Here, protein families with a large proportion of domain hits that cannot be explained by the model are interesting candidates for the exploration of metagenomic novelty. The results of three case studies indicate that our method is able to characterize metagenomic novelty in terms of the protein families that significantly contribute to unexplained domain counts. We found a good correspondence between our predictions and the discoveries in the original studies as well as specific indicators of functional novelty that have not yet been described.
- KonferenzbeitragLarge-scale bicluster editing(German conference on bioinformatics 2014, 2014) Sun, Peng; Guo, Jiong; Efficient, Jan Baumbach; Giegerich, Robert; Hofestädt, Ralf; Nattkemper, Tim W.The explosion of the biological data has dramatically reformed today's biological research. The need to integrate and analyze high-dimensional biological data on a large scale is driving the development of novel bioinformatics approaches. Biclustering, also known as simultaneous clustering or co-clustering, has been successfully utilized to discover local patterns in gene expression data and similar biomedical data types. Here, we contribute a new approach: Bi-Force. It is based on the weighted bicluster editing model, to perform biclustering on arbitrary sets of biological entities, given any kind of similarity function. We first evaluated the power of Bi-Force to solve dedicated bicluster editing problems by comparing Bi-Force with two existing algorithms in the BiCluE software package. We then followed a biclustering evaluation protocol from a recent review paper from Eren et al. and compared Bi-Force against eight existing tools: FABIA, QUBIC, Cheng and Church, Plaid, Bimax, Spectral, xMOTIFS and ISA. To this end, a suite of synthetic data sets as well as nine large gene expression data sets from Gene Expression Omnibus were analyzed. All resulting biclusters were subsequently investigated by Gene Ontology enrichment analysis to evaluate their biological relevance. The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering. We thus outperformed existing tools with Bi-Force at least when following the evaluation protocols from Eren et al.. Bi-Force is implemented in Java and integrated into the open source software package of BiCluE. The software as well as all used data sets are publicly available at
- KonferenzbeitragInteractive and dynamic web-based visual exploration of high dimensional bioimages with real time clustering(German conference on bioinformatics 2014, 2014) Rathke, Magnus; Kölling, Jan; Nattkemper, Tim W.; Giegerich, Robert; Hofestädt, Ralf; Nattkemper, Tim W.Web browsers and web applications have become common tools in bioinformatics over the past decades. Many existing web applications revolve around server-client interaction, where heavy computational tasks are often outsourced to the server and the presentation is handled on the the client-side. However more recent additions to the web browser technology embrace the capability of handling more complex operations on the client-side itself, cutting out most of the server-client interaction except for data loading. This paper contributes to the exploration of the potential of approaches to implement and speed up computational expensive tasks, like image cluster analysis, within a client-side web browser environment. The experimental results, incorporating the well known k-means algorithm which serves as a platform for various parallelization approaches, indicate the possibility to achieve real time image clustering. Especially for the available MALDI-MSI data set the results look promising. Despite good results of multithreading approaches, algorithmic approaches appear to be relevant too. Therefore advancements in accelerating the k-means algorithm itself are considered.
- KonferenzbeitragRNA-seq driven gene identification(German conference on bioinformatics 2014, 2014) Zickmann, Franziska; Lindner, Martin S.; Renard, Bernhard Y.; Giegerich, Robert; Hofestädt, Ralf; Nattkemper, Tim W.The reliable identification of genes is a challenging and crucial part of genome research. Various methods aiming at accurate predictions have evolved that predict genes ab initio on reference sequences or evidence based with help of additional information. With high-throughput RNA-Seq data reflecting currently expressed genes, a particularly meaningful source of information has become commonly available. However, a particular challenge in including RNA-Seq data is the difficult handling of ambiguously mapped reads. Therefore we developed GIIRA, a novel gene finder that is exclusively based on RNA-Seq data and inherently includes ambiguously mapped reads. Evaluation on simulated and real data and comparison with existing methods incorporating RNA-Seq information highlight the accuracy of GIIRA in identifying the expressed genes. Further, we developed a framework to integrate GIIRA and other gene finders to obtain a verified and accurate set of gene predictions.
- KonferenzbeitragFlexible database-assisted graphical representation of metabolic networks for model comparison and the display of experimental data(German conference on bioinformatics 2014, 2014) Tillack, Jana; Bende, Melanie; Rother, Michael; Scheer, Maurice; Ulas, Susanne; Schomburg, Dietmar; Giegerich, Robert; Hofestädt, Ralf; Nattkemper, Tim W.Intracellular processes in living organisms are described by metabolic models. A visualization of metabolic models assists interpretation of data or analyzing results. We introduce the visualization tool DaViMM creating personalized graphical representations of metabolic networks for model comparison or the display of measurements or analyzing results. The tool is coupled to a relational database containing graphical network properties like coordinates, which ensure an intuitive network layout. A combination of DaViMM, the graphical database, and available biochemical databases enables an automated creation of metabolic network maps. The flexibility of this combination is demonstrated with some application examples.
- KonferenzbeitragA general approach for discriminative de novo motif discovery from high-throughput data(German conference on bioinformatics 2014, 2014) Grau, Jan; Posch, Stefan; Grosse, Ivo; Keilwagen, Jens; Giegerich, Robert; Hofestädt, Ralf; Nattkemper, Tim W.High-throughput techniques like ChIP-seq, ChIP-exo, and protein binding microarrays (PBMs) demand for novel de novo motif discovery approaches that focus on accuracy and runtime on large data sets. While specialized algorithms have been designed for discovering motifs in in-vivo ChIP-seq/ChIP-exo or in in-vitro PBM data, none of these works equally well for all these high-throughput techniques. Here, we present Dimont, a general approach for fast and accurate de-novo motif discovery from high-throughput data, which achieves a competitive performance on both ChIP-seq and PBM data compared to recent approaches specifically designed for either technique. Hence, Dimont allows for investigating differences between in-vitro and in-vivo binding in an unbiased manner using a unified approach. For most transcription factors, Dimont discovers similar motifs from in-vivo and in-vitro data, but we also find notable exceptions. Scrutinizing the benefit of modeling dependencies between binding site positions, we find that more complex motif models often increase prediction performance and, hence, are a worthwhile field of research. Original paper: doi: 10.1093/nar/gkt831
- KonferenzbeitragBlockclust: efficient clustering and classification of non-coding rnas from short Read RNA-seq profiles(German conference on bioinformatics 2014, 2014) Videm, Pavankumar; Rose, Dominic; Costa, Fabrizio; Backofen, Rolf; Giegerich, Robert; Hofestädt, Ralf; Nattkemper, Tim W.Sequence and secondary structure analysis can be used to assign putative functions to non-coding RNAs. However sequence information is changed by post-transcriptional modifications and secondary structure is only a proxy for the true 3D conformation of the RNA polymer. In order to tackle these issues we can extract a different type of description using the pattern of processing that can be observed through the traces left in small RNA-seq reads data. To obtain an efficient and scalable procedure, we propose to encode expression profiles in discrete structures, and process them using fast graph-kernel techniques.
- KonferenzbeitragTowards accurate transcription start site prediction: a modelling approach(German conference on bioinformatics 2014, 2014) Djordjevic, Marko; Giegerich, Robert; Hofestädt, Ralf; Nattkemper, Tim W.Promoter prediction in bacteria is a classical bioinformatics problem, where available methods for regulatory element detection exhibit a very high number of false positives. We here argue that accurate transcription start site (TSS) prediction is a complex problem, where available methods for sequence motif discovery are not in itself well adopted for solving the problem. We here instead propose that the problem requires integration of quantitative understanding of transcription initiation with careful description of promoter sequence specificity. We review evidence for this viewpoint based on our recent work, and discuss a current progress on accurate TSS detection on the example of sigma70 transcription start sites in E. coli.
- KonferenzbeitragA pipeline for insertion sequence detection and study for bacterial genome(German conference on bioinformatics 2014, 2014) Al-Nayyef, Huda; Guyeux, Christophe; Bahi, Jacques M.; Giegerich, Robert; Hofestädt, Ralf; Nattkemper, Tim W.Insertion Sequences (ISs) are small DNA segments that have the ability of moving themselves into genomes. These types of mobile genetic elements (MGEs) seem to play an essential role in genomes rearrangements and evolution of prokaryotic genomes, but the tools that deal with discovering ISs in an efficient and accurate way are still too few and not totally precise. Two main factors have big effects on IS discovery, namely: genes annotation and functionality prediction. Indeed, some specific genes called “transposases” are enzymes that are responsible of the production and catalysis for such transposition, but there is currently no fully accurate method that could decide whether a given predicted gene is either a real transposase or not. This is why authors of this article aim at designing a novel pipeline for ISs detection and classification, which embeds the most recently available tools developed in this field of research, namely OASIS (Optimized Annotation System for Insertion Sequence) and ISFinder database (an up-to-date and accurate repository of known insertion sequences). As this latter depend on predicted coding sequences, the proposed pipeline will encompass too various kinds of bacterial genes annotation tools (that is, Prokka, BASys, and Prodigal). A complete IS detection and classification pipeline is then proposed and tested on a set of 23 complete genomes of Pseudomonas aeruginosa. This pipeline can also be used as an investigator of annotation tools performance, which has led us to conclude that Prodigal is the best software for IS prediction. A deepen study regarding IS elements in P.aeruginosa has then been conducted, leading to the conclusion that close genomes inside this species have also a close numbers of IS families and groups.
- KonferenzbeitragProtein family analysis at the domain-level(German conference on bioinformatics 2014, 2014) Terrapon, Nicolas; Moore, Andrew; Bornberg-Bauer, Erich; Giegerich, Robert; Hofestädt, Ralf; Nattkemper, Tim W.The analysis of protein domains has gained considerable attention over the last years. Many new insights on protein modular evolution, combined with improved domain detection, have paved the way for an integrated analysis of protein families from a domain-centric perspective. We recently released DoMosaics, a JAVA application that facilitates the interactive analysis of protein domain arrangements. DoMosaics combines guided domain annotation, a highly-customisable visualization of arrangements, and a number of analysis tools. It also integrates domain-centric algorithms such as CODD, which is used for the detection of divergent domain occurences that have escaped Pfam thresholds, as well as RADS/RAMPAGE which provides means to search for proteins with a domain arrangement similar to a given query. RADS provides an alignment of domain strings as opposed to amino-acid sequences, while RAMPAGE produces an amino-acid alignment guided by RADS results. Hence, RADS/RAMPAGE produces fast and yet accurate alignments, and associated ranking, of proteins with similar domain arrangements. Together, these tools greatly simplify the domain-centric analysis of protein function, structure and evolution.