- KonferenzbeitragIntegration and visualisation of multimodal biological data(German conference on bioinformatics 2009, 2009) Rohn, Hendrik; Klukas, Christian; Schreiber, Falk; Grosse, Ivo; Neumann, Steffen; Posch, Stefan; Schreiber, Falk; Stadler, PeterUnderstanding complex biological systems requires data from manifold biological levels. Often this data is analysed in some meaningful context, for example, by integrating it into biological networks. However, spatial data given as 2D images or 3D volumes is commonly not taken into consideration and analysed separately. Here we present a new approach to integrate and analyse complex multimodal biological data in space and time. We present a data structure to manage this kind of data and discuss application examples for different data integration scenarios.
- KonferenzbeitragConverting DNA to music: COMPOSALIGN(German conference on bioinformatics 2009, 2009) Ingalls, Todd; Martius, Georg; Hellmuth, Marc; Marz, Manja; Prohaska, Sonja J.; Grosse, Ivo; Neumann, Steffen; Posch, Stefan; Schreiber, Falk; Stadler, PeterAlignments are part of the most important data type in the field of comparative genomics. They can be abstracted to a character matrix derived from aligned sequences. A variety of biological questions forces the researcher to inspect these alignments. Our tool, called COMPOSALIGN, was developed to sonify large scale genomic data. The resulting musical composition is based on COMMON MUSIC and allows the mapping of genes to motifs and species to instruments. It enables the researcher to listen to the musical representation of the genome-wide alignment and contrasts a bioinformatician's sight-oriented work at the computer.
- KonferenzbeitragSemi-supervised learning for improving prediction of HIV drug resistance(German conference on bioinformatics 2009, 2009) Perner, Juliane; Altmann, André.; Lengauer, Thomas; Grosse, Ivo; Neumann, Steffen; Posch, Stefan; Schreiber, Falk; Stadler, PeterResistance testing is an important tool in today's anti-HIV therapy management for improving the success of antiretroviral therapy. Routinely, the genetic sequence of viral target proteins is obtained. These sequences are then inspected for mutations that might confer resistance to antiretroviral drugs. However, interpretation of the genomic data is challenging. In recent years, approaches that employ supervised statistical learning methods were made available to assist the interpretation of the complex genetic information (e.g. geno2pheno and VircoTYPE). However, these methods rely on large amounts of labeled training data, which are expensive and labor-intensive to obtain. This work evaluates the application of semi-supervised learning (SSL) for improving the prediction of resistance from the viral genome.
- KonferenzbeitragSelf-taught learning for classification of mass spectrometry data: a case study of colorectal cancer(German conference on bioinformatics 2009, 2009) Alexandrov, Theodore; Grosse, Ivo; Neumann, Steffen; Posch, Stefan; Schreiber, Falk; Stadler, PeterMass spectrometry is an important technique for chemical profiling and is a major tool in proteomics, a discipline interested in large-scale studies of proteins expressed by an organism. In this paper we propose using a sparse coding algorithm for classification of mass spectrometry serum protein profiles of colorectal cancer patients and healthy individuals following the so-called self-taught learning approach. Being applied to the dataset of 112 spectra of length 4731 bins, the sparse coding algorithm represents each of them by means of less then ten prototype spectra. The classification of spectra is done as in our previous study on the same dataset [ADM+09], using Support Vector Machines evaluated by means of the double cross-validation. However, the classifiers take as input not discrete wavelet coefficients but the sparse coding coefficients. Comparing the classification results with reference results, we show that providing the same total recognition rate, the sparse coding-based procedure leads to higher generalization performance. Moreover, we propose using the sparse coding coefficients for clustering of mass spectra and demonstrate that this approach allows one to highlight differences between the cancer spectra.
- KonferenzbeitragCUDA-based multi-core implementation of MDS-based bioinformatics algorithms(German conference on bioinformatics 2009, 2009) Fester, Thilo; Schreiber, Falk; Strickert, Marc; Grosse, Ivo; Neumann, Steffen; Posch, Stefan; Schreiber, Falk; Stadler, PeterSolving problems in bioinformatics often needs extensive computational power. Current trends in processor architecture, especially massive multi-core processors for graphic cards, combine a large number of cores into a single chip to improve the overall performance. The Compute Unified Device Architecture (CUDA) provides programming interfaces to make full use of the computing power of graphics processing units. We present a way to use CUDA for substantial performance improvement of methods based on multi-dimensional scaling (MDS). The suitability of the CUDA architecture as a high-performance computing platform is studied by adapting a MDS algorithm on specific hardware properties. We show how typical bioinformatics problems related to dimension reduction and network layout benefit from the multi-core implementation of the MDS algorithm. CUDA-based methods are introduced and compared to standard solutions, demonstrating 50-fold acceleration and above.
- KonferenzbeitragIdentification of cancer and cell-cycle genes with protein interactions and literature mining(German conference on bioinformatics 2009, 2009) Royer, Loic; Plake, Conrad; Schroeder, Michael; Grosse, Ivo; Neumann, Steffen; Posch, Stefan; Schreiber, Falk; Stadler, PeterGene prioritization based on background knowledge mined from literature has become an important method for the analysis of results from high-throughput experimental assays such as gene expression microarrays, RNAi screens and genomewide association studies. We apply our gene mention identifier, which achieved the best result of over 80% in the BioCreative II text-mining challenge [HPR+08], and show how text-mined associations can be complemented using guilt-by-association on high confidence protein interaction networks. First, we predict hand-curated gene-disease relationships in the OMIM database, Entrez Gene summaries and GeneRIFs with 37% success rate. Second, we confirm 24% of novel cell-cycle genes identified in a recent RNAi screen [KPH+07] by using text-mining and high confidence protein interactions. Moreover, we show how 71% of GOA cell-cycle annotations can be automatically recovered. Third, we devise a method to rank genes based on novelty, increasing interest, impact, and popularity.
- KonferenzbeitragMaximum likelihood estimation of weight matrices for targeted homology search(German conference on bioinformatics 2009, 2009) Menzel, Peter; Gorodkin, Jan; Stadler, Peter F.; Grosse, Ivo; Neumann, Steffen; Posch, Stefan; Schreiber, Falk; Stadler, PeterGenome annotation relies to a large extent on the recognition of homologs to already known genes. The starting point for such protocols is a collection of known sequences from one or more species, from which a model is constructed – either automatically or manually – that encodes the defining features of a single gene or a gene family. The quality of these models eventually determines the success rate of the homology search. We propose here a novel approach to model construction that not only captures the characteristic motifs of a gene, but are also adjusts the search pattern by including phylogenetic information. Computational tests demonstrate that this can lead to a substantial improvement of homology search models.
- Editiertes BuchGerman conference on bioinformatics 2009(2009) Grosse, Ivo; Neumann, Steffen; Posch, Stefan; Schreiber, Falk; Stadler, Peter
- KonferenzbeitragAutomated bond order assignment as an optimization problem(German conference on bioinformatics 2009, 2009) Dehof, Anna Katharina; Rurainski, Alexander; Lenhof, Hans -Peter; Hildebrandt, Andreas; Grosse, Ivo; Neumann, Steffen; Posch, Stefan; Schreiber, Falk; Stadler, PeterNumerous applications in Computational Biology process molecular structures and hence require not only reliable atomic cordinates, but also correct bond order information. Regrettably, this information is not always provided in molecular databases like the Cambridge Structural Database or the Protein Data Bank. Very different strategies have been applied to derive bond order information, most of them relying on the correctness of the atom coordinates. We extended a different ansatz proposed by Wang et al. that assigns heuristic molecular penalty scores solely based on connectivity information and tries to heuristically approximate its optimum. In this work, we present two efficient and exact solvers for the problem replacing the heuristic approximation scheme of the original approach: an ILP formulation and an A* approach. Both are integrated into the upcoming version of the Biochemical Algorithms Library BALL and have been successfully validated on the MMFF94 validation suite.
- KonferenzbeitragGraph-kernels for the comparative analysis of protein active sites(German conference on bioinformatics 2009, 2009) Fober, Thomas; Mernberger, Marco; Moritz, Ralph; Hüllermeier, Eyke; Grosse, Ivo; Neumann, Steffen; Posch, Stefan; Schreiber, Falk; Stadler, PeterGraphs are often used to describe and analyze the geometry and physicochemical composition of biomolecular structures, such as chemical compounds and protein active sites. A key problem in graph-based structure analysis is to define a measure of similarity that enables a meaningful comparison of such structures. In this regard, so-called kernel functions have recently attracted a lot of attention, especially since they allow for the application of a rich repertoire of methods from the field of kernel-based machine learning. Most of the existing kernel functions on graph structures, however, have been designed for the case of unlabeled and/or unweighted graphs. Since proteins are often more naturally and more exactly represented in terms of node-labeled and edge-weighted graphs, we propose corresponding extensions of existing graph kernels. Moreover, we propose an instance of the substructure fingerprint kernel suitability for the analysis of protein binding sites. The performance of these kernels is investigated by means of an experimental study in which graph kernels are used as similarity measures in the context of classification.