Bioinformatics
MiPanda: A Resource for Analyzing and Visualizing Next-Generation Sequencing Transcriptomics Data
We created the Michigan Portal for the Analysis of NGS data portal (http://mipanda.org), an open-access online resource that provides the scientific community with access to the results of a large-scale computational analysis of thousands of high-throughput RNA sequencing (RNA-seq) samples. The portal provides access to gene expression profiles, enabling users to interrogate expression of genes across myriad normal and cancer tissues and cell lines. From these data, tissue- and cancer-specific expression patterns can be identified. Gene-gene co-expression profiles can also be interrogated. The current portal contains data for over 20,000 RNA-seq samples and will be continually updated. Neoplasia. 2018 Nov;20(11):1144-1149.
Two-pass alignment improves novel splice junction quantification
Discovery of novel splicing from RNA sequence data remains a critical and exciting focus of transcriptomics, but reduced alignment power impedes expression quantification of novel splice junctions. Here, we profiled performance characteristics of two-pass alignment, which separates splice junction discovery from quantification. Per sample, across a variety of transcriptome sequencing datasets, two-pass alignment improved quantification of at least 94% of simulated novel splice junctions, and provided as much as 1.7-fold deeper median read depth over those splice junctions. We further demonstrated that two-pass alignment works by increasing alignment of reads to splice junctions by short lengths, and those potential alignment errors are readily identifiable by simple classification. Taken together, two-pass alignment promises to advance quantification and discovery of novel splicing events (Bioinformatics. 2016 Jan 1;32(1):43-9).
Reconstructing targetable pathways in lung cancer by integrating diverse omics data
Global 'multi-omics' profiling of cancer cells harbors the potential for characterizing the signalling networks associated with specific oncogenes. We profiled the transcriptome, proteome and phosphoproteome in a panel of non-small cell lung cancer (NSCLC) cell lines in order to reconstruct targetable networks associated with KRAS dependency. We developed a two-step bioinformatics strategy addressing the challenge of integrating these disparate data sets. We first defined an 'abundance-score' combining transcript, protein and phospho-protein abundances to nominate differentially abundant proteins and then used the Prize Collecting Steiner Tree algorithm to identify functional sub-networks. We identified three modules centered on KRAS and MET, LCK and PAK1 and β-Catenin. We validated the activation of these proteins in KRAS-dependent (KRAS-Dep) cells and perform functional studies defining LCK as a critical gene for cell proliferation in KRAS-Dep but not KRAS-independent NSCLCs. These results suggest that LCK is a potential druggable target protein in KRAS-Dep lung cancers. (Nat Commun. 2013 Oct 18;4:2617)
Oculus: faster sequence alignment by streaming read compression
Despite significant advancement in alignment algorithms, the exponential growth of nucleotide sequencing throughput threatens to outpace bioinformatic analysis. Computation may become the bottleneck of genome analysis if growing alignment costs are not mitigated by further improvement in algorithms. Much gain has been gleaned from indexing and compressing alignment databases, but many widely used alignment tools process input reads sequentially and are oblivious to any underlying redundancy in the reads themselves. We developed Oculus, a software package that attaches to standard aligners and exploits read redundancy by performing streaming compression, alignment, and decompression of input sequences. This nearly loss-less process (> 99.9%) led to alignment speedups of up to 270% across a variety of data sets, while requiring a modest amount of memory. We expect that streaming read compressors such as Oculus could become a standard addition to existing RNA-Seq and ChIP-Seq alignment pipelines, and potentially other applications in the future as throughput increases. Oculus efficiently condenses redundant input reads and wraps existing aligners to provide nearly identical SAM output in a fraction of the aligner runtime. It includes a number of useful features, such as tunable performance and fidelity options, compatibility with FASTA or FASTQ files, and adherence to the SAM format (http://www.ncbi.nlm.nih.gov/pubmed/23148484" target="_blank" rel="noopener">BMC Bioinformatics. 2012 Nov 13;13:297).
The platform-independent C++ source code is freely available online:
http://code.google.com/p/oculus-bio
ChimeraScan: a tool for identifying chimeric transcription in sequencing data
We previously used high-throughput paired-end transcriptome sequencing (RNA-Seq) to detect aberrant, chimeric RNAs and uncovered recurrent classes of clinically relevant gene fusions such as those found in breast cancer described below. This discovery was facilitated by the development of an open-source software package, ChimeraScan, for the discovery of chimeric transcription between two independent transcripts in high-throughput transcriptome sequencing data (schematic shown in Fig. 1). ChimeraScan includes features such as the ability to process long (>75 bp) paired-end reads, processing of ambiguously mapping reads, detection of reads spanning a fusion junction, integration with the popular Bowtie aligner, supports the standardized SAM format and generation of HTML reports for easy investigation of results. Overall, we believe that the ChimeraScan will facilitate the discovery of additional gene fusions that may serve as clinically relevant targets in cancer. (Bioinformatics. 2011 Oct 15;27(20):2903-4)