Sunday, May 28, 2017

RNAseq coursesource materials


Course Source, undergraduate

NIBLSE, bioinformatics core competencies

This set of bioinformatics core competencies for undergraduate life scientists is informed by the survey results of more than 1,200 people, analysis of 90 syllabi addressing bioinformatics across institutions and diverse departments, and discussion among experts across academia and industry. The bulleted lists contain examples illustrating the competencies.

  1. Explain the role of computation and data mining in addressing hypothesis-driven and hypothesis-generating questions within the life sciences: It is crucial for students to have a clear understanding of the role computing and data mining play in the modern life sciences. Given a traditional hypothesis-driven research question, students should have ideas about what types of data and software exist that could help them answer the question quickly and efficiently. They should also appreciate that mining large datasets can generate novel hypotheses to be tested in the lab or field.
    • What hypotheses can one ask based biometric data being compiled (Fitbit, Google, etc.)
    • Understand the role of various databases in identifying potential gene targets for drug development
  2. Summarize key computational concepts, such as algorithms and relational databases, and their applications in the life sciences: In order to make use of sophisticated software and database tools, students must have a basic understanding of the underlying principles that these tools are based upon. Students are not expected to be experts in multiple algorithms or sophisticated databases, but currently the vast majority of life sciences majors never take a programming or database course, and have essentially zero exposure to how these tools work. This must change.
    • Be exposed to how data is organized in relational databases
    • Be able to modify the search parameters to achieve biologically meaningful results
    • Understand underlying algorithm(s) employed in sequence alignment (e.g. BLAST)
  3. Apply statistical concepts used in bioinformatics: Many biology curricula contain statistics, either as a standalone biostatistics course or as part of other courses such as capstone research courses. The primary distinction with regard to bioinformatics has to do with the statistics of large datasets and multiple comparisons.
    • Drug trials: Interpretation of well designed drug trial data
    • Transcriptomics: Understand the statistical modelling used to identify differentially expressed genes; Understand how genes implicated in cancer are identified using panels of sequenced tumor and WT cell lines or biopsies
    • Sequence similarity searching: Understand that there is a probability of finding a given sequence similarity score by chance (the p-value); The size of the database searched affects the probability that they would see that particular score in a particular search (the expectation, or e-value).
  4. Use bioinformatics tools to examine complex biological problems in evolution, information flow, and other important areas of biology: This competency is written broadly so as to encompass a variety of problems addressed using bioinformatics tools, from understanding the evolutionary underpinnings of sequence comparison and homology detection, to the distinctions between genomic sequences, RNA sequences, and protein sequences, to the interpretation of phylogenetic trees. We want to emphasize that bioinformatics tools can be used to teach existing parts of the curriculum such as the central dogma or phylogenetic relationships, thus integrating the bioinformatics into the curriculum as opposed to adding it on as an addition to an already overfull curriculum (and thus forcing decisions about what topic to remove to make room). The point of saying “complex” biological problems is that students should be able to work through a problem with multiple steps, not just perform isolated tasks.
    • Employ gene ontology tools (e.g., Mapman, GO, KEGG).
    • Understand protein sequence, structure, and function, using a variety of tools
    • Understand gene structure, genomic context, alternative splicing using genome browsers
    • Understand concept of homology
  5. Find, retrieve, and organize various types of biological data: Given the numerous and varied datasets currently being generated from all of the ‘omics fields, students should develop the facility to: identify appropriate data repositories; navigate and retrieve data from these databases; and organize data relevant to their area of study (in flat files or small local stand-alone databases).
    • Store and interrogate small datasets using spreadsheets or delimited text files.
    • Navigate and retrieve data from genome browsers
    • Retrieve data from protein and genome databases (PDB, UniProt, NCBI)
  6. Explore and/or model biological interactions, networks and data integration using bioinformatics: Modeling of biological systems at all levels, from cellular to ecological, is being facilitated by technological (e.g., sequencing, biochemical, genetics) and algorithmic advances. These models provide novel insights into the perturbations in systems causative of disease, interactions of microbes with various eukaryotic systems, and how metabolic networks respond to environmental stresses. Students should be familiar with the techniques used to generate these analyses, have the ability to interpret the outputs, and use the data to generate novel hypotheses.
    • Cell Biology: predict impact of gene knockout on cell-signaling pathway
    • Transcriptome: Analysis of transcriptomic data (RNA-Seq) available from SRS using Galaxy
    • Ecological: Analysis of microbial sequence data using QIIME on Galaxy
  7. Use command-line bioinformatics tools and write simple computer scripts: The majority of the datasets students should be familiar with and be able to interact with (e.g., genomic and proteomic sequences, BLAST results, RNASeq and resulting differential expression data) are text files. The most powerful and dynamic way to interact with these datasets is through the command line or shell scripting, both of which are readily acquired skills. Students need to have the flexibility to manipulate their own data, and to create and modify complex data processing and analysis workflows.
    • Write simple unix shell scripts to manipulate files
    • Apply RNASeq analyses using R (STAR, Tophat, DESeq2) to open source data sets (SRS)
    • Build and run statistical analyses using R or Python scripts
    • Run BLAST using command line options
  8. Describe and manage biological data types, structure, and reproducibility: This competency addresses two distinct concerns: 1) each of the varied ‘omics fields produce data in formats particular to its needs, and these formats evolve with changes in technologies and refinements in downstream software; and 2) all experimental data is subject to error and the user must be cognizant of the need to verify the reproducibility of their data. The first concern highlights the requirement for students to develop an awareness of and ability to manipulate different data types given the versioning of formats. The second points to the need for caution, to carry out appropriate statistical analyses on their data as part of normal operating procedures and report the uncertainty of their results, and to provide the relevant information to enable reproduction of their results. Sometimes students have the tendency to assume that anything they retrieve from an online database must be correct; they need to be taught that this is not always the case.
    • Reproducibility: Compare reproducibility of biological replicate data (e.g.transcriptomic data) using statistical tests (Spearman).
    • Formats: Understand the various sequence formats used to store DNA and protein sequences (FASTA, FASTQ); Understand the representation of gene features using Gene Feature Format (GFF) files; Mass-Spec
  9. Interpret the ethical, legal, medical, and social implications of biological data: The increasing scale and penetrance of human genetic and genomic data has greatly enhanced our ability to identify disease-related loci, druggable targets, and potential for gene replacements with developing techniques. However, with this information also comes many ethical, legal, and social questions which are often outpaced by the technological advances. As part of their scientific training, students should debate the medicinal, societal and ethical implications of these information sets and techniques.
    • How does the scientific community protect against the falsification or manipulation of large datasets?
    • Who should have access to this data, and how should it be protected?
    • What are the implications, good and bad, of being able to walk into a doctor’s office and have your genome sequenced and analyzed in minutes?

Friday, May 26, 2017


datacamp, $150 for a year service for advanced courses

raspberryPI can be hooked up to a monitor

Friday, May 19, 2017

day5, jackson lab

Mark Adams
microbial genomics service

mock microbial community => assess DNA extraction method, or other procedures

Aditya Srikanth Kovuri
Sandeep Namburi

NIST, cloud characteristics,
on-demand self-service
broad network access
resource pooling
rapid elasticity,
measured service

Amazon S3

glalaxy cloudman

Google cloud is cheaper than GoogleCloud.
GoogleGenomics API.


Microsoft Azure Research awards

Google Research award

Thursday, May 18, 2017

day4, jackson lab,

=> Krish Karuturi
big data genomics, computational and informatics challenges
TORQUE resource manager

benchmarking pipelines

GSA, Effron & Tibshirani



Peter Robinson, Ph.D., The Jackson Laboratory for Genomic Medicine
Phenotype driven genome analysis

Ontology, disambuilgous terms.

human phenotype ontology

information content (IC) of concept.

semantically similar diseases scores

Washington NL 2009, Plos Biology

Y Ada Zhan, ChIP-seq

bd2kuser@ip-172-31-73-47:~/ChIPseq$ cat readme.txt
# ChIP-seq module #

# ChIP-seq data
In the directory ChIPseq/

# Genome
In the directory ChIPseq/hg38/

# Tools
 fastqc (quality check)
 bowtie (sequence mapping or alignments)
 samtools (manipulating alignments in SAM format. BAM format is a compressed version of SAM file)
 macs2 (peak calling)
 bedtools (to handle sequence coordinate files in BED format)

bd2kuser@ip-172-31-73-47:~/ChIPseq$ cat
# quality check
fastqc GM12878_control_chr1.fastq
fastqc GM12878_CTCF_chr1.fastq

# Prepare genome
bowtie-build hg38/GRCh38.chr1.fa hg38/GRCh38.chr1

# Mapping
bowtie -m 1 -S ./hg38/GRCh38.chr1 GM12878_control_chr1.fastq > GM12878_control_chr1.sam
bowtie -m 1 -S ./hg38/GRCh38.chr1 GM12878_CTCF_chr1.fastq > GM12878_CTCF_chr1.sam

# Further processing
## compress to BAM
samtools view -bSo GM12878_control_chr1.bam GM12878_control_chr1.sam
samtools view -bSo GM12878_CTCF_chr1.bam GM12878_CTCF_chr1.sam
## sort
samtools sort GM12878_control_chr1.bam GM12878_control_chr1.sorted
samtools sort GM12878_CTCF_chr1.bam GM12878_CTCF_chr1.sorted
## index
samtools index GM12878_control_chr1.sorted.bam
samtools index GM12878_CTCF_chr1.sorted.bam

# Peak calling
macs2 callpeak -t GM12878_CTCF_chr1.sorted.bam -c GM12878_control_chr1.sorted.bam -f BAM -g 175000000 -n GM12878_CTCF_chr1 -B -q 0.01

# Check the peak model
Rscript GM12878_CTCF_chr1_model.r

# Motif analysis
## extend summits 100bp on both directions
bedtools slop -i GM12878_CTCF_chr1_summits.bed -g hg38/GRCh38.chr1.size -b 100 > GM12878_CTCF_chr1_summits_ext.bed
## get sequence file (i.e. fasta)
bedtools getfasta -fi hg38/GRCh38.chr1.fa -bed  GM12878_CTCF_chr1_summits_ext.bed -fo GM12878_CTCF_chr1_summits_ext.fa
## The .fa file will be uploaded to MEME online server for motif discovery (

BED file format

MEME motif discovery

ChiPseek website for interactive data analysis,

Wednesday, May 17, 2017

day3, 20170517Wed Jackson Lab, Galaxy, IGV,

=> Paola Vera-Licona
gene network

time series gene expression data -> network

structure-based control of signaling networks (optimization of interaction? )

HER2-positive breast cancer

BiNoM           -> geneXplain --> OCSANA
gene expression -> list TFs ---> mapping pathways + master regulator --> identify optimal combination of intervention from network analysis

candidate genes with p-values
pick largest connected component
using random sampling permutation to evaluate the choice of p-value cutoff.

Using annotated pathway to build a directed nework for intervention analysis and prediction.

How drugble? Drug reposition?


=> Reinhard Laubenbacher

Karl Broman, Reproducible research (should added to my REU bootcamp training).

biostatistics and medical informatics

IGV: need *bam file for alignment, *bai file for index. 

vcf file can be visualized in IGV or Ensembl Variant Effect Predictor.

Usually, large genes tend to have more mutations than small genes. Genes with repetitive elements tend to have more mutations.

network software

=> RTN, bioconductor


=> Cytoscape

=> KENev

=> MARINa (MATlab)

=> ingenuity

=> geneXplain

bioconductor KEGG.db

KEGG.db contains mappings based on older data because the original
  resource was removed from the the public domain before the most
  recent update was produced. This package should now be considered
  deprecated and future versions of Bioconductor may not have it
  available.  Users who want more current data are encouraged to
  look at the KEGGREST or reactome.db packages

KEGG ftp price list

personal account $2000 per year
organization account $5000 per year

Tuesday, May 16, 2017

day 2, afternoon, 20170515 jackson lab

genome data sources

genomes in a bottle

Carl Zimmer

George Church

JAX HPC 256G RAM per node, 20 cores per node,

day2, moring, 20170516

=> Sheng Li, RNAseq
RNAseq library contruction

Kukurba KR, montgomery SB, Cold Spring Harbo Protoc, 2015,

For microRNA, ~20nt, special protocol is required.

stranded and non-stranded library (to distinguish overlapping exons or genes on opposite DNA strands)

minimal reads: 20-25 millions reads  for mammalian transcriptiome

Illumina Hiseq-4000, ~ 4000 millions per lane. 4-8 libraries per lane. Often, double indexing can be used for high number of multiplexing libraries.

2nd step, Gene annotation: GenCode
GTF format

3rd step, gene expression quantification

RNAseq metric,
single-end RPKM, reads per kilobase per million reads
paired-end, FPKM, fragments per kilobase per million reads
nomalize read counts for sequencing depth, length of gene
TPM, transipts per million
 pro: sum of total normalized reads is the same for all samples.(not for  R/FPKM)

before 1st step, Quality check step.
 genebody coverage, (with genes)
 insert sizes
 GC content
 reads distribution
 adaptor enrichment (containmination or PCR amplification bias?)
 read quality

RSeQC, Liguo Wang, Bioinformatics 2012
  polyA selected 3' UTR, so 5'UTR degradation can be a problem.

Public data:
RNA-seq blog

combatR, correct of batch effect

biological degradation of mRNA during aging, using sva latent variable, to distinguish biological degradation from non-biological degradation.

Single cell RNAseq, Ion Mandoiu

psuedotemporal order of cells

single cell mutaional profieing and clonal phylogeny in cancer
Potter, Genome Re

cell type identification in primary visual cortex


challenges in single-cell RNAseq: low RT and sequencing depth, "zero inflated" data


Matching clusters to cell types or organism parts

10X genomics . .
neuron cortex

yeast GEO, aging and large scale

Dang lab
methylation and chip-seq

287 samples, Holstege

  • Sameith K, Amini S, Groot Koerkamp MJ, van Leenen D et al. 
    A high-resolution gene expression atlas of epistasis between gene-specific transcription factors exposes potential mechanisms for genetic interactions. BMC Biol 2015 Dec 23;13:112. PMID: 26700642

Aging, single cell expression data set

physiologically aged hematopoietic stem cells

To uphold appropriate homeostasis of short-lived blood cells, immature blood cells need to proliferate vigorously. Here, using a conditional H2B-mCherry labeling mouse-model, we characterize hematopoietic stem cell (HSC) and progenitor proliferation dynamics in steady state, upon physiological aging and following several types of induced stress. Following transplantation, HSCs shifted towards higher degrees of proliferation that was sustained long-term. HSCs were, by contrast, poorly recruited into proliferation following cytokine-induced mobilization and after acute depletions of selected blood cell lineages. Using indexed single cell sorting coupled to multiplex gene expression analyses, proliferation history separated candidate HSCs into units with distinct molecular and functional attributes. Our data thereby highlight that HSC proliferation following transplantation is fundamentally different not only from native hematopoiesis but also from other stress contexts, and demonstrate the power of divisional history as a functional criterion to resolve HSC heterogeneity
About 1000 genes are measured in GSE77477