Life Sciences Workflow 2

Course objectives

The main objective of this course is to demonstrate how a bioinformatics workflow can be seamlessly run using TACC resources from raw input to biological interpretation. There are many different types of workflows you can run using the systems we have available here at TACC. In particular, transcriptome analysis will be demoed as it is a familiar concept among wide range of disciplines.

Understand Transcriptome analysis
Run through a RNA-seq data analysis pipeline using Stampede

1. Transcriptome analysis

Alt text

Transcriptome is simply defined as all of RNA molecules in a single cell.

Represents expression levels of all genes for a single condition of cell for a specie at given time.
Also Rereferred as expression profiling.
Unlike Genome analysis can show differences in the exact same cell that are in different external environment.

Only a decade ago, the study of gene expression was limited to human genetics for medical purposes or model organisms such as mouse, fruit fly and nematodes. Then, microarrays and serial analyses of gene expression were the only available tools for examining transcriptome. With the recent advances of next-generation sequencing technologies, the cost effectiveness of sequencing and maturation of analytical tools, the transcriptome analysis has become more a realistic option for genetic nonmodel organisms, even for individual laboratories.

a. Examples from high-impact journal

Alt text

Transcriptome studies often give high-impact results as it describes status of cells in different conditions in context of expression levels. Here are some examples of these in Science, Nature and Cell (Impact Factor 34.661, 38.138, 28.710 respectively)

b. Scientific background

The pipeline uses Next-generation sequencing RNA-seq data to as a raw input. These are prepared by (experimental workflow).

https://www.nature.com/article-assets/npg/nrg/journal/v12/n10/images/nrg3068-f1.jpg

2. Transcriptome analysis workflow using TopHat suite

Alt text

3. Where to find required test-dataset

Sequence Read Archive

https://www.ncbi.nlm.nih.gov/sra/SRX2771645 https://www.ncbi.nlm.nih.gov/sra/SRX2771647

Genome index and annotation https://ccb.jhu.edu/software/tophat/igenomes.shtml

Hands-on: Try downloading and extracting genome index and annotation files to a directory

wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Arabidopsis_thaliana/Ensembl/TAIR10/Arabidopsis_thaliana_Ensembl_TAIR10.tar.gz
tar -zxvf Arabidopsis_thaliana_Ensembl_TAIR10.tar.gz

4. Tools used in this session

SRAtoolKit

Tuxedo suite: includes: TopHat, Cufflinks, Cuffmerge, Cuffdiff (cite) https://www.nature.com/article-assets/npg/nprot/journal/v7/n3/images_article/nprot.2012.016-F1.jpg

Samtools

5. Benchmarking analysis (TopHat)

What is the advantage of using TACC vs. workstation on transcriptome analysis? Testing number of threads and CPU time for an alignment to complete

Alt text

6. Information on test dataset (Scientific background)

The two sequencing files are of a plant model organism, Arabidopsis Thaliana.

A member of the Brassica family
A favorite research subject for plant molecular geneticists
Due to its remarkably small genome size (about 100,000 kB, with less than 25% repetitive sequences)
And ease of culture in the lab
Short stature and life cycle
Ameanable to genetic mapping studies and other techniques (such as transformation and gene cloning).

Alt text

Col-0 strain, which is columbia seed.
A widely used wild type seed, selected for its high fertility, vigor, and responsiveness to changes in photoperiod.
Contains no visible genetic markers.
Second is of jazQ mutant, which has enhanced JA-regulated defense against insect herbivory without an associated reduction in leaf growth.
Is a JA signalling mutant (jazQ), in which the removal of multiple JAZ repressors causes hyperactivation of JA responses
As a consequence, jazQ plants exhibit both enhanced resistance to insect herbivory and diminished growth of leaves and roots.

First hands on:

Convert SRA file to FASTQ

module load sratoolkit
scratch_cache
prefetch SRR5488800 ; fastq-dump SRR5488800 
prefetch SRR5488802 ; fastq-dump SRR5488802

cp /scratch/02114/wonaya/SSI/SRR5488800.fastq . ; cp /scratch/02114/wonaya/SSI/SRR5488802.fastq .

(2 minutes)

7. TopHat and Cufflink demo/hands-on

module load perl bowtie tophat

module load boost on ls5

What happens if you do just module load tophat without prerequisite modules?

TopHat run: Aligning sequences on arabidopsis genome guided with gene annotations

Remember Do NOT run long processes (more than 5 minutes) on compute node

tophat2 -p 4 -G Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.gtf -o test_1 --no-novel-juncs Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/Bowtie2Index/genome SRR5488800.fastq
tophat2 -p 4 -G Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.gtf -o test_2 --no-novel-juncs Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/Bowtie2Index/genome SRR5488802.fastq

cp -r /scratch/02114/wonaya/SSI/test_1/ . ; cp -r /scratch/02114/wonaya/SSI/test_2/ . (15*2 minutes)

Cufflinks:

cufflinks -o wt_cuff -G Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.gtf test_1/accepted_hits.bam
cufflinks -o sa_cuff -G Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.gtf test_2/accepted_hits.bam

cp -r /scratch/02114/wonaya/SSI/wt_cuff/ . ; cp -r /scratch/02114/wonaya/SSI/sa_cuff/ . (7*2 minutes)

Cuffmerge:

cuffmerge -g Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.gtf -s Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/WholeGenomeFasta/genome.fa -p 16 cuffmerge.txt

in which the content of cuffmerge.txt should be (using text editor such as vim):

wt_cuff/transcripts.gtf
sa_cuff/transcripts.gtf 

cp /scratch/02114/wonaya/SSI/cuffmerge.txt . (2 minutes)

Cuffdiff:

cuffdiff -L WT,SA -p 16 merged_asm/merged.gtf test_1/accepted_hits.bam test_2/accepted_hits.bam

cp -r /scratch/02114/wonaya/SSI/merged_asm . (11 minutes)

8. Alignment statistics

cat test_1/align_summary.txt

How efficient was the alignment? TopHat reports alignment statistics internally, but other tools may not. In these cases, use samtools flagstat

9. Functional analysis with BARtools

So with this data, what can you deduce?

http://bar.utoronto.ca/ntools/cgi-bin/ntools_classification_superviewer.cgi

Using the dataset you got from running cuffdiff, you should be able to extract top 100 genes with highest log-fold changes in expression. Save the names of these genes as a separate text file, and copy the names of genes to the tool above. What functional enrichment do you see in mutant strain?

sort -nrk 10 gene_exp.diff > gene_exp.diff.sort
head -100 gene_exp.diff.sort > gene_exp.diff.sort.top100
cut -f 3 gene_exp.diff.sort.top100 > gene_exp.diff.sort.top100.names

or Can you pipe this into one process?

sort -nrk 10 gene_exp.diff | head -100 - | cut -f 3 - > gene_exp.diff.sort.nrk.top100.names

Results: Alt text

Removal of multiple JAZ genes in this jaz quintuple (jazQ) mutant led to constitutive activation of JA responses, causing hypersensitivity to exogenous JA treatment, upregulation of defense-related genes, increased production of secondary metabolites and higher resistance to insect herbivory attack Campos et al. Nat. Comms. 2016 https://www.ncbi.nlm.nih.gov/pubmed/27573094

Congratulations, You can now do Transcriptome analysis!

Hands-on

Try submitting to TACC system using a job script for tophat to cuffdiff workflow
Explore different RNAseq dataset

https://www.ncbi.nlm.nih.gov/sra/SRX2834995 https://www.ncbi.nlm.nih.gov/sra/SRX2834994

Back to Agenda

Computational Techniques for Life Sciences

Part of the TACC Institute Series, Immersive Training in Advanced Computation

Life Sciences Workflow 2

Course objectives

1. Transcriptome analysis

a. Examples from high-impact journal

b. Scientific background

2. Transcriptome analysis workflow using TopHat suite

3. Where to find required test-dataset

4. Tools used in this session

5. Benchmarking analysis (TopHat)

6. Information on test dataset (Scientific background)

7. TopHat and Cufflink demo/hands-on

Remember Do NOT run long processes (more than 5 minutes) on compute node

8. Alignment statistics

9. Functional analysis with BARtools

Congratulations, You can now do Transcriptome analysis!

Hands-on