View on GitHub

Computational Techniques for Life Sciences

Part of the TACC Institute Series, Immersive Training in Advanced Computation

Life Sciences Workflow 2

Course objectives

The main objective of this course is to demonstrate how a bioinformatics workflow can be seamlessly run using TACC resources from raw input to biological interpretation. There are many different types of workflows you can run using the systems we have available here at TACC. In particular, transcriptome analysis will be demoed as it is a familiar concept among wide range of disciplines.

1. Transcriptome analysis

Alt text

Transcriptome is simply defined as all of RNA molecules in a single cell.

Only a decade ago, the study of gene expression was limited to human genetics for medical purposes or model organisms such as mouse, fruit fly and nematodes. Then, microarrays and serial analyses of gene expression were the only available tools for examining transcriptome. With the recent advances of next-generation sequencing technologies, the cost effectiveness of sequencing and maturation of analytical tools, the transcriptome analysis has become more a realistic option for genetic nonmodel organisms, even for individual laboratories.

Alt text

a. Examples from high-impact journal

Alt text Alt text Alt text

Transcriptome studies often give high-impact results as it describes status of cells in different conditions in context of expression levels. Here are some examples of these in Science, Nature and Cell (Impact Factor 34.661, 38.138, 28.710 respectively)

b. Scientific background

The pipeline uses Next-generation sequencing RNA-seq data to as a raw input. These are prepared by (experimental workflow).

https://www.nature.com/article-assets/npg/nrg/journal/v12/n10/images/nrg3068-f1.jpg

2. Transcriptome analysis workflow using TopHat suite

Alt text

Alt text

3. Where to find required test-dataset

Sequence Read Archive

https://www.ncbi.nlm.nih.gov/sra/SRX2771645 https://www.ncbi.nlm.nih.gov/sra/SRX2771647

Genome index and annotation https://ccb.jhu.edu/software/tophat/igenomes.shtml

wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Arabidopsis_thaliana/Ensembl/TAIR10/Arabidopsis_thaliana_Ensembl_TAIR10.tar.gz
tar -zxvf Arabidopsis_thaliana_Ensembl_TAIR10.tar.gz

4. Tools used in this session

SRAtoolKit

Tuxedo suite: includes: TopHat, Cufflinks, Cuffmerge, Cuffdiff (cite) https://www.nature.com/article-assets/npg/nprot/journal/v7/n3/images_article/nprot.2012.016-F1.jpg

Samtools

5. Benchmarking analysis (TopHat)

What is the advantage of using TACC vs. workstation on transcriptome analysis? Testing number of threads and CPU time for an alignment to complete

Alt text

6. Information on test dataset (Scientific background)

The two sequencing files are of a plant model organism, Arabidopsis Thaliana.

Alt text

First hands on:

(2 minutes)

module load perl bowtie tophat

module load boost on ls5

What happens if you do just module load tophat without prerequisite modules?

TopHat run: Aligning sequences on arabidopsis genome guided with gene annotations

Remember Do NOT run long processes (more than 5 minutes) on compute node

tophat2 -p 4 -G Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.gtf -o test_1 --no-novel-juncs Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/Bowtie2Index/genome SRR5488800.fastq
tophat2 -p 4 -G Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.gtf -o test_2 --no-novel-juncs Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/Bowtie2Index/genome SRR5488802.fastq

cp -r /scratch/02114/wonaya/SSI/test_1/ . ; cp -r /scratch/02114/wonaya/SSI/test_2/ . (15*2 minutes)

Cufflinks:

cufflinks -o wt_cuff -G Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.gtf test_1/accepted_hits.bam
cufflinks -o sa_cuff -G Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.gtf test_2/accepted_hits.bam

cp -r /scratch/02114/wonaya/SSI/wt_cuff/ . ; cp -r /scratch/02114/wonaya/SSI/sa_cuff/ . (7*2 minutes)

Cuffmerge:

cuffmerge -g Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.gtf -s Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/WholeGenomeFasta/genome.fa -p 16 cuffmerge.txt

in which the content of cuffmerge.txt should be (using text editor such as vim):

wt_cuff/transcripts.gtf
sa_cuff/transcripts.gtf 

cp /scratch/02114/wonaya/SSI/cuffmerge.txt . (2 minutes)

Cuffdiff:

cuffdiff -L WT,SA -p 16 merged_asm/merged.gtf test_1/accepted_hits.bam test_2/accepted_hits.bam

cp -r /scratch/02114/wonaya/SSI/merged_asm . (11 minutes)

8. Alignment statistics

cat test_1/align_summary.txt

How efficient was the alignment? TopHat reports alignment statistics internally, but other tools may not. In these cases, use samtools flagstat

9. Functional analysis with BARtools

So with this data, what can you deduce?

http://bar.utoronto.ca/ntools/cgi-bin/ntools_classification_superviewer.cgi

Using the dataset you got from running cuffdiff, you should be able to extract top 100 genes with highest log-fold changes in expression. Save the names of these genes as a separate text file, and copy the names of genes to the tool above. What functional enrichment do you see in mutant strain?

sort -nrk 10 gene_exp.diff > gene_exp.diff.sort
head -100 gene_exp.diff.sort > gene_exp.diff.sort.top100
cut -f 3 gene_exp.diff.sort.top100 > gene_exp.diff.sort.top100.names

or Can you pipe this into one process?

sort -nrk 10 gene_exp.diff | head -100 - | cut -f 3 - > gene_exp.diff.sort.nrk.top100.names

Results: Alt text

Removal of multiple JAZ genes in this jaz quintuple (jazQ) mutant led to constitutive activation of JA responses, causing hypersensitivity to exogenous JA treatment, upregulation of defense-related genes, increased production of secondary metabolites and higher resistance to insect herbivory attack Campos et al. Nat. Comms. 2016 https://www.ncbi.nlm.nih.gov/pubmed/27573094

Congratulations, You can now do Transcriptome analysis!

Hands-on

https://www.ncbi.nlm.nih.gov/sra/SRX2834995 https://www.ncbi.nlm.nih.gov/sra/SRX2834994

Back to Agenda