Ribosome profiling data analysis workflow
The workflow assumes that the reads look like:
NNN <RPF> NNN NNN ADAPTOR
Where NNN are random nucleotides around the ribosome protected fragment (RPF).
Preparing genomes and annotations
“Undesired RNA” annotation
Example:
Organism of interest: Homo sapiens (Human)
A FASTA file containing rRNA, snRNA and snoRNA sequences can be generated from Ensembl biomart by clicking-on/selecting the following options in order:
- Database: Ensembl Genes (Version)
- Dataset: Human genes (Version)
- Click “Filters”
- Expand the “Gene” tab
- Select Transcript type: rRNA, snRNA, snoRNA
- Click “Attributes”
- Select “Sequences” (Radio button) and expand the “Sequences” tab
- Select “Unspliced (gene)”
- Expand the “Header information” tab
- Select “Gene stable ID”
- Click “Count”
- Click “Results”
- Download the FASTA file
tRNA FASTA sequences can be obtained from Genomic tRNA database
- Under “Links to Most Viewed Genomes”, select “Homo sapiens (GRCh38/hg38)”
- On the left hand side, select “FASTA Seqs”
- Download “High confidence tRNA sequences: hg38-tRNAs.fa”
Use NCBI Nucleotide module to procure 45S, 28S, 18S, 5.8S, 5S rRNA sequences.
- Search: biomol_rRNA[prop] AND “Homo sapiens”[Organism].
- On the top right, click “Send to”
- Select “File”
- Select Format: FASTA
- Click “Create File”
The above procedure will download all rRNA sequences (806 at the time of writing) that were shown in the result.
Once you have the above three FASTA files (for sno-, sn- and r-RNA; tRNA; and rRNA), concatenate them into one single FASTA file. This is your annotation for the “undesired” RNA.
# File from ENSEMBL: GRCh38_r_sno_sn_RNA_ENSEMBL.txt
# File from NCBI: GRCh38_rRNA_NCBI.txt
# File from tRNAdb: GRCh38_tRNA_tRNADB.txt
# Catenate the files
cat *.txt > non_target_rna.fa
Transcriptome annotation
A FASTA file containing sequences of principle splice isoforms that are extended by 18 nt in 5’- and 3’- ends can be generated from Ensembl biomart by clicking-on/selecting the following options in order: by clicking-on/selecting the following options in order:
- Database: Ensembl Genes (Version)
- Dataset: Human genes (Version)
- Click “Filters”
- Expand the “Gene” tab
- Select Transcript type: lncRNA (optional) , protein_coding
- Check Source (transcript): ensembl_havana
- Check Ensembl Canonical: Only
- Check APPRIS annotation
- Click “Attributes”
- Select “Sequences” (Radio button) and expand the “Sequences” tab
- Select “Coding sequence”
- Check Upstream flank and provide the value “18”
- Check the Downstream flank and provide the value “18”
- Expand the “Header information” tab
- Select “Transcript stable ID”
- Click “Count”
- Click “Results”
- Download the FASTA file (called GRCh38_ensembl_cds_plus18_annot.fa in this document)
Indexing
dir_in="/dir/path"
gen_in="/dir/path"
THREADS=128
# Safety first
set -e
date +"%d-%m-%Y %T"
echo "-------------------------"
echo "Indexing non-target RNA fasta using Bowtie"
echo "-------------------------"
bowtie-build \
--threads $THREADS \
"$gen_in"/non_target_rna.fa \
"$gen_in"/non_target_rna
date +"%d-%m-%Y %T"
echo "-------------------------"
echo "Indexing CDS fasta using Bowtie"
echo "-------------------------"
bowtie-build \
--threads $THREADS \
"$gen_in"/GRCh38_ensembl_cds_plus18_annot.fa \
"$gen_in"/GRCh38_ensembl_cds_plus18_annot
date +"%d-%m-%Y %T"
echo "-------------------------"
echo "Indexing DONE!"
echo "-------------------------"
Removing 5' random nt and 3' adaptors from fastq files
dir_in="/dir/path"
gen_in="/dir/path"
THREADS=128
for file in "$dir_in"/*.fastq.gz; do
base=$(basename "$file" .fastq.gz)
date +"%d-%m-%Y %T"
echo "-------------------------"
echo "Cutadapt working on "${base}""
echo "-------------------------"
# Remove 5' random nt and adaptor
cutadapt \
-j $THREADS \
--quality-cutoff 30 \
--minimum-length 25 \
--cut 3 \
--no-indels \
--discard-untrimmed \
--adapter CTGTAGGCACCATCAAT \
--output "$dir_in"/${base}_no_adapter.fastq.gz \
"$file" \
1> "$dir_in"/${base}_adapter_removal_cutadapt_log.txt
# Remove 3' random nt
cutadapt \
-j $THREADS \
--quality-cutoff 30 \
--minimum-length 25 \
--cut -6 \
--output "$dir_in"/${base}_clean.fastq.gz \
"$dir_in"/${base}_no_adapter.fastq.gz \
1> "$dir_in"/${base}_trim_cutadapt_log.txt
# remove temporary file
rm "$dir_in"/${base}_no_adapter.fastq.gz
date +"%d-%m-%Y %T"
echo "-------------------------"
echo "Cutadapt done processing "${base}""
echo "-------------------------"
done
Mapping
dir_in="/dir/path"
gen_in="/dir/path"
THREADS=128
# Map to non-target
for file in "$dir_in"/*_clean.fastq.gz; do
base=$(basename "$file" _clean.fastq.gz)
echo "-------------------------"
echo "Bowtie aligning "$base" to non-target RNA"
echo "-------------------------"
bowtie \
--threads $THREADS \
--chunkmbs 320 \
--time \
--best \
--sam \
--un "$dir_in"/${base}_no_ncrna.fastq \
-x "$gen_in"/non_target_rna \
-q "$file" \
2> "$dir_in"/${base}_non_target_rna_map_log.txt \
1> /dev/null
echo "-------------------------"
echo "Compressing fastq"
echo "-------------------------"
pigz -p64 "$dir_in"/${base}_no_ncrna.fastq
echo "-------------------------"
echo "Bowtie DONE aligning "$base" to non-target RNA"
echo "-------------------------"
done
# Map to CDS
for file in "$dir_in"/*_no_ncrna.fastq.gz; do
base=$(basename "$file" _no_ncrna.fastq.gz)
echo "-------------------------"
echo "Bowtie aligning "$base" to CDS"
echo "-------------------------"
bowtie \
--threads "$THREADS" \
--chunkmbs 320 \
--time \
--best \
-v 1 \
-m 1 \
--norc \
--strata \
--sam \
-x "$gen_in"/GRCh38_ensembl_cds_plus18_annot \
-q "$file" \
2> "$dir_in"/${base}_cds_map_log.txt | \
samtools view \
-@ 10 \
-h -bS -F 4 > "$dir_in"/${base}_cds_unsorted.bam
echo "-------------------------"
echo "Samtool sorting: "${base}_cds_unsorted.bam""
echo "-------------------------"
# Sort the BAM file
samtools sort \
-@ $THREADS \
-o "$dir_in"/${base}_cds_sorted.bam "$dir_in"/${base}_cds_unsorted.bam
echo "-------------------------"
echo "Samtool indexing: "${base}_cds_sorted.bam""
echo "-------------------------"
# Index the BAM file
samtools index \
-@ "$THREADS" \
"$dir_in"/${base}_cds_sorted.bam
echo "-------------------------"
echo "Cleaning up: Removing unsorted BAM files"
echo "-------------------------"
rm "$dir_in"/${base}_cds_unsorted.bam
echo "-------------------------"
echo "Bowtie DONE aligning "$base""
echo "-------------------------"
done
Ribomala usage
Indexing
The above code will generate GRCh38_ensembl_cds_plus18_annot.csv file. This file will be required by the analysis module.
QC
where samples.txt contains:
Analysis
where sample_sheet.csv contains: