Ribosome profiling data analysis workflow

The workflow assumes that the reads look like:

NNN <RPF> NNN NNN ADAPTOR

Where NNN are random nucleotides around the ribosome protected fragment (RPF).

Preparing genomes and annotations

“Undesired RNA” annotation

Example:

Organism of interest: Homo sapiens (Human)

A FASTA file containing rRNA, snRNA and snoRNA sequences can be generated from Ensembl biomart by clicking-on/selecting the following options in order:

Database: Ensembl Genes (Version)
Dataset: Human genes (Version)
Click “Filters”
Expand the “Gene” tab
Select Transcript type: rRNA, snRNA, snoRNA
Click “Attributes”
Select “Sequences” (Radio button) and expand the “Sequences” tab
Select “Unspliced (gene)”
Expand the “Header information” tab
Select “Gene stable ID”
Click “Count”
Click “Results”
Download the FASTA file

tRNA FASTA sequences can be obtained from Genomic tRNA database

Under “Links to Most Viewed Genomes”, select “Homo sapiens (GRCh38/hg38)”
On the left hand side, select “FASTA Seqs”
Download “High confidence tRNA sequences: hg38-tRNAs.fa”

Use NCBI Nucleotide module to procure 45S, 28S, 18S, 5.8S, 5S rRNA sequences.

Search: biomol_rRNA[prop] AND “Homo sapiens”[Organism].
On the top right, click “Send to”
Select “File”
Select Format: FASTA
Click “Create File”

The above procedure will download all rRNA sequences (806 at the time of writing) that were shown in the result.

Once you have the above three FASTA files (for sno-, sn- and r-RNA; tRNA; and rRNA), concatenate them into one single FASTA file. This is your annotation for the “undesired” RNA.

# File from ENSEMBL: GRCh38_r_sno_sn_RNA_ENSEMBL.txt
# File from NCBI: GRCh38_rRNA_NCBI.txt
# File from tRNAdb: GRCh38_tRNA_tRNADB.txt

# Catenate the files
cat *.txt > non_target_rna.fa

Transcriptome annotation

A FASTA file containing sequences of principle splice isoforms that are extended by 18 nt in 5’- and 3’- ends can be generated from Ensembl biomart by clicking-on/selecting the following options in order: by clicking-on/selecting the following options in order:

Database: Ensembl Genes (Version)
Dataset: Human genes (Version)
Click “Filters”
Expand the “Gene” tab
Select Transcript type: lncRNA (optional) , protein_coding
Check Source (transcript): ensembl_havana
Check Ensembl Canonical: Only
Check APPRIS annotation
Click “Attributes”
Select “Sequences” (Radio button) and expand the “Sequences” tab
Select “Coding sequence”
Check Upstream flank and provide the value “18”
Check the Downstream flank and provide the value “18”
Expand the “Header information” tab
Select “Transcript stable ID”
Click “Count”
Click “Results”
Download the FASTA file (called GRCh38_ensembl_cds_plus18_annot.fa in this document)

Indexing

dir_in="/dir/path"
gen_in="/dir/path"
THREADS=128

# Safety first
set -e

date +"%d-%m-%Y %T"
echo "-------------------------"
echo "Indexing non-target RNA fasta using Bowtie"
echo "-------------------------"

bowtie-build \
--threads $THREADS \
"$gen_in"/non_target_rna.fa \
"$gen_in"/non_target_rna


date +"%d-%m-%Y %T"
echo "-------------------------"
echo "Indexing CDS fasta using Bowtie"
echo "-------------------------"

bowtie-build \
--threads $THREADS \
"$gen_in"/GRCh38_ensembl_cds_plus18_annot.fa \
"$gen_in"/GRCh38_ensembl_cds_plus18_annot

date +"%d-%m-%Y %T"
echo "-------------------------"
echo "Indexing DONE!"
echo "-------------------------"

Removing 5' random nt and 3' adaptors from fastq files

dir_in="/dir/path"
gen_in="/dir/path"
THREADS=128

for file in "$dir_in"/*.fastq.gz; do

    base=$(basename "$file" .fastq.gz)

    date +"%d-%m-%Y %T"
    echo "-------------------------"
    echo "Cutadapt working on "${base}""
    echo "-------------------------"

    # Remove 5' random nt and adaptor
    cutadapt \
    -j $THREADS \
    --quality-cutoff 30 \
    --minimum-length 25 \
    --cut 3 \
    --no-indels \
    --discard-untrimmed \
    --adapter CTGTAGGCACCATCAAT \
    --output "$dir_in"/${base}_no_adapter.fastq.gz \
    "$file" \
    1> "$dir_in"/${base}_adapter_removal_cutadapt_log.txt

    # Remove 3' random nt
    cutadapt \
    -j $THREADS \
    --quality-cutoff 30 \
    --minimum-length 25 \
    --cut -6 \
    --output "$dir_in"/${base}_clean.fastq.gz \
    "$dir_in"/${base}_no_adapter.fastq.gz \
    1> "$dir_in"/${base}_trim_cutadapt_log.txt

    # remove temporary file
    rm "$dir_in"/${base}_no_adapter.fastq.gz

    date +"%d-%m-%Y %T"
    echo "-------------------------"
    echo "Cutadapt done processing "${base}""
    echo "-------------------------"

done

Mapping

dir_in="/dir/path"
gen_in="/dir/path"
THREADS=128

# Map to non-target
for file in "$dir_in"/*_clean.fastq.gz; do

    base=$(basename "$file" _clean.fastq.gz)

    echo "-------------------------"
    echo "Bowtie aligning "$base" to non-target RNA"
    echo "-------------------------"

    bowtie \
    --threads $THREADS \
    --chunkmbs 320 \
    --time \
    --best \
    --sam \
    --un "$dir_in"/${base}_no_ncrna.fastq \
    -x "$gen_in"/non_target_rna \
    -q "$file" \
    2> "$dir_in"/${base}_non_target_rna_map_log.txt \
    1> /dev/null

    echo "-------------------------"
    echo "Compressing fastq"
    echo "-------------------------"

    pigz -p64 "$dir_in"/${base}_no_ncrna.fastq

    echo "-------------------------"
    echo "Bowtie DONE aligning "$base" to non-target RNA"
    echo "-------------------------"


done

# Map to CDS
for file in "$dir_in"/*_no_ncrna.fastq.gz; do

    base=$(basename "$file" _no_ncrna.fastq.gz)

    echo "-------------------------"
    echo "Bowtie aligning "$base" to CDS"
    echo "-------------------------"

    bowtie \
    --threads "$THREADS" \
    --chunkmbs 320 \
    --time \
    --best \
    -v 1 \
    -m 1  \
    --norc \
    --strata \
    --sam \
    -x "$gen_in"/GRCh38_ensembl_cds_plus18_annot \
    -q "$file" \
    2> "$dir_in"/${base}_cds_map_log.txt | \
    samtools view \
    -@ 10 \
    -h -bS -F 4 > "$dir_in"/${base}_cds_unsorted.bam

    echo "-------------------------"
    echo "Samtool sorting: "${base}_cds_unsorted.bam""
    echo "-------------------------"

    # Sort the BAM file
    samtools sort \
    -@ $THREADS \
    -o "$dir_in"/${base}_cds_sorted.bam "$dir_in"/${base}_cds_unsorted.bam

    echo "-------------------------"
    echo "Samtool indexing: "${base}_cds_sorted.bam""
    echo "-------------------------"

    # Index the BAM file
    samtools index \
    -@ "$THREADS" \
    "$dir_in"/${base}_cds_sorted.bam

    echo "-------------------------"
    echo "Cleaning up: Removing unsorted BAM files"
    echo "-------------------------"

    rm "$dir_in"/${base}_cds_unsorted.bam


    echo "-------------------------"
    echo "Bowtie DONE aligning "$base""
    echo "-------------------------"

done

Ribomala usage

Indexing

ribomala index \
--fasta /path/GRCh38_ensembl_cds_plus18_annot.fa

The above code will generate GRCh38_ensembl_cds_plus18_annot.csv file. This file will be required by the analysis module.

QC

ribomala qc \
--samples /path/samples.txt \
--min 27 \
--max 34 \
--input /path/ \
--output /path/output/

where samples.txt contains:

> head sample.txt
sample_1.bam
sample_2.bam
sample_3.bam

Analysis

ribomala analysis \
--txcsv /path/GRCh38_ensembl_cds_plus18_annot.csv \
--samples /path/sample_sheet.txt \
--input /path/bams/ \
--output /path/output/ \
--codon CAT,CAC

where sample_sheet.csv contains:

file_name   condition   read_length frame   offset  rep
sample_1.bam    X   30  2   16  A
sample_1.bam    X   31  2   16  A
sample_2.bam    X   29  0   15  B
sample_2.bam    X   30  0   15  B
sample_3.bam    Y   29  0   15  A
sample_3.bam    Y   30  0   15  A
sample_3.bam    Y   30  2   16  A
sample_3.bam    Y   31  2   16  A
sample_4.bam    Y   29  0   15  B
sample_4.bam    Y   30  0   15  B