API Reference
Index module
Indexing functionality for Ribomala. This module handles indexing of the transcriptome fasta file.
index_fasta(fasta_file)
Indexes a transcriptome FASTA file if it is not already indexed and returns a pysam.FastaFile object.
Parameters
fasta_file : str Path to the transcriptome FASTA file.
Returns
pysam.FastaFile
An indexed pysam.FastaFile object for the given transcriptome.
Notes
This function checks if an index file (.fai) exists for the given FASTA file.
If the index file is missing, it generates one using pysam.faidx.
The function also creates a CSV file with transcript information.
Source code in ribomala/index.py
run(args)
Run the index mode with the given arguments.
Parameters
args : argparse.Namespace Command-line arguments for the index mode. Should contain a 'fasta' attribute with the path to the FASTA file.
Returns
None
Notes
This function configures logging and executes the indexing process. If indexing fails, the program will exit with code 1.
Source code in ribomala/index.py
QC module
Ribosome profiling data quality control functionality for Ribomala. The module performs the following tasks:
- processes BAM files listed in a sample list.
- extracts transcript ID, 5' position and read length
- calculates reading frames
- checks periodicity and ribosome reading frame
- Plots the data and saves them to the output directory.
calculate_frame(df)
Calculate the reading frame as pos modulo 3.
Parameters
df : pl.DataFrame Input DataFrame with a 'pos' column.
Returns
pl.DataFrame A DataFrame with an added 'frame' column (pos % 3).
Source code in ribomala/qc.py
check_frame_dist(df, output_dir, sample_name, min_read_length=28, max_read_length=33)
Check ribosome reading frame distribution.
Parameters
df : pl.DataFrame Input DataFrame with 'read_length' and 'frame' columns. output_dir : Path Path to the output directory. sample_name : str Name of the sample (used in output filenames). min_read_length : int, optional Minimum read length to consider, by default 28. max_read_length : int, optional Maximum read length to consider, by default 33.
Notes
This function groups data by read_length and frame, then generates and exports a Plotly bar plot and CSV data to the output directory.
Source code in ribomala/qc.py
check_periodicity(df, output_dir, sample_name, min_read_length=28, max_read_length=33)
Check periodicity in ribosome profiling data.
Parameters
df : pl.DataFrame Input DataFrame with 'pos', 'read_length', and 'frame' columns. output_dir : Path Path to the output directory. sample_name : str Name of the sample (used in output filenames). min_read_length : int, optional Minimum read length to consider, by default 28. max_read_length : int, optional Maximum read length to consider, by default 33.
Notes
This function filters reads, groups by position, read_length, and frame, then generates and exports a Plotly bar plot and CSV data to the output directory.
Source code in ribomala/qc.py
extract_bam_info(bam_file)
Extract transcript ID, 5' position, and read length from a BAM file.
Parameters
bam_file : str Path to the BAM file.
Returns
pl.DataFrame A Polars DataFrame with columns: transcript_id, pos, and read_length.
Raises
Exception If the BAM file cannot be processed.
Source code in ribomala/qc.py
read_samples_file(sample_list_file)
Read the sample list file or a single sample and return a list of sample filenames.
Parameters
sample_list_file : Path Path to the sample list file or a single sample name.
Returns
list A list of sample file names.
Notes
If sample_list_file is an existing file, read sample names from it; otherwise, treat the provided value as a single sample name.
Source code in ribomala/qc.py
run(args)
Run the QC mode with the given arguments.
Parameters
args : argparse.Namespace Parsed command-line arguments with the following attributes: - input: Path to the input directory containing BAM files - output: Path to the output directory for results - samples: Path to the samples list file or a single sample name - min: Minimum read length to consider - max: Maximum read length to consider
Returns
None
Notes
This function configures logging and executes the QC process. If QC fails, the program will exit with code 1.
Source code in ribomala/qc.py
Analysis module
Ribosome profiling data analysis functionality for Ribomala. This module processes BAM files listed in a sample sheet. Calculates E-, P- and A-site ribosome occupancy
asite_pos(df, offset_file)
Join the DataFrame with an offset file and compute the shifted position (pos + offset).
Parameters
df : pl.DataFrame Input DataFrame that must include 'read_length', 'frame', 'pos', and 'transcript_id' columns.
str
Path to a tab-delimited offset file with columns: read_length, frame, offset.
Returns
pl.DataFrame A DataFrame with "transcript_id" and 'a_site_pos' columns.
Source code in ribomala/analysis.py
calc_enrichment_offset(df, fasta_index, codon, excl_start, excl_end, offset_upstream, offset_downstream)
Calculate median enrichment scores at codon-specific offsets for a given codon.
This function takes a DataFrame of per-codon enrichment scores and a FASTA index file,
and computes the median enrichment score for each codon offset (in codon units)
relative to a reference codon (codon) at the A-site position. It filters out
positions too close to transcript ends according to exclusion windows.
Parameters
df : pl.DataFrame Input DataFrame containing, at minimum, the following columns: - "transcript_id" (str): Transcript identifier. - "a_site_pos" (int): Nucleotide position of the A-site codon. - "a_codon" (str): Codon at the A-site. - "enrichment_score" (float): Enrichment score for that codon position. fasta_index : str Path to a CSV file containing a FASTA index with at least: - "transcript_id" (str): Transcript identifier. - "sequence" (str): Full transcript nucleotide sequence. - "length" (int): Length of the transcript sequence. excl_start : int Number of nucleotides upstream of the A-site to exclude from analysis. excl_end : int Number of nucleotides downstream of the A-site to exclude from analysis. offset_upstream : int Maximum codon offset (in codons) upstream (negative direction) to include. offset_downstream : int Maximum codon offset (in codons) downstream (positive direction) to include. codon : str The reference codon at the A-site to center the offset analysis around.
Returns
pl.DataFrame
A DataFrame with one row per codon offset, containing:
- "offset" (int): Codon offset relative to the reference A-site codon.
- "med_enrichment_score" (float): Median enrichment score across all transcripts
at the given offset.
- "pos0_codon" (str): The reference codon (same for all rows, equal to codon).
Notes
- Positions are measured in nucleotides; codon offsets are multiplied by 3 to translate to nucleotide offsets.
- Transcripts for which the calculated offset position would fall within
excl_startnt of the 5′ end or withinexcl_endnt of the 3′ end (after trimming) are automatically filtered out. - The FASTA index CSV is read into memory; ensure it contains valid sequences
matching the transcript IDs in
df.
Source code in ribomala/analysis.py
423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 | |
calc_enrichment_score(df)
Calculate enrichment score for each position in the transcripts.
Implements the enrichment score calculation as described in Hussmann J et al, PLOS Genetics, 2015 (supplementary data equations 1 and 2). The enrichment score is calculated as the ratio of the observed read count to the mean read count per position for that transcript.
Parameters
df : pl.DataFrame Input DataFrame containing codon positions and read count information. Must contain at minimum 'transcript_id', 'e_site_pos', 'e_codon', 'p_site_pos', 'p_codon', 'a_site_pos', 'a_codon', 'reads', and 'length' columns.
Returns
pl.DataFrame DataFrame with "transcript_id", "e_site_pos", "e_codon", "p_site_pos", "p_codon", "a_site_pos", "a_codon", "mean_read_count", "reads", "length", and "enrichment_score" columns, where enrichment_score is calculated as reads/mean_read_count.
Notes
The function first calculates the mean read count per position for each transcript by dividing the total read count by the transcript length, then calculates the enrichment score for each position.
Source code in ribomala/analysis.py
count_cds_reads(df, fasta_index)
Aggregates reads on CDS of each transcripts.
Parameters
df : pl.DataFrame Input DataFrame containing 'transcript_id', 'pos', and 'reads' columns. fasta_index : pl.DataFrame or str Path to a CSV file or DataFrame containing transcript sequences with columns 'transcript_id', 'sequence' and 'length.
Returns
pl.DataFrame DataFrame with 'transcript_id', 'length', 'reads' and 'tpm' columns
Source code in ribomala/analysis.py
count_reads_on_pos(df)
Count reads on A-site positions by grouping by transcript_id and a_site_pos.
Parameters
df : pl.DataFrame Input DataFrame that must include 'transcript_id' and 'a_site_pos' columns.
Returns
pl.DataFrame A DataFrame with "transcript_id", 'a_site_pos' and "reads" columns.
Source code in ribomala/analysis.py
filter_and_comp_ep_pos(df, fasta_index, excl_start=60, excl_end=60)
Filter transcripts based on length criteria and calculate ribosome site positions.
This function expects CDS to be extended by 18 nt on both sides. It excludes the specified number of nucleotides from the start and end of transcripts, and keeps only positions that lie within the remaining CDS region. It also calculates E-site and P-site positions from the A-site.
Parameters
df : pl.DataFrame Input DataFrame that must include 'transcript_id', 'a_site_pos', and 'reads' columns. fasta_index : pl.DataFrame or str Path to a CSV file or DataFrame containing transcript information with columns 'transcript_id', 'length', and other metadata. excl_start : int, default 60 Number of nucleotides to exclude from the start of the CDS (after the initial 18 nt extension). excl_end : int, default 60 Number of nucleotides to exclude from the end of the CDS (before the final 18 nt extension).
Returns
pl.DataFrame DataFrame with 'transcript_id', 'length', 'e_site_pos', 'p_site_pos', 'a_site_pos', and 'reads' columns. Only includes transcripts with at least 3 nucleotides remaining after trimming and only positions within the valid range.
Source code in ribomala/analysis.py
identify_codons(df, fasta_index)
Identify codons at E, P, and A-sites for each position in the transcripts.
Parameters
df : pl.DataFrame Input DataFrame containing 'transcript_id', 'length', 'e_site_pos', 'p_site_pos', 'a_site_pos', and 'reads' columns. fasta_index : pl.DataFrame or str Path to a CSV file or DataFrame containing transcript sequences with columns 'transcript_id' and 'sequence'.
Returns
pl.DataFrame DataFrame with 'transcript_id', 'length', 'e_site_pos', 'e_codon', 'p_site_pos', 'p_codon', 'a_site_pos', 'a_codon', and 'reads' columns, where the codon columns contain the 3-nucleotide sequences at each ribosome site.
Source code in ribomala/analysis.py
parse_sample_sheet(sample_sheet_path)
Parse the sample sheet and extract file names and their corresponding read length, frame, and offset information.
Parameters
sample_sheet_path: Path to the sample sheet CSV file
Returns
A tuple containing:
- List of unique file names
- Dictionary mapping file names to their read length, frame, and offset information as a polars DataFrame
Source code in ribomala/analysis.py
run(args)
Execute the main analysis pipeline for Ribomala.
Parameters
args : argparse.Namespace Parsed command-line arguments containing: - input (str): Path to input directory containing BAM files. - output (str): Path to output directory. - exclstart (int): Start position for exclusion in enrichment analysis. - exclend (int): End position for exclusion in enrichment analysis. - samples (str): Path to sample sheet CSV file. - txcsv (str): Path to transcriptome FASTA index CSV file. - upstream (int): Number of upstream positions for enrichment calculation. - downstream (int): Number of downstream positions for enrichment calculation. - codon (str): Comma-separated list of codons to analyze.
Notes
This function validates inputs, processes each BAM file for read alignment and enrichment, and writes the results to the output directory. Intermediate steps include read shifting, codon identification, and enrichment score calculations.
Source code in ribomala/analysis.py
536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 | |
validate_input_files(input_dir, file_names)
Validate that all files in the sample sheet exist in the input directory.
Parameters
input_dir: Directory containing input BAM files
file_names: List of file names from the sample sheet
Returns
A tuple containing:
- List of valid file paths
- List of missing files