riboraptor package¶

Submodules¶

riboraptor.cli module¶

Command line interface for riboraptor

riboraptor.coherence module¶

riboraptor.coherence.coherence(original_values, frames=[0])[source]¶

Calculate coherence and an idea ribo-seq signal

Parameters:	values : array like List of values
Returns:	periodicity : float Periodicity score calculated as coherence between input and idea 1-0-0 signal f: array like List of frequencies Cxy: array like List of coherence at the above frequencies

riboraptor.coherence.coherence_ft(values, nperseg=30, noverlap=15, window='flattop')[source]¶

Calculate coherence and an idea ribo-seq signal based on Fourier transform

Parameters:	values : array like List of values
Returns:	periodicity : float Periodicity score calculated as coherence between input and idea 1-0-0 signal window: str or tuple See scipy.signal.get_window f: array like List of frequencies Cxy: array like List of coherence at the above frequencies

riboraptor.coherence.get_periodicity(values, input_is_stream=False)[source]¶

Calculate periodicty wrt 1-0-0 signal.

Parameters:	values : array like List of values
Returns:	periodicity : float Periodicity calculated as cross correlation between input and idea 1-0-0 signal

riboraptor.coherence.naive_periodicity(values, identify_peak=False)[source]¶

Calculate periodicity in a naive manner

Take ratio of frame1 over avg(frame2+frame3) counts. By default the first value is treated as the first frame as well

Parameters:	values : Series Metagene profile
Returns:	periodicity : float Periodicity

riboraptor.count module¶

Utilities for read counting operations.

class riboraptor.count.OrderedCounter(**kwds)[source]¶

Bases: collections.Counter, collections.OrderedDict

Counter that remembers the order elements are first encountered

riboraptor.count.bam_to_bedgraph(bam, strand='both', end_type='5prime', saveto=None)[source]¶

Create bigwig from bam.

Parameters:	bam : str Path to bam file strand : str, optional Use reads mapping to ‘+/-/both’ strands end_type : str Use only end_type=5prime(5’) or “3prime(3’)” saveto : str, optional Path to write bedgraph
Returns:	genome_cov : str Bedgraph output

riboraptor.count.bedgraph_to_bigwig(bedgraph, sizes, saveto, input_is_stream=False)[source]¶

Convert bedgraph to bigwig.

Parameters:	bedgraph : str Path to bedgraph file sizes : str Path to genome chromosome sizes file or genome name saveto : str Path to write bigwig file input_is_stream : bool True if input is sent through stdin

riboraptor.count.count_uniq_mapping_reads(bam)[source]¶

Count number of mapped reads.

Parameters:	bam: str Path to bam file
Returns:	n_mapped: int Count of uniquely mapped reads

riboraptor.count.export_gene_coverages(bed, bw, saveto, offset_5p=0, offset_3p=0)[source]¶

Export all gene coverages.

Parameters:	bed: str Path to CDS or 5’UTR or 3’UTR bed bw: str Path to bigwig to fetch the scores from saveto: str Path to write output tsv file offset_5p: int (positive) Number of bases to count upstream (5’) offset_3p: int (positive) Number of bases to count downstream (3’)
Returns:	gene_profiles: file with the following format: gene1 5poffset1 3poffset1 cnt1_1 cnt1_2 cnt1_3 … gene2 5poffset2 3poffset2 cnt2_1 cnt2_2 cnt2_3 …

riboraptor.count.export_metagene_coverage(bed, bw, max_positions=None, saveto=None, offset_5p=0, offset_3p=0, orientation='5prime', n_jobs=16)[source]¶

Export metagene coverage.

Parameters:

bed: str: Path to CDS or 5’UTR or 3’UTR bed
bw: str: Path to bigwig to fetch the scores from
max_positions: int: Number of positions to consider while calculating the normalized coverage Higher values lead to slower implementation
saveto: str: Path to write output tsv file
offset_5p: int (positive): Number of bases to count upstream (5’)
offset_3p: int (positive): Number of bases to count downstream (3’)
orientation: string: [‘5prime’, ‘3prime’] indicating the end of read being tracked
n_jobs: int: Number of paralle threads to open Better to do on a multi-cpu machine, but also works decently on a single core machine

Returns:

metagene_coverage: series: Metagene coverage

riboraptor.count.export_read_counts(gene_coverages, saveto, keep_offsets=True)[source]¶

export read counts from gene coverages file.

Parameters:	gene_coverages: string Path to gene coverages.tsv saveto: str Path to save output tsv gene_name count length keep_offsets: bool whether to keep the 5’ and 3’ offsets in gene coverages default is False

riboraptor.count.export_read_length(bam, saveto=None)[source]¶

Count read lengths.

Parameters:	bam: str Path to bam file saveto: str Path to write output tsv file Returns ——- lengths: counter Counter of read length and counts

riboraptor.count.extract_uniq_mapping_reads(inbam, outbam)[source]¶

Extract only uniquely mapping reads from a bam.

Parameters:	inbam: string Path to input bam file outbam: string Path to write unique reads bam to

riboraptor.count.gene_coverage(gene_group, bw, offset_5p=0, offset_3p=0)[source]¶

Get gene coverage.

Parameters:	gene_group: DataFrame gene group from bed file bw: str Path to bigwig to fetch the scores from offset_5p: int (positive) Number of bases to count upstream (5’) offset_3p: int (positive) Number of bases to count downstream (3’)
Returns:	coverage_combined: series Series with index as position and value as coverage gene_offset_5p: Gene wise 5 prime offset This might be different from offset_5p in cases where offset_5p leads to a negative coordinate gene_offset_3p: Gene wise 3 prime offset This might be different from offset_3p in cases where offset_3p leads to position beyond chromsome length

riboraptor.count.get_bam_coverage(bam, bed, outprefix=None)[source]¶

Get coverage from bam

Parameters:	bam: string Path to bam file bed: string Path to genes.bed or cds.bed file required for inferring protocol of bam
Returns:	coverage: dict(dict) dict with keys as chrom:position with query lengths as keys

riboraptor.count.get_bam_coverage_on_bed(bam, bed, protocol='forward', orientation='5prime', max_positions=1000, offset=60, saveto=None)[source]¶: Get bam coverage over start_codon/stop_codon coordinates

riboraptor.count.get_region_sizes(bed)[source]¶

Get collapsed lengths of gene in bed.

Parameters:	bed: str Path to bed file
Returns:	region_sizes: dict Region sizes with gene names as key and value as size of this named region

riboraptor.count.mapping_reads_summary(bam, saveto=None)[source]¶

Count number of mapped reads.

Parameters:	bam: str Path to bam file saveto: str Path to write output tsv
Returns:	counts : counter Counter with keys as number of times read maps and values as number of reads of that type

riboraptor.count.merge_gene_coverages(gene_coverages, max_positions=None, saveto=None)[source]¶

merge gene coverages to generate metagene coverage.

Parameters:	gene_coverages: string Path to gene coverages.tsv max_positions: int Number of positions to consider while calculating the normalized coverage saveto: str Path to save output tsv
Returns:	metagene_coverage : Series Metagene coverage

riboraptor.count.merge_read_counts(read_counts, saveto)[source]¶

merge read counts tsv files to one count table

Parameters:	read_counts: str Path to file which contains paths of all read counts tsv file you want to merge format: sample1 path1 sample2 path2 saveto: str Path to save output tsv

riboraptor.count.read_enrichment(read_lengths, min_length=28, max_length=32)[source]¶

Calculate read enrichment for a certain range of lengths

Parameters:	read_lengths: str Path to read length tsv min_length: int The low end of the range max_length: int The high end of the range
Returns:	ratio: float Enrichment in this range (Scale 0-1)

riboraptor.download module¶

Utilities to download data from NCBI SRA

riboraptor.download.run_download_sra_script(download_root_location=None, ascp_key_path=None, srp_id_file=None, srp_id_list=None)[source]¶

Download data from SRA.

Parameters:	download_root_location : string Path to download SRA files ascp_key_path : string Location for aspera private keypp srp_id_list : list List of SRP ids for download srp_id_file : string File containing list of SRP Ids, one per line

riboraptor.dtw module¶

riboraptor.dtw.dtw(X, Y, metric='euclidean', ddtw=False, ddtw_order=1)[source]¶

Parameters:

X : array_like: M x D matrix
Y : array_like: N x D matrix
metric : string: The distance metric to use. Can be : ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘wminkowski’, ‘yule’. See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
ddtw : bool: Should use derivative DTW where the distance matrix is created using the derivate values at each point rather than the point themselves
ddtw_order : int [1,2]: First order uses one difference method Second order uses np.gradient for an approximation upto second order
Returns
——-
total_cost : float: Total (minimum) cost of warping
pointwise_cost : array_like: M x N matrix with cost at each (i, j)
accumulated_cost : array_like: M x N matrix with (minimum) cost accumulated till (i,j) having started from (0, 0)

riboraptor.dtw.get_path(D)[source]¶

Traceback path of minimum cost

Given accumulated cost matrix D, trace back the minimum cost path

Parameters:	D : array_like M x N matrix as obtained from accumulated_cost using: total_cost, pointwise_cost, accumulated_cost = dtw(X, Y, metric=’euclidean’)
Returns:	traceback_x, traceback_x : array_like M x 1 and N x 1 array containing indices of movement starting from (0, 0) going to (M-1, N-1)

riboraptor.dtw.plot_warped_timeseries(x, y, pointwise_cost, accumulated_cost, path, colormap=<Mock name='mock.pyplot.cm.Blues' id='140237965673304'>, linecolor='#D55E00')[source]¶

riboraptor.fasta module¶

class riboraptor.fasta.FastaReader(fasta_location)[source]¶

Bases: object

Class for reading and querying fasta file.

chromosomes¶

Return list of chromsome and their sizes as in the fasta file.

Returns:	chroms : dict Dictionary with {“chr”: “Length”} format .. currentmodule:: .FastaReader .. autosummary:: .FastaReader

complement(seq)[source]¶

query(intervals)[source]¶

Query regions for sequence.

Parameters:	intervals: list of Interval The intervals for fasta is one-based and full-closed
Returns:	sequences: list(str) An array containing scores for each Interval This function is agnostic of the strand information, the position in the scores is corresponding to the interval .. currentmodule:: .FastaReader .. autosummary:: .FastaReader

reverse_complement(seq)[source]¶

riboraptor.genome module¶

riboraptor.hdf_parser module¶

class riboraptor.hdf_parser.HDFParser(filepath)[source]¶

Bases: object

close()[source]¶

get_coverage(region=(None, None, None, None), fragment_length=None, orientation='5prime', outprefix=None)[source]¶

Get coverage for a selected region.

Since the region is user provided, we assume the user is aware if it is strand specfic or not

get_query_alignment_length_dist()[source]¶

get_read_length_dist()[source]¶

riboraptor.hdf_parser.create_metagene_from_multi_bigwig(bed, bigwigs, max_positions=1000, offset_5p=0, offset_3p=0, n_jobs=16, saveto=None)[source]¶

Collapse multiple bigwigs to get bigwig.

Test case -> we should be able to get the same coverage file as from export_metagene coverage if everything above this works okay.

riboraptor.hdf_parser.hdf_to_bigwig(hdf, prefixdir, read_lengths_to_use='all', output_normalized=True)[source]¶

Create fragment and strand specific bigwigs from hdf

Parameters:	hdf: string Path to hdf file prefix: string Prefix to store output files read_lengths_to_use: list Create bigwig only from these fragment lengths TODO: one fix for this would be to create separate nonoverlapping bigwigs and then merge them with bigWigCat (one more dependency)

riboraptor.hdf_parser.merge_bigwigs(bigwigs, chrom_sizes, saveto, scale=False)[source]¶

Merge multiple bigwigs into one.

Note: This seems impossible doing it through pybigWig way and hence the dependency on ucsc-bigWigMerge

Parameters:	bigwigs: list List of path to bigwigs chrom_size: string Path to chromsome.sizes file saveto: string Path for outputting bigwig

riboraptor.hdf_parser.normalize_bw_hdf(bw, hdf, read_length, outbw)[source]¶

Normalize a fragment specific bigwig to RPM for that fragment length

Parameters:	bw: string Path to bigwig file hdf: string Path to hdf file (used for fetching total counts for that read type) read_length: int Fragment length outbw: string Path to output bigwig

riboraptor.hdf_parser.tsv_to_bigwig(df, chrom_lengths, prefix)[source]¶

Convert tsv (created by bam-coverage) to bigwig

We create multiple bigwigs, each separately for fragment length, and the strand.

This will output the following files, where N represents the fragment length

N.5prime.pos.bw N.3prime.pos.bw

N.5prime.neg.bw N.3prime.neg.bw

Parameters:	df: string or pd.DataFrame Path to bam-coverage outputted tsvi chrom_lengths: list of tuples A list of tuples with chromsome name and size [(‘chr1’, 10000), [‘chr2’, 5000]] prefix: string Path to store the final tsv

riboraptor.helpers module¶

All functions that are not so useful, but still useful.

riboraptor.helpers.bwshift(bw, shift_by, out_bw, chunk_size=20000)[source]¶

Given a bigwig shift all the values by this Shifting by 10: variableStep chrom=chr span=1 1 1 2 2 3 5 4 6 5 5 6 3 7 3 8 5 9 5 10 5 11 6 12 6 13 0 14 2 15 3 16 3 17 10 18 4 19 4 20 2 21 2 22 2 23 1

shifted by 10 variableStep chrom=chr span=1 1 6 2 6 3 0 4 2 5 3 6 3 7 10 8 4 9 4 10 2 11 2 12 2 13 1

riboraptor.helpers.bwsum(bw, chunk_size=5000, scale_to=1000000.0)[source]¶

riboraptor.helpers.check_file_exists(filepath)[source]¶

Check if file exists.

Parameters:	filepath : str Path to file

riboraptor.helpers.codon_to_anticodon(codon)[source]¶

Codon to anticodon.

Parameters:	codon : string Input codon

riboraptor.helpers.complementary_strand(strand)[source]¶

Get complementary strand

Parameters:	strand: string +/-
Returns:	rs: string -/+

riboraptor.helpers.counts_to_tpm(counts, sizes)[source]¶

Counts to TPM

Parameters:	counts: array like Series/array of counts sizes: array like Series/array of region sizes

riboraptor.helpers.create_bam_index(bam)[source]¶

Create bam index.

Parameters:	bam : str Path to bam file

riboraptor.helpers.create_ideal_periodic_signal(signal_length)[source]¶

Create ideal ribo-seq signal.

Parameters:	signal_length : int Length of signal to create
Returns:	signal : array_like 1-0-0 signal

riboraptor.helpers.featurecounts_to_tpm(fc_f, outfile)[source]¶

Convert htseq-counts file to tpm

Parameters:	fc_f: string Path to htseq-count output outfile: string Path to output file with tpm values

riboraptor.helpers.find_first_non_none(positions)[source]¶

Given a list of positions, find the index and value of first non-none element.

This method is specifically designed for pysam, which has a weird way of returning the reference positions. If they are mismatched/softmasked it returns None when fetched using get_reference_positions.

query_alignment_start and query_alignment_end give you indexes of position in the read which technically align, but are not softmasked i.e. it is set to None even if the position does not align

Parameters:	positions: list of int Positions as returned by pysam.fetch.get_reference_positions

riboraptor.helpers.find_last_non_none(positions)[source]¶

Given a list of positions, find the index and value of last non-none element.

This function is similar to the find_first_non_none function, but does it for the reversed list. It is specifically useful for reverse strand cases

Parameters:	positions: list of int Positions as returned by pysam.fetch.get_reference_positions

riboraptor.helpers.get_region_sizes(region_bed)[source]¶

Get summed up size of a CDS/UTR region from bed file

Parameters:	region_bed: string Input bed file
Returns:	region_sizes: pd.Series Series with region name as index and size as key

riboraptor.helpers.get_strandedness(filepath)[source]¶

Parse output of infer_experiment.py from RSeqC to get strandedness.

Parameters:	filepath : str Path to infer_experiment.py output
Returns:	strandedness : str reverse or forward or none

riboraptor.helpers.htseq_to_tpm(htseq_f, outfile, cds_bed_f)[source]¶

Convert htseq-counts file to tpm

Parameters:	htseq_f: string Path to htseq-count output outfile: string Path to output file with tpm values cds_bed_f: string Path to CDS/genesize bed file

riboraptor.helpers.identify_peaks(coverage)[source]¶: Given coverage array, find the site of maximum density

riboraptor.helpers.is_read_uniq_mapping(read)[source]¶

Check if read is uniquely mappable.

Parameters:	read : pysam.Alignment.fetch object Most reliable: [‘NH’] tag

riboraptor.helpers.list_to_ranges(list_of_int)[source]¶

Convert a list to a list of range object

Parameters:	list_of_int: list List of integers to be squeezed into range
Returns:	list_of_range: list List of range objects

riboraptor.helpers.load_pickle(filepath)[source]¶: Read pickled files easy in Python 2/3

riboraptor.helpers.merge_intervals(intervals, chromosome_lengths=None, offset_5p=0, offset_3p=0, zero_based=True)[source]¶

Collapse intervals into non overlapping manner

Parameters:

intervals : list of Interval
chromosome_lengths : dict: A map of each chromosome’e length Only used with offset_3p, offset_5p>0
offset_5p : int (positive): Number of bases to count upstream (5’)
offset_3p : int (positive): Number of bases to count downstream (3’)
zero_based: bool: Indicate if the intervals are zero-based True means zero-based half open False means one-based full closed

Returns:

interval_combined : list of Interval sorted by the start: A merged version of intervals This is useful when the annotations are overlapping. Example: chr1 310 320 gene1 + chr1 319 324 gene1 + Returns: chr1 310 324 gene1 +
gene_offset_5p: Gene wise 5 prime offset: This might be different from offset_5p in cases where offset_5p leads to a negative coordinate
gene_offset_3p: Gene wise 3 prime offset: This might be different from offset_3p in cases where offset_3p leads to position beyond chromsome length

riboraptor.helpers.millify(n)[source]¶

Convert integer to human readable format.

Parameters:	n : int
Returns:	millidx : str Formatted integer

riboraptor.helpers.mkdir_p(path)[source]¶

Python version mkdir -p

Parameters:	path : str

riboraptor.helpers.order_dataframe(df, columns)[source]¶

Order a dataframe

Order a dataframe by moving the columns in the front

Parameters:	df: Dataframe Dataframe columns: list List of columns that need to be put in front

riboraptor.helpers.pad_five_prime_or_truncate(some_list, offset_5p, target_len)[source]¶

Pad first the 5prime end and then the 3prime end or truncate

Parameters:	some_list : list Input list offset_5p : int 5’ offset target_length : int Final length of list If being extended, returns list padded with NAs.

riboraptor.helpers.pad_or_truncate(some_list, target_len)[source]¶

Pad or truncate a list upto given target length

Parameters:	some_list : list Input list target_length : int Final length of list If being extended, returns list padded with NAs.

riboraptor.helpers.parse_star_logs(infile, outfile=None)[source]¶

Parse star logs into a dict

Parameters:	infile : str Path to starlogs.final.out file
Returns:	star_info : dict Dict with necessary records parsed

riboraptor.helpers.path_leaf(path)[source]¶: Get path’s tail from a filepath

riboraptor.helpers.r2(x, y)[source]¶

Calculate pearson correlation between two vectors.

Parameters:	x : array_like Input y : array_like Input

riboraptor.helpers.read_bed_as_intervaltree(filepath)[source]¶

Read bed as interval tree

Useful for reading start/stop codon beds

Parameters:	filepath: string Location to bed
Returns:	bedint_tree: dict dict with keys as gene name and strand as intervaltree

riboraptor.helpers.read_chrom_sizes(filepath)[source]¶

Read chr.sizes file sorted by chromosome name

Parameters:	filepath: string Location to chr.sizes
Returns:	chrom_lengths: list of tuple A list of tuples with chromsome name and their size

riboraptor.helpers.read_enrichment(read_lengths, enrichment_range=range(28, 33), input_is_stream=False, input_is_file=True)[source]¶

Calculate read enrichment for a certain range of lengths

Parameters:	read_lengths: Counter A counter with read lengths and their counts enrichment_range: range or str Range of reads to concentrate upon (28-32 or range(28,33)) input_is_stream: bool True if input is sent through stdin
Returns:	ratio: float Enrichment in this range (Scale 0-1)

riboraptor.helpers.read_htseq(htseq_f)[source]¶

Read HTSeq file.

Parameters:	htseq_f: str Path to htseq counts file
Returns:	htseq_df: dataframe HTseq counts as in a dataframe

riboraptor.helpers.read_refseq_bed(filepath)[source]¶

Read refseq bed12 from UCSC.

Parameters:	filepath: string Location to bed12
Returns:	refseq: dict dict with keys as gene name and values as intervaltree

riboraptor.helpers.round_to_nearest(x, base=5)[source]¶

Round to nearest base.

Parameters:	x : float Input
Returns:	v : int Output

riboraptor.helpers.scale_bigwig(inbigwig, chrom_sizes, outbigwig, scale_factor=1)[source]¶

Scale a bigwig by certain factor.

Parameters:	inbigwig: string Path to input bigwig chrom_sizes: string Path to chrom.sizes file outbigwig: string Path to output bigwig scale_factor: float Scale by value

riboraptor.helpers.set_xrotation(ax, degrees)[source]¶

Rotate labels on x-axis.

Parameters:	ax : matplotlib.Axes Axes object degrees : int Rotation degrees

riboraptor.helpers.summarize_counters(samplewise_dict)[source]¶

Summarize gene counts for a collection of samples.

Parameters:	samplewise_dict : dict A dictionary with key as sample name and value as another dictionary of counts for each gene
Returns:	totals : dict A dictionary with key as sample name and value as total gene count

riboraptor.helpers.summary_stats_two_arrays_welch(old_mean_array, new_array, old_var_array=None, old_n_counter=None, carried_forward_observations=None)[source]¶

Average two arrays using welch’s method

Parameters:

old_mean_array : Series: Series of previous means with index as positions
old_var_array : Series: Series of previous variances with index as positions
new_array : array like: Series of new observations (Does noes Ciunts of number of positions at a certain index

Returns:

m : array like: Column wise Mean array
var : array like: Column wise variance
Consider an example: [1,2,3], [1,2,3,4], [1,2,3,4,5]
old = [1,2,3]
new = [1,2,3,4]
counter = [1,1,1]
mean = [1,2,3,4] Var =[na, na, na, na], carried_fowrad = [[1,1], [2,2], [3,3], [4]]
old = [1,2,3,4]
new = [1,2,3,4,5]
couter = [2,2,2,1]
mean = [1,2,3,4,5]
var = [0,0,0, na, na]
carried_forward = [[], [], [], [4,4], [5]]

riboraptor.helpers.symlink_force(source, destination)[source]¶

Create forcelink forcefully

Parameters:	source: string Location to source file destination: string Location to target

riboraptor.helpers.yield_intervals(chrom_size, chunk_size=20000)[source]¶

riboraptor.infer_protocol module¶

riboraptor.infer_protocol.infer_protocol(bam, bed, n_reads=500000, drop_probability=0.2)[source]¶

Infer strandedness protocol given a bam file

Parameters:	bam: string Path to bam file bed: string Path to gene bed file n_reads: int Number of reads to use (downsampled) drop_probability: float [0-1] Reads are randomly droppped with this probability to simulate sampling
Returns:	protocol: string unstranded/forward/reverse forward_mapped_reads: float Proportion of reads of type + mapping to + (++) or - mapping to - (–) reverse_mapped_reads: float Proportion of reads of type + mapping to - (+-) or - mapping to + (-+)

riboraptor.interval module¶

class riboraptor.interval.Interval(chrom=None, start=0, end=0, strand=None)[source]¶

Bases: object

Class for bed interval

riboraptor.kmer module¶

riboraptor.kmer.fastq_kmer_histogram(fastq_file, kmer_length=range(5, 31), five_prime=False, max_seq=1000000, offset=0, drop_probability=0)[source]¶

Get a histogram of kmers from a fastq file

Parameters:

fastq_file: string: Location of .fastq or .fastq.gz
kmer_length: range: Range of kmer to consider
five_prime: bool: Should consider sequences from 5’ end? Default: False (uses sequence from 3’ end)
max_seq: int: Maximum number of sequences to consider
offset: int: Offset to ignore at the 5’ end or 3’end Example: If the offset is 3, the first 3 bases will be skipped and kmers will start only from the 4th position For 3’end if the offset is 3, the last 3 bases will be skipped
drop_probability: float [0-1]: Reads are randomly droppped with this probability to simulate sampling
Returns
——-
kmers: Series: Sorted series with most frequent kmer

riboraptor.normalization module¶

riboraptor.normalization.deseq2_normalization(list_of_profiles)[source]¶

Perform DESeq2 like normalization position specific scores

Parameters:	list_of_profiles: array-like array of profiles across samples for one gene
Returns:	normalized_profiles: array-like array of profiles across samples

riboraptor.parallel module¶

riboraptor.parallel.ParallelExecutor(use_bar='tqdm', **joblib_args)[source]¶

riboraptor.parallel.func(x)[source]¶

riboraptor.parallel.text_progessbar(seq, total=None)[source]¶

riboraptor.plotting module¶

Plotting methods.

riboraptor.plotting.create_wavelet(data, ax)[source]¶

riboraptor.plotting.plot_featurewise_barplot(utr5_counts, cds_counts, utr3_counts, ax=None, saveto=None, **kwargs)[source]¶

Plot barplots for 5’UTR/CDS/3’UTR counts.

Parameters:

utr5_counts : int or dict: Total number of reads in 5’UTR region or alternatively a dictionary/series with genes as key and 5’UTR counts as values
cds_counts : int or dict: Total number of reads in CDs region or alternatively a dictionary/series with genes as key and CDS counts as values
utr3_counts : int or dict: Total number of reads in 3’UTR region or alternatively a dictionary/series with genes as key and 3’UTR counts as values
saveto : str: Path to save output file to (<filename>.png/<filename>.pdf)

riboraptor.plotting.plot_framewise_counts(counts, frames_to_plot='all', ax=None, title=None, millify_labels=False, position_range=None, saveto=None, ascii=False, input_is_stream=False, **kwargs)[source]¶

Plot framewise distribution of reads.

Parameters:	counts : Series A series with position as index and value as counts frames_to_plot : str or range A comma seaprated list of frames to highlight or a range ax : matplotlib.Axes Default none saveto : str Path to save output file to (<filename>.png/<filename>.pdf)

riboraptor.plotting.plot_periodicity_df(df, saveto, cbar=False, figsize=(8, 8))[source]¶

Plot periodicty values across fragment lengths as a matrix.

Parameters:	df: string Path to dataframe containing fragment length specific periodicities saveto: string Path to output plot file cbar: bool Whether to plot cbar or not

riboraptor.plotting.plot_read_counts(counts, ax=None, marker=None, color='royalblue', title=None, label=None, millify_labels=False, identify_peak=True, saveto=None, position_range=None, ascii=False, input_is_stream=False, ylabel='Normalized RPF density', **kwargs)[source]¶

Plot RPF density aro und start/stop codons.

Parameters:

counts : Series/Counter: A series with coordinates as index and counts as values
ax : matplotlib.Axes: Axis to create object on
marker : string: ‘o’/’x’
color : string: Line color
label : string: Label (useful only if plotting multiple objects on same axes)
millify_labels : bool: True if labels should be formatted to read millions/trillions etc
saveto : str: Path to save output file to (<filename>.png/<filename>.pdf)

riboraptor.plotting.plot_read_length_dist(read_lengths, ax=None, millify_labels=True, input_is_stream=False, title=None, saveto=None, ascii=False, **kwargs)[source]¶

Plot read length distribution.

Parameters:	read_lengths : array_like Array of read lengths ax : matplotlib.Axes Axis object millify_labels : bool True if labels should be formatted to read millions/trillions etc input_is_stream : bool True if input is sent through stdin saveto : str Path to save output file to (<filename>.png/<filename>.pdf)

riboraptor.plotting.setup_axis(ax, axis='x', majorticks=5, minorticks=1, xrotation=45, yrotation=0)[source]¶

Setup axes defaults

Parameters:	ax : matplotlib.Axes axis : str Setup ‘x’ or ‘y’ axis majorticks : int Length of interval between two major ticks minorticks : int Length of interval between two major ticks xrotation : int Rotate x axis labels by xrotation degrees yrotation : int Rotate x axis labels by xrotation degrees

riboraptor.plotting.setup_plot()[source]¶: Setup plotting defaults

riboraptor.sequence module¶

Utilities for extracting sequence from fasta.

riboraptor.sequence.export_gene_sequences(bed, fasta, saveto=None, offset_5p=0, offset_3p=0)[source]¶

Export all gene sequences.

Parameters:	bed: str Path to CDS or 5’UTR or 3’UTR bed fasta: str Path to fasta to fetch the sequences from saveto: str Path to write output tsv file offset_5p: int (positive) Number of bases to count upstream (5’) offset_3p: int (positive) Number of bases to count downstream (3’)
Returns:	gene_profiles: file with the following format: gene1 5poffset1 3poffset1 sequence gene2 5poffset2 3poffset2 sequence

riboraptor.sequence.gene_sequence(gene_group, fasta, offset_5p=0, offset_3p=0)[source]¶

Extract seq genewise given coordinates in bed file

Parameters:	gene_group: DataFrame gene group from bed file fasta str Path to fasta file offset_5p: int (positive) Number of bases to count upstream (5’) offset_3p: int (positive) Number of bases to count downstream (3’)
Returns:	sequence: str sequence of the gene gene_offset_5p: Gene wise 5 prime offset This might be different from offset_5p in cases where offset_5p leads to a negative coordinate gene_offset_3p: Gene wise 3 prime offset This might be different from offset_3p in cases where offset_3p leads to position beyond chromsome length

riboraptor.sradb module¶

Helper functions for parsing SRAmetadb.sqlite file

class riboraptor.sradb.SRAdb(sqlite_file)[source]¶

Bases: object

close()[source]¶

convert_gse_to_srp(gse)[source]¶: Convert GSE to SRP id. Requires input db to be GEOmetadb.sqlite

desc_table(table)[source]¶

get_query(query)[source]¶

get_row_count(table)[source]¶: Get row counts for a table

get_table_counts()[source]¶

list_fields(table)[source]¶: List all fields in a given table

list_tables()[source]¶: List all tables in the sqlite

open()[source]¶

search_experiment(srx)[source]¶: Search for a SRX/GSM id in the experiments

sra_convert(acc, out_type=['study_accession', 'experiment_accession', 'experiment_title', 'run_accession', 'taxon_id', 'library_selection', 'library_layout', 'library_strategy', 'library_source', 'library_name', 'bases', 'spots', 'adapter_spec'])[source]¶

riboraptor.statistics module¶

riboraptor.statistics.KDE(values)[source]¶

Perform Univariate Kernel Density Estimation.

Wrapper utility around statsmodels for quick KDE TODO: scikit-learn has a faster implementation (?)

Parameters:	values : array like
Returns:	support : array_like cdf : array_like

riboraptor.statistics.KS_test(a, b)[source]¶

Perform KS test between a and b values

Parameters:	a, b : array-like Input
Returns:	D : int KS D statistic effect_size : float maximum difference at point of D-statistic cdf_a, cdf_b : float CDF of a, b Note: By default this method does testing for alternative=lesser implying that the test will reject H0 when the CDf of b is ‘above’ a

riboraptor.statistics.calculate_cdf(data)[source]¶

Calculate CDF given data points

Parameters:	data : array-like Input values
Returns:	cdf : series Cumulative distribution funvtion calculated at indexed points

riboraptor.statistics.coherence_pvalue(x, N)[source]¶

Calculate p-value for coherence score

Parameters:	x: float [0-1] Coherence value N: int Total number of windows
Returns:	pval: float p-value

riboraptor.statistics.series_cdf(series)[source]¶

Calculate cdf of series preserving the index

Parameters:	series : series like
Returns:	cdf : series

riboraptor.tracks module¶

riboraptor.tracks.create_trackdb(bwdir, srp, orientation='5prime')[source]¶: Create track file

riboraptor.tracks.get_bigwigtrack_text(track_name, parent, big_data_url, negate_values)[source]¶: Create bigwig track text

riboraptor.tracks.get_compositetrack_text(track_name, parent)[source]¶: Create composite track text

riboraptor.tracks.get_multiwigtrack_text(track_name, parent)[source]¶

Create a multiWig track.

Example track myMultiWig container multiWig aggregate transparentOverlay showSubtrackColorOnUi on type bigWig 0 1000 viewLimits 0:10 maxHeighPixels 100:32:8

track myFirstOverlaySig parent myMultiWig color 255,128,128 type bigWig 0 1139

track myFirstBigWig parent myMultiWig color 120,235,204

riboraptor.tracks.get_supertrack_text(track_name)[source]¶

riboraptor.utils module¶

riboraptor.utils.copy_sra_data(df, sra_source_dir='/staging/as/skchoudh/SRA_datasets/', sra_dest_dir='/staging/as/skchoudh/re-ribo-datasets/')[source]¶: Copy SRA data to a new location retaining only single ended samples.

riboraptor.utils.create_config_file(df)[source]¶

riboraptor.utils.determine_cell_type(sample_attribute)[source]¶

riboraptor.utils.filter_single_end_samples(df)[source]¶

Filter single end samples from a dataframe

Parameters:	df: DataFrame Dataframe as obtained from SRAb.sra_convert()
Returns:	df: DataFrame DataFrame with only single end samples

riboraptor.utils.get_cell_line_or_tissue(row)[source]¶

riboraptor.utils.get_enrichment_cds_stats(pickle_file)[source]¶

riboraptor.utils.get_fragment_enrichment_score(txt_file)[source]¶

riboraptor.utils.get_strain_type(sample_attribute)[source]¶

riboraptor.utils.get_tissue_type(sample_attribute)[source]¶

riboraptor.utils.load_tpm(path)[source]¶

riboraptor.utils.summary_starlogs_over_runs(directory, list_of_srr)[source]¶

riboraptor.utils.write_config(species, srp)[source]¶

riboraptor.wig module¶

class riboraptor.wig.WigReader(wig_location)[source]¶

Bases: object

Class for reading and querying wigfiles.

chromosomes¶

Return list of chromsome and their sizes as in the wig file.

Returns:	chroms : dict Dictionary with {“chr”: “Length”} format .. currentmodule:: .WigReader .. autosummary:: .WigReader

close()[source]¶

query(intervals)[source]¶

Query regions for scores.

Parameters:	intervals : list of Interval
Returns:	scores : array of array A numpy array containing scores for each Interval This function is agnostic of the strand information, the position in the scores is corresponding to the interval .. currentmodule:: .WigReader .. autosummary:: .WigReader

Contents

Previous topic

Next topic

This Page

riboraptor package¶

Submodules¶

riboraptor.cli module¶

riboraptor.coherence module¶

riboraptor.count module¶

riboraptor.download module¶

riboraptor.dtw module¶

riboraptor.fasta module¶

riboraptor.genome module¶

riboraptor.hdf_parser module¶

riboraptor.helpers module¶

riboraptor.infer_protocol module¶

riboraptor.interval module¶

riboraptor.kmer module¶

riboraptor.normalization module¶

riboraptor.parallel module¶

riboraptor.plotting module¶

riboraptor.sequence module¶

riboraptor.sradb module¶

riboraptor.statistics module¶

riboraptor.tracks module¶

riboraptor.utils module¶

riboraptor.wig module¶

Module contents¶