riboraptor package

Submodules

riboraptor.cli module

Command line interface for riboraptor

riboraptor.coherence module

riboraptor.coherence.coherence(original_values, frames=[0])[source]

Calculate coherence and an idea ribo-seq signal

Parameters:
values : array like

List of values

Returns:
periodicity : float

Periodicity score calculated as coherence between input and idea 1-0-0 signal

f: array like

List of frequencies

Cxy: array like

List of coherence at the above frequencies

riboraptor.coherence.coherence_ft(values, nperseg=30, noverlap=15, window='flattop')[source]

Calculate coherence and an idea ribo-seq signal based on Fourier transform

Parameters:
values : array like

List of values

Returns:
periodicity : float

Periodicity score calculated as coherence between input and idea 1-0-0 signal

window: str or tuple

See scipy.signal.get_window

f: array like

List of frequencies

Cxy: array like

List of coherence at the above frequencies

riboraptor.coherence.get_periodicity(values, input_is_stream=False)[source]

Calculate periodicty wrt 1-0-0 signal.

Parameters:
values : array like

List of values

Returns:
periodicity : float

Periodicity calculated as cross correlation between input and idea 1-0-0 signal

riboraptor.coherence.naive_periodicity(values, identify_peak=False)[source]

Calculate periodicity in a naive manner

Take ratio of frame1 over avg(frame2+frame3) counts. By default the first value is treated as the first frame as well

Parameters:
values : Series

Metagene profile

Returns:
periodicity : float

Periodicity

riboraptor.count module

Utilities for read counting operations.

class riboraptor.count.OrderedCounter(**kwds)[source]

Bases: collections.Counter, collections.OrderedDict

Counter that remembers the order elements are first encountered

riboraptor.count.bam_to_bedgraph(bam, strand='both', end_type='5prime', saveto=None)[source]

Create bigwig from bam.

Parameters:
bam : str

Path to bam file

strand : str, optional

Use reads mapping to ‘+/-/both’ strands

end_type : str

Use only end_type=5prime(5’) or “3prime(3’)”

saveto : str, optional

Path to write bedgraph

Returns:
genome_cov : str

Bedgraph output

riboraptor.count.bedgraph_to_bigwig(bedgraph, sizes, saveto, input_is_stream=False)[source]

Convert bedgraph to bigwig.

Parameters:
bedgraph : str

Path to bedgraph file

sizes : str

Path to genome chromosome sizes file or genome name

saveto : str

Path to write bigwig file

input_is_stream : bool

True if input is sent through stdin

riboraptor.count.count_uniq_mapping_reads(bam)[source]

Count number of mapped reads.

Parameters:
bam: str

Path to bam file

Returns:
n_mapped: int

Count of uniquely mapped reads

riboraptor.count.export_gene_coverages(bed, bw, saveto, offset_5p=0, offset_3p=0)[source]

Export all gene coverages.

Parameters:
bed: str

Path to CDS or 5’UTR or 3’UTR bed

bw: str

Path to bigwig to fetch the scores from

saveto: str

Path to write output tsv file

offset_5p: int (positive)

Number of bases to count upstream (5’)

offset_3p: int (positive)

Number of bases to count downstream (3’)

Returns:
gene_profiles: file

with the following format: gene1 5poffset1 3poffset1 cnt1_1 cnt1_2 cnt1_3 …

gene2 5poffset2 3poffset2 cnt2_1 cnt2_2 cnt2_3 …

riboraptor.count.export_metagene_coverage(bed, bw, max_positions=None, saveto=None, offset_5p=0, offset_3p=0, orientation='5prime', n_jobs=16)[source]

Export metagene coverage.

Parameters:
bed: str

Path to CDS or 5’UTR or 3’UTR bed

bw: str

Path to bigwig to fetch the scores from

max_positions: int

Number of positions to consider while calculating the normalized coverage Higher values lead to slower implementation

saveto: str

Path to write output tsv file

offset_5p: int (positive)

Number of bases to count upstream (5’)

offset_3p: int (positive)

Number of bases to count downstream (3’)

orientation: string

[‘5prime’, ‘3prime’] indicating the end of read being tracked

n_jobs: int

Number of paralle threads to open Better to do on a multi-cpu machine, but also works decently on a single core machine

Returns:
metagene_coverage: series

Metagene coverage

riboraptor.count.export_read_counts(gene_coverages, saveto, keep_offsets=True)[source]

export read counts from gene coverages file.

Parameters:
gene_coverages: string

Path to gene coverages.tsv

saveto: str

Path to save output tsv gene_name count length

keep_offsets: bool

whether to keep the 5’ and 3’ offsets in gene coverages default is False

riboraptor.count.export_read_length(bam, saveto=None)[source]

Count read lengths.

Parameters:
bam: str

Path to bam file

saveto: str

Path to write output tsv file

Returns
——-
lengths: counter

Counter of read length and counts

riboraptor.count.extract_uniq_mapping_reads(inbam, outbam)[source]

Extract only uniquely mapping reads from a bam.

Parameters:
inbam: string

Path to input bam file

outbam: string

Path to write unique reads bam to

riboraptor.count.gene_coverage(gene_group, bw, offset_5p=0, offset_3p=0)[source]

Get gene coverage.

Parameters:
gene_group: DataFrame

gene group from bed file

bw: str

Path to bigwig to fetch the scores from

offset_5p: int (positive)

Number of bases to count upstream (5’)

offset_3p: int (positive)

Number of bases to count downstream (3’)

Returns:
coverage_combined: series

Series with index as position and value as coverage

gene_offset_5p: Gene wise 5 prime offset

This might be different from offset_5p in cases where offset_5p leads to a negative coordinate

gene_offset_3p: Gene wise 3 prime offset

This might be different from offset_3p in cases where offset_3p leads to position beyond chromsome length

riboraptor.count.get_bam_coverage(bam, bed, outprefix=None)[source]

Get coverage from bam

Parameters:
bam: string

Path to bam file

bed: string

Path to genes.bed or cds.bed file required for inferring protocol of bam

Returns:
coverage: dict(dict)

dict with keys as chrom:position with query lengths as keys

riboraptor.count.get_bam_coverage_on_bed(bam, bed, protocol='forward', orientation='5prime', max_positions=1000, offset=60, saveto=None)[source]

Get bam coverage over start_codon/stop_codon coordinates

riboraptor.count.get_region_sizes(bed)[source]

Get collapsed lengths of gene in bed.

Parameters:
bed: str

Path to bed file

Returns:
region_sizes: dict

Region sizes with gene names as key and value as size of this named region

riboraptor.count.mapping_reads_summary(bam, saveto=None)[source]

Count number of mapped reads.

Parameters:
bam: str

Path to bam file

saveto: str

Path to write output tsv

Returns:
counts : counter

Counter with keys as number of times read maps and values as number of reads of that type

riboraptor.count.merge_gene_coverages(gene_coverages, max_positions=None, saveto=None)[source]

merge gene coverages to generate metagene coverage.

Parameters:
gene_coverages: string

Path to gene coverages.tsv

max_positions: int

Number of positions to consider while calculating the normalized coverage

saveto: str

Path to save output tsv

Returns:
metagene_coverage : Series

Metagene coverage

riboraptor.count.merge_read_counts(read_counts, saveto)[source]

merge read counts tsv files to one count table

Parameters:
read_counts: str

Path to file which contains paths of all read counts tsv file you want to merge format: sample1 path1 sample2 path2

saveto: str

Path to save output tsv

riboraptor.count.read_enrichment(read_lengths, min_length=28, max_length=32)[source]

Calculate read enrichment for a certain range of lengths

Parameters:
read_lengths: str

Path to read length tsv

min_length: int

The low end of the range

max_length: int

The high end of the range

Returns:
ratio: float

Enrichment in this range (Scale 0-1)

riboraptor.download module

Utilities to download data from NCBI SRA

riboraptor.download.run_download_sra_script(download_root_location=None, ascp_key_path=None, srp_id_file=None, srp_id_list=None)[source]

Download data from SRA.

Parameters:
download_root_location : string

Path to download SRA files

ascp_key_path : string

Location for aspera private keypp

srp_id_list : list

List of SRP ids for download

srp_id_file : string

File containing list of SRP Ids, one per line

riboraptor.dtw module

riboraptor.dtw.dtw(X, Y, metric='euclidean', ddtw=False, ddtw_order=1)[source]
Parameters:
X : array_like

M x D matrix

Y : array_like

N x D matrix

metric : string

The distance metric to use. Can be : ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘wminkowski’, ‘yule’. See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

ddtw : bool

Should use derivative DTW where the distance matrix is created using the derivate values at each point rather than the point themselves

ddtw_order : int [1,2]

First order uses one difference method Second order uses np.gradient for an approximation upto second order

Returns
——-
total_cost : float

Total (minimum) cost of warping

pointwise_cost : array_like

M x N matrix with cost at each (i, j)

accumulated_cost : array_like

M x N matrix with (minimum) cost accumulated till (i,j) having started from (0, 0)

riboraptor.dtw.get_path(D)[source]

Traceback path of minimum cost

Given accumulated cost matrix D, trace back the minimum cost path

Parameters:
D : array_like

M x N matrix as obtained from accumulated_cost using: total_cost, pointwise_cost, accumulated_cost = dtw(X, Y, metric=’euclidean’)

Returns:
traceback_x, traceback_x : array_like

M x 1 and N x 1 array containing indices of movement starting from (0, 0) going to (M-1, N-1)

riboraptor.dtw.plot_warped_timeseries(x, y, pointwise_cost, accumulated_cost, path, colormap=<Mock name='mock.pyplot.cm.Blues' id='140237965673304'>, linecolor='#D55E00')[source]

riboraptor.fasta module

class riboraptor.fasta.FastaReader(fasta_location)[source]

Bases: object

Class for reading and querying fasta file.

chromosomes

Return list of chromsome and their sizes as in the fasta file.

Returns:
chroms : dict

Dictionary with {“chr”: “Length”} format

.. currentmodule:: .FastaReader
.. autosummary::

.FastaReader

complement(seq)[source]
query(intervals)[source]

Query regions for sequence.

Parameters:
intervals: list of Interval

The intervals for fasta is one-based and full-closed

Returns:
sequences: list(str)

An array containing scores for each Interval This function is agnostic of the strand information, the position in the scores is corresponding to the interval

.. currentmodule:: .FastaReader
.. autosummary::

.FastaReader

reverse_complement(seq)[source]

riboraptor.genome module

riboraptor.hdf_parser module

class riboraptor.hdf_parser.HDFParser(filepath)[source]

Bases: object

close()[source]
get_coverage(region=(None, None, None, None), fragment_length=None, orientation='5prime', outprefix=None)[source]

Get coverage for a selected region.

Since the region is user provided, we assume the user is aware if it is strand specfic or not

get_query_alignment_length_dist()[source]
get_read_length_dist()[source]
riboraptor.hdf_parser.create_metagene_from_multi_bigwig(bed, bigwigs, max_positions=1000, offset_5p=0, offset_3p=0, n_jobs=16, saveto=None)[source]

Collapse multiple bigwigs to get bigwig.

Test case -> we should be able to get the same coverage file as from export_metagene coverage if everything above this works okay.

riboraptor.hdf_parser.hdf_to_bigwig(hdf, prefixdir, read_lengths_to_use='all', output_normalized=True)[source]

Create fragment and strand specific bigwigs from hdf

Parameters:
hdf: string

Path to hdf file

prefix: string

Prefix to store output files

read_lengths_to_use: list

Create bigwig only from these fragment lengths

TODO: one fix for this would be to create separate nonoverlapping bigwigs
and then merge them with bigWigCat (one more dependency)
riboraptor.hdf_parser.merge_bigwigs(bigwigs, chrom_sizes, saveto, scale=False)[source]

Merge multiple bigwigs into one.

Note: This seems impossible doing it through pybigWig way and hence the dependency on ucsc-bigWigMerge

Parameters:
bigwigs: list

List of path to bigwigs

chrom_size: string

Path to chromsome.sizes file

saveto: string

Path for outputting bigwig

riboraptor.hdf_parser.normalize_bw_hdf(bw, hdf, read_length, outbw)[source]

Normalize a fragment specific bigwig to RPM for that fragment length

Parameters:
bw: string

Path to bigwig file

hdf: string

Path to hdf file (used for fetching total counts for that read type)

read_length: int

Fragment length

outbw: string

Path to output bigwig

riboraptor.hdf_parser.tsv_to_bigwig(df, chrom_lengths, prefix)[source]

Convert tsv (created by bam-coverage) to bigwig

We create multiple bigwigs, each separately for fragment length, and the strand.

This will output the following files, where N represents the fragment length

N.5prime.pos.bw N.3prime.pos.bw

N.5prime.neg.bw N.3prime.neg.bw

Parameters:
df: string or pd.DataFrame

Path to bam-coverage outputted tsvi

chrom_lengths: list of tuples
A list of tuples with chromsome name and size

[(‘chr1’, 10000), [‘chr2’, 5000]]

prefix: string

Path to store the final tsv

riboraptor.helpers module

All functions that are not so useful, but still useful.

riboraptor.helpers.bwshift(bw, shift_by, out_bw, chunk_size=20000)[source]

Given a bigwig shift all the values by this Shifting by 10: variableStep chrom=chr span=1 1 1 2 2 3 5 4 6 5 5 6 3 7 3 8 5 9 5 10 5 11 6 12 6 13 0 14 2 15 3 16 3 17 10 18 4 19 4 20 2 21 2 22 2 23 1

shifted by 10 variableStep chrom=chr span=1 1 6 2 6 3 0 4 2 5 3 6 3 7 10 8 4 9 4 10 2 11 2 12 2 13 1

riboraptor.helpers.bwsum(bw, chunk_size=5000, scale_to=1000000.0)[source]
riboraptor.helpers.check_file_exists(filepath)[source]

Check if file exists.

Parameters:
filepath : str

Path to file

riboraptor.helpers.codon_to_anticodon(codon)[source]

Codon to anticodon.

Parameters:
codon : string

Input codon

riboraptor.helpers.complementary_strand(strand)[source]

Get complementary strand

Parameters:
strand: string

+/-

Returns:
rs: string

-/+

riboraptor.helpers.counts_to_tpm(counts, sizes)[source]

Counts to TPM

Parameters:
counts: array like

Series/array of counts

sizes: array like

Series/array of region sizes

riboraptor.helpers.create_bam_index(bam)[source]

Create bam index.

Parameters:
bam : str

Path to bam file

riboraptor.helpers.create_ideal_periodic_signal(signal_length)[source]

Create ideal ribo-seq signal.

Parameters:
signal_length : int

Length of signal to create

Returns:
signal : array_like

1-0-0 signal

riboraptor.helpers.featurecounts_to_tpm(fc_f, outfile)[source]

Convert htseq-counts file to tpm

Parameters:
fc_f: string

Path to htseq-count output

outfile: string

Path to output file with tpm values

riboraptor.helpers.find_first_non_none(positions)[source]

Given a list of positions, find the index and value of first non-none element.

This method is specifically designed for pysam, which has a weird way of returning the reference positions. If they are mismatched/softmasked it returns None when fetched using get_reference_positions.

query_alignment_start and query_alignment_end give you indexes of position in the read which technically align, but are not softmasked i.e. it is set to None even if the position does not align

Parameters:
positions: list of int

Positions as returned by pysam.fetch.get_reference_positions

riboraptor.helpers.find_last_non_none(positions)[source]

Given a list of positions, find the index and value of last non-none element.

This function is similar to the find_first_non_none function, but does it for the reversed list. It is specifically useful for reverse strand cases

Parameters:
positions: list of int

Positions as returned by pysam.fetch.get_reference_positions

riboraptor.helpers.get_region_sizes(region_bed)[source]

Get summed up size of a CDS/UTR region from bed file

Parameters:
region_bed: string

Input bed file

Returns:
region_sizes: pd.Series

Series with region name as index and size as key

riboraptor.helpers.get_strandedness(filepath)[source]

Parse output of infer_experiment.py from RSeqC to get strandedness.

Parameters:
filepath : str

Path to infer_experiment.py output

Returns:
strandedness : str

reverse or forward or none

riboraptor.helpers.htseq_to_tpm(htseq_f, outfile, cds_bed_f)[source]

Convert htseq-counts file to tpm

Parameters:
htseq_f: string

Path to htseq-count output

outfile: string

Path to output file with tpm values

cds_bed_f: string

Path to CDS/genesize bed file

riboraptor.helpers.identify_peaks(coverage)[source]

Given coverage array, find the site of maximum density

riboraptor.helpers.is_read_uniq_mapping(read)[source]

Check if read is uniquely mappable.

Parameters:
read : pysam.Alignment.fetch object
Most reliable: [‘NH’] tag
riboraptor.helpers.list_to_ranges(list_of_int)[source]

Convert a list to a list of range object

Parameters:
list_of_int: list

List of integers to be squeezed into range

Returns:
list_of_range: list

List of range objects

riboraptor.helpers.load_pickle(filepath)[source]

Read pickled files easy in Python 2/3

riboraptor.helpers.merge_intervals(intervals, chromosome_lengths=None, offset_5p=0, offset_3p=0, zero_based=True)[source]

Collapse intervals into non overlapping manner

Parameters:
intervals : list of Interval
chromosome_lengths : dict

A map of each chromosome’e length Only used with offset_3p, offset_5p>0

offset_5p : int (positive)

Number of bases to count upstream (5’)

offset_3p : int (positive)

Number of bases to count downstream (3’)

zero_based: bool

Indicate if the intervals are zero-based True means zero-based half open False means one-based full closed

Returns:
interval_combined : list of Interval sorted by the start

A merged version of intervals This is useful when the annotations are overlapping. Example: chr1 310 320 gene1 + chr1 319 324 gene1 + Returns: chr1 310 324 gene1 +

gene_offset_5p: Gene wise 5 prime offset

This might be different from offset_5p in cases where offset_5p leads to a negative coordinate

gene_offset_3p: Gene wise 3 prime offset

This might be different from offset_3p in cases where offset_3p leads to position beyond chromsome length

riboraptor.helpers.millify(n)[source]

Convert integer to human readable format.

Parameters:
n : int
Returns:
millidx : str

Formatted integer

riboraptor.helpers.mkdir_p(path)[source]

Python version mkdir -p

Parameters:
path : str
riboraptor.helpers.order_dataframe(df, columns)[source]

Order a dataframe

Order a dataframe by moving the columns in the front

Parameters:
df: Dataframe

Dataframe

columns: list

List of columns that need to be put in front

riboraptor.helpers.pad_five_prime_or_truncate(some_list, offset_5p, target_len)[source]

Pad first the 5prime end and then the 3prime end or truncate

Parameters:
some_list : list

Input list

offset_5p : int

5’ offset

target_length : int

Final length of list

If being extended, returns list padded with NAs.
riboraptor.helpers.pad_or_truncate(some_list, target_len)[source]

Pad or truncate a list upto given target length

Parameters:
some_list : list

Input list

target_length : int

Final length of list

If being extended, returns list padded with NAs.
riboraptor.helpers.parse_star_logs(infile, outfile=None)[source]

Parse star logs into a dict

Parameters:
infile : str

Path to starlogs.final.out file

Returns:
star_info : dict

Dict with necessary records parsed

riboraptor.helpers.path_leaf(path)[source]

Get path’s tail from a filepath

riboraptor.helpers.r2(x, y)[source]

Calculate pearson correlation between two vectors.

Parameters:
x : array_like

Input

y : array_like

Input

riboraptor.helpers.read_bed_as_intervaltree(filepath)[source]

Read bed as interval tree

Useful for reading start/stop codon beds

Parameters:
filepath: string

Location to bed

Returns:
bedint_tree: dict

dict with keys as gene name and strand as intervaltree

riboraptor.helpers.read_chrom_sizes(filepath)[source]

Read chr.sizes file sorted by chromosome name

Parameters:
filepath: string

Location to chr.sizes

Returns:
chrom_lengths: list of tuple

A list of tuples with chromsome name and their size

riboraptor.helpers.read_enrichment(read_lengths, enrichment_range=range(28, 33), input_is_stream=False, input_is_file=True)[source]

Calculate read enrichment for a certain range of lengths

Parameters:
read_lengths: Counter

A counter with read lengths and their counts

enrichment_range: range or str

Range of reads to concentrate upon (28-32 or range(28,33))

input_is_stream: bool

True if input is sent through stdin

Returns:
ratio: float

Enrichment in this range (Scale 0-1)

riboraptor.helpers.read_htseq(htseq_f)[source]

Read HTSeq file.

Parameters:
htseq_f: str

Path to htseq counts file

Returns:
htseq_df: dataframe

HTseq counts as in a dataframe

riboraptor.helpers.read_refseq_bed(filepath)[source]

Read refseq bed12 from UCSC.

Parameters:
filepath: string

Location to bed12

Returns:
refseq: dict

dict with keys as gene name and values as intervaltree

riboraptor.helpers.round_to_nearest(x, base=5)[source]

Round to nearest base.

Parameters:
x : float

Input

Returns:
v : int

Output

riboraptor.helpers.scale_bigwig(inbigwig, chrom_sizes, outbigwig, scale_factor=1)[source]

Scale a bigwig by certain factor.

Parameters:
inbigwig: string

Path to input bigwig

chrom_sizes: string

Path to chrom.sizes file

outbigwig: string

Path to output bigwig

scale_factor: float

Scale by value

riboraptor.helpers.set_xrotation(ax, degrees)[source]

Rotate labels on x-axis.

Parameters:
ax : matplotlib.Axes

Axes object

degrees : int

Rotation degrees

riboraptor.helpers.summarize_counters(samplewise_dict)[source]

Summarize gene counts for a collection of samples.

Parameters:
samplewise_dict : dict

A dictionary with key as sample name and value as another dictionary of counts for each gene

Returns:
totals : dict

A dictionary with key as sample name and value as total gene count

riboraptor.helpers.summary_stats_two_arrays_welch(old_mean_array, new_array, old_var_array=None, old_n_counter=None, carried_forward_observations=None)[source]

Average two arrays using welch’s method

Parameters:
old_mean_array : Series

Series of previous means with index as positions

old_var_array : Series

Series of previous variances with index as positions

new_array : array like

Series of new observations (Does noes Ciunts of number of positions at a certain index

Returns:
m : array like

Column wise Mean array

var : array like

Column wise variance

Consider an example: [1,2,3], [1,2,3,4], [1,2,3,4,5]
old = [1,2,3]
new = [1,2,3,4]
counter = [1,1,1]
mean = [1,2,3,4] Var =[na, na, na, na], carried_fowrad = [[1,1], [2,2], [3,3], [4]]
old = [1,2,3,4]
new = [1,2,3,4,5]
couter = [2,2,2,1]
mean = [1,2,3,4,5]
var = [0,0,0, na, na]
carried_forward = [[], [], [], [4,4], [5]]

Create forcelink forcefully

Parameters:
source: string

Location to source file

destination: string

Location to target

riboraptor.helpers.yield_intervals(chrom_size, chunk_size=20000)[source]

riboraptor.infer_protocol module

riboraptor.infer_protocol.infer_protocol(bam, bed, n_reads=500000, drop_probability=0.2)[source]

Infer strandedness protocol given a bam file

Parameters:
bam: string

Path to bam file

bed: string

Path to gene bed file

n_reads: int

Number of reads to use (downsampled)

drop_probability: float [0-1]

Reads are randomly droppped with this probability to simulate sampling

Returns:
protocol: string

unstranded/forward/reverse

forward_mapped_reads: float

Proportion of reads of type + mapping to + (++) or - mapping to - (–)

reverse_mapped_reads: float

Proportion of reads of type + mapping to - (+-) or - mapping to + (-+)

riboraptor.interval module

class riboraptor.interval.Interval(chrom=None, start=0, end=0, strand=None)[source]

Bases: object

Class for bed interval

riboraptor.kmer module

riboraptor.kmer.fastq_kmer_histogram(fastq_file, kmer_length=range(5, 31), five_prime=False, max_seq=1000000, offset=0, drop_probability=0)[source]

Get a histogram of kmers from a fastq file

Parameters:
fastq_file: string

Location of .fastq or .fastq.gz

kmer_length: range

Range of kmer to consider

five_prime: bool

Should consider sequences from 5’ end? Default: False (uses sequence from 3’ end)

max_seq: int

Maximum number of sequences to consider

offset: int

Offset to ignore at the 5’ end or 3’end Example: If the offset is 3, the first 3 bases will be skipped and kmers will start only from the 4th position For 3’end if the offset is 3, the last 3 bases will be skipped

drop_probability: float [0-1]

Reads are randomly droppped with this probability to simulate sampling

Returns
——-
kmers: Series

Sorted series with most frequent kmer

riboraptor.normalization module

riboraptor.normalization.deseq2_normalization(list_of_profiles)[source]

Perform DESeq2 like normalization position specific scores

Parameters:
list_of_profiles: array-like

array of profiles across samples for one gene

Returns:
normalized_profiles: array-like

array of profiles across samples

riboraptor.parallel module

riboraptor.parallel.ParallelExecutor(use_bar='tqdm', **joblib_args)[source]
riboraptor.parallel.func(x)[source]
riboraptor.parallel.text_progessbar(seq, total=None)[source]

riboraptor.plotting module

Plotting methods.

riboraptor.plotting.create_wavelet(data, ax)[source]
riboraptor.plotting.plot_featurewise_barplot(utr5_counts, cds_counts, utr3_counts, ax=None, saveto=None, **kwargs)[source]

Plot barplots for 5’UTR/CDS/3’UTR counts.

Parameters:
utr5_counts : int or dict

Total number of reads in 5’UTR region or alternatively a dictionary/series with genes as key and 5’UTR counts as values

cds_counts : int or dict

Total number of reads in CDs region or alternatively a dictionary/series with genes as key and CDS counts as values

utr3_counts : int or dict

Total number of reads in 3’UTR region or alternatively a dictionary/series with genes as key and 3’UTR counts as values

saveto : str

Path to save output file to (<filename>.png/<filename>.pdf)

riboraptor.plotting.plot_framewise_counts(counts, frames_to_plot='all', ax=None, title=None, millify_labels=False, position_range=None, saveto=None, ascii=False, input_is_stream=False, **kwargs)[source]

Plot framewise distribution of reads.

Parameters:
counts : Series

A series with position as index and value as counts

frames_to_plot : str or range

A comma seaprated list of frames to highlight or a range

ax : matplotlib.Axes

Default none

saveto : str

Path to save output file to (<filename>.png/<filename>.pdf)

riboraptor.plotting.plot_periodicity_df(df, saveto, cbar=False, figsize=(8, 8))[source]

Plot periodicty values across fragment lengths as a matrix.

Parameters:
df: string

Path to dataframe containing fragment length specific periodicities

saveto: string

Path to output plot file

cbar: bool

Whether to plot cbar or not

riboraptor.plotting.plot_read_counts(counts, ax=None, marker=None, color='royalblue', title=None, label=None, millify_labels=False, identify_peak=True, saveto=None, position_range=None, ascii=False, input_is_stream=False, ylabel='Normalized RPF density', **kwargs)[source]

Plot RPF density aro und start/stop codons.

Parameters:
counts : Series/Counter

A series with coordinates as index and counts as values

ax : matplotlib.Axes

Axis to create object on

marker : string

‘o’/’x’

color : string

Line color

label : string

Label (useful only if plotting multiple objects on same axes)

millify_labels : bool

True if labels should be formatted to read millions/trillions etc

saveto : str

Path to save output file to (<filename>.png/<filename>.pdf)

riboraptor.plotting.plot_read_length_dist(read_lengths, ax=None, millify_labels=True, input_is_stream=False, title=None, saveto=None, ascii=False, **kwargs)[source]

Plot read length distribution.

Parameters:
read_lengths : array_like

Array of read lengths

ax : matplotlib.Axes

Axis object

millify_labels : bool

True if labels should be formatted to read millions/trillions etc

input_is_stream : bool

True if input is sent through stdin

saveto : str

Path to save output file to (<filename>.png/<filename>.pdf)

riboraptor.plotting.setup_axis(ax, axis='x', majorticks=5, minorticks=1, xrotation=45, yrotation=0)[source]

Setup axes defaults

Parameters:
ax : matplotlib.Axes
axis : str

Setup ‘x’ or ‘y’ axis

majorticks : int

Length of interval between two major ticks

minorticks : int

Length of interval between two major ticks

xrotation : int

Rotate x axis labels by xrotation degrees

yrotation : int

Rotate x axis labels by xrotation degrees

riboraptor.plotting.setup_plot()[source]

Setup plotting defaults

riboraptor.sequence module

Utilities for extracting sequence from fasta.

riboraptor.sequence.export_gene_sequences(bed, fasta, saveto=None, offset_5p=0, offset_3p=0)[source]

Export all gene sequences.

Parameters:
bed: str

Path to CDS or 5’UTR or 3’UTR bed

fasta: str

Path to fasta to fetch the sequences from

saveto: str

Path to write output tsv file

offset_5p: int (positive)

Number of bases to count upstream (5’)

offset_3p: int (positive)

Number of bases to count downstream (3’)

Returns:
gene_profiles: file

with the following format: gene1 5poffset1 3poffset1 sequence

gene2 5poffset2 3poffset2 sequence

riboraptor.sequence.gene_sequence(gene_group, fasta, offset_5p=0, offset_3p=0)[source]

Extract seq genewise given coordinates in bed file

Parameters:
gene_group: DataFrame

gene group from bed file

fasta str

Path to fasta file

offset_5p: int (positive)

Number of bases to count upstream (5’)

offset_3p: int (positive)

Number of bases to count downstream (3’)

Returns:
sequence: str

sequence of the gene

gene_offset_5p: Gene wise 5 prime offset

This might be different from offset_5p in cases where offset_5p leads to a negative coordinate

gene_offset_3p: Gene wise 3 prime offset

This might be different from offset_3p in cases where offset_3p leads to position beyond chromsome length

riboraptor.sradb module

Helper functions for parsing SRAmetadb.sqlite file

class riboraptor.sradb.SRAdb(sqlite_file)[source]

Bases: object

close()[source]
convert_gse_to_srp(gse)[source]

Convert GSE to SRP id. Requires input db to be GEOmetadb.sqlite

desc_table(table)[source]
get_query(query)[source]
get_row_count(table)[source]

Get row counts for a table

get_table_counts()[source]
list_fields(table)[source]

List all fields in a given table

list_tables()[source]

List all tables in the sqlite

open()[source]
search_experiment(srx)[source]

Search for a SRX/GSM id in the experiments

sra_convert(acc, out_type=['study_accession', 'experiment_accession', 'experiment_title', 'run_accession', 'taxon_id', 'library_selection', 'library_layout', 'library_strategy', 'library_source', 'library_name', 'bases', 'spots', 'adapter_spec'])[source]

riboraptor.statistics module

riboraptor.statistics.KDE(values)[source]

Perform Univariate Kernel Density Estimation.

Wrapper utility around statsmodels for quick KDE TODO: scikit-learn has a faster implementation (?)

Parameters:
values : array like
Returns:
support : array_like
cdf : array_like
riboraptor.statistics.KS_test(a, b)[source]

Perform KS test between a and b values

Parameters:
a, b : array-like

Input

Returns:
D : int

KS D statistic

effect_size : float

maximum difference at point of D-statistic

cdf_a, cdf_b : float

CDF of a, b

Note: By default this method does testing for alternative=lesser implying
that the test will reject H0 when the CDf of b is ‘above’ a
riboraptor.statistics.calculate_cdf(data)[source]

Calculate CDF given data points

Parameters:
data : array-like

Input values

Returns:
cdf : series

Cumulative distribution funvtion calculated at indexed points

riboraptor.statistics.coherence_pvalue(x, N)[source]

Calculate p-value for coherence score

Parameters:
x: float [0-1]

Coherence value

N: int

Total number of windows

Returns:
pval: float

p-value

riboraptor.statistics.series_cdf(series)[source]

Calculate cdf of series preserving the index

Parameters:
series : series like
Returns:
cdf : series

riboraptor.tracks module

riboraptor.tracks.create_trackdb(bwdir, srp, orientation='5prime')[source]

Create track file

riboraptor.tracks.get_bigwigtrack_text(track_name, parent, big_data_url, negate_values)[source]

Create bigwig track text

riboraptor.tracks.get_compositetrack_text(track_name, parent)[source]

Create composite track text

riboraptor.tracks.get_multiwigtrack_text(track_name, parent)[source]

Create a multiWig track.

Example track myMultiWig container multiWig aggregate transparentOverlay showSubtrackColorOnUi on type bigWig 0 1000 viewLimits 0:10 maxHeighPixels 100:32:8

track myFirstOverlaySig parent myMultiWig color 255,128,128 type bigWig 0 1139

track myFirstBigWig parent myMultiWig color 120,235,204

riboraptor.tracks.get_supertrack_text(track_name)[source]

riboraptor.utils module

riboraptor.utils.copy_sra_data(df, sra_source_dir='/staging/as/skchoudh/SRA_datasets/', sra_dest_dir='/staging/as/skchoudh/re-ribo-datasets/')[source]

Copy SRA data to a new location retaining only single ended samples.

riboraptor.utils.create_config_file(df)[source]
riboraptor.utils.determine_cell_type(sample_attribute)[source]
riboraptor.utils.filter_single_end_samples(df)[source]

Filter single end samples from a dataframe

Parameters:
df: DataFrame

Dataframe as obtained from SRAb.sra_convert()

Returns:
df: DataFrame

DataFrame with only single end samples

riboraptor.utils.get_cell_line_or_tissue(row)[source]
riboraptor.utils.get_enrichment_cds_stats(pickle_file)[source]
riboraptor.utils.get_fragment_enrichment_score(txt_file)[source]
riboraptor.utils.get_strain_type(sample_attribute)[source]
riboraptor.utils.get_tissue_type(sample_attribute)[source]
riboraptor.utils.load_tpm(path)[source]
riboraptor.utils.summary_starlogs_over_runs(directory, list_of_srr)[source]
riboraptor.utils.write_config(species, srp)[source]

riboraptor.wig module

class riboraptor.wig.WigReader(wig_location)[source]

Bases: object

Class for reading and querying wigfiles.

chromosomes

Return list of chromsome and their sizes as in the wig file.

Returns:
chroms : dict

Dictionary with {“chr”: “Length”} format

.. currentmodule:: .WigReader
.. autosummary::

.WigReader

close()[source]
query(intervals)[source]

Query regions for scores.

Parameters:
intervals : list of Interval
Returns:
scores : array of array

A numpy array containing scores for each Interval This function is agnostic of the strand information, the position in the scores is corresponding to the interval

.. currentmodule:: .WigReader
.. autosummary::

.WigReader

Module contents