============================================================================ Manual ============================================================================ Contents ======== roboraptor is a comprehensive pipeline and set of tools for analyzing Ribo-seq data. This manual explains the stages in our pipeline, how to use the analysis tools and how to modify the pipeline for your specific context. Assumptions =========== Our pipeline was designed to run in a cluster computing context, with many processing nodes available, and a job submission system like PBS or SGE. Much of this analysis is computationally intensive. We assume that individual nodes will have several GB of memory available for processing. Translatome construction ================================ Ribo-seq experiments are always single-end sequenced. Ribosome protected fragments range from 28-32 nucleotides and hence most experiments involve 50 bp single end reads. Before mapping, we need to get rid of the adapters ligated at the 3' end of the fragments as part of the library protocol. Trimming Reads -------------- We use trim_galore_ for trimming. It automates adapter trimming: .. code-block:: console $ trim_galore -o -q -o out_dir Output directory -q min_quality Trim low-quality ends from reads in addition to adapter removal Mapping Reads ------------- We use STAR_ to map reads, though other splice-aware aligners can also be used. The first step is to create an index, preferably using a GTF file. If the index step is run without a GTF file (which is optional), STAR_ will not be splice-aware. Creating Index ~~~~~~~~~~~~~~ .. code-block:: console $ STAR --runThreadN \ --runMode genomeGenerate\ --genomeDir \ --genomeSAindexNbases \ --genomeFastaFiles \ --sjdbGTFfile --runThreadN Number of threads to use --runMode Flag to set for index mode --genomeDir Directory to write index files to --genomeSAindexNbases min(14, log2(GenomeLength)/2 - 1), this **must** be scaled down for small genomes --genomeFastaFiles Path to reference fasta --sjdbGTFfile Path to GTF file Mapping ~~~~~~~ Often Ribo-seq experiments have a matching RNA-seq library. Ideally, the RNA-seq library should be as similar to Ribo-seq library and hence will often be single-ended. We recommend both RNA-seq and Ribo-seq samples be mapped using the following parameters: .. code-block:: console $ STAR --runThreadN \ --genomeDir \ --outFilterMismatchNmax 2\ --alignIntronMin \ --alignIntronMax \ --outFileNamePrefix --readFilesIn \ --outSAMtype BAM Unsorted\ --readFilesCommand zcat\ --quantMode TranscriptomeSAM\ --outTmpDir /tmp/_tmp\ --outReadsUnmapped Fastx\ --runThreadN Number of threads to use --genomeDir Path to index directory --outFilterMismatchNmax Allow a maximum of mismatches=2 --alignIntronMin Minimum intron size. Any genomic gap is considered intron if its length >= alignIntronMin. (Default = 20) --alignIntronMax Maximum intron size (Default = 1000000) --outFileNamePrefix Prefix for output files --readFilesIn Path to input fastq.gz --outSAMtype Output an unsorted BAM file (outtype=BAM Unsorted) --readFilesCommand cat/zcat depending on input is fq/fq.gz --quantMode Also output BAM aligned to the transcriptome --outTmpDir Directory to use for writing temporary files --outReadsUnmapped Write unmapped reads to separate fastq file Sorting and Indexing ~~~~~~~~~~~~~~~~~~~~ STAR outputted BAM files are not sorted. We need a BAM file sorted by coordinates. .. code-block:: console $ samtools sort Aligned.out.bam -o -T _sort &&\ $ samtools index Aligned.out.bam Additionaly, we also need BAM file sorted by name, since htseq-counts_ (and featureCounts_) prefer a BAM sorted by name in their default mode. .. code-block:: console $ samtools sort -on -T -o &&\ $ samtools index Translatome analysis ============================ Once we have the bams, we are ready for downstream analysis. The downstream step often involves a number of steps. The following list summarises these steps along with their recommended values (wherever applicable): * Quality Control - Number of uniquely mapped reads : >=5M - Periodicity : TODO - Ratio of CDS/(5'UTR+3'UTR) : >1 after length normalization - Fragment length distribution : Peak around 28-32 nt * Metagene analysis - P-site offsets : Around 12-14 nt upstream of the start codon when counting based on 5'end Counting uniquely mapped reads ------------------------------ The first step is to simply caculate the number of uniquely mapped reads. We recommend a minimum of 5 million reads for any downstream analysis. TODO: list different recommendation for different species .. code-block:: console $ riboraptor uniq-mapping-count --bam --bam input.bam Path to bam file Read length distribution ------------------------ An ideal Ribo-seq library is expected to have 28-32 nt long fragments most enriched. We can calculate enrichment and plot the fragment size distribution using riboraptor. Readd length distribution can be calculated using the `read-length-dist` subcommand: .. code-block:: console $ riboraptor read-length-dist --bam This will print out the read length and associated counts on the console. In order to visualize thhese counts as a barplot, we can use the `plot-read-dist` subcommand: .. code-block:: console $ riboraptor read-length-dist --bam \ | riboraptor plot-read-dist --saveto Metagene Analysis ----------------- A metagene plot is used as a summary statistic to visualize the distribution of ribosome protected fragments along the positions of a gene often starting (ending) at the start (stop) codon. This is useful for estimating P-site offsets. The ribosome subunuits are known to protect 28-32 nt and hence the P-site is often located 12 nt downstream the 5' position of the mapped read. Creating bigWig file ~~~~~~~~~~~~~~~~~~~~~ To perform metagene analysis, we will work with bigWig_ format. in order to do that, we need an intermediate bedGraph_ file. This can be done using `bam-to-bedgraph` subcommand: .. code-block:: console $ riboraptor bam-to-bedgraph --bam This will print the bedGraph to console. this cna be piped to `bedgraph-to-bigwig` subcommand: .. code-block:: console $ riboraptor bam-to-bedgraph --bam \ | riboraptor bedgraph-to-bigwig --sizes --saveto We now have `` ready for further downstream analysis. Distribution in 5'UTR/3'UTR/CDS regions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ TODO (See Example) Metagene plot ~~~~~~~~~~~~~ TODO (See Example) Example ======= We will use two samples from GSE94454_ , one RNA-seq sample (SRR5227310) and one Ribo-seq sample (SRR5227306) as examples for examples that follow. .. code-block:: console $ riboraptor uniq-mapping-count --bam data/SRR5227310.bam 28637667 This is a pretty deep library. .. code-block:: console $ riboraptor read-length-dist --bam data/SRR5227310.bam\ | riboraptor plot-read-dist --saveto SRR5227310.png .. figure:: images/SRR5227310.png :align: center :width: 100% :alt: Fragment length distribution SRR5227310 :figclass: align center Fragment length distribution for SRR5227310 How enriched is it in 27-32 nt fragment range? .. code-block:: console $ riboraptor read-length-dist --bam data/SRR5227310.bam\ | riboraptor read-enrichment (Enrichment: 1.52768004237, pval: 0.458943823895) So the fragment length distribution doesn't seem to be enriched. We next perform metagene analysis. Ribo-seq data is expected to have an inherent periodicity of 3, since ribosomes move one codon at a time during active translation. .. code-block:: console $ riboraptor bedgraph-to-bigwig -bg data/SRR5227310.bg -s hg38 -o data/SRR5227310.bw .. code-block:: console $ riboraptor metagene-coverage -bw data/SRR5227310.bw \ --region_bed hg38_cds --max-positions 500 \ --prefix data/SRR5227310.metagene --offset 60 --ignore_tx_version .. code-block:: console $ riboraptor plot-read-counts \ --counts data/SRR5227310.metagene_metagene_normalized.pickle\ --saveto data/SRR5227310.metagene.png .. figure:: images/SRR5227310.metagene.png :align: center :width: 100% :alt: Metagene distribution for SRR5227310 :figclass: align center Metagene distribution for SRR5227310 Since metagene gives a summary statistic, we can also look at the abolute counts distribution per frame: .. code-block:: console $ riboraptor plot-framewise-counts --counts data/SRR5227310.metagene_metagene_raw.pickle\ --saveto data/SRR5227310.framewise.png .. figure:: images/SRR5227310.framewise.png :align: center :width: 100% :alt: Fragment length distribution SRR5227310 :figclass: align center Framewise distribution for SRR5227310 Let's try another sample: SRR5227306 and compare it with SRR5227310 with respect to distribution of reads. .. code-block:: console $ riboraptor uniq-mapping-count --bam data/SRR5227306.bam 10658208 .. code-block:: console $ riboraptor read-length-dist --bam data/SRR5227306.bam | riboraptor plot-read-dist --saveto SRR5227306.png .. figure:: images/SRR5227306.png :align: center :width: 100% :alt: Fragment length distribution SRR5227306 :figclass: align center Fragment length distribution for SRR5227306 .. code-block:: console $ riboraptor read-length-dist --bam data/SRR5227306.bam | riboraptor read-enrichment (Enrichment: 14.0292145986, pval: 0.135220082438) As compared to SRR5227310, the enrichment in this case is almost 10 times higher. .. code-block:: console $ riboraptor plot-framewise-counts --counts data/SRR5227306.metagene_metagene_raw.pickle\ --saveto data/SRR5227306.framewise.png .. figure:: images/SRR5227306.framewise.png :align: center :width: 100% :alt: Fragment length distribution SRR5227306 :figclass: align center Framewise distribution for SRR5227306 We can see the framewise distribution of reads in SRR5227310 is more or less uniform, but not so in SRR5227306. .. code-block:: console $ riboraptor bedgraph-to-bigwig -bg data/SRR5227306.bg -s hg38 -o data/SRR5227306.bw .. code-block:: console $ riboraptor metagene-coverage -bw data/SRR5227306.bw \ --region_bed hg38_cds --max-positions 500 \ --prefix data/SRR5227306.metagene --offset 60 --ignore_tx_version .. code-block:: console $ riboraptor plot-read-counts \ --counts data/SRR5227306.metagene_metagene_normalized.pickle\ --saveto data/SRR5227306.metagene.png .. figure:: images/SRR5227306.metagene.png :align: center :width: 100% :alt: Metagene distribution for SRR5227306 :figclass: align center Metagene distribution for SRR5227306 The metagene of a Ribo-seq sample will show periodicity as in the case of SRR5227306 sample. On the other hand a RNA-seq sample like SRR5227310 will tend to have a flat profile. Distribution of 5'UTR/CDS/3'UTR counts -------------------------------------- TODO .. _trim_galore: https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/ .. _STAR: https://github.com/alexdobin/STAR .. _riboraptor: https://github.com/saketkc/riboraptor .. _GSE94454: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE94454 .. _htseq-counts: https://htseq.readthedocs.io/ .. _featureCounts: http://bioinf.wehi.edu.au/featureCounts/ .. _bigWig: https://genome.ucsc.edu/goldenpath/help/bigWig.html .. _bedGraph: https://genome.ucsc.edu/goldenpath/help/bedgraph.html