On Github craigsketchley / INFO5010_Presentation
By Craig Sketchley
# The Paper I am going to be discussing the paper shown here: - STAR: ultrafast universal RNA-seq aligner Note: Do I need an overview slide of the talk?RNA-seq: sequencing the mRNA within a cell at a given point in time.
mRNA is not transcribed from a single contiguous section in DNA.
How do you accurately identify splice sites?
High-throughput sequencing also makes it challenging to detect & characterise splice sites.
Two key tasks:
Extended DNA mapping solutions.
Offer compromises in either accuracy or resources required.
Computational component becoming bottleneck.
Mostly designed for short reads (≤ 200 bases).
Not great for "Third Generation Sequencing" (potentially full length reads).
# TODO:A tool specifically designed to align non-contiguous sequences to a reference genome.
"Spliced Transcripts Alignment to a Reference" (STAR)
It stands for ... It's specifically designed... Computational bottleneck - too slow or run out of RAM.The STAR algorithm consists of 2 main steps:
Searching for seeds involves a sequential search for the Maximum Mappable Prefix ($MMP$).
The $MMP$ is calculated as follows:
Given a read sequence $R$, read location $i$ and a reference genome sequence $G$, the $MMP(R,i,G)$ is defined as the longest substring $(R_i,R_{i+1},\dots,R_{i+MML-1})$ that matches exactly one or more substrings of $G$, where $MML$ is the Maximum Mappable Length. The main idea behind the seed search phase of STAR is to sequentially search through looking for the Maximum Mappable Prefix. [read the algo] That didn't make a whole lot of sense to me straight away so I'll demonstrate this with a diagram...
Your browser does not support the video tag.
The search is implemented using a suffix array for the reference genome; the read sequence is then threaded through.
## Basic Idea - Once the MMP is found, this is a splice location, rerun this on the remainder of the read sequence that is not mapped [show diagram]. - Search is performed forwards and backwards on the read, and also can be started from a user defined position to facilitate finding anchors with errors at the ends. ## Figure - This approach can also help finding mismatches/indels. Run a sequence, if it doesn't reach the end, expand the "anchor" allowing for mismatches (what about indels?). - Poor alignment from this procedure can help identify sequence traits such as poly-A tails (+AAAAAAA), library adapters (???) or just poor sequencing. ## SA - The uncompressed SA offer a speed advantage with a trade off for space. Which we'll come to in the results section. - Build a suffix array of the reference genome and thread the read sequence through the array for a given read position.Cluster seeds around a selected anchor seed.
Anchor seeds are selected by minimising the number of genomic mappings.
Seeds are then stitched together using a local-linear transcription model.
Your browser does not support the video tag. STAR next builds the entire read sequence... Clustering the seeds found in the first stage by proximity to a selected number of anchor seeds. Anchors are selected from the seeds that have the minimum number of genomic mappings. This prioritises seeds that are not as "multi-mapped". The seeds that map within a user-defined genomic window are stitched together using the... local-linear transcription model (genomic ordering of local seeds maintained, with no overlaps). Seeds are stitchedStitching is guide by a local alignment score.
$S = +\sum_{m}P_m - \sum_{mm}P_{mm} - \sum_{ins}P_{ins} - \sum_{del}P_{del} - \sum_{gap}P_{gap}$
$P_{ins/del} = P_{ins/del}^{open} + P_{ins/del}^{extend} . L_{ins/del}$
If one genomic window is not enough to map the entire read sequence, another anchor is chosen and clustering applied again.
This results in a chimeric read; where the mRNA is spliced from 2 distal parts of the genome.
If one window is not enough to cover the read sequence, then the alignment is expanded to include another window - resulting in a chimeric read. A join of 2 separate exons from potentially different parts of the genome (diff. chromosome?), joining to form at RNA. The stitching is guided by a local alignment scoring scheme that includes user-defined scores for mis/matches and indels. It uses 'Affine Gap' for short insertions and deletions. User defined minimum, otherwise a splice.The paper compared STAR 2.1.3 results with 4 other popular RNA-seq mappers:
Simulated data allows for accurate expected results.
All aligners were run in de novo mode with default parameters.
ROC curves plot true positive vs. false positive.
Varying $N$, the number of reads required across a splice junction for it to be recognised as a splice, from 1 to 100.
All ROC curves exhibit desirable results.
README!!! Initially, they used simulated data to accurately benchmark all the mappers; they know exactly what mapping to expect. Data included: genomic variations and sequencing errors. __de novo__ - from the beginning - without any knowledge of the genome/transcript. -- String comparison. __default parameters__ - commonly accepted practice, since all aligners should have been optimised for mammalian genomes and RNA-seq data. ### ROC curve: - Plots the true positive (a classification hit), vs. a false positive (a misclassification, or false alarm). - Varies with a detection/discrimination threshold. - In this case, the plot is varying the number of reads required across a splice junction for it to be recognised as a splice, from 1 to 100. HOW MUCH EVIDENCE REQUIRED FOR A SPLICE. "All aligners exhibit desirable steep ROC curves at high values of detection threshold." "At the lowest detection threshold of 1 read per junction, STAR exhibits the lowest false-positive rate while achieving high sensitivity."All mappers were run on an ENCODE (Encyclopaedia of DNA Elements) long RNA-seq dataset.
Percentage of reads aligned:
To measure accuracy, the plots included a pseudo-ROC.
It plotted the follow against each other:
The Idea:
All mappers were run with default parameters on the ~40 million 2 x 76 Illumina human RNA-seq dataset.
Close to linear scaling of the throughput rate with the number of threads.
STAR with 12 threads ~= 45 million reads per thread per hour.
RAM usage more than most, ~27GB RAM for human genome.
STAR has a sparse options to reduce RAM usage, for less speed.
## Setup - Tests completed on a "two 6-core Intel Xeon CPUs X5680@ 3.33GHz and 148GB of RAM" - Mappers ran with default parameters again on the same experimental dataset as before. ## Observations: - Close to linear scaling of STAR; losing ~10% off per thread mapping speed. - Pretty impressive; so even 1 thread outperforms next closest mapper. - ~27GB RAM for human genome. Still within a reasonable level. - Sparse decreases speed by ~25% within sacrificing accuracy.STAR was validated on data as part of the ENCODE and compared against BLAT (a popular mRNA aligner).
Similar or higher accuracy to BLAT.
2 x faster than BLAT, important for high-throughput sequencing.
No NOtes.Aligning non-contiguous RNA-seq data to a reference genome is hard. It remains unsolved.
STAR is: