Prediction Model
General
Our prediction model aims to scan the potential off-target sites on the genome and identify off-target sites with high confidence in experiments.
In our model pipeline, we first construct our training set from the result of high-throughput sequencing and bioinformatics tools. Next, we use a machine learning model to carry out the classification task after extracting the features according to the sequence information.
Dataset
CIRCLE-seq[1] is an NGS-based method to detect off-target cleavage of CRISPR/Cas9. Genome DNA extracted from samples is sheared and circularized followed by degradation of residual linear DNA. Circular DNA molecules will be cleaved by Cas9 nuclease and then the cleavage sites are detected by high-throughput sequencing. We have selected Supplementary Table 2 (List of all CIRCLE-seq detected off-target sites) in CIRCLE-seq: a highly sensitive in vitro screen for genome-wide CRISPR–Cas9 nuclease off-targets as our positive dataset because CIRCLE-seq can provide an accessible, rapid, and comprehensive method for identifying genome-wide off-target mutations.
Meanwhile, Cas-OFFinder[2] is a fast and versatile algorithm that searches for potential off-target sites of Cas9 RNA-guided endonucleases. It can scan the whole genome to search all sequences matching the PAM pattern, and output sequences under a mismatch threshold after comparison with targets. Experiments have shown that when Cas9-sgRNA complex attempts to bind with the target DNA, the effect of mismatches at different positions on the formation of the R-loop is inconsistent[3]. Therefore, a large proportion of potential off-target sites produced by Cas-OFFinder (only consider mismatch numbers) may not be detected in experiments. We categorize those sequences only to appear in Cas-OFFinder‘s output into our negative dataset.
Algorithm
Researchers have released quite a lot of machine learning or deep learning model for recognition of CRISPR/Cas9 off-target sites. Hui Peng et al.(Recognition of CRISPR/Cas9 off-target sites through ensemble learning of uneven mismatch distributions)[4] converted a sequence pair to a feature vector through nucleotide composition change features and position-specific binary mismatch features, and they finally use an ensemble SVM classifier to predict off-target sites. Jiecong Lin et al. (Off-target predictions in CRISPR-Cas9 gene editing using deep learning)[5] chose to represent the sequence pair by an OR operation on two one-hot encoders, and they realized the off-target prediction with CNN.
We learn from these models and improve the feature extraction method as well as we share a similar way to generate the dataset. We replace their simple binary mismatch features with PMFM (positional mismatch frequency matrix) features and substitute a faster GBDT algorithm[6] for SVM or CNN to rapidly predict thousands of potential off-target sequences.
We consider the mismatch condition at each position for the sequence pair contains (Match, Transition, Transversion), and the frequencies of them existing in the positive dataset are different. Our classification task can be transformed into a problem of estimating posterior probability $P(C_+|M)$ where $M$ represents the mismatch information after comparing the sequence pair one by one. We can use the mismatch frequency to obtain the approximate posterior probability. $$P(C_+|M) = \frac{P(M|C_+)P(C_+)}{P(M)} = P(C_+)\prod_{i=1}^{20}\frac{P(m_i|C_+)}{P(m_i)}$$ $$\log P(C_+|M) \propto \sum_{i=1}^{20}w_i\log P(m_i|C_+)$$
where $m_i$ represents positional mismatch frequency.
Therefore, PMFM is a convincing feature to encode the sequence pair. Gradient boosting decision trees (GBDTs) use a boosting method to combine individual decision trees. Each tree attempts to minimize the errors of the previous tree. Trees in boosting are weak learners but adding many trees in series and each focusing on the errors from the previous one makes boosting a highly efficient and accurate model.
Result
In the first step, we have inspected the mismatch frequencies at each position for the positive and negative datasets. The mismatch incidence is near uniformly distributed at all 20bp positions for negative samples while frequencies for positive ones at the 3' terminal are apparently lower. It reveals that the mismatch frequencies at the 20 positions are significantly different between the positive and negative sample (P-value=0.033).
We utilize 5-fold training to evaluate our model's robustness after feature extraction. Here are the common metrics to assess the performance of our machine learning model and the receiver operating characteristic (ROC) curve.
Fold | Accuracy | Sensitivity | Specificity | Precision | F-measure | MCC |
---|---|---|---|---|---|---|
Fold 1 | 0.983 | 0.664 | 0.995 | 0.832 | 0.739 | 0.735 |
Fold 2 | 0.982 | 0.651 | 0.995 | 0.817 | 0.725 | 0.720 |
Fold 3 | 0.982 | 0.663 | 0.994 | 0.814 | 0.731 | 0.726 |
Fold 4 | 0.982 | 0.654 | 0.994 | 0.800 | 0.720 | 0.714 |
Fold 5 | 0.982 | 0.644 | 0.996 | 0.803 | 0.715 | 0.710 |
We have reached a convincing prediction result so that it is very likely to detect off-target effects at the off-target sites we find by the model in the experiment due to the high accuracy and specificity. We also analyze the positional mismatch's influence on the off-target decision by exploring the permutation importance which measures the increase in the prediction error of the model after we shuffle the feature's values. A similar trend of importance at different mismatch positions can be observed at the reduction of mismatch frequency from negative to positive samples. For instance, a mismatch at the farthest four position from the PAM sequence affects the off-target situation very slightly.
In order to verify that our machine learning model can accurately predict off-target sites for most targets, all potential off-target sites for 10 different target sequences are validated on the model training with other targets. The evaluation metrics provide strong evidence in the robust performance of our model that we can exclude nearly all negative samples.
Site1 | Site2 | Site3 | Site4 | EMX1 | FANCF | HBB | RNF2 | VEGFA1 | VEGFA2 | VEGFA3 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy | 0.983 | 0.978 | 0.981 | 0.969 | 0.985 | 0.975 | 0.980 | 0.988 | 0.957 | 0.904 | 0.982 |
Specificity | 0.997 | 0.998 | 0.999 | 0.987 | 0.995 | 0.990 | 0.993 | 0.991 | 0.994 | 0.995 | 0.994 |
Reference
[1] Tsai, S., Nguyen, N., Malagón-López, J., Topkar, V.V., Aryee, M., & Joung, J.K. (2017). CIRCLE-seq: a highly sensitive in vitro screen for genome-wide CRISPR-Cas9 nuclease off-targets. Nature methods, 14, 607 - 614.
[2] Bae, S., Park, J., & Kim, J. (2014). Cas-OFFinder: a fast and versatile algorithm that searches for potential off-target sites of Cas9 RNA-guided endonucleases. Bioinformatics, 30, 1473 - 1475.
[3] Jones, S.K., Hawkins, J.A., Johnson, N.V., Jung, C., Hu, K., Rybarski, J.R., Chen, J.S., Doudna, J., Press, W., & Finkelstein, I.J. (2020). Massively parallel kinetic profiling of natural and engineered CRISPR nucleases. Nature Biotechnology, 1-10.
[4] Peng, H., Zheng, Y., Zhao, Z., Liu, T., & Li, J. (2018). Recognition of CRISPR/Cas9 off‐target sites through ensemble learning of uneven mismatch distributions. Bioinformatics, 34, i757–i765.
[5] Lin, J., & Wong, K. (2018). Off-target predictions in CRISPR-Cas9 gene editing using deep learning. Bioinformatics, 34, i656 - i663.
[6] Guryanov, A. (2019). Histogram-Based Algorithm for Building Gradient Boosting Ensembles of Piecewise Linear Decision Trees. AIST.