Training and test datasets

The training data set was randomly divided into 17 gene folds where each fold has a further distribution of 94% and 6% for training and test data sets, respectively.

y = w

_{0}x

_{0}+ w

_{1}x

_{1}+ w

_{2}x

_{2}+ w

_{3}x

_{3}+ w

_{4}x

_{4}+ ................. +w

_{n}x

_{n}

y -- Score of the Guide

w

_{0}... w

_{n}-- Weights of features contributing to score

x

_{0}... x

_{n}-- Features contributing to score as predicted by the algorithm.

Features

Table-1: List of variables/features

Transformation process[1]

Parameters were binned by a fixed width for each feature and applied over the range of values, with the upper-most and lower-most bins collapsed with the neighboring bins if the number of data points at each edge were sparse.

For binning parameters, a fixed-width was chosen for each feature/parameter and applied over the range of values, with the upper-most and lower-most bins collapsed with the neighboring bins if the number of data points at each edge were sparse. Each feature was then split into individual parameters for each bin and sgRNAs were assigned a 1 for the bin if the value fell within the bin or 0 if not. For linearizing sgRNA positioning parameters with continuous curves, sgRNA positions were fit to the activity score (individually for the distance to each TSS coordinate) using SVR with a radial basis function kernel and hyperparameters C and gamma determined using a grid search approach cross-validated within the training set. The fit score at each position was then used as the transformed linear parameter. Binary parameters were assigned a 1 if true or a 0 if false. All linearized parameters were z-standardized and fit with elastic net linear regression, with different L1/L2 ratio values [.5, .75, .9, .99, 1] as suggestions.

Calculating features[2]

ViennaRNA package was used with default parameters to calculate RNA folding metrics. The signal at each base of the target site including the PAM was averaged to calculate the chromatin features. Bx-python module was used to extract the processed continuous signal from the following BigWig files obtained from the ENCODE Consortium: MNase-seq https://www.encodeproject.org/files/ENCFF000VNN/ (Michael Snyder lab, Stanford University), DNase-seq https://www.encodeproject.org/files/ENCFF000SVY/ (Gregory Crawford lab, Duke University), and FAIRE-seq https://www.encodeproject.org/files/ENCFF000TLU/ (Jason Lieb lab, University of North Carolina) (ENCODE Project Consortium, 2012).

Method

^{[1]}translates the top-level features into low-level variables/features which were calculated

^{[2]}, linearized and z-standardized.

Fig-1: The top-level feature “distance” was translated into 8 low-level features: primary and secondary upstream/downstream normal distances & squares of distances from TSS

Note: ElasticNetCV is a combination of Lasso and Ridge regressions. Lasso takes care of data set’s collinearity issues and ridge makes sure the regression model is not overfitted.

Findings

Note: The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

[https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html]

As per the linear model that was generated, the following features (Table-1) contributed to the target activity score with corresponding coefficients (Fig-2).

Table-2: Low-level variables/features of regression model with highest R 2 value.

Fig-2: Coefficients of low-level variables/features of regression model with highest R 2 value.

Discussion

Click here for the results.

Scatter plots

Fig-3: Scatter plots of primary/secondary upstream/downstream distances from TSS vs our algorithm score.

However, when we used squares of primary/secondary upstream/downstream distances from TSS to guide, we identified a strong & clear linear relationship.

Fig-4: Scatter plots of squares of primary/secondary upstream/downstream distances from TSS vs our algorithm score.

Model Generator Flowchart

Limitations

- Availability of large quantity of qRT-PCR training data is difficult to obtain than phenotype based data
- Model is trained based on human genome hg19

Acknowledgment: Our model was inspired from https://elifesciences.org/articles/19760 research paper. We extend our sincere gratitude to all the authors & contributors of the paper.