Predictions of Diversity of Tags
Introduction
In our project, we replaced sgRNA with hgRNA to increase barcodes' diversity. Previous literature has demonstrated that the diversity of barcodes generated by hgRNA is more than eight times that of sgRNA, but the total number of variants that reflect the diversity directly is still unknown. To prove the superiority of using hgRNA, we developed models to estimate the total number of variants.
Models and Results
1. Selection of hgRNA
We consulted literature to obtain relevant data on hgRNA. In literature[1], researchers listed various properties of 60 hgRNA expressed in mice when combined with CRISPR. They extensively characterized the activity profile of each hgRNA which we decided to take it as a screening index. They classified hgRNAs into four categories:
The hgRNAs suitable for lineage tracing should produce new variants continuously during organism development, and the number of variants produced in each stage should be enough to label different cells. Considering these two points, we decided to choose the "mid" hgRNAs. We selected #22 hgRNA as an example analyzing all variants and frequencies data of it:
2. The number of barcode generated by a single hgRNA
We assume that the mutant for #22 hgRNA contributes a random value of individuals to the sample, where = 0, 1, 2, . . .. If = 0 then the ith mutant is unobserved. We further assume that ,… are i.i.d (independent identically distributed) with pdf P=P(·,θ), θ presents the estimated parameters. Then we believe that the total number of variants N produced by this hgRNA is the sum of the number of variants observed in the sampling m and the number of unobserved variants P(=0,θ).
By observing the data distribution of the hgRNAs above, we found that its characteristics match the typical characteristics mentioned in the literature [2] used to estimate species richness. There is a steep slope upward to the left and a long tail to the right which represents many rare and a few abundant mutants. Therefore, we think that some common distribution functions mentioned in the literature may fit the data. Besides, other new functional forms are also considered.
First, we use Matlab's Curve fitting Toolbox to fit the frequency-number data.
Fitting function 1: (in order to prevent overfitting, only first order is used here).
The following is the imitative result:
Figure 1: the frequency and count data of 22hgRNA mutant
In this model, a=1100, b=-1.736; RMSE (root mean squared error)=2.32; =0.9975; It shows a good imitative effect. We can obtain that P(=0)=1100 , then N=m+ P(=0)=263+1100=1363.
To test robustness, we further introduced a right-truncation point: σ, which indicates that we can fit the model only to the frequency counts up to σ, and calculate the final estimation by adding in the number of mutants with abundances greater than σ. This idea was supported by the fact that the estimation of P(=0) largely depends on the low-order frequency counts. When σ={21,24,26,29}, N =1363, indicating that it is robust to fit the data with an exponential function.
When we were trying to fit the data with a mixture of exponentials function, the curve fitting toolbox performed badly because of the increasing number of parameters. So we were considering looking for other software to perform this part of the fitting. We inquired about a software CatchAll used to estimate the richness of macro and microspecies. We learned from the literature [3] that its fitting function included the mixture of exponentials function above, and its basic form was as follows:
The best model results run by this software are fitted by the two-mixed-exponentials function, and the total number of predictions is 1401.
Discussion
This result of CatchAll is basically the same as the fitting result 1363 in Fitting function1, so both of these fitting methods are feasible to some extent. At the same time, we also analyzed the possible reasons why the estimated value is much higher than the observed value in the literature:
1. The number of samples in the literature experiment was insufficient, so lots of variants were not observed;
2. The total number estimated in this model includes the variants which have large deletions, resulting in the deactivation of hgDNA and the inability to be cut again. Therefore, these variants will not be observed but will be reflected in the estimated number of variants.
References
[1] Kalhor, R., Kalhor, K., Mejia, L., Leeper, K., Graveline, A., Mali, P., & Church, G. M. (2018). Developmental barcoding of whole mouse via homing CRISPR. Science (New York, N.Y.), 361(6405), eaat9804.
[2] Bunge, J., & Barger, K. (2008). Parametric models for estimating the number of classes. Biometrical journal. Biometrische Zeitschrift, 50(6), 971–982.
[3] Bunge, J., Woodard, L., Böhning, D., Foster, J. A., Connolly, S., & Allen, H. K. (2012). Estimating population diversity with CatchAll. Bioinformatics (Oxford, England), 28(7), 1045–1047.