Problems in identifying recombination breakpoints Papers which use methods to improve the performance of RDPv4 Papers which evaluate programs inside of RDP Evaluating RDPv4 in its entirety Our simulations Advice to future researchers concerning RDP References

Page Title

Excellence in Another Area: The Identification Of Viral Recombination Breakpoints

The William and Mary 2020 iGEM team sought to tackle the threat of viral pandemics on multiple fronts. In addition to designing a preventative treatment for future pandemics in the form of a broad spectrum probiotic, our igem group has studied one vessel of viral evolution in hopes of ameliorating the capacity of predicting future patterns of viral evolution. The main methods of viral evolution include point mutations, gene duplication, and viral recombination (Nasir et. al 2012). We were inspired to focus on viral recombination as there is significant reason to suspect that recombination could have given rise to the current Covid-19 pandemic (Patino-Galino et. al 2019). Viral recombination can lead to vast phenotypic variation among viruses. In the presence of an intermediate host, viruses which have evolved to target different species can exchange genetic information. This leads to phenotypic variation in a virus that occurs on a fast timescale. (Bentely and Evens, 2018) Additionally there are many pressing questions that have yet to be answered in concerns to viral recombination. Researchers are still unsure about the exact mechanisms which contribute to recombination in viruses (Bentely and Evens, 2018). Even as there are questions concerning the mechanisms of viral recombination, many computer programs exist which aim to pinpoint locations of viral recombination along viral genomes. The accurate identification of recombination breakpoints is critical in the study and characterization of viral recombination. Thus we have found it essential to evaluate the capacity and possible pitfalls of programs which aim to identify recombination events.

To study the current state of recombination detection our IGEM team has done a thorough literature review of the most widely used program to identify recombination breakpoints in viruses, Recombination Detection Program (RDP), and we have highlighted inaccuracies inside of the most current version of RDP (RDPv4) which are often overlooked in its implementation. Along with this we have investigated the current methods in the literature for simulating viral genomic datasets along an evolutionary line which incorporates recombination, and how these simulations are used to evaluate RDP. Our team has created our own simulated genomic datasets utilizing the methods found in the litterature and have evaluated the performance of RDPv4 with these datasets.

The initial prospect of our IGEM team was in fact something much more daunting. We hoped to create a computer program which could characterize a viral genome into areas in which recombination is likely to occur otherwise known as recombination hotspots. This would greatly enhance the current capabilities of predicting viral evolution. Programs to this effect already exist to classify DNA recombination hotspots in mitotic recombination, and our group took much inspiration from this (Zhanga et. al 2019). In order to create a machine learning program of this sort it is necessary to have a large data containing information on the location and characteristics of recombination breakpoints. Thus our group began to identify recombination breakpoints in coronavirus genomes. Ensuring to uphold the principles of the scientific principles, we analyzed and tested each step of this process and discovered that there were many issues in identifying recombination breakpoints in an accurate manner , which at first appeared quite simple.

.

Problems In Identifying Recombination Breakpoints

Our team chose the program Recombination Detection Program version 4 (Martin et. al 2015) as our recombination identification tool, mainly due to its large presence in the field with its most recent iteration having 1,300 citations, and it’s wide usage in identifying recombination among viruses. RDPv4 utilizes eight different programs to detect recombination breakpoints and then compares the results of each, these programs are RDP, GENECONV, BOOTSCAN, MAXCHI, CHIMAERA, SISCAN, LARD, and PHYLPRO (Martin et. al 2015). If the same breakpoint is identified by a certain number of these programs it is selected by RDP to be a breakpoint. In order to identify recombination breakpoints RDP compares 3 viral strands at a time and searches for similarities along the genome. If a possible daughter recombinant consists of one area which is quite similar to one viral strand and another which is similar to a different viral strand it is marked as a possible recombinant and a daughter of the two parentals viruses. Possible recombinants are then validated through RDP using phylogenetic information. There are additional methods incorporated in each of the programs which go beyond this in their identification of recombination identification. The RDP instruction manual contains a brief description of each of its incorporated programs (Bentrand et. al 2016).

RDPv4 requires an alignment to be performed before it can analyze the genome. We began by attempting to analyze a set of Delta Coronaviruses for recombination breakpoints, this genomes which contains 135 viruses. Reading the RDPv4 instruction manual thoroughly, we knew that the alignments were important in the results of RDPv4 and that a poor alignment could result in an overload of falsely identified recombination events(Bentrand et. al 2016). To circumvent this problem our group used the alignment programs recommended in the instruction manual, ClustalX (Larkin et al. 2007) and Muscle (Edgar 2004), in addition we tried the programs Mafft (Katoh et al. 2013) and Blast (Johnson 2008) to perform our alignements. Our group was surprised at the large disparities in the results from RDP when different alignment programs were used. There was not only a large difference in the number of recombination breakpoints identified for example 150 recombination breakpoints when Muscle was used for the alignment and 85 Breakpoints identified when Mafft was used, but also in the distribution of recombination breakpoints. Pictured below is the distribution of recombination breakpoints along the genome and the associated p-value of each recombination event, a p-value is given to each breakpoint to evaluate the likelihood that it is indeed a real recombination breakpoint.

.

Breakpoints Using Mafft Alignment

.

Breakpoints Using Muscle

Our group even attempted to work around this issue by directly comparing the breakpoints from different alignment programs but found this to be nearly impossible as there was little to no overlap close enough in the genomic to pinpoint the location of each recombination breakpoint. Alignment programs will add in gaps between nucleotides in order to give a better overall alignment between sequences, this results in genome lengths and positions which vary among alignment programs, making direct comparison unreliable. The variation between alignment programs and their respective outputs from RDP was a major source of concern for our group. If alignments done with different programs resulted in different recombination breakpoints how could we be sure that the breakpoints identified in each could be trusted.

Along with problems concerning the alignements our group ran into issues when using RDPv4. The largest of these was in the parameter choices of RDP, these parameters fine tune the sensitivity of RDPs performance. While the RDP instruction manual does an excellent job of explaining the significance of each of its parameters, we found that many researchers will use the default parameters in RDP (Bentrand et. al 2016). Additionally many papers which do vary parameters do not offer much detail on how they arrived at the parameters they have chosen (Hon et al. 2019). For example we found it difficult to decide on the parameter list event, this parameter decided the number of programs inside of RDP which must agree in order for RDP to identify a recombination breakpoint. There is the option to not include the results from any one of the programs incorporated into RDP, yet there is no direct advice given by RDP for which programs should be excluded or incorporated.

To ensure the validity of the results from RDP we began to look for researchers which have used RDP and consciously took steps to circumvent the issues we found.

.

Papers Which Use Methods To Improve The Performance Of RDPv4

In an attempt to resolve the problems we encountered with RDP our team began to search the literature to see if others have encountered similar issues and how they dealt with this, oftentimes we found that the methods proposed to absolve RDP of its mistakes lead to more problems and questions. Below we have summarized the extent to which the literature has attempted to maximize the performance of RDP.

Concerning alignments it was a challenge to find a paper which explained why they chose a specific alignment program over another. In brief we found that the most common choices for alignment programs were Mafft and Clustal, while programs like Muscle were also used. Additionally we have found papers which mentioned that they have manually edited their alignments but did not explain the method in which they edited them and what the purpose of their edits were (Patiño-Galindo et. al. 2020).

Apart from working with the alignment beforehand, some researchers have attempted to retroactively evaluate the validity of the recombination breakpoints identified by RDP. In the paper Evidence of ancient papillomavirus recombination (Varsni et. al 2006), the authors checked the validity of each recombination breakpoint identified by RDP using two separate tests. These two tests involved isolating the two parent strains and the daughter recombinant strains which RDP identifies and then re-evaluating them in isolation. In the first of the two tests the researchers picked out the regions between the recombination breakpoints and then realigned them in isolations. Using a chi squared test, nucleotide matches and mismatches between the alignment done in isolation were compared to the original alignment. If the chi squared test resulted in a p value of p<0.1 the recombination event was kept otherwise it was discarded (Varsni et. al 2006). In the second test the three strands were taken in their entirety and realigned using the same alignment program and then re-analyzed by RDP, if the same recombination breakpoint was identified then it was kept (Varsani et. al 2006). The researchers identified 10 recombinants which passed both tests out of 529 which were originally identified by RDP. The researchers opted not to comment on the performance of RDP and instead to focus on what each recombination breakpoint told about recombination in papillomaviruses. To our group this method likely underestimated the number of real recombination events as information from across the genome can be considered in a few programs inside of RDP. That being said, this method could be applied to a dataset with a very large number of breakpoints when it is not necessary to identify every breakpoint.

In the same vein as the authors above, the researchers in the paper Characterization of New Recombinant Forms of HIV-1 From the Comunitat Valenciana (Spain) by Phylogenetic Incongruence identified and analyzed nine putative recombinants from a sample of HIV genomes collected from Valencia Spain for their confidence of the recombination. To evaluate each recombination event the researchers utilized phylogenetic information which they notated as a “modified phylogenetic congruence pipeline” (Bemund et al. 2019) . It is worthwhile to note that since a recombinant consists of genetic information from both parentals, it is often seen that the recombinant strand will move greatly across phylogenetic trees when multiple trees are created. This is due to the fact that in some iterations of the phylogenetic tree the recombinant will cluster closer to the first/major parental strand in some iterations, then with the second/minor parental in others. (Bemund et al. 2019). In fact this information is used by some recombination identification programs to identify recombinants (citation needed). The researches of this paper recreated this process on a more precise scale by isolating the genomic region between the recombination breakpoints, aligning it in isolations, and then testing for phylogenetic movement (Bemund et al. 2019). The authors utilized the MAFFT for the alignment and the program IQ-TREE to create multiple phylogenetic trees to observe movement. The authors made the claim that two of the nice recombinants which were originally identified by RDPv4 were in fact an evolution from one of the other strains and were not in fact new recombinants. Thus this method does seem promising to filter our recombinants. Although this process is relatively time consuming and has not been tested on datasets with large numbers of recombination events.

These two researchers have shown the most promising methods in concerns to filtering out our results from RDPv4. In the case where a small genomic dataset is to be analyzed, researchers are able to compare the results from multiple recombination identification programs. This is possible because on smaller datasets the overlap and the proximity of recombination breakpoints is smaller. For example in the paper Molecular Evolution of Zika Virus during Its Emergence in the 20th Century (Faye et al. 2014), the researchers studied recombination in Zika Virus and used RDP along with Rec-HMM. They found slight overlap in recombination breakpoints, in that some occurred in the same terminal but were identified at different nucleotide positions. They noted that RDP identified 13 possible recombinants while Rec-HMM identified only 8 (Faye et al. 2014). The comparison between the two programs served the authors well as they were concerned with the approximate location of breakpoints and thus only needed to know general patterns of where recombination may occur. Therefore the comparison of multiple programs is useful, but this still is inaccurate in concerns to the exact location of the breakpoints, and fails to cure the mistakes of each program individually.

.

Papers Which Evaluate Programs Inside Of RDP

Seeing that many researches have actively tried to circumvent the problems inside of RDP our team began to question the performance capacities of RDP. Thus we began to look into papers which focused their research on the evaluation of RDP. The main method of evaluation came through simulations. This consists of simulating a viral genome which incorporates recombination. RDP is then run on these simulated genomes and evaluated based on the number of breakpoints which it correctly identifies, fails to identify, or the number of falsely identified recombination events. Simulations are the most prevalent form of evaluation as the ground truth and exact location of recombination breakpoints is known. As RDP consists of multiple different programs researchers have both evaluated RDP as a hole and each individual program inside of RDP.

The researchers in the paper Recombination Detection Under Evolutionary Scenarios Relevant to Functional Divergence (Bay et al. 2011) studied the performance of each program inside of RDP independently. Multiple different genomes were simulated using changing tree topologies to quantify the effect of, “asymmetric tree topology and sequence divergence, non-stationary codon bias and selection pressure, and positive selection” (Bay et al. 2011) on false positive rates in GENECONV, MaxChi, Chimera, RDP, GARD-SBP, and GARD-MBP. All of which are incorporated into RDPv4 except GARD-SBP and GARD-MBP. The program INDELible (Fletcher and Yang 2009) was used to create the simulated sequences. The researches in this paper qualified a false positive solely as a strand which was identified as a recombinant which was not a true recombinant and were not concerned with the location of the recombination breakpoint identified by the program. In testing the effect of asymmetric tree topology sequences 200 codons in length were evolved along 16 taxon phylogeny which was either symmetric or asymmetric (Bay et al. 2011). The effect of sequence divergence was also measured by varying the substitution rate per nucleotide as a number from 1-10.

They noted that for all programs incorporated into RDP there was a significant increase of false positive rate with an increase in sequence divergence. MaxChi and Chimera specifically had false positive rates varying from 6%-56% and 8%-60% respectively. They found that for RDP and GENECONV there was relatively little effect of sequence divergence with maximum false positive rates of 20% for RDP and an average false positive rate of 2.8% for asymmetric trees and 4.8% for symmetric trees for GENECONV. Additionally asymmetric tree topologies had little to no effect for all the programs incorporated into RDP (Bay et al. 2011). In their second set of simulations the authors investigated the effect of non-stationary evolution on recombination detection by codon bias, that is when one nucleotide, for example C or G, is favored during point mutation(Bay et al. 2011). As in the first simulation, sequences 200 codons in length were simulated along a 16 taxon phylogeny, this time with a constant substitution rate per nucleotide of 4. The codon bias was implemented along the center of the phylogenetic tree, this in practice separated the sequences into two types “type A” and “type B” in which type B experiences codon bias. Codon bias is modeled using the method of Aris-Brosou and Bielawski (Aris-Browsou and Bielawski, 2006) which uses a perimeter η to assign varying GC substitution rate along the sequences. The Authors concluded that a shift in codon bias does not affect false positive rates.

In their final set of analysis the research above attempted to analyze the power of recombination identification programs under different levels of sequence diversity. Simulated datasets were borrowed from a previous study, these datasets had only been analyzed with GARD-MBP (Kosakovsky Pond et al. 2006). The datasets consist of 8 taxon alignments and sequences of length 3000 bp with a varying number recombination events of 0, 1, 2, 4, and 8. For each number of recombination events, two additional simulations were done one with 5% nucleotide diversity and one with 25% nucleotide diversity. For each of these datasets 100 replicates were generated.

The authors generalized the data from the simulations to state that as genetic diversity increases programs are able to more powerfully identify recombination. Furthermore, they noted that as the number of recombination events increases the programs are more likely to identify recombination. These results were not shocking but they highlight the fact that these programs often underestimate the number of recombination events and are prone to false positives.

One of the original papers which use simulations to perform an analysis of recombination identification programs is Evaluation of methods for detecting recombination from DNA sequences: Computer simulations (Posada, Crandal 2001). The power and false positive rates were evaluated for 14 different methods of detecting recombination events BOOTSCANING (as implemented in simplot), GENECOV, Homoplasy Test, Informative Sites Test (PIST), Maximum χ2 (MAXCHI), Maximum mismatch χ2 ( CHIMERA), Phylogenetic Profiles (PHYLPRO), Partial Likelihood (PLATO), RDP, Recombination Parsimony (RECPARS), Reticulate, Runs Test, Sneath Test, and Triple. Of the 14 studied in this paper, RDP, MAXCHI, CHIMERA, and GENECOV are implemented in RDPv4. Two simulated genomic dataset were created one to measure the power of each recombination program and one to measure the false positive rate. He used four different ideologies for coalescent evolution which include recombination; the parameters he chose are in the table below where s is the number of replicates, n is the sample size, l is the length of the sequences in bp, N is the effective population size, μ is the mutation rate in mutation per site per generation, Ө is the population mutation parameter where Ө= 4Nμl, r is the recombination rate in recombination per site per generation, ρ is the average pairwise sequence divergence, R(n) is the expected number of recombination events, and α is the rate variation among sites.

It is important to note that the authors considered an identification of recombination successful, if the respective recombination detection methods identified the recombinant strand correctly and not necessarily the location of the recombination breakpoint. Posada and Crandall note that the different recombination detection methods have quite distinct performance capacities. In analyzing the power of recombination CHIMERS, MAXCHI, and GENECONV had a low power of identifying recombination events at low sequence diversity and showed, “little increase in detection even with increasing amounts of recombination”(Posada, Crandall 2001), PHYLPRO performed the best overall in detecting recombination, and RDP showed increasing power as the number of recombination events increased (Posada, Crandall 2001). In concerns to false positives all programs showed similar false positive rates of around 5% among all levels of sequence diversity and site variation. It is important to note though that when calculating false positive rates the sequences had no recombination events at all, thus the effect of existing recombination breakpoints on false positive rates is not considered. Posoda and Crandall conclude that several recombination events are likely necessary in order for any of the following programs to identify recombination.

These papers show that in isolation the individual programs inside of RDP are prone to false positives and are often lacking in their power to identify recombination events. Furthermore, there is a need in the literature to evaluate the ability of programs in RDP to accurately predict the location of recombination breakpoints. As most of the literature is from the early 2000s, these papers should be updated to include any changes in these programs. It is also important to note that RDP does slightly change the algorithm of each of its programs to fit its implementation thus the performance of each of the methods in RDP may vary slightly in isolation.

.

Evaluating RDPv4 In Its Entirety

Along with papers that evaluate the performance of the programs inside of RDP that are a few newer papers which evaluate RDP in its entirety.

The paper Revisiting Recombination Signal in the Tick-Borne Encephalitis Virus: A Simulation Approach (Betrand 2016) was one of our biggest warning signs concerning RDP. The authors evaluate the performance of RDPv4 on the Tick-Borne Encephalitis virus (TBEV) in terms of true and false positive rate through the analysis of an edited genome of Tick-Borne Encephalitis Viruses. Recombination events were simulated at several locations on the viruses phylogenetic tree (Betrand 2016). The researchers created two simulated dataset one in which there was no simulated recombination and one in which there was simulated recombination. First the genome of sequenced TBEVs was collected from GenBank and was separated into three groups based on the monophyletic subtypes, Weastern European (W), Far Easter (F), and Siberian (S). An alignment was made for each subtype using MAFFT v 7.0. Then RDPv4 was run on these three alignments and all sequences which were identified as recombinants were removed from the genome. To incorporate linage-specific evolution, “the gene was considered to be the unit of selection, thus each gene region was endowed with its own genealogy”(Betrand 2016). Thus, the genome was separated into genes and a Bayesian phylogenetic tree was then created for this dataset using BEAST along with the mutation rates for each gene region. MODELTEST was then used to construct the ancestral sequence for each individual gene region. To create the dataset without recombination, SeqGen was then used to simulate the evolution of the ancestral sequences along the Basyesian phylogeny incorporating the mutation rates found for each region. FastTree was then used to analyze the phylogenetic tree resulting from the first round of SeqGen simulations and weak branches along the phylogenetic tree were collapsed. SeqGen was then used to again simulate the evolution of the ancestral sequences along the phylogenetic tree modified by FastTree. The genes were then concatenated to five simulated full-length genomes. Finally RDPv4 was used to analyze the simulated genomes and any identified recombinants were noted as false positives. In simulating genomes with recombination, the exact same steps were taken except recombination was incorporated into the SeqGen simulations. Three recombination events were incorporated for the S and W subtypes and four for the FE subtype. The recombination events were that of various different lengths and were incorporated at varied locations along the genome. RDPv4 was then used to analyze simulated genomes in terms of false positives and true positives.

In the recombination free dataset there was a large amount of false positives, “False positive rates greatly exceeded the 5% expectation, with large variations depending on the method and the targeted subtype.” They note that false positive rates dip under 5% only when the incredibly low p-value of 1.0E-9 is used which is much lower than the stand value 0.05 which is commonly used in the literature and is the default setting in RDP. In datasets incorporating recombination there was a definite increase of false positives with the introduction of recombination. Additionally RDP showed weak power in detecting the recombination event which decreased as the size and depth in the phylogenetic tree were increased. [see supplemental file 1 of the paper to see the exact results for each subtype, pone.0164435.s001.pdf]. Overall the authors concluded that the false positive rates of RDPv4 are greatly underestimated and that while recombination is relevant in the evolution of TBEV it’s effects are overestimated. This is the main paper in the literature which evaluates RDPv4 in its entirety via a simulation approach. More research should be done in validating RDPv4.

.

Our Simulations

In order to contribute to the literature concerning the evaluation of RDPv4 we at William and Mary IGEM have performed our own simulations of viral evolution incorporating recombination. We used the program NetRecodon 5.0.6 (Arenas, Posada 2010) which is a Coalescent simulator that incorporates recombination. Parameters for viral mutation rate and average genome size were sourced from the literature (Sanjuan et al. 2010) in terms of substitutions per site per generation. For each simulation we simulated 50 viral strands and varied the number of recombination events to 10, 20, and 30. We performed multiple rounds of simulation to represent how RDP would perform on the viral families Poliovirus, Hepatitis C, Influenza A, and HIV. Each of the viral families had varying mutation rates which results in diverse nucleotide similarities as mutation is one of the main factors in determining nucleotide similarity in simulation. For each simulation an outgroup was utilized to represent one virus which may have diverged a long time ago from the rest of the family. Below are the parameters used.

Here is an example of the code given to NetRecodon to create a simulated genome for polovirus with a mutation rate of 9.0e-5, a length of 7400 bp, and 10 recombination breakpoints

Here is an example of the code given to NetRecodon to create a simulated genome for polovirus with a mutation rate of 9.0e-5, a length of 7400 bp, and 10 recombination breakpoints:

./NetRecodon5.0.6 -u9e-5 -n1 -s50 -l7440 -*2 -bPolio10 -r.00000003 -dBreakpoints -ktimes -o0.1 -w10

Parameters -b -d and -k simply name the sequence file, identify the location of the breakpoints, and the time of each breakpoint respectively.

These simulated genomes were then analyzed in RDPv4. To constitute a successful identification of a recombination breakpoint RDP must have identified the breakpoint within 20bp of the real breakpoint, any breakpoint identified more than 20 bp away was considered a false identification. If a breakpoint was correctly identified multiple times or the same breakpoint was misidentified this constituted only one true or false positive respectively. Pictured below are our results along with the mutation rate and sequence length of each viral family (Sanjuan et al. 2010).

From our simulations, RDP was consistently able to perform better as the number of recombination events was increased, as in each case with 30 breakpoints the true positives outnumber the false positives, thus there is reason to believe that RDP is slightly more accurate with a large number of possible recombination breakpoints. Note that RDP performed the worst for Influenza A, this is likely due to its lower mutation rate and large sequence length. RDP performed best on Hepatitis C, this is likely due to its larger mutation rate. A larger mutation rate can lead to increased diversity among the sequences which allows RDP to more easily pinpoint recombinants as the two parental strands will be more genetically diverse. Even so RDP did not perform particularly well on any of these simulated data, only managing to identify more than 50% of the breakpoints correctly in two instances, for Hepatitis C with 10 breakpoints and Influenza A with 20 breakpoints. Often the false positive rates rivaled or even exceeded the number of correctly identified recombination breakpoints, which resulted in nearly half of all identified breakpoints being faulty. This is especially concerning since as this is a simulated dataset, it is ensured that the exact parental strands are incorporated, thus it should be easier to recognize the recombinant. This is not the case in sequences that are collected from the environment as often a close relative of the parental strand is collected and not the parental strand itself.

.

Advice To Future Researchers Concerning RDP

RDP is a great tool for identifying possible recombination breakpoints, but the word possible should not be forgotten. Here are a few tactics our team recommends that can be employed to ensure that RDP is used without overestimating the number recombination breakpoints.

In regards to alignments we recommend that researchers try multiple alignment programs before choosing which one to do the final analysis; the one which gives the least number of recombination breakpoints should be chosen. The RDP handbook notes that errors in alignements can lead to the false identification of recombination breakpoints(citation) thus whichever alignment gives the fewest recombination breakpoints is likely to have the smallest number of false positives due to alignment errors. Additionally if there are areas of poor alignment, in that there is relatively little nucleotide overlap between strands, this should be kept in mind when viewing the results of RDP. Each recombinant region should be cross referenced against the percent nucleotide similarity, if there low similarity this should call into question the validity of the identified recombination event.

To maximize the performance of RDP, simulations should be used. A simulated data set should be made that is as similar as possible to the genome which is to be analyzed. This dataset should evolve from a sample viral strand chosen from the desired genome and should be evolved using parameters which are sourced from the literature for mutation rate and recombination rate. Additionally, the simulated genome can be compared to the original in terms of nucleotide similarity among the strands of each respective dataset. The simulated dataset should then be run through RDP where the false positive and false negative rates must be noted. If RDP does not perform well on the simulated dataset, it is a good sign that high precaution should be taken when considering the results of RDP on the real dataset. Additionally, if the simulated dataset performs well on RDP then parameters inside of RDP can be adjusted using the simulated dataset to maximize the performance.

If the exact location of breakpoints is not of dire concern other methods can be used in attempting to find general regions along the genome which are common for recombination. In this case the outputs of multiple alignment programs can be compared and only regions in which there is a common identification of recombination breakpoints should be considered. In addition to using multiple alignment programs multiple recombination identification programs can be used. This would look something similar to what the researchers did in the paper Molecular evolution of Zika virus during its emergence in the 20(th) century where they compared the results of both RDP and Rec-HMM.

In the case where the absolute breakpoint location is of high importance, future researchers may combine the methods from the papers Characterization of New Recombinant Forms of HIV-1 From the Comunitat Valenciana (Spain) by Phylogenetic Incongruence (Faye et al. 2014) and Evidence of ancient papillomavirus recombination (Varsni et. al 2006) . That is researchers can isolate the parental strands and the daughter strands align them independently then rerun RDP to see if the same recombinant region is identified (Varsni et. al 2006). Then phylogenetic testing can be performed on the three strands in isolation to see if there is movement across the phylogenetic tree (Faye et al. 2014). This method is extremely useful because both of these ensure that errors contributed by alignments are vastly decreased since they are aligned in isolation. It should be noted that both of the methods independently vastly decrease the number of recombination breakpoints and likely greatly underestimates the total number of recombination events.

.

References

Arenas M, Posada D. Coalescent simulation of intracodon recombination. Genetics. 2010 Feb;184(2):429-37. doi: 10.1534/genetics.109.109736. Epub 2009 Nov 23. PMID: 19933876; PMCID: PMC2828723.
Aris-Brosou S, Bielawski JP (2006) Large-scale analyses of synonymous substitution rates can be sensitive to assumptions about the process of mutation. Gene 378:58–64
Bay, R.A., Bielawski, J.P. Recombination Detection Under Evolutionary Scenarios Relevant to Functional Divergence. J Mol Evol 73, 273–286 (2011). https://doi.org/10.1007/s00239-011-9473-0
Bentley K, Evans DJ. Mechanisms and consequences of positive-strand RNA virus recombination. J Gen Virol. 2018 Oct;99(10):1345-1356. doi: 10.1099/jgv.0.001142. Epub 2018 Aug 29. PMID: 30156526.
Beamud B, Bracho MA, González-Candelas F. Characterization of New Recombinant Forms of HIV-1 From the Comunitat Valenciana (Spain) by Phylogenetic Incongruence. Front Microbiol. 2019 May 22;10:1006. doi: 10.3389/fmicb.2019.01006. PMID: 31191463; PMCID: PMC6540936.
Bertrand YJK, Johansson M, Norberg P (2016) Revisiting Recombination Signal in the Tick-Borne Encephalitis Virus: A Simulation Approach. PLOS ONE 11(10): e0164435. https://doi.org/10.1371/journal.pone.0164435
Darren P. Martin, Ben Murrell, Michael Golden, Arjun Khoosal, Brejnev Muhire, RDP4: Detection and analysis of recombination patterns in virus genomes, Virus Evolution, Volume 1, Issue 1, March 2015, vev003, https://doi.org/10.1093/ve/vev003
Robert C. Edgar, MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research, Volume 32, Issue 5, 1 March 2004, Pages 1792–1797, https://doi.org/10.1093/nar/gkh340
Faye O, Freire CC, Iamarino A, et al. Molecular evolution of Zika virus during its emergence in the 20(th) century. PLoS Negl Trop Dis. 2014;8(1):e2636. Published 2014 Jan 9. doi:10.1371/journal.pntd.0002636
Mark Johnson, Irena Zaretskaya, Yan Raytselis, Yuri Merezhuk, Scott McGinnis, Thomas L. Madden, NCBI BLAST: a better web interface, Nucleic Acids Research, Volume 36, Issue suppl_2, 1 July 2008, Pages W5–W9, https://doi.org/10.1093/nar/gkn201
Juan Ángel Patiño-Galindo, Ioan Filip, Mohammed AlQuraishi, Raul RabadanbioRxiv 2020.02.10.942748; doi: https://doi.org/10.1101/2020.02.10.942748
Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30(4):772-780. doi:10.1093/molbev/mst010
Kosakovsky Pond SL, Posada D, Gravenor MB, Woelk CH, Frost SDW (2006) Automated phylogenetic detection of recombination using a genetic algorithm. Mol Biol Evol 23:1891–1901
Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG. Clustal W and Clustal X version 2.0. Bioinformatics. 2007 Nov 1;23(21):2947-8. doi: 10.1093/bioinformatics/btm404. Epub 2007 Sep 10. PMID: 17846036.
Martin DP, Murrell B, Golden M, Khoosal A, & Muhire B (2015) RDP4: Detection and analysis of recombination patterns in virus genomes. Virus Evolution 1: vev003 doi: 10.1093/ve/vev003
Posada D, Crandall KA. Evaluation of methods for detecting recombination from DNA sequences: computer simulations. Proc Natl Acad Sci U S A. 2001 Nov 20;98(24):13757-62. doi: 10.1073/pnas.241370698. PMID: 11717435; PMCID: PMC61114.
Patiño-Galindo, J. Á., Filip, I., AlQuraishi, M., & Rabadan, R. (2020). Recombination and lineage-specific mutations led to the emergence of SARS-CoV-2. Cold Spring Harbor Laboratory, na. https://doi.org/10.1101/2020.02.10.942748
Nasir, A., Kim, K. M., & Caetano-Anollés, G. (2012). Viral evolution: Primordial cellular origins and late adaptation to parasitism. Mobile genetic elements, 2(5), 247–252. https://doi.org/10.4161/mge.22797
Sanjuán R, Nebot MR, Chirico N, Mansky LM, Belshaw R. Viral mutation rates. J Virol. 2010 Oct;84(19):9733-48. doi: 10.1128/JVI.00694-10. Epub 2010 Jul 21. PMID: 20660197; PMCID: PMC2937809.
Varsani A, van der Walt E, Heath L, Rybicki EP, Williamson AL, Martin DP. Evidence of ancient papillomavirus recombination. J Gen Virol. 2006 Sep;87(Pt 9):2527-2531. doi: 10.1099/vir.0.81917-0. PMID: 16894190.
Zhang S, Yang K, Lei Y, Song K. iRSpot-DTS: Predict recombination spots by incorporating the dinucleotide-based spare-cross covariance information into Chou's pseudo components. Genomics. 2019 Dec;111(6):1760-1770. doi: 10.1016/j.ygeno.2018.11.031. Epub 2018 Dec 6. PMID: 30529702.

Team:William and Mary/Excellence in Another Area