Team:Manchester/Wiki Study





Wiki Study


  • We carefully investigated what makes a good iGEM website
  • We collaborated with Team Heidelberg to decide on variables of interest and Heidelberg iGEM team collected data through web scraping
  • We made predictions about the influence each variable and tested these hypotheses using a binomial logistic regression.
  • Number of internal links, number of external links and (to a lesser extent) the number of titles showed a clear correlation with iGEM Wiki success, indicating that larger, more tightly integrated projects produce more successful Wiki pages
  • Surprisingly, the number of pictures and the complexity of the language (mean characters in a sentence and mean words per sentence) do not show the expected correlation

Why did we do a Wiki study?

Websites have become increasingly important for companies and entities since the creation of the World Wide Web. They are one of the main platforms to present a product, communicate ideas and advertise companies and their products. Competition for the attention of Internet users is fierce, and consequently it is essential to build websites in the most effective and memorable way possible.

The team Wikis are the most important and long-lasting part of every iGEM project. They are the place where the entire project is presented and where judges should be able to access all of the information in order to award medals and prizes. Consequently, we began our Wiki making process by researching what makes a good website; during this research we became interested in effective web design and began to question what specifically makes a good iGEM Wiki.

After researching general website etiquette and characteristics of a successful website we realised that the Wiki for iGEM is a very specific type of website with its own criteria, so we decided to investigate iGEM Wiki structure to see if we could find any recurring themes underlying succesful communication.

We then realised that it would be very interesting to compare past years Wikis and see if any trends arise and how these trends match up with some of the “best practices” we found when researching website design.

Is there a set of components that when combined make a “winner” Wiki?

What is the wiki study?

After some consideration we selected several variables of interest which we felt could be influential for a Wiki’s success; these were then put forward to the Heidelberg iGEM 2020, who collaborated with us by applying their programming expertise to data collection through web scraping. We received feedback from them about which variables could be plausibly investigated with this method. We then narrowed it down to the variables below. For each team together with Team Heidelberg we determined:


  • Number of Titles;
  • Number of Subtitles;
  • Number of Sub-subtitles;
  • Number of Pictures;
  • Number of PDFs;
  • Number of Videos;
  • Mean Characters per Sentence;
  • Mean Words per Sentence;
  • Number of internal links;
  • Number of external links;
  • Team Size.

We also differentiated the teams that won the Wiki prize from those that didn't, including runner-ups as “winners” in order to have a slightly more balanced number of winners and losers. Success with the judges is thus our main indicator of what makes a “winner” Wiki.

Mean word length was not found through web scraping; instead it is a combined metric for mean character per sentence and mean words per sentence. We divided the mean characters by mean words in a sentence to give us a very vague indication of the average length of words used. This seemed to be an interesting way to possibly gain some insight into the complexity of the language used.

Due to the two-phase design of our project, we are carrying out this study in two parts, this year we have collaborated with Team Heidelberg to do preliminary web scraping and analysed the results to identify some trends. Next year's team can use our analysis from this year and test the model on next year's Wiki data and expand on our conclusions or adjust the analysis. Furthermore, we can recruit team members with more programming expertise who can expand on the web scraping and maybe use more advanced methods of data collection.

Hypotheses


  • Number of Titles: We predict that title number may be positively correlated with being a winner Wiki as it may indicate more components to the project.
  • Number of Subtitles: We predict that although there may be some positive correlation with success,it likely will not be as notable as that of the Number of Titles.
  • Number of Sub-subtitles: Similarly to Number of subtitles, but if there is any correlation it will be even smaller
  • Number of Pictures:We were unable to differentiate between images and figures so we think there may be a tenuous link between number of pictures and Wiki success.
  • Number of PDFs: PDFs would usually be used to hold detailed information that is summarized on the wiki pages. This was a potentially interesting variable as we were unsure if Wiki judges prefer for everything to be on the Wiki or if summarized outlines accompanied by more detailed PDFs were preferred.
  • Number of Videos: We predicted that more videos would potentially be correlated to winner Wikis because they are a more complicated way to present content and may be well received by judges. Furthermore, this is an interesting variable for Phase 2 next year to investigate as the iGEM deliverables have changed this year to include two videos.
  • Mean Characters per Sentence: Higher mean numbersmay indicate more complex writing; we predicted that there might be a correlation that levels off, as overly complicated writing becomes unclear and would likely be detrimental to the success of a Wiki.
  • Mean Words per Sentence: Higher mean numbers may indicate more complex writing; we predicted that there might be a correlation that levels off, as overly complicated writing becomes unclear and would likely be detrimental to the success of a Wiki.
  • Number of internal links: We predicted that this variable would have the biggest correlation with winning the Wiki prize, as iGEM encourages interlinking project components and showing how one section may inform other parts of the project.
  • Number of external links: Although this may have some correlation to winning we did not expect it would be of much consequence.
  • Team Size: We predicted that a larger team may result in a larger collective skillset, as well as having more time, which might result in a generally better Wiki and thus a correlation to winning the Wiki prize.

Analysis

For our analysis we collected information about each variable for all teams from the past 5 years (2014-2019) for a total of 1727 teams, the data is summarised in Table 1.

Table 1. Summary of data collected during web scraping by Team Heidelberg

Table 1

In order to investigate the relationships between these variables and whether that team had a winner wiki we carried out regression analysis.

We chose a binomial logistic regression because our dependent variable - whether the wiki was a winner - is a qualitative binomial variable. We used logistic regression to estimate the probability of a Wiki winning or not.

Prior to running the regression analysis it is essential to ensure that there is no correlation between the independent variables. This was done by creating a correlation matrix and validated by checking Variance Inflation Factors (VIF) as shown in Table 2.

Table 2. Variance Inflation Factors (VIF) values for our variables of interest

Table 2

The correlation matrix depicts correlation coefficients between each pair of variables, and as shown in Table 3 only a few of these coefficients exceed 0.5. Those that do are mean characters per sentence and mean words per sentence which as mentioned above were expected to have a relationship.

Table 3. Matrix of correlation values between each pair of variables

Table 3

Variance Inflation Factors are a numerical value of 1 or greater, which indicate what percentage of the variance is inflated for each coefficient that displays multicollinearity. If below 5, then they are sufficiently low and do not display multicollinearity. The only variables that displayed values >5 were mean characters per sentence and mean words per sentence as well as the variable calculated from these two (Mean word length). This was to be expected as both are sentence variables and there would obviously be high correlation, similarly for the mean word length there is a correlation with the other two due to it being the result of them.

Table 4

Figure 1. Bar chart depicting Variable significance in relation to winning the Wiki prize

As seen in Figure 1, which depicts the standardized coefficients of the model, the only three variables where the results are significant at a 95% confidence level are Number of internal links, number of external links and the number of titles. The others have too much of a spread for us to be able to draw any conclusions from them.

Predictions revisited

Number of Titles shows a significant positive correlation to being a winner wiki. Additionally, the correlations for the number of external links and number of internal links agree with our predictions.

Quite surprisingly, the Number of pictures, mean characters in a sentence and mean words per sentence do not show the expected correlation with Wiki success.

Number of Sub-subtitles, Number of PDFs, Number of Videos, Mean Characters per Sentence, Mean Words per Sentence and Team Size all showed positive values with large error bars which means none of these had significant results.

Conclusion and Further Steps

We created a preliminary model which can predict the potential influence of a variable upon the team’s likelihood of having a winner Wiki, which may be useful to inform future iGEM teams building effective Wiki pages.

In future, this model can be further validated by using data from upcoming years and used to predict Wiki prize outcomes based on Wiki characteristics. Additionally, next year’s team should look for more factors that may affect a Wiki's chance of winning. Although, our analysis this year focused on numerical variables, next year's team could expand upon this and investigate different types of variables.

Future steps could involve increasing the number of years analysed and adding Grand Prize winners as these teams tend to also have great wikis generally. It would also be interesting to see how the quality of Wiki pages has changed over time: is there evidence that iGEM teams are learning from the success of their predecessors? Another potential change could be to compare the same number of winners to losers by using a randomly selected pool of losers ("undersampling") and repeating the analysis with different loser pools to validate the results. Additionally, this method could potentially be used in a predictive manner for future years to test their wiki as they are making it and see their journey to Wiki success. After all, the aim is not only to win the Wiki prize, but more importantly to effectively communicate the results of our Wiki projects to a global audience.

References

Web Scraping Code:

https://github.com/igemsoftware2020/Heidelberg_2020/tree/main/Wiki_Study_Collaboration_Manchester
Logo 3 white


Logo 4


Logo 5
Logo 6 Logo 7


igem2020manchester@gmail.com


Logo 8 png Logo 9 png
Logo 1


Logo 2