Team:SCU-China/Software

RNAlphABA

Background

Late Spring Cold (LSC) refers to a weather commonly occurring in China, in which temperatures rise quickly in early spring (generally in March in the northern Hemisphere) and are lower in late spring (generally in April or May) than in normal years. It is mainly caused by prolonged rainy weather or frequent intrusion of cold air, or strong radiative cooling of clear nights often under the control of cold anticyclones. From our visits to the surrounding farmers in Sichuan, China, we learned that severe LSC weather can cause harm to agricultural production, especially when the early temperature is much higher than usual and the later temperature is much lower, the harm is more serious. In the severe LSC weather, the rotted seedlings reached 30 ~ 40%, and all the leaves of the crops were damaged by freezing. Therefore, it is of great significance to accurately predict the LSC weather and take preventive measures in time to ensure the agricultural harvest.

2020 SCU_China hopes to predict the occurrence, intensity and time of LSC in the second year by means of machine learning.

Purpose

Collect a large amount of meteorological data, extract features, and use machine learning method to predict LSC.

Methodology

Since atmospheric motion is an extremely complex dynamical time-varying system with nonlinear interactions and dissipative structural properties, we collected a large amount of meteorological data in order to make early and long-term predictions of the weather process of LSC in major cities in China. By means of significance analysis and feature engineering, we screened 11 most favorable features for the LSC prediction. Based on the random forest algorithm, we predicted the occurrence, intensity, and time of LSC and provided visualized prediction results for farmers.

What is Random Forest Algorithm?

Random forest is a new approach to integrating machine learning models (nonlinear tree-based models). In the 1980s Breiman et al. invented the classification tree algorithm to classify or regress data by repeatedly dichotomizing the data, which greatly reduced the computational effort. In 2001 Breiman combined the classification tree into a random forest, which randomizes the use of variables (columns) and data (rows), generates many classification trees, and then aggregates the classification tree results. Random forest improves the prediction accuracy without significantly improving the calculation amount, and is not sensitive to multicollinearity. Its results are robust to missing data and non-equilibrium data, and can well predict up to thousands of explanatory variables. It is praised as one of the best machine learning algorithms at present.

For the random forest of regression problems, the simple average method is usually used (Arithmetically average the regression results obtained by T weak learners).

1 Data collection and data processing

1.1 Data collection

Data set A: includes maximum wind speed, minimum air temperature, maximum air pressure, minimum air pressure, average air pressure, average 2-minute wind speed, average air temperature, average vapor pressure, average relative temperature, average minimum air temperature, average maximum air temperature, monthly sunshine percentage, sunshine duration, maximum wind speed, maximum daily precipitation, minimum relative humidity data per month in January to December. The data set includes meteorological data from 165 international ground exchange stations in China, covering the years 1970-202.This data set is used to generate the prediction model of LSC.

Data set B: includes daily maximum temperature, daily minimum temperature, and daily average temperature data. The data set includes meteorological data from 165 international ground exchange stations in China, covering the years 1951-2019.This data set is used for the calculation of the intensity and starting and ending dates of LSC in major cities of China from 1970 to 2019.

1.2 Data processing

(1) For the case where only the district station number is in the original data, without place and province, we construct a hash table by using the meteorological station number comparison table and then add geographic information to the original data.

(2) Treatment of default values: Since there are a large number of default values in the meteorological data, we have processed the default values for different data sets. For data set A, the rows with a large number of default values are deleted, and other default values are represented by null values; for data set B, the default values in the original data are 99999 (monthly data set) or 32766 (daily data set), and we take the average value of the adjacent data to replace these default values.

2 Calculation of cold weather indicators

We define the meteorological index of LSC in mathematics. The meteorological indicator is composed of three features: the degree of warming in the early stage, the degree of coldness in the late stage (the daily temperature departure value) and the duration of the LSC process.

The formula for calculating the meteorological indicator of LSC is as follows: $$K = \frac{\delta T_1}{a_1} - \frac{\delta T_2}{a_2} + \frac{L}{a_3}$$ In this formula, $K$ is the meteorological indicator of LSC; $\delta T_1$ is the degree of warming in the early stage; $\delta T_2$ is the degree of coldness in the late stage; $L$ is the duration of the LSC process; $a_1$, $a_2$, $a_3$ are the parameters, and their values are 4, 2, 10, respectively.

Some explanation of the formula above:

2.1 $\delta T_1$

When the temperature in a certain period of spring is lower than that in a normal year, the average temperature departure value anomaly for the previous 10 days is $\delta T_1$.The value of $\delta T_1$ must be greater than or equal to 0, indicating that the previous temperature is warmer than in normal years.

2.2 $\delta T_2$

When the average temperature in March to May is colder than that in the same period of the year, the maximum temperature anomaly of continuous 10d is $\delta T_2$.The value of $\delta T_2$ must be less than 0, indicating that the temperature in the later spring is relatively cold. $\delta T_2$ is the minimum value of rolling calculation.

2.3 Determination of the start date

Under the condition that the previous temperature is warmer, the start of LSC refers to the time when the difference value between the average daily temperature of spring and the average temperature of 5 consecutive days at the midpoint of this day is less than or equal to -1 and lasts for more than 3 days.

2.4 Determination of the end date

The end date of LSC refers to the time when the difference value between the daily average temperature and the standard value of the average temperature for 5 consecutive days at the midpoint of this day is greater than -1 and lasts for more than 3 days.

2.5 Duration L

The L value that meets the LSC standard must be greater than 3, indicating that the LSC weather lasts more than 3 days. Referring to the damage of agricultural meteorological disaster on crops, the duration of LSC was set as three grades: $3 d < L \le 7d$; $7 d < L \le 10$; $L > 10d$.

2.6 LSC index: $K$

K represents the grade of LSC, which is divided into three levels: mild LSC $K\le 3$; moderate LSC, 3 < $K<5$; severe LSC, $K\ge5$.

You can experience the LSC index calculation program on our wiki.

Please select the reigon and year you want to calculate



3 Feature Engineering

Data and features determine the upper limit of machine learning, while models and algorithms simply approach that limit.

                                                                                                                                                   from Kaggle

Since atmospheric motion is an extremely complex dynamical time-varying system with nonlinear interactions and dissipative structural properties, we have collected a large amount of meteorological data including 12*21 = 252 characteristic features that together join the construction of LSC prediction model. Obviously, we cannot use these 252 features directly, which would be a dimensional disaster. Therefore, we need to eliminate the irrelevant feature subset and select the relevant feature subset to avoid the dimensional disaster problem, reduce the difficulty of the learning task, and likewise make the model simple and reduce the computational complexity.

By means of significance analysis and feature engineering, we screened 11 features that were most favorable to the construction of the prediction model: the December maximum pressure, November maximum pressure, October maximum pressure, August maximum pressure, February maximum pressure, September maximum pressure, December minimum pressure, November minimum pressure, March maximum pressure, January maximum pressure, and July maximum pressure. We were surprised to find that all the features screened were barometric characteristics, and that the data in December and November had the greatest impact on the construction of the prediction model. This coincides with the information we previously collected and further validates the validity of feature engineering:

(1) Spring is the transitional season from winter to summer and the period of adjustment of the atmospheric circulation. During this period, a lot of heat exchange takes place between high and low latitudes in order to achieve longitudinal heat balance, which leads to quite active north-south airflow. For a certain region, this kind of activity is reflected by the frequent and strong cold and warm air, which is prone to LSC. And reflected in the region pressure, temperature, humidity and other meteorological elements on the evolution of the field, there is bound to be an obvious "mutation". Pressure is the feature that best illustrates the nature of atmospheric circulation conversion adjustment.

(2) There are many Chinese proverbs related to the prediction of LSC, which are full of valuable experience and wisdom accumulated by the vast number of Chinese people. For example, "进九冷,出九热;进九热,出九冷。"(proverb in Chinese). This proverb means that if the weather is cold at the winter solstice (a Chinese solar term, usually in December), it will be warm in spring (March or April). On the contrary, if the weather is warm at the winter solstice, the spring will be cold, and there may be an LSC. These sayings show that the meteorological characteristics of December and November of the previous year have important predictive significance for the occurrence of the next year's LSC.

4 Construction of a random forest regression model

According to the correlation table and scatter plot matrix, the maximum pressure in December and the maximum pressure in November had the greatest linear correlation with LSC respectively, with the correlation coefficient exceeding 0.8. And the correlation coefficient between the October and the August maximum pressure was more than 0.7. In the meantime, the correlation coefficient between the December and the November maximum pressure was more than 0.9, and the correlation coefficient of the maximum pressure in August and October also exceeded 0.8. The above conditions indicate that there is multicollinearity among all features, which is not independent of each other, and linear regression cannot be carried out directly in our model. Therefore, we choose the random forest algorithm to construct the LSC prediction model.

Fig 1. Table of correlation coefficients between 11 features Fig 2. Scatterplot matrix of 11 features. Fig 3. The steps of the Model construction.

Model evaluation

In our model, the maximum pressure in December and November are still the first two indexes that have the greatest impact on LSC, followed by the maximum pressure in October, August and December, which are basically consistent with the results of correlation coefficient analysis. It can be seen from the model evaluation table that the fitting interpretation degree R^2 of the model constructed is 0.9945 and 0.9258 on the training set and verification set, respectively, which indicates that our model has a good effect and can be used to predict LSC.

Fig 4. Indicator Importance of our Model. Table 1. Model evaluation of LSC prediction model built using the random forest algorithm.
EvaluationTraining SetValidation Set
R-squared$0.9945$$0.9258$
MSE$37.4867$$82.1434$
MAE$2.1413$$4.6402$

LSC Prediction Software

Through a survey of rural areas in China, we found that farmers do not know the late spring cold (LSC) and the serious agricultural losses it brought. However, severe LSC usually results in up to 30 ~ 40% rotted seedling rate, and crop seedlings and new shoots of some perennials are damaged by freezing. Through our investigation, we found that the problem is mainly caused by two reasons: (a) Farmers and local authorities do not receive early warning that LSC is approaching, and they are unable to judge whether LSC will occur from the weather of the previous days. (b)At present, China's meteorological broadcasting systems, including China Central Television's national weather forecast, only report on meteorological indicators such as temperature and precipitation, but there is no professional unified standard and forecast of LSC.

Fig 5. Pictures from field visits.


Therefore, SCU-China, based on the LSC prediction model , tries to develop an handy LSC forecast software to help local governments and farmers know the occurrence, intensity and time of LSC in advance so that farmers can prepare for the weather in advance.

So far, we use the 2019 climate data to forecast the 2020 LSC for major cities in China as shown in Figure 6. Next to each spot are the region's Chinese name, longitude and latitude, and the LSC intensity in 2020. The colors of the spots respectively represent mild LSC (blue), moderate LSC (yellow) and severe LSC (red).

Fig 6. Visualization results of LSC prediction for key cities in China in 2020.

We hope to improve our software in the future and to provide more convenience for farmers by putting the free app for LSC forecast on the app store.

You can download our software here.

Reference

  • 1. Breiman, L. 2001a. Random forests. Machine Learning 45:5-32.
  • 2. Breiman, L. 2001b. Statistical modeling: The two cultures. Statistical Science 16:199-215.
  • 3. Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone 1984. Classification and Regression Trees. Chapman and Hall, New York.
  • 4. Iverson, L. R., A. M. Prasad, S. N. Matthews, and M. Peters. 2008. Estimating potential habitat for 134 eastern US tree species under six climate scenarios. Forest Ecology and Management 254:390-406.
  • 5. Wikipedia, the free encyclopedia Random Forest
  • 6. Jerom H. Friedman, "Greedy function approximation: A gradient boosting machine" Volume 29, Number 5 (2001), 1189-1232.
  • 7. sklearn documentation for RandomForestRegressor, http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html.
  • 8. Leo Breiman. (2001). "Random Forests." Machine Learning , 45 (1): 5–32.doi:10.1023/A:10109334043243. J. H. Friedman. "Greedy Function Approximation: A Gradient Boosting Machine," https://statweb.stanford.edu/~jhf/ftp/trebst.pdf.
  • 9. J. H. Friedman. “Greedy Function Approximation: A Gradient Boosting Machine,” https://statweb.stanford.edu/~jhf/ftp/trebst.pdf.
  • 10. sklearn documentation for RandomForestRegressor, http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html.
  • 11. L.Breiman, "Bagging predictors,", http://statistics.berkeley.edu/sites/default/files/techreports/421.pdf.
  • 12. Jiawen Li.Application of fuzzy processing of barometric characteristics in forecasting late spring cold weather
  • 13. Shimin Gu, Peng Han.Climate change favours a destructive agricultural pest in temperate regions: late spring cold matters