Introduction
Golf officials, as well as fans, are always interested in the result of each golfing event, and they become aware of it mostly through the press and/or media. The broadcasters and commentators cautiously predict the winner and winning factors, especially, in the Ladies Professional Golf Association (LPGA) majors. Golf fans also judge the result based on the performance of each player. Expert performance-analysis scholars attempt to determine the winning factors and the performance factors that affect the money leader, based on the updated LPGA longitudinal data of many years. It was reported in several research papers that greens in regulation (GIR) and putting average (PA) had higher contributions and were more important that the other factors affecting the average strokes, money leaders or winning (Chae and Park, 2017; Dodson et al., 2008; Finley and Halsey, 2004; Park and Chae, 2016).
In most sports competitions, strategy analysts for each team invest efforts to analyze the records and data of the home and away teams to equip coaching staff with decisive factors that can affect the outcome of the game. These efforts are the same in the LPGA as in various other fields, and skill information such as the length of the game field, types or lay of the land, the level of difficulty of the course, the type of grass and green conditions, weather, and strategy for course targeting, is provided (McGarry et al., 2002). However, recently, prediction and description of the determinant of victory of the team and players, as well as the winner, have been required in sports competitions (Dorsel and Rotunda, 2001; Park and Chae, 2016).
This requirement has reached a level wherein scholars statistically provide winner and rank possibilities employing prediction models on accumulated data (Hayes et al., 2015; Jida and Jie, 2015; Neeley et al., 2009). Chae et al. (2018) used multiple regression analysis, which is a statistical analytical model, for the rank prediction of LPGA players based on the fact that the medal rank of the 2016 Rio Olympic female golf tournament was predicted by multiple regression analysis (Mercuri et al., 2017). The methods of analysis for this type of prediction are usually linear regression analysis, curve estimation, discriminant function analysis, logistic regression analysis, principal component regression analysis, classification tree analysis, and more recently, the frequently used artificial neural network analysis. Classification tree analysis, logistic regression analysis, discriminant analysis, and artificial neural network analysis, in particular, are generally used in quantitative prediction analysis (Agga and Scott, 2015; Cenker et al., 2009; Maszczyk et al., 2012, 2016; Neeley et al., 2009).
The discriminant function analysis is a statistical technique to predict how the individual would behave under given circumstances, based on various characteristics of social phenomena. Several types of supposition should be satisfied when using the discriminant function analysis (Couceiro et al., 2013; Kuligowski et al., 2016; Mieke et al., 2014; Shehri and Soliman, 2015). Classification tree analysis segments the individuals as members of small groups with similar behaviors or conducts stratification based on a certain standard and if the LPGA player will win, fail, or lead in wins (Surucu et al., 2016). Logistic regression analysis is a general linear model, wherein the object variable is a binary variable that is categorical data. Logistic regression analysis has an advantage that there are few constraints for the discrimination variable; however, there exists a regression-analysis– oriented disadvantage that it cannot overcome the interaction effect and the numbers of independent variables (Clark, 2001; Lu, 2017; Sperandei, 2014).
Artificial neural network analysis mimics the human neural–brain system. A typical neural network is composed of three layers, i.e., the input layer, hidden layer, and output layer, which include several neurons (Almassri et al., 2018). The neurons in the hidden layer conduct intermediate treatment if the input nodes receive stimulation, resulting in response from the output nodes. Thus, when using artificial neural network analysis, the predicting variable is applied to the input layer and the dependent variable to the output layer. The hidden layer oversees the intermediary management, and the researcher does not grant a role to a specific observed variable even though the researcher designates the number of hidden layers and neurons.
The back-propagation algorithm is applied between the input and hidden layers, and hidden and output layers, if the input variable is supplied to the neural network. The connection weight value is adjusted every time to minimize the error between the real value in the unit of the output after applying the back-propagation algorithm and the value calculated by the artificial neural network. The optimum point is investigated by applying the big learning rate from a random point by combining the intensity that affects the direction where the slope is highest by the algorithm (learning rate, η > 0) and the intensity that affects the direction from the initial to the current direction (moment, α > 0) (Chen and Liu, 2014; Smaoui et al., 2018). Artificial neural network analysis is relatively independent of the statistical preconditions and can describe the nonlinear relationships between variables. Therefore, it is preferred over the traditional methods (Chen and Liu, 2014; Maszczyk et al., 2014; Sun and Lo, 2018).
Even though there are many prediction analysis methods, this study aims to investigate performance variables that affect the winning possibilities of players and the degrees of importance of these variables, from the annual data of 25 seasons of the LPGA (1993 to 2017). Moreover, it aims to select the most accurate model from four prediction models (classification tree analysis, logistic regression analysis, discriminant function analysis, and artificial neural network analysis). This study presents a relative comparison of the influence of the predicting variables in the four prediction models on victory. That is, it tells the performance variable that should be considered for winning, and it can predict the possibility of victory of an individual using an optimum prediction model. The results of this study are expected to show the effect of prior preparation on victory.
Methods
Participants and Data Collection
The data used in this study included LPGA players, falling within the 60th rank (money leaders), from over a period of 25 years from 1993 to 2017; i.e., the annual average value of 1,500 players (60 players multiplied by 25 years). The data were collected from the LPGA homepage (http://www.lpga.com/) Because the data on the LPGA homepage did not collect private identifier information such as telephone numbers, home addresses, social security numbers, etc., ethical approval was not required for this experimental study. The performance variables chosen were those that were being measured and used in the current LPGA analyses. The variables were reconstituted in this study as independent variables (predicting variables), which were continuous variables, and dependent variables (response variables), which were categorical variables (Table 1).
Table 1
Experimental Approach to the Problem
The data analysis aimed to determine key performance variables that affected the possibility of winning, the variable that was the most significant, and whether the player would win a game or be in the lead in wins. Four prediction models, i.e., classification tree analysis, logistic regression analysis, discriminant function analysis, and artificial neural network analysis, were employed. The most accurate model was selected, according to the purpose of the study.
Procedures
The player’s accumulated raw data released by the LPGA were arranged using Microsoft Office Excel 2010 (Microsoft Corporation, Redmond, WA, USA), and the result was deduced using the IBM SPSS 22.0 (IBM Corp., Armonk, NY, USA) statistical program. In the first round of analysis, we used classification accuracy as a basis to find the possibility that a certain player could win the game in the LPGA, using the four prediction models (discriminant function analysis, classification tree analysis, logistic regression analysis, and artificial neural network analysis (multilayer perceptron, MLP)). One-way analysis of variance (ANOVA; post-hoc: least significant difference [LSD] test) was used if there was a mean difference in the classification accuracy of the four prediction models.
The input predicting variables of the four prediction models were divided into skill variables (driving accuracy [DA], driving distance [DD], sand saves [SS], GIR, and PA), skill result variables (birdies, eagles, par3 scoring average [P3A], par4 scoring average [P4A], and par5 scoring average [P5A]), and season outcome variables (official money [OM], scoring average [SA], top 10 finish% [T10], 60-strokes average [60SA], and rounds under par [RUP]). When inputting these predicting variables as dependent variables, they were divided into both two groups (victory/no victory) and three groups (no victory/one victory/multiple victories). From the results of the four prediction models, the standardized discriminant function coefficient, normalization importance or Wald value, which are importance indexes linking the independent variable to the dependent variable, could be obtained. Finally, one-way ANOVA and the post-hoc (LSD) test were conducted to examine the mean difference in the classification accuracy of the four models. Statistical significance was set at 0.05.
Statistical Analyses
In the discriminant function analysis, the function to maximize the group difference of an object based on continuous and discrete variables was deduced, and each participant (player) was classified using Fisher’s linear discriminant functions (Mieke et al., 2014; Shehri and Soliman, 2015). It should be known which group, from among the many groups, included each object to be used in this model. When each group was already known, the category to which each object belonged was classified and predicted by calculating the discriminant score of the individuals in each group by finding the discriminant function:
which could classify each group from the measured variables (Kuligowski et al., 2016; Novak, 2016; Schumm, 2006).
Classification tree analysis was used for classification and prediction by tree-structurally schematizing the decision-making rule. A decision-making tree consists of a node, body, and stems that connect different nodes. The decision-making pattern is found at the top of the node if it repetitively classifies the node according to the tree structure forming process. Before the analysis using this decision-making tree, decision trees have an assumption that prior to analysis, the type of variable is precisely specified according to the measurement level. That is, it should be analyzed whether the variables have been accurately designated for the measuring levels (Surucu et al., 2016).
The methods of growing the tree are classified according to the characteristics of the data and purpose of decision making into chi-squared automatic interaction detection (CHAID), exhaustive CHAID, classification and regression tree (CRT), and quick, unbiased, and efficient statistical tree (QUEST). The classification accuracy was found to be high for the CRT basic data (Hayes et al., 2015). The tree structure was formed by designating the standard and pattern (decision trees are classified according to the purpose of the analysis and the structure of the data) as well as classifying for the purpose of analysis and data structuring. The decision tree is to select the predicting variable and to set the standard of the category when forming a low node from a single upper node. A pure low node was formed by most efficiently classifying the distribution of the target (dependent) variables. In this case, purity was defined as the degree of including individuals in a certain category of the target (dependent) variable. It set the predicting model according to the analysis result and interpreted by grasping the meaning of certain parts, as the decision-making tree described the relationships between variables as tree structures (Linda et al., 2008; Neeley et al., 2007).
The merit of this study is that the process is simpler than the other methods (artificial neural network analysis, discriminant function analysis, regression analysis, and so on), as prediction or classification is described based on the induction rule of the tree structure. In this study, CRTs of four tree-growing methods were used. Homogeneity within nodes was maximized by dividing the parent node for maximum homogeneity of the dependent variable within the child node (Hayes et al., 2015). In the splitting criterion of the classification tree, the status to merge the input variable selection and category when each parent branch formed a child branch was a criterion, and it was processed from the input variable, grasping distribution of the target variable, and child branch forming in sequence (i.e., first from the input variable, then from the grasping distribution, etc.). The degree of classifying the distribution of target variables was measured in terms of the purity or impurity. The purity of the child branch was very high, compared to that of the parent branch. Pruning removed the branch that had high risk of misclassification or inappropriate induction rules.
There is cross-validation and split-sample validation for the validity evaluation. Namely, cross-validation and split-sample validation existed in the assessment of validity. The analytical sample was divided into m (= 2, 3, 4 ...) parts, and the remaining part of the sample was excluded. Thus, each part of the data was used to generate m-1 trees, and 1 was used to evaluate trees. That is, this study used cross-validation that divided analysis samples into parts of m values, made the tree with the rest of certain parts of m values, and conducted model assessment with the remaining one part. Split-sample validation divided the observation samples into training samples (training: 70%) and test samples (test: 30%) and conducted an assessment of the tree with the test samples after making the tree with the training samples. This means that the produced tree, without just being a sample, can perform expended application to a population, which is the origin of the analysis sample. Model assessment could be described with profit charts or risk charts. Namely, the decision tree found the hidden pattern and useful correlations using data and could be used as a reference for decision making in the future, as well as for finding associations between data that were difficult to quantify accurately (Duan et al., 2015).
In logistic regression analysis, variables measured by nominal, ordinal, interval, and ratio scales could be used as independent variables; however, the dependent variables had to be categorical variables that were measured in a nominal scale to analyze and predict whether an individual observation belonged to a certain group. The functional formula of the logistic technique was
which was expressed as
resulting in ability possibility for linear regression analysis. Thus, the natural log value in brackets, which is on the left-hand side of this logit linear function is an odds-ratio; p, which is the numerator, is the probability that an individual belongs to a certain group; and 1 − p, which is the denominator, is the probability that an individual does not belong to a certain group. Thus, as a result of calculation using n predicting variables (X) in the right-hand side, the bigger the logit value, the higher is the possibility it belongs to the group (Curtis, 2019).
Artificial neural network analysis, by using learning materials in computers, aims to learn the optimum result, apply that result of learning to new data or conditions, and deduce an expected result such as how a human behaves, through learning (Chen and Liu, 2014). The neural network used in this study was composed of three layers (input layer, hidden layer, and output layer), and each layer included several neurons (Chen and Liu, 2014; Nair et al., 2016). The neurons in the hidden layer received the stimulation (every type of information) from the neurons in the input layer and the linear combination
was connected as a weighted value. The bigger this linear combination, the higher the activation the neuron received; it was deactivated in the opposite case (Almassri et al., 2018; Nair et al., 2016).
If the degree of this activation value was S, the activation {logistic functions:
The goodness-of-fit of the neural network was obtained by maximizing the corresponding likelihood function using the back-propagation algorithm. Conceptually, this algorithm attempts efficient calculation by combining the learning rate (the intensity in the direction where the slope is the highest) and moment (the intensity in the direction until now) (Jida and Jie, 2015; Smaoui et al., 2018). Namely, the neural-network fitting algorithm was started from a random location, and it actively explored the highest point using a high learning rate at the beginning. It gradually lowered the learning rate to reach the highest point (Sun and Lo, 2018; Xi et al., 2013). This process was repeated at the other locations. The point finally reached by repeating this process dozens of times was not the local highest point, but the global highest point (Nair et al., 2016). It found a weight parameter for which the probability became the maximum. The predicting variable was set to skill (DA, DD, SS, GIR, PA), skill result (birdies, eagles, P3A, P4A, P5A), and season outcome variables (OM, SA, T10, 60SA, RUP), and the dependent variable was categorized to no victory and victory or no victory, one victory, and multiple victories.
Results
Influence of Skill Variable on Achieving Victory
The type of an athlete that belongs to a certain group can be predicted using different models. Namely, it is possible to predict which athlete will belong to which group using a prediction model. Table 2 solves this problem when it comes to the probability of victory between an LPGA rookie and a veteran. Table 2 categorizes the dependent variables according to victory (Yes/No) from the results of four prediction model tables, when the independent prediction variable was set to a skill variable such as DA, DD, SS, GIR or PA.
Table 2
This discriminant function was significant as the Wilks'λ test statistic was 0.883 (p < 0.001). The classification accuracy of this discriminant function was 74.1% and the importance of the prediction variables was in the order of SS < DA < DD < PA < GIR. The validity evaluation of classification tree analysis, the second model, was described by risk estimates. The misclassification rate was 26.4% and 27.2%, when the classification tree model included training data of the sample and cross-validation, respectively. Namely, this misclassification is a value divided by the misclassified values ((59+335) / 1500), and the total classification accuracy of this model was 73.7%. The importance of the prediction variables was in the order of DA < SS < DD < PA < GIR.
In the goodness-of-fit test of the third model, the binomial logistic regression model, the model was found to be better than the base model, as chi-square ( x2 ) was 186.83, which was significant (p < 0.001). The classification accuracy of this model was 74.2%, and the importance of the predicting variables was in the order of SS < DA < DD < PA < GIR.
The goodness-of-fit of the fourth model, the artificial neural network analysis model, was determined by the area under the curve (AUC), and the model improved as the AUC became closer to 1. The AUC of this study model under the receiver operating characteristic (ROC) curve could fall in two categories: 0.736, a group with winning experience, and 0.736, a group without it. With higher accuracy of prediction, the shape of the ROC curve moved further up from the 45° line. The AUC was the area under the ROC curve, the 45° line was a curve corresponding to the random classification ratio, and the AUC was 0.5.
Thus, the AUC was in the range of 0.5 to 1.0, if it was superior to the random classification, and it became close to 1 for a more accurate model. The probability value was calculated by applying the importance index of each predicting variable to the hyperbolic tangent function between the input and hidden layers. If the hidden layer was formed and the weight coefficient value of the variable that belonged to the hidden layer was applied to the softmax function that was applied between the hidden and output layers, the probability value that corresponded to each category (Yes/No) of the finally calculated dependent variable changed from 0 to 1, and group classification criteria could be applied to the classification standard of the group by estimating the sum of probability to 1.0.
The classification accuracy rate from these repeated processes was 75.3%. The importance of predicting variables in this model was in the order of SS <DD < DA < PA < GIR. To sum up, artificial neural network analysis showed a higher prediction accuracy rate than the other three models; i.e., prediction accuracy rates were as follows: classification tree model (73.7%) < discriminant model (74.1%) < binominal logistic regression model (74.2%) < artificial neural network model (75.3%). Moreover, predicting variables that were most significant for determining victory included GIR and PA in all four prediction models (Table 2).
Influence of Skills on Victory
If an LPGA player needs to determine the possibility of victory in a tour, the results in Table 3 will help solve this problem (or will help provide this information). Table 3 is a result table for the four prediction models, based on the category of victory (Yes/No) and the predicting variable, which is an independent variable composed of the skill variables: birdies, eagles, P3A, P4A, and P5A. The discriminant model discriminated between the groups to which each participant belonged, using the coefficient value of the discriminant function. This discriminant function was significant as the test statistic Wilks'λ was 0.879 (p < 0.001).
The classification accuracy of this discriminant function was 74.1% and the importance of the predicting variables was in the order of eagles < P4A < P3A < P5A < birdies. The feasibility study of the second model, the classification tree model, is described by the risk estimate. In the training data of the samples, the misclassification rate of this model was 25.6% and cross-validation showed 26.1% misclassification. Namely, this misclassification was a value divided by misclassified (112+272)/1500, and the total classification accuracy of this model was 74.4%. The importance of predicting variables was in the order of eagles < P5A < P4A < P3A < birdies. The goodness-of-fit of the binominal logistic regression model was better than that of the base model, as the chi-square value (x2) was 188.04, which was significant (p < 0.001).
The classification accuracy of this model was 74.3%, and the importance of predicting variables was in the order of P4A < eagles < P3A < P5A < birdies. In the artificial neural network analysis goodness-of-fit test, the AUC, which was the area under the ROC curve, could take values in two groups: a group with winning experience (0.733) and a group without any experience of victory (0.733). If it was superior to random classification, the AUC was between 0.5 and 1.0, and the model improved as the AUC increased and reached closer to 1; the AUC was 0.5 for this model. If the hidden layer was formed and the weight coefficient value of the variable that belonged to the hidden layer was applied to softmax function that was applied between the hidden and output layers, the probability value that corresponded to each category (Yes/No) of the finally calculated dependent variable changed from 0 to 1, and could be applied to the classification standard of the group by estimating the sum of probability to 1.0.
The classification accuracy rate from these repeated processes was 75.7%. The importance of predicting variables in this model was in the order of eagles < P3A < P5A < P4A < birdies. To sum up, artificial neural network analysis showed higher prediction accuracy rates than the other three models, as in the discriminant model (74.1%) < binominal logistic regression model (74.3%) < classification tree model (74.4%) < artificial neural network model (75.7%). Moreover, the predicting variable that was most important in determining the victory was found to be birdies in all four predicting models (Table 3).
Influence of the Season Outcome on Victory
The data in Table 4 help a player determine the possibility of victory during the LPGA tour. Table 4 is a result table of the four prediction models and the predicting variable is a season outcome such as OM, SA, T10, 60SA, and RUP. The dependent variable is victory (Yes/No).
The discriminant model discriminated between the groups to which each participant belonged, based on the coefficient value of the discriminant function. This discriminant function was significant as the Wilks' λ test statistic was 0.717 (p < 0.001). The classification accuracy of this discriminant function was 78.5% and the importance of the predicting variables was in the order of SA < RUP < 60SA < OM < T10. The evaluation of the validity of the second model, the classification tree model, was described by risk estimates. The misclassification ratio of the model when the sample was training data was 20.3% and cross-validation showed 21.3% misclassification. Namely, this misclassification was a value divided by the wrongly classified (137+167) / 1500, and the total classification accuracy of this model was 79.7%. The importance of predicting variables was in the order of 60SA < RUP < SA < OM < T10. In the goodness-of-fit test of the binominal logistic regression model, the model fit improved compared to the base model as the chi-square (x2) of the analysis model was 477.262, which was significant (p < 0.001).
The classification accuracy of this model was 78.7%, and the importance of the predicting variables was in the order of 60SA < SA < RUP < T10 < OM. In the artificial neural network analysis goodness-of-fit test, the AUC could be in two different groups: a group with winning experience (0.844) and a group without any winning experience (0.844). If it were superior to random classification, the AUC would be between 0.5 and 1.0, and the model improved as the AUC increased and reached closer to 1; the AUC was 0.5 for this model. If the hidden layer was formed and the weight coefficient value of a variable that belonged to the hidden layer was applied to the softmax function between the hidden and output layers, the probability value that corresponded to each category (Yes/No) of the finally calculated dependent variable changed from 0 to 1. Furthermore, this value could be applied to the classification standard of the group by estimating the sum of the probability to 1.0.
The classification accuracy rate from these repeated processes was 80.2%. The importance of predicting variables in this model was in the order of 60SA < RUP < T10 < SA < OM. To sum up, the artificial neural network analysis showed a higher prediction accuracy rate than the other three models, as in the discriminant model (78.5%) < binominal logistic regression model (78.7%) < classification tree model (79.7%) < artificial neural network model (80.2%). Moreover, predicting variables that were most significant in determining victory were T10 and OM in the discriminant model and classification tree, and OM, T10, and SA in the binominal logistic regression model and artificial neural network model (Table 4).
Test of Mean Difference of Classification Accuracy of Prediction Models
Table 5 shows the best model in terms of the classification accuracy from the four prediction models, showing the mean difference in the classification accuracy of the statistic models, arising from the change in the number of independent variables according to the change in the dependent variable level (2 or 3). The test of mean difference of the classification accuracy ratio was conducted by one-way ANOVA and it was significant (p < 0.05). The post-hoc test was necessary to determine the exact difference between the prediction models. The LSD post-hoc test showed that the artificial neural network model had higher classification accuracy than the other three models.
Table 3
Table 4
[i] *p < 0.05, **p < 0.01, ***p < 0.001, ROC: receiver operating characteristic, IV: independent variable, SDFC: standardized discriminant function coefficient, NI: normalization importance, T10: top 10 finish%, OM: official money, 60SA: 60-strokes average, RUP: rounds under par, SA: scoring average
Table 5
Discussion
The purpose of this study was to find the best model, in terms of the classification accuracy, from four prediction models using the annual average performance variable data of LPGA players within the 60th rank, over 25 seasons, and to compare the importance of the predicting variables according to the victory status of the four prediction models (Dodson et al., 2008; McGarry et al., 2002). We found that, first, the artificial neural network model showed a higher prediction rate than the other three models, when the independent variable was a skill variable and the dependent variable was the achievement of victory (Almassri et al., 2018; Jida and Jie, 2015). The prediction rate was in the order of the classification tree (73.7%) < discriminant model (74.1%) < binominal logistic regression model (74.2%) < artificial neural network model (75.3%). The most important predicting variables for determining victory were GIR and PA in all four prediction models.
Second, the artificial neural network model showed a higher prediction rate than the other three models when the independent variable was the skill result and the dependent variable was victory. The prediction rate was in the order of the discriminant model (74.1%) < binominal logistic regression model (74.3%) < classification tree model (74.4%) < artificial neural network model (75.7%). Moreover, the most important predicting variable for determining victory was birdies in all four prediction models.
Third, the artificial neural network model showed a higher prediction rate than the other three models when the independent variable was the season outcome and the dependent variable was victory. The prediction rate was in the order of the discriminant model (78.5%) < binominal logistic regression model (78.7%) < classification tree model (79.7%) < artificial neural network model (80.2%). The most important predicting variables for determining victory were T10 and OM in the discriminant and classification tree models, and OM, T10, and SA in the binomial logistic regression and artificial neural network model. To sum up the above three results, the player who aims for victory in the LPGA should have a chance of birdies at each hole by improving the GIR and PA, driving distance, and driving accuracy among skill variables, lowering the average strokes. This will increase the probability of being within T10 as well as the victory at each competition.
Fourth, the one-way ANOVA was conducted to find the best model in terms of the classification accuracy of the four prediction models and to test the mean difference of the classification accuracy rate rising from the change in the number of independent variables according to the change in the dependent variable level (2 or 3). The LSD post-hoc test showed that the artificial neural network model had higher classification accuracy than the other three models. We can conclude that the artificial neural network model was superior when comparing the classification accuracy rates of the predicting models. This is consistent with the results of another study using neural networks when the sports disciplines considered were basketball, soccer, and tennis (Chae et al., 2018). Future research can supplement the data for predicting variables and quantify the mental strength and teamwork that are difficult to quantify for achieving an optimum harmony of predicting variables.
Conclusions
The first practical implication relates to the prediction of the probability of victory in the LPGA using the artificial neural network model for achieving more meaningful results. The second implication is to arrange the schedule of training based on the DD, DA, GIR, SS, PA, and GIR if the player aims at victory in the LPGA tour. Furthermore, birdies are the most important skill result variable affecting victory as all four prediction models indicated birdies as the most important variable of victory. Thus, more time can be spent establishing a strategy for improving this skill.