Optimal Soccer Position Predictor

Hyun Woo (Harry) Choi, Shoudao (Kenny) Gao, Timothy Ng, Danny Suradja

EA Sports FC 24 complete player dataset

Research Focus

  • Assumptions:
    • Statistics reflect real athlete performance.
    • Athletes play in their optimal positions.
  • Goal:
    • Predict the optimal position of a new player based on their statistics

Data Cleaning

  • Removed duplicate players (multiple FIFA games)
  • Selected primary position
  • Removed positions with low frequencies (left/right wing back, center forward) and missing data (goalkeepers)
  • Standardized the numerical data

Position Differences

From https://themastermindsite.com/2019/04/10/4-3-3-vs-4-1-4-1-tactical-flexibility/

Data Balance

Position Foot Preferences

Skills by Position

Correlations

Statistical Methods

  • Linear and Quadratic Discriminant Analysis
    • Supervised learning
    • Prediction based on lower dimensional “scores”
  • Random Forest Model
    • Supervised learning, ensemble method (bagging)
    • Predictions based on splitting features

LDA Results

We achieved an overall predictive accuracy of 0.6976479.

Linear Discriminants

                              LD1    LD2
preferred_footRight         0.052  0.171
height_cm                  -0.038  0.179
weight_kg                  -0.031  0.177
pace                        0.015 -0.130
defending                  -1.934 -0.412
shooting                    0.827  0.472
passing                    -0.113 -0.413
dribbling                   0.107  1.398
skill_ball_control          0.072 -0.222
skill_dribbling             0.107 -1.297
mentality_positioning       0.427 -0.294
skill_fk_accuracy          -0.105 -0.044
attacking_crossing         -0.058 -0.550
attacking_heading_accuracy  0.128  0.894

LDA Separation Example

Recall that the first LD had a high negative coefficient for defending and a high positive coefficient for shooting.

QDA Results

Our Quadratic Discriminant Analysis model achieved an accuracy of 0.7024947.

Random Forest Results

Our random forest classifier achieved an accuracy of 0.7141126.

Grouped Positions

We grouped the different positions based on role similarity.

LDA for the Groups

Our linear discriminant analysis model achieved a group accuracy of 0.7933001.

QDA for the Groups

Our quadratic discriminant analysis model achieved a group accuracy of 0.8037063.

Random Forest for the Groups

Our random forest model achieved a group accuracy of 0.8137562.

Discussion: Model Performance

Table 1 contains the test accuracies presented in the previous slides.

The random forest model outperformed LDA and QDA in both position and group predictions.

Table 1: Test accuracies across the different models for the different predictions
LDA QDA Random Forest
Position 0.6976479 0.7024947 0.7141126
Group 0.7933001 0.8037063 0.8137562

Discussion: Random Forest Features

The random forest implementation in R also gives us an idea of which features are most important for distinguishing the positions

Discussion: Random Forest Feature Importance by Position

We can also look at which features are important for distinguishing specific positions.

Conclusion

  1. The random forest model makes the best predictions for both classification tasks.
  2. Defending seems like an important statistic for determining optimal position.
  3. Every model seems to struggle differentiating wingers (LW, RW) from side midfielders (RM, LM), often predicting the latter.