This exercise began playfully to see how we could apply some simple ´R` algorithms to predict this Saturday´s Champions Leagues Final results.
Some basic data was first accumulated into Excel tables about players performance and team historical results. Initially, the data was fairly dirty with duplicate teams due to distinctions in naming and relatively small dataset considering the task at hand. After some time, the data is treated and is much easier to work with but some useful information can be extracted with crude data as can be seen in the following images. These simply represent the relationship between the number of fouls and number of goals individual players had accumulated in the championship so far.
Each external ball represents a player where the size of the ball represents the number of goals scored and thickness of the bar connecting the player to the team represents the number of fouls that player had accumulated.
The next diagram shows you a little more detail:
Large circles represent more goals and the thickness of the connecting rods represents the number of fouls by that players and thus a relative handicap reading.
Facebook fan activity comparison can help to model a likely mood from supporters on the day of the match itself and could therefore provide insights into attendance patterns.
The following diagram shows an interesting grouping of winning teams based on attendance levels in consecutive years. It also shows a trend line although team dynamics tend to keep these trends fairly short.
This visualization was achieved with very little data on a kMeansClustering algorithm executed through R in a Microstrategy metric expression.
Many ways to visualize the Goals and fouls by each player this season.
The following visualizations are designed to quickly define the relationship between fouls and goals and gives you some insight into the fine balance act between the two variables.
We can then add a trend line algorithm out of the box to provide further insights into foul/goal trends .
This allows us to examine “whatif” scenarios by filtering players like Ronaldo to see how this affects the teams capacity to score.
The graph below shows average goals for each team so far this season related to the fouls accumulated to achieve these goals.
You will notice that Atlético´s avg goal capacity is 0.65 and Real Madrid 1.18 thus at an advantage until we remove Ronaldo for injury or foulplay and then observe the avg line plumet to 0.48 !!
The above visualizations provide great insights from relatively limited data but still takes the exploration into another level by introducing advanced R analytics into the mix; above trend lines use R as follows:
The first step is to ensure R is fully integrated with our Microstrategy installation as discussed in a previous Blog.
As you can see from the following screenshot, syntax is critical and does not forgive in the R console:
Survival Analysis using Cox Regression
“On a long enough timeline, the survival rate for everything drops to zero” –Tyler Durden
In the domain of Survival Analysis, the Cox Proportional Hazards model is a commonly used technique that calculates the relative risk of an event occurring as a function of any number of covariates. It’s called Survival Analysis because the “event” typically represents the end of something, such as a component failure, a customer being lost, or any other type of “end of life.” The Cox Regression model quantifies the effect that each independent variable has on the Hazard Rate, or the likelihood that an event, assuming it has not occurred yet, will occur at any point in time. For each record, the model outputs the Hazard Ratio, representing the Hazard Rate for that customer divided by the Hazard Rate for the average customer.
This R Script has two functional modes:
Training creates a model and persists it in an .Rdata file while returning its predictions.
Scoring uses the model created during training to make predictions on a new dataset.
Based on the above theory, we can experiment by predicting future Goals or Fouls by specific players or as aggregates in teams.
The metric expressions shown here assume that the SurvivalAnalysis.R file has been downloaded to the server. If using the URL-based approach where the SurvivalAnalysis.R file is accessed directly via URL, please consult the R Script Shelf .
1) Risk: For each record, returns the risk of an event occurring relative to the average. For instance, a value of 120% means that an event is 20% more likely to occur to this record than a record which had the average values for each independent variable.
If using R Integration Pack V 2.0 with named parameters:
For training, use this metric expression:
RScript<_RScriptFile=”Survival.R”, _InputNames=”Time,Status,Vars”, _Params=”TrainMode=TRUE, FileName=’Survival'”>(Time, Status, Vars)
For scoring, use this metric expression:
RScript<_RScriptFile=”Survival.R”, _InputNames=”Time, Status, Vars”, _Params=”TrainMode=FALSE, FileName=’Survival'”>(Time, Status, Vars)
If you have trouble deploying the R console extension, make sure you have loaded the Microstrategy library first:
Then you are ready to Deploy with the above command.
Ok, lets get back to the predictions !!
We are still thus far working with dubious data but some inicial most likely scorers start to emerge using a clustering algorithm known as k-Medoids:
k-Medoids clustering is a useful alternative to the popular k-Means clustering algorithm. Like k-Means, it groups items into k distinct clusters so that items within the same cluster are more similar to each other than items within different clusters. k-Medoids clustering has the advantage over k-Means in that it chooses a prototypical item for each cluster, rather than computing a theoretical mean for each cluster.
k-Medoids clustering is particularly useful when there is a need to understand the nature of each cluster by identifying its prototypical member. Please keep in mind the following best practices for cluster analysis:
I then cleaned the data considerably before continuing experimenting with further R algorithms to provide cleaner fuel to the machine. Another visualization was then compiled with the new dataset and a fresh K-Mediods Metric expression:
This revealed some rather interesting insights which only football fans will comprehend I suspect. I will leave that to you as I have no prior knowledge in football and therefore completely free of emotion relating to the results. I do however understand that some of the above characters are not likely scorers.
The Following visualization came out by chance and appears to identify the most popular Champions games thus far:
And again using some simple R adaptations, the most likely contenders to the final :
Now with the new dataset once more to pin down the players that will take their tem to the title, Real Madrid is not showing excessive strength relative to Atlético and in fact I would say that Ronaldo is key to propping up their lead with that big green bag of goals in the sky!!
This decisive K-Medoid visualization does point towards a winning Real Madrid just for the sheer number of Goals conceded to them.
This final diagram is an attempt to shed some light as to who will score those fateful goals??
Well.. Ronaldo wasn´t a surprise I must say but Toni Kroos?!?
AND ATTENTION!!! … THE WINNER IS…
2-1 to Real Madrid
(unless Ronaldo get kicked off or injures himself…)
Written by Stephane Rodicq