Kaggle Christmas Challenge 2019
This year I participated in the annual Kaggle Christmas Challenge and will illustrate my approaches in this post since the submission deadline has passed. My final model reaches the optimal solution within 37 hours using a linear programming min-cost-flow formulation. A 3% optimality gap is reached within 5 hours.
Problem Introduction
In this year’s problem, we are asked to optimally schedule visitors for Santa’s Workshop in the 100 days before Christmas. Five thousand families have each submitted ten date preferences and must be scheduled so that the number of visitors for each day falls between 125 and 300. If we assign a family to their less desired preferences, we have to offer larger and larger gifts as a penalty. Additionally an accounting cost,
, based on the number of visitors at each day applies. A full problem description is available on the competition page.
The accounting cost is where the main complexity comes from as it is intentionally designed to be non-linear and non-convex which presents a challenge for optimization models.
Attempts
We will first try to ignore, then limit and finally linearize the accounting costs and show the gains in solution quality along the way.
1. Ignoring the Accounting Costs
As an initial attempt we will just ignore the accounting costs and find the assignment penalty optimal solution. Using the sets of families and days , we can formulate the following mixed integer optimization problem:
With decision variable when family is assigned to day , and otherwise. As parameters we are given as the size of family and can precompute as the penalty cost incurred when assigning family to day based on their preference list.
This can be visualized as the partial day-family matrix shown below, where for every column (=family) we need to pick precisely one day, and for every row (=day), we must select between and people while minimizing the sum of our selected assignment penalties .
This model can be easily solved to optimality within seconds using modern solver software. When we evaluate the solution using the full cost function, the costs are enormous, much higher than the provided sample solution: 15 583 511 274. It is evident that just optimizing for the assignment penalty costs is not enough.
In the optimal assignment chart, we can identify a preference for weekends and the last days before Christmas.
2. Limiting the Accounting Costs
For a single day we can reformulate the accounting cost function to , with as the number of visitors assigned to the day and as the difference in visitors from the previous day. When plotting the function we can see that it is particularly sensitive to a change in and is more forgiving at lower values:
As a first measure, we will now extend our first formulation by limiting the delta allowed between two days by a fixed value . We linearize the the absolute value function and add the following constraints:
By experimenting with different values, we determine to yield the best results. The optimal solution for this model is 77 347.70, which is found after approx. 200 seconds, showing a huge improvement over the previous cost of 15 583 511 274.
In the assignment profile we can see that the peaks have been smoothed out:
We can further improve the result by replacing the fixed limit using a function dependent on as a decrease to low visitor levels is hardly penalized. We revisit the cost function plot and draw an approximate piecewise linear function where the accounting costs exceed 120.
We can now limit the depending on in the linear program. Optimizing the model using the piecewise linear function for a time limit of one hour results in a further improvement to 73 338.75. In the assignment graph, we can see that the ramp-up then -down pattern in the peaks to the right has been replaced with going straight back to 125, a move for which we don’t pay any accounting costs.
Solving the computationally expensive piecewise linear problem formulation to optimality would likely improve the result slightly but wouldn’t lead to a global optimum when applying the full cost function. We need to incorporate the accounting cost function in the objective value instead of just limiting the solution space.
3. Linearizing the Accounting Costs
If we were able to linearize the accounting costs, the full problem could be approached using linear programming. Since the accounting cost is somewhat equivalent to a state transition cost, a graph representation of the problem looks suitable. Based on states we could be in for a single day (=the visitor levels), we can transition to a new state for the next day. The accounting costs can then be expressed as a transition cost.
To begin our linearization we first discretize the possible visitor levels for each day. As our visitor limits are between 125 and 300 we get a total of possible states over all dates with transition arcs between them. More formally we define the problem as directed graph , with set containing a vertex for every day and visitor level . The arc set contains an arc for every day , current visitor level and next visitor level . We add source and sink nodes and respectively. For each arc we define a cost parameter based on the accounting cost formula. A partial graph with a limited amount nodes is illustrated below:
The lowest accounting costs can now be found using the shortest path from the start to the finish node using the costs as distance measure. Arcs that exceed a large cost limit can be removed from the graph as they are doubtful to be used. There are no cycles, so this is possible in polynomial time using a dynamic programming approach. However, since we need to integrate the family assignment problem with the accounting cost, we have to choose a min-cost-flow formulation for linear programming instead. We start with the initial family assignment model containing the following constraints:
We now introduce a binary variable indicating if we traverse arc (with i and j each representing a day-level vertex). Sets and contain incoming and outgoing arcs of an vertex respectively. Using these sets, we can add flow conservation constraints to the model, which enforce that we only have an outflow from a node if we had an inflow and that we have outgoing and incoming flows for the source and sink.
Having both the family assignment and the accounting problem within the same model, the only missing step is connecting the two. We introduce linking constraints that require that we assign visitors on the day when there is an incoming flow to vertex .
Together with the sum of the family assignment penalties and the accounting cost the full model is then:
Solving this model requires roughly 12GB of memory due to a large number of arc variables but can be completed within 37 hours on a standard quad-core PC using Gurobi as a solver. A 3% optimality gap was reached after 5 hours and lowered to 1% after 12 hours.
The final (and optimal) total cost identified is 68 888.04343.
In the assignment chart, we see the same development as seen when changing to the piecewise linear model, just much more pronounced. We have several peaks with slow up-ramps and then a direct return to the minimum visitor-level 125.
Final Remarks
I was impressed by how well Gurobi was handling the sizeable min-cost-flow subproblem with 175 000 vertices and 3 million arcs. It likely has special optimizations that detect the min-cost-flow problem part and applies tailored heuristics to it. This is definitively something to remember when modeling complex problems.
Outside of the Kaggle Challenge, even the relatively simple piecewise linear function approach from attempt two would have been reasonably good, having just a 6% gap to the optimal solution. On Kaggle even the optimal solution was only good enough for a silver medal as I was quite late to the party. Definitely a fun challenge, though, and I’m looking forward to next year’s edition.