How would you characterize "optimization data?"

Question

We often hear that in practice, not enough data of sufficient quality, consistency, recency, etc. is available for feeding into mathematical optimization models. Example: my university wanted to plan/optimize their weekly timetable using an integer program, but they did not know the number of rooms (let alone capacities, availabilities, equipment, location, etc.), they did not know the preferences of professors, nor which courses they actually taught (the system listed them as "responsible" for a course which did not imply that they were actually teaching that course!); they didn't know the number of students to expect in a course. I could contribute a lot of such stories.
Now, many companies (truthfully) claim that they collect data. E.g., sensor data from production, temperatures, filling rates, number of faulty products per hour, web clicks, customer orders, energy prices, etc., etc. I can't really grasp what makes we reject such data as "suitable" for optimization, and I am looking for a definition of what "different kind of data" needs to be collected in order to feed a typical mathematical program for e.g., timetabling, production planning, facility layout, or designing tariff zones. I thought for a while that the notion I am looking for is "actionable", but this doesn't capture it. Ideally, I would like to contrast this "optimization data" to data that is typically fed into machine learning algorithms (which extrapolate, cluster, predict, find trends, anomalies, patterns, etc.).
How would you call the number of students in a course, the availabilities of teachers, the capacities of rooms, the data that a course belongs to a certain curriculum?

tcokyasar · Answer

I think, answers provided so far are great. When talking to professionals in the field, I second Nikos and call them "parameters" and cross my fingers they know the difference between a parameter and a variable (which is a bleeding wound between the OR profession in Industrial Engineering and OR profession in Business Administration). On the other hand, practitioners usually have a hugely different understanding of what "data" mean. They call every number or descriptive text gathered "data," which is hard to argue as they are right by the definition of data being like: "Come, come, whoever you are [Mevlana Jelaluddin Rumi]."
To kindly express the notion in my mind "I cannot do this so-called optimization stuff without you giving me the right data, dude!" I would just tell the practitioner: "I need the problem data" and define what I mean by it. Have I been successful so far? I don't know, I am only 30... I need to collect more data to answer whether this was a successful approach :)

Nikos Kazazakis · Answer

Adhering to the rules of encapsulation, I would simply call it "parameters". If we're thinking of an optimisation model and, as you said, what changes is the number of things (number of students, number of classrooms, a table with the teachers' schedule, etc.), that's what we call usually parameters in optimisation modelling so I don't see a reason to use a different term.
If we wanted to make the name more descriptive, I would attach a problem-specific prefix there, e.g., "planning parameters".
I like this term because it indicates that the math would be the same (assuming that's the case here) even if those numbers change.
I would avoid the word "data" because it's too broad - we also use "data" to formulate the math.

prubin · Answer

I would just call it "planning data". I think it might be easier to convince an administrator that "planning data" needs to be recorded/captured than to sell them on "<insert techno-jargon phrase here> data". Administrators grasp what planning is (whether or not they are adept at doing it), and at some visceral level they probably realize that not planning is bad (which might make them a bit more inclined to make an effort to collect the data). If "planning data" does not do much to distinguish this sort of data from other sorts (salary data, student/faculty ratio, ...), perhaps that's a good thing. They already understand the relevance of the other data, and that it needs to be collected, so by association they might realize this data is also important.

Answered by prubin on August 19, 2021

user3680510 · Answer

I would call them decision-relevant data, because most optimization problems in practice help people do decisions better, which they already do in a heuristic fashion. This puts the focus on the decision and what is needed to effectively make this decision.
Alternatives would be system-describing data/system-boundary data, because the data defines the boundaries of the feasible states of the system/the boundaries of feasible decision.
On the other hand the data in machine learning i would call historical observational data, because you often have observable states of the system from the past.
I find it difficult to draw a line, that this data is for optimization and that data is for machine learning, because often data can be used for both.
In your example with the timetable for the university courses you could for example not have the capacity of each room, but instead the average number of students per room for each day in the last year. This data would at a first glance be rather machine learning data, but you could use it to derive an estimate for the capacity to feed it to your optimization model.
I agree that pure observable data is often useless for optimization problems, because you only observe feasible states, but have no data on how much you can deviate from these and what are the effects on the deviation and optimizing is basically putting the system in an unseen state than before.

How would you characterize "optimization data?"

4 Answers

Add your own answers!

Ask a Question