Get notified of new posts
Pytups, functional programming & data wrangling in optimization¶
Every now and then, I get a sudden urge to learn another programming language. It goes away after about a week of reading and preparing a development environment in my computer. It happened last year with Kotlin and Scala. This time it happened with Rust.
Usually, I get really excited until I realized that whatever language I'm learning does not include the libraries I use and love in python. Usually it's about libraries to produce pretty charts (e.g., plotnine) or to call efficient graph algorithms (e.g., graph-tool, networkx). But, If I'm being honest, I think the one I would miss the most, especially as it relates to algorithms and mathematical models, is pytups.
It seems rust has its own graph libraries and plotting libraries even the latter is not, at first look, nearly as easy to use as ggplot or plotnine.
What pytups is¶
Full disclosure
I'm the creator and maintainer of pytups.
Pytups is in essence a very light-weight Swiss-army knife for python array-like operations that provides most of the methods you need when working with lists and dictionaries. It's two main classes are called TupList and SuperDict which can be used as a safe replacement of python lists and dictionaries, respectively. They allow chaining operations, select, map, filter, mutate, join, group_by and summarize operations similar to what you would do in a dplyr table in R, or a dataframe in polars.
Pytups is not more efficient than just doing your own list/dict comprehensions in python. In fact, that's most of what happens behind the scenes anyway. But it does provides a clean syntax that covers 95% of vector operations while not adding any boilerplate.
To install it you do:
Why make another library when dataframes do the job?¶
Often, people use a dataframe-like library (like pandas, polars or even SQL) to prepare data to feed to their optimization problem. This has many limitations:
- These libraries are very verbose and not very functional.
- They take overhead time for each operation (especially pandas).
- They are large libraries or require large dependencies.
- They impose a somehow strict API.
- They often require moving back and forth from dataframe objects to python objects.
The biggest benefit (performance) is not critical for the pre-processing or post-processing steps of most optimization models: usually the amount of data is not very big and most of the work will be done at a different time, when a model or heuristic will be executed.
pytups is really just python code and python functions so there's very little you cannot do. I show below a couple of examples of what can be done.
An example of grouping, filtering and sorting¶
Imagine you have a dictionary with "orders" and their information. Some orders have an assigned driver ("driver_id") and an assigned sequence ("driver_sequence"). The driver will carry out each order according to its sequence (first sequence #1, then sequence #2, and so forth).
order_info = {
"order1": {"id": "order1", "driver_id": 5, "driver_sequence": 2},
"order2": {"id": "order2", "driver_id": 5, "driver_sequence": 1},
"order3": {"id": "order3", "driver_id": None, "driver_sequence": None},
}
You want to get, for each driver, an ordered list of the orders that they need to do. For our toy example we want this:
driver_orders = {
5: [
{"id": "order2", "driver_id": 5, "driver_sequence": 1},
{"id": "order1", "driver_id": 5, "driver_sequence": 2},
]
}
You can do it fairly easily with a for-loop. This is the most straight forward I can think of:
driver_orders = {}
# You iterate over all orders
for order in order_info.values():
# if a driver is None, skip
if order["driver_id"] is None:
continue
if order["driver_id"] not in driver_orders:
# initialize a list if it's the first order of the driver
driver_orders[order["driver_id"]] = [order]
else:
# add the order to the drivers' list
driver_orders[order["driver_id"]].append(order)
# for each driver, sort his orders according to an attribute
for orders in driver_orders.values():
orders.sort(key=lambda x: x["driver_sequence"])
The "pytups way" consists of chaining the operations one after the other:
from pytups import SuperDict
driver_orders = (
# get the values as a tuplist
SuperDict(order_info).values_tl()
# filter orders with no driver
.vfilter(lambda v: v["driver_id"] is not None)
# group by attribute: (result_col=None keeps the whole object)
.to_dict(indices=["driver_id"], result_col=None)
# apply a sort function to each list
.vapply(lambda v: v.sorted(key=lambda x: x["driver_sequence"]))
)
The same example with pandas¶
I tried to replicate the same example with pandas and failed. I got stuck trying to get a list per each driver_id, see below. In any case, it's clear the code is not as clean/ functional, e.g., there are several references to my_df.
import pandas as pd
my_df = pd.DataFrame.from_dict(order_info, orient='index').reset_index(drop=True)
my_df[~my_df.driver_id.isnull()].groupby('driver_id')
An example of joining¶
Pytups is also useful when you need to join information from two tables (in this case two dictionaries). Let's say we have the following data for drivers:
driver_info: the general attributes for each driver.driver_stats: information coming from a driver plan. "Start" and "end" are minutes since midnight.
driver_info = {
1: {"id": 1, "max_hours": 8},
2: {"id": 2, "max_hours": 9},
3: {"id": 3, "max_hours": 7.5},
}
driver_stats = {
1: {"start": 500, "end": 1200},
2: {"start": 700, "end": 1000},
3: {"start": 740, "end": 1000},
}
You want to check which drivers will do more time than the maximum allowed, and by how much:
from pytups import SuperDict
# we select the column max_hours and convert it into minutes
max_in_min = SuperDict(driver_info).get_property("max_hours").vapply(lambda v: v * 60)
driver_overtime = (
SuperDict(driver_stats)
# we calculate the total time for the plan
.vapply(lambda v: v["end"] - v["start"])
# we join and subtract the maximum time
.kvapply(lambda k, v: v - max_in_min[k])
# we filter the drivers that have overtime
.vfilter(lambda v: v > 0)
)
print(driver_overtime)
This will return {1: 220}, meaning the driver 1 did 220 minutes of overtime.
Pytups also offers shortcuts to join & apply a function on both dictionaries, called sapply. On top of that, many of the normal operators (sum, subtraction) are overloaded for SuperDicts so that they "just work". The example above can be re-written as:
from pytups import SuperDict
max_in_min = SuperDict(driver_info).get_property("max_hours").vapply(lambda v: v * 60)
driver_time = SuperDict(driver_stats).vapply(lambda v: v["end"] - v["start"])
driver_overtime = (driver_time - max_in_min).vfilter(lambda v: v > 0)
print(driver_overtime)
Pytups always assumes a left join. If you want a right join, you just swap the order of the objects. If there are missing keys in the right side object, you will get a KeyError. If the right side object is an int, str or float, pytups will broadcast the value.
The same joining with pandas¶
If I would like to do this in pandas I would have to do the following, which implies declaring several intermediary objects and going back and forth from pandas to python:
import pandas as pd
my_df = pd.DataFrame.from_dict(driver_info, orient='index').reset_index()
my_df2 = pd.DataFrame.from_dict(driver_stats, orient='index').reset_index()
my_df3 = pd.merge(my_df, my_df2,on='index', how='left')
my_df3['overtime'] = my_df3.end - my_df3.start - my_df3.max_hours * 60
my_df4 = my_df3[my_df3.overtime>0].to_dict('records')
driver_overtime = {d['id']: d['overtime'] for d in my_df4}
print(driver_overtime)
More examples¶
pytups is particularly useful when it's used to wrangle data that will then be used to access an API, like the ones used by mathematical modellers (i.e., pulp or pyomo).
pytups + pulp + MIP
If you're interested in how pytups helps model MIP problems easily, here is an example modelling a simple scheduling problem with pytups + pulp. The code for the MIP model takes 7 lines of actual code.
Conclusion¶
Pytups is a lightweight, production-ready library that does multi-dimension data wrangling and works particularly well with the pulp library to create optimization models. It can be considered an alternative to pandas or polars when the amount of data is small and you care more about using native python objects.