我有点不适应:我是一名统计学家,主要从事 SAS 和 R 工作。我正在尝试使用 Pulp 在 Python 中编写本质上经典的背包问题的代码。我的数据看起来像这样(这是一个简化版本):
fan.csv 示例数据
id Fan Cost Switch
10001 32.83 3.18 A/B
10003 75.25 4.20 C
10005 71.60 8.79 A/B
10010 24.23 9.99 C/D
10011 98.69 8.88 D
10012 48.81 8.68 C/D
我需要在每个 Switch 类别中选择 1 或 2 个项目(例如 1 个“A”、2 个“D”等)时最大化粉丝,同时保持在成本阈值以下。到目前为止,相当标准。
我的问题是,某些项目可能属于一个或另一个 Switch 类别(请参阅 Switch 值中带有“/”的 obs - 例如,id 10001 可以充当“A”或“B”)。因为我是一名统计学家,所以我认为就 n x k 数据集而言,在这种情况下,我将以长形式设置数据(就好像这些是重复测量数据一样):
id Fan Cost Switch
10001 32.83 3.18 A
10001 32.83 3.18 B
10003 75.25 4.20 C
10005 71.60 8.79 A
10005 71.60 8.79 B
10010 24.23 9.99 C
10010 24.23 9.99 D
10011 98.69 8.88 D
10012 48.81 8.68 C
10012 48.81 8.68 D
但是,我基本上采用了其他人的示例代码来调用 Pulp 优化模型,并根据我的情况进行了调整。 我当前的问题是,我不清楚如何向线性优化模型添加约束,以便每个项目(每个 id)最多只选择一次。这是一个问题,因为我“解压”Switch 变量并列出了具有不同 Switch 值的多个 id。
我的代码出现在下面。在步骤 1 中,我将 fan.csv 数据转换为长格式(我自己编写了此代码)。第 2 步主要是我改编的示例数据。示例代码将数据组织在嵌套字典中,然后调用 Pulp 函数来指定优化模型。
import os
import pandas as pd
from pulp import *
import numpy as np
# Step 1
os.chdir("E:\\")
df01 = pd.read_csv("fan.csv")
df02 = df01[df01['Switch'].str.contains('/')]
df03 = df01[~df01['Switch'].str.contains('/')]
df02[['Swit1','Swit2']] = df02.Switch.str.split("/", expand=True)
df04 = pd.wide_to_long(df02, ["Swit"], i="id", j="Pos")
df04.reset_index(inplace=True)
df04.drop(["Pos"], axis=1, inplace=True)
df05 = pd.concat([df04,df03])
df06 = df05.sort_values(by=['id'])
df06.reset_index(drop=True, inplace=True)
df06.loc[(~df06['Switch'].str.contains('/')), 'Swit'] = df06.Switch
# Step 2
df07 = df06.groupby(["Swit", "id", "Fan", "Cost"]).agg('count')
df07 = df07.reset_index()
costs = {}
fans = {}
for sth in df07.Swit.unique():
df07_sth = df07[df07.Swit == sth]
cost = list(df07_sth[['id', 'Cost']].set_index("id").to_dict().values())[0]
fan = list(df07_sth[['id', 'Fan']].set_index("id").to_dict().values())[0]
costs[sth] = cost
fans[sth] = fan
sth_num_available = {
"A": 1,
"B": 1,
"C": 1,
"D": 2,
}
cost_cap = 25
_vars = {k: LpVariable.dict(k, v, cat='Binary') for k, v in fans.items()}
model1 = LpProblem("Fans", LpMaximize)
fanval = []
costval = []
switch_constraints = []
for k, v in _vars.items():
costval += lpSum([costs[k][i] * _vars[k][i] for i in v])
fanval += lpSum([fans[k][i] * _vars[k][i] for i in v])
model1 += lpSum([_vars[k][i] for i in v]) == sth_num_available[k]
model1 += lpSum(fanval)
model1 += lpSum(costval) <= cost_cap
model1
model1.solve()
我还有 2 个额外的限制要添加。我将其包含在这篇文章中,因为这个问题与最初的文章相关。
扩展后的 fan.csv 示例数据为:
id Fan Cost Switch Hit Group Cat
10001 32.83 3.18 A/B 0 G1 C1
10003 75.25 4.20 C 1 G1 C2
10005 71.60 8.79 A/B 0 G1 C3
10010 24.23 9.99 C/D 1 G2 C1
10011 98.69 8.88 D 1 G2 C2
10012 48.81 8.68 C/D 1 G2 C3
背包解决方案最多可以包含 n id,编码为 1,用于表示来自任何一组的命中。
我尝试针对此约束调整 Reinderien 的代码。这里 n 设置为 2:
df02 = df01
df02.set_index(["id","Group",], drop=False, inplace=True)
for G, total in df02.Assign.groupby(level='Group').sum().items():
model1.addConstraint(name='Group' + G + 'Hit', constraint=df02.Hit.dot(total) <= 2)
但我收到此错误:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Users\Programs\Python\Python311\Lib\site-packages\pandas\core\series.py", line 3015, in dot
if lvals.shape[0] != rvals.shape[0]:
~~~~~~~~~~~^^^
IndexError: tuple index out of range
>>>
此外,背包解决方案必须包含至少 2 个类别(Cat)。
这是我改编的代码:
df04 = df02
df04.reset_index(drop=True, inplace=True)
df04.set_index(["id","Cat",], drop=False, inplace=True)
for category, total in df04.Assign.groupby(level='Cat').sum().items():
model1.addConstraint(name='Cat_' + category, constraint=total >= 2)
运行没有错误并且似乎可以工作。这是最好的方法(或者至少是一个相当好的方法)吗?
================================================
我链接的其他答案使用 scipy.在这里,我演示了 PuLP,因为它就是您所使用的。旁白:
[我主要工作] SAS 和 RR 有自己的线性程序支持,那么为什么要切换到 Python?无论如何,下面演示的主要要点是多重索引使事情变得简单:
import numpy as np
import pandas as pd
import pulp
df = (
pd.read_fwf('fan.csv')
.set_index('id')
.rename(columns={'Switch': 'SwitchOptions'})
)
# arbitrary columns, id index, values are switch string fragments
exploded_switch = df.SwitchOptions.str.split('/', expand=True)
# A, B, C, D
switch_values = pd.unique(exploded_switch.stack())
# ABCD columns, id index, values are True or NaN to indicate switch match
switch_sparse = pd.DataFrame(
(
exploded_switch.values[:, :, np.newaxis]
== switch_values[np.newaxis, np.newaxis, :]
).any(axis=1),
index=df.index,
columns=pd.Index(name='Switch', data=switch_values),
)
switch_sparse[~switch_sparse] = np.nan
# Multi-index to 10 ID/switch values
df = switch_sparse.stack().align(df)[1]
# Reasonable assignment variable names; will be unique
var_names = (
'asn_i'
+ df.index.get_level_values('id').astype(str)
+ '_s'
+ df.index.get_level_values('Switch')
).to_series(index=df.index)
# Assignment variables, named, binary, with same ID/switch multi-index
df['Assign'] = var_names.apply(pulp.LpVariable, cat=pulp.LpBinary)
# Indexed total switch availability series
switch_avail = pd.Series(
index=switch_sparse.columns,
data=(1, 1, 1, 2),
)
# Maximize total Fan via dot product
model = pulp.LpProblem(name='Fans', sense=pulp.LpMaximize)
model.objective = df.Assign.dot(df.Fan)
model.addConstraint(name='cost_cap', constraint=df.Assign.dot(df.Cost) <= 25)
# Constrain assignments by switch availability
for switch, total in df.Assign.groupby(level='Switch').sum().items():
model.addConstraint(name='switch_' + switch, constraint=total <= switch_avail[switch])
# Constrain IDs to be assigned exclusively. Only add constraints where more than one switch is possible.
id_groups = df.Assign.groupby(level='id')
for ID, total in id_groups.sum()[id_groups.count() > 1].items():
model.addConstraint(name=f'excl_id_{ID}', constraint=total <= 1)
print(model)
model.solve()
assert model.status == pulp.LpStatusOptimal
df['Assign'] = df.Assign.apply(pulp.LpVariable.value)
print(df)
Fans:
MAXIMIZE
32.83*asn_i10001_sA + 32.83*asn_i10001_sB + 75.25*asn_i10003_sC + 71.6*asn_i10005_sA + 71.6*asn_i10005_sB + 24.23*asn_i10010_sC + 24.23*asn_i10010_sD + 98.69*asn_i10011_sD + 48.81*asn_i10012_sC + 48.81*asn_i10012_sD + 0.0
SUBJECT TO
cost_cap: 3.18 asn_i10001_sA + 3.18 asn_i10001_sB + 4.2 asn_i10003_sC
+ 8.79 asn_i10005_sA + 8.79 asn_i10005_sB + 9.99 asn_i10010_sC
+ 9.99 asn_i10010_sD + 8.88 asn_i10011_sD + 8.68 asn_i10012_sC
+ 8.68 asn_i10012_sD <= 25
switch_A: asn_i10001_sA + asn_i10005_sA <= 1
switch_B: asn_i10001_sB + asn_i10005_sB <= 1
switch_C: asn_i10003_sC + asn_i10010_sC + asn_i10012_sC <= 1
switch_D: asn_i10010_sD + asn_i10011_sD + asn_i10012_sD <= 2
excl_id_10001: asn_i10001_sA + asn_i10001_sB <= 1
excl_id_10005: asn_i10005_sA + asn_i10005_sB <= 1
excl_id_10010: asn_i10010_sC + asn_i10010_sD <= 1
excl_id_10012: asn_i10012_sC + asn_i10012_sD <= 1
VARIABLES
0 <= asn_i10001_sA <= 1 Integer
0 <= asn_i10001_sB <= 1 Integer
0 <= asn_i10003_sC <= 1 Integer
0 <= asn_i10005_sA <= 1 Integer
0 <= asn_i10005_sB <= 1 Integer
0 <= asn_i10010_sC <= 1 Integer
0 <= asn_i10010_sD <= 1 Integer
0 <= asn_i10011_sD <= 1 Integer
0 <= asn_i10012_sC <= 1 Integer
0 <= asn_i10012_sD <= 1 Integer
Welcome to the CBC MILP Solver
Version: 2.10.3
Build Date: Dec 15 2019
At line 2 NAME MODEL
At line 3 ROWS
At line 14 COLUMNS
At line 73 RHS
At line 83 BOUNDS
At line 94 ENDATA
Problem MODEL has 9 rows, 10 columns and 28 elements
Coin0008I MODEL read with 0 errors
Option for timeMode changed from cpu to elapsed
Continuous objective value is 277.963 - 0.00 seconds
Cgl0004I processed model has 9 rows, 10 columns (10 integer (10 of which binary)) and 28 elements
Cutoff increment increased from 1e-05 to 0.00999
Cbc0038I Initial state - 1 integers unsatisfied sum - 0.00568828
Cbc0038I Pass 1: suminf. 0.00563 (1) obj. -277.814 iterations 3
Cbc0038I Solution found of -179.68
...
Result - Optimal solution found
Objective value: 255.58000000
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 0.00
Time (Wallclock seconds): 0.00
Option for printingOptions changed from normal to all
Total time (CPU seconds): 0.00 (Wallclock seconds): 0.00
Fan Cost SwitchOptions Assign
id Switch
10001 A 32.83 3.18 A/B 1.0
B 32.83 3.18 A/B 0.0
10003 C 75.25 4.20 C 1.0
10005 A 71.60 8.79 A/B 0.0
B 71.60 8.79 A/B 0.0
10010 C 24.23 9.99 C/D 0.0
D 24.23 9.99 C/D 0.0
10011 D 98.69 8.88 D 1.0
10012 C 48.81 8.68 C/D 0.0
D 48.81 8.68 C/D 1.0