我很难将使用np.genfromtxt
从CSV加载的结构化数组转换为np.array
,以便将数据拟合到Scikit-Learn估算器中。问题是在某些时候会发生从结构化数组到常规数组的转换,从而产生ValueError: can't cast from structure to non-structure
。很长一段时间,我一直在使用.view
来执行转换,但这导致NumPy发布了一些弃用警告。代码如下:
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
data = np.genfromtxt(path, dtype=float, delimiter=',', names=True)
target = "occupancy"
features = [
"temperature", "relative_humidity", "light", "C02", "humidity"
]
# Doesn't work directly
X = data[features]
y = data[target].astype(int)
clf = GradientBoostingClassifier(random_state=42)
clf.fit(X, y)
提出的例外是:ValueError: Can't cast from structure to non-structure, except if the structure only has a single field.
我的第二次尝试是使用如下视图:
# View is raising deprecation warnings
X = data[features]
X = X.view((float, len(X.dtype.names)))
y = data[target].astype(int)
哪个有效并且完全符合我的要求(我不需要数据的副本),但会导致弃用警告:
FutureWarning: Numpy has detected that you may be viewing or writing to
an array returned by selecting multiple fields in a structured array.
This code may break in numpy 1.15 because this will return a view
instead of a copy -- see release notes for details.
目前我们正在使用tolist()
将结构化数组转换为列表,然后转换为np.array
。这是有效的,但它看起来非常低效:
# Current method (efficient?)
X = np.array(data[features].tolist())
y = data[target].astype(int)
必须有一个更好的方法,我会感激任何建议。
注意:此示例的数据来自UCI ML Occupancy Repository,数据如下所示:
array([(nan, 23.18, 27.272 , 426. , 721.25, 0.00479299, 1.),
(nan, 23.15, 27.2675, 429.5 , 714. , 0.00478344, 1.),
(nan, 23.15, 27.245 , 426. , 713.5 , 0.00477946, 1.), ...,
(nan, 20.89, 27.745 , 423.5 , 1521.5 , 0.00423682, 1.),
(nan, 20.89, 28.0225, 418.75, 1632. , 0.00427949, 1.),
(nan, 21. , 28.1 , 409. , 1864. , 0.00432073, 1.)],
dtype=[('datetime', '<f8'), ('temperature', '<f8'), ('relative_humidity', '<f8'),
('light', '<f8'), ('C02', '<f8'), ('humidity', '<f8'), ('occupancy', '<f8')])
添加.copy()
到data[features]
:
X = data[features].copy()
X = X.view((float, len(X.dtype.names)))
并且FutureWarning
消息消失了。
这应该比首先转换为列表更有效。
如果您可以首先将数据读入普通的NumPy数组(通过省略names
参数),则可以避免复制的需要:
data = np.genfromtxt(path, dtype=float, delimiter=',', skip_header=1)
然后(幸运的是),X
由除了第一列和最后一列之外的所有列组成(即省略datetime
和occupancy
列)。所以我们可以将X
和y
表达为切片:
X = data[:, 1:-1]
y = data[:, -1].astype(int)
然后我们可以轻松地将这些传递给scikit-learn函数:
clf = GradientBoostingClassifier(random_state=42)
clf.fit(X, y)
并且,如果我们愿意,我们可以在以后查看简单的NumPy数组作为结构化数组:
features = ["temperature", "relative_humidity", "light", "C02", "humidity"]
X = X.ravel().view([(field, X.dtype.type) for field in features])
不幸的是,这种解决方法依赖于X
可以表达为切片 - 例如,如果occupancy
出现在其他特征列之间,我们将无法避免复制。这也意味着你必须使用X
而不是更人性化的X = data[:, 1:-1]
来定义X = data[features]
。
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
data = np.genfromtxt(path, dtype=float, delimiter=',', skip_header=1)
X = data[:, 1:-1]
y = data[:, -1].astype(int)
clf = GradientBoostingClassifier(random_state=42)
clf.fit(X, y)
features = ["temperature", "relative_humidity", "light", "C02", "humidity"]
X = X.ravel().view([(field, X.dtype.type) for field in features])
如果你必须从结构化数组开始,那么hpaulj's answer展示了如何view/reshape/slice
结构化数组获得一个普通的数组而不复制:
import numpy as np
nan = np.nan
data = np.array([(nan, 23.18, 27.272 , 426. , 721.25, 0.00479299, 1.),
(nan, 23.15, 27.2675, 429.5 , 714. , 0.00478344, 1.),
(nan, 23.15, 27.245 , 426. , 713.5 , 0.00477946, 1.),
(nan, 20.89, 27.745 , 423.5 , 1521.5 , 0.00423682, 1.),
(nan, 20.89, 28.0225, 418.75, 1632. , 0.00427949, 1.),
(nan, 21. , 28.1 , 409. , 1864. , 0.00432073, 1.)],
dtype=[('datetime', '<f8'), ('temperature', '<f8'), ('relative_humidity', '<f8'),
('light', '<f8'), ('C02', '<f8'), ('humidity', '<f8'), ('occupancy', '<f8')])
target = 'occupancy'
nrows = len(data)
X = data.view('<f8').reshape(nrows, -1)[:, 1:-1]
y = data[target].astype(int)
这利用了每个字段长度为8个字节的事实。因此很容易将结构化数组转换为dtype <f8
的普通数组。重塑使其成为具有相同行数的2D数组。切片从阵列中删除datetime
和occupancy
列/字段。