xgboost.QuantileDMatrix 中分位数变换的数学定义是什么？

Question

XGBoost 包提供函数

xgboost.QuantileDMatrix

，它将

numpy.ndarray

或

pandas.DataFrame

作为输入，应用分位数变换并将数据存储在稀疏表示中以提高性能。据我所知，如果参数

max_bin

设置为等于或大于输入数据中的样本数 (

max_bin>=number_of_samples

)，则分位数变换不起作用，因为每个数据点都由中位数表示其本身。但是，如果您这样做并随后使用

QuantileDMatrix.get_data().data

检查数据，您会发现数据中的最低值总是被完全不同的值替换。如果您有

功能，那么它将为每个功能替换一个值。

那么

QuantileDMatrix

到底是如何运作的呢？这种量化是如何在数学上定义的？

如何重现：

import xgboost as xgb
import pandas as pd
import numpy as np

# define data with numpy
feature1 = np.array([1,2,3,4])

# put it into pandas
a = pd.DataFrame({'feature1': feature1})

quantized_a = xgb.QuantileDMatrix(a, max_bin = 4)

# to show that the behaviour is consistent both with pandas and numpy
quantized_feature1 = xgb.QuantileDMatrix(feature1.reshape(-1, 1), max_bin = 4)

print(quantized_a.get_data().data)
print(quantized_feature1.get_data().data)
# output: [-1.e-05, 2.e+00, 3.e+00, 4.e+00 ]

# different data yields similar problem
feature2 = np.array([10399., 34552., -48585., 70.])
quantized_feature2 = xgb.QuantileDMatrix(feature2.reshape(-1, 1), max_bin = 4)
print(quantized_feature2.get_data().data)

np.testing.assert_almost_equal(feature2, quantized_feature2.get_data().data)
# Arrays are not almost equal to 7 decimals

# Mismatched elements: 1 / 4 (25%)
# Max absolute difference: 48585.
# Max relative difference: 0.5
# x: array([ 10399.,  34552., -48585.,     70.])
# y: array([ 1.0399e+04,  3.4552e+04, -9.7170e+04,  7.0000e+01], dtype=float32)
# in this case -48686 is the value affected, the lowest. 
# If you make it positive, then the value affected 
# is 70 which becomes the lowest one

以下是要求：

xgboost>=1.7.6
numpy>=1.23.5
pandas>=1.5.7

Answer 1

每个数据点实际上都替换为每个分位数箱的下限。对于最小的 bin，下界是

-inf

。但开发人员没有使用

-inf

，而是使用

min(2x, 0)-1.e-05

。然而，开发人员承认

min(2x, 0)-1.e-05

并不是

-inf

的良好替代品，并且

-inf

应直接使用 [1]。

在 Github 上打开与此功能相关的问题：

[1] https://github.com/dmlc/xgboost/issues/2914

[2] https://github.com/dmlc/xgboost/issues/9680

xgboost.QuantileDMatrix 中分位数变换的数学定义是什么？

问题描述投票：0回答：1

1个回答

最新问题

xgboost.QuantileDMatrix 中分位数变换的数学定义是什么？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1