了解 TFDV 中使用的 L-无穷范数

Question

我试图实现 TensorFlow 数据验证来检查数据集中的漂移/倾斜。他们使用 L-无穷范数作为度量。我不明白这个概念。 谁能解释一下它是如何计算的以及为什么他们在这里使用阈值作为 0.01 ？

 train_day1_stats = tfdv.generate_statistics_from_tfrecord(data_location=train_day1_data_path)
# Add a drift comparator to schema for 'payment_type' and set the threshold of L-infinity norm for triggering drift anomaly to be 0.01.
**tfdv.get_feature(schema, 'payment_type').drift_comparator.infinity_norm.threshold = 0.01**
drift_anomalies = tfdv.validate_statistics(
    statistics=train_day2_stats, schema=schema, previous_statistics=train_day1_stats)

Answer 1

COMPARATOR_L_INFTY_HIGH 的触发方式如下：

使用的架构字段： * feature.skew_comparator.infinity_norm.threshold.
* feature.drift_comparator.infinity_norm.threshold
统计字段： * feature.string_stats.rank_histogram
检测条件：L-无穷大表示向量之间差异的向量范数来自 feature.string_stats.rank_histogram 中的归一化计数控制统计数据（即，提供倾斜或先前的统计数据漂移统计数据）和治疗统计数据（即训练偏差统计或漂移当前统计）> feature.skew_comparator.infinity_norm.threshold 或 feature.drift_comparator.infinity_norm.threshold

L-无穷形式基本上是abs(max([x1,....,xn]) 在这种情况下，x1 = count(bucket1 值)/控制组中的总数 - count(bucket1 值)/处理组中的总数。一旦我们有了 L-inf，我们就检查 > (feature.skew_comparator.infinity_norm.threshold 或 feature.drift_comparator.infinity_norm.threshold) 如果是，则触发 COMPARATOR_L_INFTY_HIGH。实际值（0.01）需要根据您的具体情况和数据统计进行微调。

Answer 2

详细的检测条件在张量流文档中进行了解释（下面的链接），

https://www.tensorflow.org/tfx/data_validation/anomalies

对于你的情况，它提到，

COMPARATOR_L_INFTY_HIGH

架构字段：

feature.skew_comparator.infinity_norm.threshold feature.drift_comparator.infinity_norm.threshold

统计领域：

feature.string_stats.rank_histogram*

检测条件： 向量的 L-无穷大范数，表示控制统计中

feature.string_stats.rank_histogram

的归一化计数（即，偏斜的服务统计数据或漂移的先前统计数据）与处理统计数据（即偏斜或当前的训练统计数据）之间的差异漂移统计）

> feature.skew_comparator.infinity_norm.threshold

或

feature.drift_comparator.infinity_norm.threshold

了解 TFDV 中使用的 L-无穷范数

问题描述投票：0回答：2

2个回答

最新问题

了解 TFDV 中使用的 L-无穷范数

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2