Pgmpy:缺失数据的贝叶斯网络参数学习的期望最大化

问题描述 投票:0回答:2

我正在尝试使用Python的PGMPY包来学习贝叶斯网络的参数。如果我正确理解期望最大化,它应该能够处理缺失值。我目前正在试验 3 变量 BN,其中前 500 个数据点有缺失值。不存在潜在变量。尽管 pgmpy 中的描述表明它应该适用于缺失值,但我收到错误。仅当调用具有缺失值的数据点的函数时,才会发生此错误。难道我做错了什么?或者我应该寻找另一个包含缺失值的 EM 包?

#import
import numpy as np
import pandas as pd
from pgmpy.estimators import BicScore, ExpectationMaximization
from pgmpy.models import BayesianNetwork
from pgmpy.estimators import HillClimbSearch

# Read data that does not contain any missing values
data = pd.read_csv("asia10K.csv")
data = pd.DataFrame(data, columns=["Smoker", "LungCancer", "X-ray"])
test_data = data[:2000]
new_data = data[2000:]

# Learn structure of initial model from data
bic = BicScore(test_data)
hc = HillClimbSearch(test_data)
model = hc.estimate(scoring_method=bic)

# create some missing values
new_data["Smoker"][:500] = np.NaN

# learn parameterization of BN
bn = BayesianNetwork(model)
bn.fit(new_data, estimator=ExpectationMaximization, complete_samples_only=False)

我得到的错误是索引错误:

  File "main.py", line 100, in <module>
    bn.fit(new_data, estimator=ExpectationMaximization, complete_samples_only=False)
  File "C:\Python38\lib\site-packages\pgmpy\models\BayesianNetwork.py", line 585, in fit
    cpds_list = _estimator.get_parameters(n_jobs=n_jobs, **kwargs)
  File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 213, in get_parameters
    weighted_data = self._compute_weights(latent_card)
  File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 100, in _compute_weights
    weights = df.apply(lambda t: self._get_likelihood(dict(t)), axis=1)
  File "C:\Python38\lib\site-packages\pandas\core\frame.py", line 8833, in apply
    return op.apply().__finalize__(self, method="apply")
  File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 727, in apply
    return self.apply_standard()
  File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 851, in apply_standard
    results, res_index = self.apply_series_generator()
  File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 867, in apply_series_generator
    results[i] = self.f(v)
  File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 100, in <lambda>
    weights = df.apply(lambda t: self._get_likelihood(dict(t)), axis=1)
  File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 76, in _get_likelihood
    likelihood *= cpd.get_value(
  File "C:\Python38\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py", line 195, in get_value
    return self.values[tuple(index)]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

谢谢!

python missing-data bayesian-networks expectation-maximization pgmpy
2个回答
2
投票

由于您的具体问题仍然没有答案,让我用另一个模块提出一个解决方案:

#import 
import pandas as pd
import numpy as np
import pyAgrum as gum

# Read data that does not contain any missing values
data = pd.read_csv("asia10K.csv")
# not exactly the same names
data = pd.DataFrame(data, columns=["smoking", "lung_cancer", "positive_XraY"]) 
test_data = data[:2000]
new_data = data[2000:].copy() 

# Learn structure of initial model from data
learner=gum.BNLearner(test_data)
learner.useScoreBIC()
learner.useGreedyHillClimbing()
model=learner.learnBN()

# create some missing values
new_data["smoking"][:500] = "?" # instead of NaN

# learn parameterization of BN
bn = gum.BayesNet(model)
learner2=gum.BNLearner(new_data,model)
learner2.useEM(1e-10)
learner2.fitParameters(bn)

在笔记本中: EM in a notebook


0
投票

(这是一个问题)

嗨,

我正在开展类似的研究。此实现仅使用期望最大化来进行参数学习。然而,我们需要使用 EM 进行结构学习,因为只有这样我们才能在学习结构的同时利用它处理缺失值的能力。

© www.soinside.com 2019 - 2024. All rights reserved.