如何将 Wikipedia XML 转储导入 MongoDB?

问题描述 投票:0回答:1

我使用了这个PHP代码:

https://github.com/kodekrash/wikipedia.org-xmldump-mongodb

通过以下方式获取数据集:

wget -c http://wikipedia.c3sl.ufpr.br/enwiki/20150901/enwiki-20150901-pages-articles.xml.bz2

相当大,有12GB。

我更改了相应的配置:

$dsname = 'mongodb://wiki:wiki@localhost:27017/wikipedia';
$file = '../data/enwiki-20150901-pages-articles.xml.bz2';
$logpath = './';

并从命令行运行:

php wikipedia.org-xmldump-mongodb.php

我收到此错误:

    PHP Warning:  simplexml_load_string(): Entity: line 37: parser error : expected '>' in /home/username/wiki-project/wikipedia.org-xmldump-mongodb/wikipedia.org-xmldump-mongodb.php on line 73
PHP Warning:  simplexml_load_string(): </namespaces> in /home/username/wiki-project/wikipedia.org-xmldump-mongodb/wikipedia.org-xmldump-mongodb.php on line 73
PHP Warning:  simplexml_load_string():            ^ in /home/username/wiki-project/wikipedia.org-xmldump-mongodb/wikipedia.org-xmldump-mongodb.php on line 73
PHP Warning:  simplexml_load_string(): Entity: line 38: parser error : Premature end of data in tag namespace line 34 in /home/username/wiki-project/wikipedia.org-xmldump-mongodb/wikipedia.org-xmldump-mongodb.php on line 73
PHP Warning:  simplexml_load_string():  in /home/username/wiki-project/wikipedia.org-xmldump-mongodb/wikipedia.org-xmldump-mongodb.php on line 73
PHP Warning:  simplexml_load_string(): ^ in /home/username/wiki-project/wikipedia.org-xmldump-mongodb/wikipedia.org-xmldump-mongodb.php on line 73
PHP Warning:  simplexml_load_string(): Entity: line 38: parser error : Premature end of data in tag namespaces line 1 in /home/username/wiki-project/wikipedia.org-xmldump-mongodb/wikipedia.org-xmldump-mongodb.php on line 73
PHP Warning:  simplexml_load_string():  in /home/username/wiki-project/wikipedia.org-xmldump-mongodb/wikipedia.org-xmldump-mongodb.php on line 73
PHP Warning:  simplexml_load_string(): ^ in /home/username/wiki-project/wikipedia.org-xmldump-mongodb/wikipedia.org-xmldump-mongodb.php on line 73
Aborting. Unable to parse namespaces.

我已经安装了

php, mbstring, simpleXML, mongodb extensions and mongodb 2.69

输出

php -m

[PHP Modules]
bcmath
bz2
calendar
Core
ctype
date
dba
dom
ereg
exif
fileinfo
filter
ftp
gettext
hash
iconv
json
libxml
mbstring
mhash
mongo
openssl
pcntl
pcre
PDO
Phar
posix
Reflection
session
shmop
SimpleXML
soap
sockets
SPL
standard
sysvmsg
sysvsem
sysvshm
tokenizer
wddx
xml
xmlreader
xmlwriter
zip
zlib

[Zend Modules]

如何调查此错误?

php xml mongodb wikipedia
1个回答
2
投票

你必须转义

>
字符。在脚本文件中,第 72 行替换此行:

$chunk = str_replace( [ 'letter">', '</namespace>' ], [ 'letter" name="', '" />' ], $chunk );

与:

$chunk = str_replace( [ 'letter"\>', '</namespace\>' ], [ 'letter" name="', '" /\>' ], $chunk );

这对我有用!

© www.soinside.com 2019 - 2024. All rights reserved.