我使用 DOMDocument 来编辑一些 HTML 文件,但有些主题的名称空间中有。所以DOMDocument自动把空格改成%20然后就找不到了。
这就是错误的样子:
Warning: DOMDocument::load() [domdocument.load]: Entity 'nbsp' not defined in file:///C:/Path/To/The/File/01%20c%2040-1964.html, line: 11 in C:/Path/To/class.php on line 51
如何修复此错误?
DOMDocument::loadHTMLFile()
而不是 load()
。这就是它的目的。 HTML 不是 XML。
XML 不知道命名实体
。但是,如果您使用 loadHTML,XML 解析器将加载 HTML 命名实体,这样错误就会消失。
另请参阅:XML 解析器错误:实体未定义。
如果加载 xml - 使用带有标志 ENT_XML1 的 htmlentities()。
$offerXml->addChild('name', htmlentities($name, ENT_XML1));
以下是如何解析包含
等 HTML 实体的 XHTML 片段。
该示例来自一段代码,该代码解析从 Confluence API 导出的 XHTML,包括像
<ac:structured-macro>
这样的自定义元素。这就是您需要设置名称空间的原因。更改或删除 xmlns
属性以适合您的用例。
$snippet = '<p>hello world</p>';
$entitydefs = XML_HTML_DEFS;
$xml = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root [
$entitydefs
]>
<root
xmlns:ac="http://www.atlassian.com/schema/confluence/4/ac/"
xmlns:ri="http://www.atlassian.com/schema/confluence/4/ri/"
xmlns="http://www.atlassian.com/schema/confluence/4/">
$snippet
</root>
XML;
$dom = new DOMDocument();
$dom->loadXML($xml);
哪里
XML_HTML_DEFS
是这个(你可以删除那些你知道不会发生的):
const XML_HTML_DEFS = <<<ENTITIES
<!ENTITY amp "&">
<!ENTITY lt "<">
<!ENTITY gt ">">
<!ENTITY Agrave "À">
<!ENTITY Aacute "Á">
<!ENTITY Acirc "Â">
<!ENTITY Atilde "Ã">
<!ENTITY Auml "Ä">
<!ENTITY Aring "Å">
<!ENTITY AElig "Æ">
<!ENTITY Ccedil "Ç">
<!ENTITY Egrave "È">
<!ENTITY Eacute "É">
<!ENTITY Ecirc "Ê">
<!ENTITY Euml "Ë">
<!ENTITY Igrave "Ì">
<!ENTITY Iacute "Í">
<!ENTITY Icirc "Î">
<!ENTITY Iuml "Ï">
<!ENTITY ETH "Ð">
<!ENTITY Ntilde "Ñ">
<!ENTITY Ograve "Ò">
<!ENTITY Oacute "Ó">
<!ENTITY Ocirc "Ô">
<!ENTITY Otilde "Õ">
<!ENTITY Ouml "Ö">
<!ENTITY Oslash "Ø">
<!ENTITY Ugrave "Ù">
<!ENTITY Uacute "Ú">
<!ENTITY Ucirc "Û">
<!ENTITY Uuml "Ü">
<!ENTITY Yacute "Ý">
<!ENTITY THORN "Þ">
<!ENTITY szlig "ß">
<!ENTITY agrave "à">
<!ENTITY aacute "á">
<!ENTITY acirc "â">
<!ENTITY atilde "ã">
<!ENTITY auml "ä">
<!ENTITY aring "å">
<!ENTITY aelig "æ">
<!ENTITY ccedil "ç">
<!ENTITY egrave "è">
<!ENTITY eacute "é">
<!ENTITY ecirc "ê">
<!ENTITY euml "ë">
<!ENTITY igrave "ì">
<!ENTITY iacute "í">
<!ENTITY icirc "î">
<!ENTITY iuml "ï">
<!ENTITY eth "ð">
<!ENTITY ntilde "ñ">
<!ENTITY ograve "ò">
<!ENTITY oacute "ó">
<!ENTITY ocirc "ô">
<!ENTITY otilde "õ">
<!ENTITY ouml "ö">
<!ENTITY oslash "ø">
<!ENTITY ugrave "ù">
<!ENTITY uacute "ú">
<!ENTITY ucirc "û">
<!ENTITY uuml "ü">
<!ENTITY yacute "ý">
<!ENTITY thorn "þ">
<!ENTITY yuml "ÿ">
<!ENTITY nbsp " ">
<!ENTITY iexcl "¡">
<!ENTITY cent "¢">
<!ENTITY pound "£">
<!ENTITY curren "¤">
<!ENTITY yen "¥">
<!ENTITY brvbar "¦">
<!ENTITY sect "§">
<!ENTITY uml "¨">
<!ENTITY copy "©">
<!ENTITY ordf "ª">
<!ENTITY laquo "«">
<!ENTITY not "¬">
<!ENTITY shy "­">
<!ENTITY reg "®">
<!ENTITY macr "¯">
<!ENTITY deg "°">
<!ENTITY plusmn "±">
<!ENTITY sup2 "²">
<!ENTITY sup3 "³">
<!ENTITY acute "´">
<!ENTITY micro "µ">
<!ENTITY para "¶">
<!ENTITY cedil "¸">
<!ENTITY sup1 "¹">
<!ENTITY ordm "º">
<!ENTITY raquo "»">
<!ENTITY frac14 "¼">
<!ENTITY frac12 "½">
<!ENTITY frac34 "¾">
<!ENTITY iquest "¿">
<!ENTITY times "×">
<!ENTITY divide "÷">
<!ENTITY forall "∀">
<!ENTITY part "∂">
<!ENTITY exist "∃">
<!ENTITY empty "∅">
<!ENTITY nabla "∇">
<!ENTITY isin "∈">
<!ENTITY notin "∉">
<!ENTITY ni "∋">
<!ENTITY prod "∏">
<!ENTITY sum "∑">
<!ENTITY minus "−">
<!ENTITY lowast "∗">
<!ENTITY radic "√">
<!ENTITY prop "∝">
<!ENTITY infin "∞">
<!ENTITY ang "∠">
<!ENTITY and "∧">
<!ENTITY or "∨">
<!ENTITY cap "∩">
<!ENTITY cup "∪">
<!ENTITY int "∫">
<!ENTITY there4 "∴">
<!ENTITY sim "∼">
<!ENTITY cong "≅">
<!ENTITY asymp "≈">
<!ENTITY ne "≠">
<!ENTITY equiv "≡">
<!ENTITY le "≤">
<!ENTITY ge "≥">
<!ENTITY sub "⊂">
<!ENTITY sup "⊃">
<!ENTITY nsub "⊄">
<!ENTITY sube "⊆">
<!ENTITY supe "⊇">
<!ENTITY oplus "⊕">
<!ENTITY otimes "⊗">
<!ENTITY perp "⊥">
<!ENTITY sdot "⋅">
<!ENTITY Alpha "Α">
<!ENTITY Beta "Β">
<!ENTITY Gamma "Γ">
<!ENTITY Delta "Δ">
<!ENTITY Epsilon "Ε">
<!ENTITY Zeta "Ζ">
<!ENTITY Eta "Η">
<!ENTITY Theta "Θ">
<!ENTITY Iota "Ι">
<!ENTITY Kappa "Κ">
<!ENTITY Lambda "Λ">
<!ENTITY Mu "Μ">
<!ENTITY Nu "Ν">
<!ENTITY Xi "Ξ">
<!ENTITY Omicron "Ο">
<!ENTITY Pi "Π">
<!ENTITY Rho "Ρ">
<!ENTITY Sigma "Σ">
<!ENTITY Tau "Τ">
<!ENTITY Upsilon "Υ">
<!ENTITY Phi "Φ">
<!ENTITY Chi "Χ">
<!ENTITY Psi "Ψ">
<!ENTITY Omega "Ω">
<!ENTITY alpha "α">
<!ENTITY beta "β">
<!ENTITY gamma "γ">
<!ENTITY delta "δ">
<!ENTITY epsilon "ε">
<!ENTITY zeta "ζ">
<!ENTITY eta "η">
<!ENTITY theta "θ">
<!ENTITY iota "ι">
<!ENTITY kappa "κ">
<!ENTITY lambda "λ">
<!ENTITY mu "μ">
<!ENTITY nu "ν">
<!ENTITY xi "ξ">
<!ENTITY omicron "ο">
<!ENTITY pi "π">
<!ENTITY rho "ρ">
<!ENTITY sigmaf "ς">
<!ENTITY sigma "σ">
<!ENTITY tau "τ">
<!ENTITY upsilon "υ">
<!ENTITY phi "φ">
<!ENTITY chi "χ">
<!ENTITY psi "ψ">
<!ENTITY omega "ω">
<!ENTITY thetasym "ϑ">
<!ENTITY upsih "ϒ">
<!ENTITY piv "ϖ">
<!ENTITY OElig "Œ">
<!ENTITY oelig "œ">
<!ENTITY Scaron "Š">
<!ENTITY scaron "š">
<!ENTITY Yuml "Ÿ">
<!ENTITY fnof "ƒ">
<!ENTITY circ "ˆ">
<!ENTITY tilde "˜">
<!ENTITY ensp " ">
<!ENTITY emsp " ">
<!ENTITY thinsp " ">
<!ENTITY zwnj "‌">
<!ENTITY zwj "‍">
<!ENTITY lrm "‎">
<!ENTITY rlm "‏">
<!ENTITY ndash "–">
<!ENTITY mdash "—">
<!ENTITY lsquo "‘">
<!ENTITY rsquo "’">
<!ENTITY sbquo "‚">
<!ENTITY ldquo "“">
<!ENTITY rdquo "”">
<!ENTITY bdquo "„">
<!ENTITY dagger "†">
<!ENTITY Dagger "‡">
<!ENTITY bull "•">
<!ENTITY hellip "…">
<!ENTITY permil "‰">
<!ENTITY prime "′">
<!ENTITY Prime "″">
<!ENTITY lsaquo "‹">
<!ENTITY rsaquo "›">
<!ENTITY oline "‾">
<!ENTITY euro "€">
<!ENTITY trade "™">
<!ENTITY larr "←">
<!ENTITY uarr "↑">
<!ENTITY rarr "→">
<!ENTITY darr "↓">
<!ENTITY harr "↔">
<!ENTITY crarr "↵">
<!ENTITY lceil "⌈">
<!ENTITY rceil "⌉">
<!ENTITY lfloor "⌊">
<!ENTITY rfloor "⌋">
<!ENTITY loz "◊">
<!ENTITY spades "♠">
<!ENTITY clubs "♣">
<!ENTITY hearts "♥">
<!ENTITY diams "♦">
ENTITIES;
如果您对传递给 DOMDocument 的文档有一定的控制权,则另一种选择是使用 HTML 实体 (
 
) 或 unicode 字符 ( U+00A0
) 而不是命名字符 (
)。
假设有以下文件: 文档1(test_1.xml)
<?xml version="1.0" encoding="UTF-8"?>
<root>
<element>this line has a space ( )</element>
<element>this line has a non breakin space ( )</element>
<element>this line has an integral symbol (∫)</element>
</root>
文档2(test_2.xml)
<?xml version="1.0" encoding="UTF-8"?>
<root>
<element>this line has a space ( )</element>
<element>this line has a non breakin space ( )</element>
<element>this line has an integral symbol (∫)</element>
</root>
文档3(test_3.xml)
<?xml version="1.0" encoding="UTF-8"?>
<root>
<element>this line has a space ( )</element>
<element>this line has a non breakin space (U+00A0)</element>
<element>this line has an integral symbol (U+222B)</element>
</root>
还有一个加载这些文档的 php 文件:(index.php)
<?php
declare(strict_types=1);
$xmlDoc = new DOMDocument();
$xml_file = file_get_contents( "test_2.xml" );
$xmlDoc->loadXML( $xml_file );
?>
第一个文档将生成如下警告:
Warning: DOMDocument::loadXML(): Entity 'nbsp' not defined in Entity, line: 4 in /path/to/file/index.php on line 5
Warning: DOMDocument::loadXML(): Entity 'int' not defined in Entity, line: 5 in /path/to/file/index.php on line 5
但是文档 2 和 3 会被很好地解析。
参考:
$textHTML = '<ul> <li>Dentro da ordem jur&iacute;dica
brasileira.</li> </ul>
像这样保存在 XML 文件中:
htmlspecialchars($textHTML, ENT_QUOTES);
像这样恢复文件:
$doc->load(file.xml);