StaX解析:Transformer.transform方法自动移动光标,并不总是很好

问题描述 投票:0回答:2

我正在使用 XMLStreamReader 来实现我的目标(分割 xml 文件)。看起来不错,但仍然没有给出想要的结果。我的目标是从输入文件中分割每个节点“nextTag”:

<?xml version="1.0" encoding="UTF-8"?>
<firstTag>
    <nextTag>1</nextTag>
    <nextTag>2</nextTag>
</firstTag>

结果应该是这样的:

<?xml version="1.0" encoding="UTF-8"?><nextTag>1</nextTag>
<?xml version="1.0" encoding="UTF-8"?><nextTag>2</nextTag>

参考使用Java分割1GB Xml文件我用这段代码实现了我的目标:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.StringWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;

public class Demo4 {

    public static void main(String[] args) throws Exception {

        InputStream inputStream = new FileInputStream("input.xml");
        BufferedReader in = new BufferedReader(new InputStreamReader(inputStream));

        XMLInputFactory factory = XMLInputFactory.newInstance();
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();

        XMLStreamReader streamReader = factory.createXMLStreamReader(in);

        while (streamReader.hasNext()) {
            streamReader.next();

            if (streamReader.getEventType() == XMLStreamReader.START_ELEMENT
                    && "nextTag".equals(streamReader.getLocalName())) {

                StringWriter writer = new StringWriter();
                t.transform(new StAXSource(streamReader), new StreamResult(
                        writer));
                String output = writer.toString();
                System.out.println(output);

            }

        }

    }

}

其实很简单。但是,我的输入文件是单行形式:

<?xml version="1.0" encoding="UTF-8"?><firstTag><nextTag>1</nextTag><nextTag>2</nextTag></firstTag>

我的 Java 代码不再产生所需的输出,而只是这个结果:

 <?xml version="1.0" encoding="UTF-8"?><nextTag>1</nextTag>

花了几个小时后,我很确定已经找出原因了:

 t.transform(new StAXSource(streamReader), new StreamResult(writer));

这是因为,执行完transform方法后,光标会自动前进到下一个事件。在代码中,我有这个分数:

while (streamReader.hasNext()) {
    streamReader.next();
                      ...
        t.transform(new StAXSource(streamReader), new StreamResult(writer));
                      ...
}

第一次变换后,streamReader直接获取2次next():

 1. from the transform method
 2. from the next method in the while loop

因此,对于此特定行 XML,光标永远无法到达第二个打开标记。 相反,如果输入 XML 具有漂亮的打印形式,则可以从光标到达第二个,因为第一个结束标记后面有一个空格事件

不幸的是,我找不到任何如何进行设置的内容,因此变换器在执行变换方法后不会自动跳到下一个事件。这太令人沮丧了。

有人知道我该如何处理吗?在语义上也很受欢迎。非常感谢。

问候,

拉特纳

PS。我肯定可以为这个问题写一个解决方法(在转换之前漂亮地打印 xml 文档,但这意味着输入 xml 之前已被修改,这是不允许的)

java xml xml-parsing sax stax
2个回答
2
投票

正如您所阐述的,如果元素节点直接相互跟随,转换步骤是否会继续到下一个创建元素。

为了解决这个问题,您可以使用嵌套 while 循环重写代码,如下所示:

while(reader.next() != XMLStreamConstants.END_DOCUMENT) {
    while(reader.getEventType() == XMLStreamConstants.START_ELEMENT && reader.getLocalName().equals("nextTag")) {
        StringWriter writer = new StringWriter();
        // will transform the current node to a String, moves the cursor to the next START_ELEMENT
        t.transform(new StAXSource(reader), new StreamResult(writer)); 
        System.out.println(writer.toString());
    }
}

1
投票

如果您的

xml
文件适合内存,您可以在
JOOX
库的帮助下尝试,该库在 中导入,例如:

compile 'org.jooq:joox:1.3.0'

还有主类,比如:

import java.io.File;
import java.io.IOException;
import org.joox.JOOX;
import org.joox.Match;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import static org.joox.JOOX.$;

public class Main {

    public static void main(String[] args) 
            throws IOException, SAXException, TransformerException {
        DocumentBuilder builder = JOOX.builder();
        Document document = builder.parse(new File(args[0]));

        Transformer transformer = 
                TransformerFactory.newInstance().newTransformer();
        transformer.setOutputProperty("omit-xml-declaration", "no");

        final Match $m = $(document);
        $m.find("nextTag").forEach(tag -> {
            try {
                transformer.transform(
                        new DOMSource(tag), 
                        new StreamResult(System.out));
                System.out.println();
            }
            catch (TransformerException e) {
                System.exit(1);
            }
        });

    }
}

它产生:

<?xml version="1.0" encoding="UTF-8"?><nextTag>1</nextTag>
<?xml version="1.0" encoding="UTF-8"?><nextTag>2</nextTag>
© www.soinside.com 2019 - 2024. All rights reserved.