document.xml中docx文档文本的顺序排列

Question

我正在尝试从 docx 文件中提取文本，我从文档中获取折叠文本，例如首先提取底部或随机文本框中存在的文本，然后提取其他位置的文本。解决这个问题后，我们知道 docx 文件是压缩在一起的 XML 文件的组合。在 document.xml 文件中，我可以按照我在输出中收到的确切顺序看到确切的文本。我认为这可能是由于锚标记的值具有相对高度属性，或者可能是其他原因？
这些文档中存在框架、文本框和更多对象，所有这些文档都以不正确的顺序返回文本，例如首先提取框架、文本框和其他绘图对象内的文本，然后提取剩余的文本。

我尝试使用 Apache POI 和 Aspose.Words 操作 document.xml 文件，要么得到不在这些框架、文本框等内的文本，要么得到相同的错误输出。为什么 document.xml 文件的文本顺序不正确，这意味着它与打开 docx 文件时显示的顺序不同？ docx 文件将其文本放入 XML 文件时是否遵循任何优先级？我们可以操作 XML 并尝试检索 MS Word 显示的文本吗？

SharedPtr<Document> doc = MakeObject<Document>(inputPath);
        auto paragraphs = doc->GetChildNodes(NodeType::Paragraph, true);

        for (int i=0;i<paragraphs->get_Count();i++)
        {
            auto para = System::ExplicitCast<Paragraph>(paragraphs->idx_get(i));
            auto paragraphFormat = para->get_FrameFormat();
            
            // Check if the paragraph has frame properties set
            if (paragraphFormat->get_IsFrame())
            {
                // Access various frame properties
                double width = paragraphFormat->get_Width();
                double height = paragraphFormat->get_Height();
                double horizontalPosition = paragraphFormat->get_HorizontalPosition();
                double verticalPosition = paragraphFormat->get_VerticalPosition();                 
                logFile << L"horizontalPosition " << horizontalPosition << std::endl;
                logFile << L"verticalPosition " << verticalPosition << std::endl;
            }
            else {
                String text = para->GetText();
                logFile << text << std::endl;
            }
        }

在上面的脚本中，我使用 Aspose.Words C++ API 检查当前段落是否是框架并检索其位置和尺寸，如果不是框架则将这些文本保存在输出文件中。我想从文档中提取每个文本，无论文本内部存在什么类型的元素，从左上角到右下角。我如何使用 Aspose.Words 来做到这一点？

我还尝试了 Aspose.Words 中提供的

layoutenumerator

和

layoutcollector

类，以获取它们的 X 和 Y 轴值，对它们进行排序并提取文本。

auto layoutCollector = MakeObject<LayoutCollector>(doc);
        auto layoutEnumerator = MakeObject<LayoutEnumerator>(doc);

        
        std::vector<ParagraphInfo> paragraphsInfo;

        
        for (int sectionIndex = 0; sectionIndex < doc->get_Sections()->get_Count(); ++sectionIndex) {
            auto section = doc->get_Sections()->idx_get(sectionIndex);
            auto body = section->get_Body();
            auto paragraphs = body->get_Paragraphs();

            
            for (int paraIndex = 0; paraIndex < paragraphs->get_Count(); ++paraIndex) {
                auto para = paragraphs->idx_get(paraIndex);

                
                layoutEnumerator->set_Current(layoutCollector->GetEntity(para));

                

                
                paragraphsInfo.push_back({ para,layoutEnumerator->get_Rectangle()->get_X(),layoutEnumerator->get_Rectangle()->get_Y() });
                
            }
        }

        
        std::sort(paragraphsInfo.begin(), paragraphsInfo.end(), CompareParagraphs);

        
        for (const auto& paragraphInfo : paragraphsInfo) {
            logFile << paragraphInfo.paragraph->GetText() << std::endl;
        }

但是 get_X() 和 get_Y() 方法似乎已从库中弃用。 System::Drawing::RectangleF 存储 X、Y 轴值和当前指向的矩形的尺寸。 Aspose.Words 中是否还有其他可用的类，我可以用它来做同样的事情。

Answer 1

通常，MS Word 文档中的视觉顺序与文档结构相匹配。但这不是规则，因为 MS Word 文档可能包含浮动对象，例如形状、表格或框架。在这种情况下，由于使用浮动对象，文档节点层次结构预计与 MS Word 中文档的视觉表示不匹配。不幸的是，无法使用 Aspose.Words 按视觉顺序返回节点。你是对的，你可以尝试使用

LayoutCollector

和

LayoutEnumerator

类来计算主文档正文中节点的位置。目前尚不清楚您所说的

get_X

和

get_Y

已弃用是什么意思。他们在那里并且在我这边工作得很好：

#include <drawing/rectangle_f.h>

std::cout << enumeraror->get_Rectangle().get_X();
std::cout << enumeraror->get_Rectangle().get_Y();

document.xml中docx文档文本的顺序排列

问题描述投票：0回答：1

1个回答

最新问题

document.xml中docx文档文本的顺序排列

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1