如何将HTML转换为格式样式完整的DOCX

Question

我正在尝试使用docx4j将HTML5文件转换为docx。更大的图景是HTML包含阿拉伯数据和英语数据。我已经在HTML中的元素上设置了样式。我的HTML在chrome上看起来很整洁，但是当我使用docx4j转换为docx时，阿拉伯文本格式会丢失。在MS word上，它表明我的阿拉伯文字设置了粗体样式，但不是粗体。同样，RTL方向也会丢失。表从RTL反转为LTR。解决方法是，我使用BufferedWriter生成.doc文件，该文件将我的HTML文件与样式属性匹配，但html中存在Base64图像，而.doc文件中没有该图像。因此，需要转换为.docx格式。我的要求是从HTML生成的可编辑文档。当我挠头时，请引导我。没有源示例代码也可以正常工作。

这是我用来将HTML转换为docx的代码。

public boolean convertHTMLToDocx(String inputFilePath, String outputFilePath, boolean headerFlag,
        boolean footerFlag,String orientation, String logoPath, String margin, JSONObject json,boolean isArabic) {
    boolean conversionFlag;
    boolean orientationFlag = false;
    try {
        if(!orientation.equalsIgnoreCase("Y")){
            orientationFlag = true;
        }
        String stringFromFile = FileUtils.readFileToString(new File(inputFilePath), "UTF-8");
        String unescaped = stringFromFile;
        WordprocessingMLPackage wordMLPackage  = WordprocessingMLPackage.createPackage();
        NumberingDefinitionsPart ndp = new NumberingDefinitionsPart();
        wordMLPackage.getMainDocumentPart().addTargetPart(ndp);
        ndp.unmarshalDefaultNumbering();

        ImportXHTMLProperties.setProperty("docx4j-ImportXHTML.Bidi.Heuristic", true);
        ImportXHTMLProperties.setProperty("docx4j-ImportXHTML.Element.Heading.MapToStyle", true);
        ImportXHTMLProperties.setProperty("docx4j-ImportXHTML.fonts.default.serif", "Frutiger LT Arabic 45 Light");
        ImportXHTMLProperties.setProperty("docx4j-ImportXHTML.fonts.default.sans-serif", "Frutiger LT Arabic 45 Light");
        ImportXHTMLProperties.setProperty("docx4j-ImportXHTML.fonts.default.monospace", "Frutiger LT Arabic 45 Light");

        XHTMLImporterImpl xHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
        xHTMLImporter.setHyperlinkStyle("Hyperlink");
        xHTMLImporter.setParagraphFormatting(FormattingOption.CLASS_PLUS_OTHER);
        xHTMLImporter.setTableFormatting(FormattingOption.CLASS_PLUS_OTHER);
        xHTMLImporter.setRunFormatting(FormattingOption.CLASS_PLUS_OTHER);

        wordMLPackage.getMainDocumentPart().getContent().addAll(xHTMLImporter.convert(unescaped, ""));

        XmlUtils.marshaltoString(wordMLPackage.getMainDocumentPart().getJaxbElement(),true,true);
        File output = new File(outputFilePath);

        wordMLPackage.save(output);

        Console.log("file path where it is stored is" + " " + output.getAbsolutePath());
        if (headerFlag || footerFlag) {
            File file = new File(outputFilePath);
            InputStream in = new FileInputStream(file);

            wordMLPackage = WordprocessingMLPackage.load(in);
            if (headerFlag) {
                // set Header 
            }
            if (footerFlag) {
                // set Footer
            }

            wordMLPackage.save(file);
            Console.log("Finished editing the word document");
        }
        conversionFlag = true;
    } catch (InvalidFormatException e) {
        Error.log("Invalid format found:-" + getStackTrace(e));
        conversionFlag = false;
    } catch (Exception e) {
        Error.log("Error while converting:-" + getStackTrace(e));
        conversionFlag = false;
    }

    return conversionFlag;
}

Answer 1

我将从@JasonPlutext对docx4j论坛上类似问题之一给出的答案开始。我不得不提到，我使用的罐子存在此问题。然后，我点击了以下链接：

https://www.docx4java.org/forums/docx-java-f6/convert-html-to-docx-with-rtl-for-hebrew-arabic-language-t2712.html

并在上面的链接页面上的以下注释中找到了罐子：

您可以尝试https://docx4java.org/docx4j/docx4j-Imp ... 180801.jar

它包含https://github.com/plutext/docx4j-Impor ... f378022303

[能否请您看一下https://github.com/plutext/docx4j-Impor ... iTest.java并为混合的希伯来语/阿拉伯语和从左到右的文本添加其他测试，尤其是在您觉得实现不正确的任何情况下。

而且，jar也没有下载，所以我搜索了jar的名称，并从jardownload.com下载了所有依赖项。尽管commons-codec和commons-io jar为1.3，但需要将其升级到最新的jar，以便在转换后以docx格式显示图像。但是，请确保html格式正确，因为docx4j要求严格遵守格式良好的html。

现在，我将讨论真正的部分，即如何使所有内容与html相同。我使用写入到.doc文件而不是.docx的简单字节数组来克服它。这样，该文档将与html完全相同。我面临的唯一问题是二进制图像未显示。仅一个框出现在图像位置。所以，我写了两个文件：1，读取html文件中的所有二进制图像标签，并使用Base64解码器解码图像，将图像保存在远程服务器的本地磁盘上，并将所有此类img标签的src属性替换为磁盘上的新位置。（新位置之前是http：// {remote_server}：{remote_port} / {war_deployment_descriptor} / images /

2，我在部署在服务器上的war文件中创建了一个简单的servlet，该servlet侦听/ images上的获取请求，并在接收到具有路径名的get请求后，在outputstream上返回该图像。瞧，图片开始出现了。

由您决定，您要进行哪种转换。 .docx（需要使用docx4j作为转换之一）或.doc（需要将html作为字节数组简单地写入.doc文件并放置2个代码文件）。我的建议是将英文文档转换为.docx。对于阿拉伯语，希伯来语或其他RTL语言，如果不严格要求生成.docx，则应进行.doc转换。

Answer 2

Listing the two files, please change as per your need:

File1.java
------------------------------------------------------------------------------------------

    public static void writeHTMLDatatoDoc(String content, String inputHTMLFile,String outputDocFile,String uniqueName) throws Exception {
        String baseTag = getRemoteServerURL()+"/{war_deployment_desciptor}/images?image=";
        String tag = "Image_";
        String ext = ".png";
        String srcTag = "";
        String pathOnServer = getDiskPath() + File.separator + "TemplateGeneration"
                + File.separator + "generatedTemplates" + File.separator + uniqueName + File.separator + "images" + File.separator;

        int i = 0;
        boolean binaryimgFlag = false;

        Pattern p = Pattern.compile("<img [^>]*src=[\\\"']([^\\\"^']*)");
        Matcher m = p.matcher(content);
        while (m.find()) {
            String src = m.group();
            int startIndex = src.indexOf("src=") + 5;
            int endIndex = src.length();

            // srcTag will contain data as data:image/png;base64,AAABAAEAEBAAAAEAGABoAw.........
            // Replace this whole later with path on local disk
            srcTag = src.substring(startIndex, src.length());

            if(srcTag.contains("base64")) {
                binaryimgFlag = true;
            }
            if(binaryimgFlag) {

                // Extract image mime type and image extension from srcTag containing binary image
                ext = extractMimeType(srcTag);
                if(ext.lastIndexOf(".") != -1 && ext.lastIndexOf(".") != 0)
                    ext = ext.substring(ext.lastIndexOf(".")+1);
                else 
                    ext = ".png";

                // read files already created for the different documents for this unique entity.
                // The location contains all image files as Image_{i}.{image_extension}
                // Sort files and read max counter in image names. 
                // Increase value of i to generate next image as Image_{incremented_i}.{image_entension}
                i = findiDynamicallyFromFilesCreatedForWI(pathOnServer);
                i++; // Increase count for next image

                // save whole data to replace later
                String srcTagBegin = srcTag; 

                // Remove data:image/png;base64, from srcTag , so I get only encoded image data.
                // Decode this using Base64 decoder.
                srcTag = srcTag.substring(srcTag.indexOf(",") + 1, srcTag.length());
                byte[] imageByteArray = decodeImage(srcTag);

                // Constrcu replacement tag
                String replacement = baseTag+pathOnServer+tag+i+ext;
                replacement = replacement.replace("\\", "/");

                // Writing image inside local directory on server
                FileOutputStream imageOutFile = new FileOutputStream(pathOnServer+tag+i+ext);
                imageOutFile.write(imageByteArray);
                content = content.replace(srcTagBegin, replacement);
                imageOutFile.close();
            }
        }

        //Re write HTML file
        writeHTMLData(content,inputHTMLFile);

        // write content to doc file
        writeHTMLData(content,outputDocFile);
    }

    public static int findiDynamicallyFromFilesCreatedForWI(String pathOnServer) {
        String path = pathOnServer;
        int nextFileCount = 0;
        String number = "";
        String[] dirListing = null;
        File dir = new File(path);
        dirListing = dir.list();
        if(dirListing.length != 0) {
            Arrays.sort(dirListing);
            int length = dirListing.length;
            int index = dirListing[length - 1].indexOf('.');
            number = dirListing[length - 1].substring(0,index);
            int index1 = number.indexOf('_');
            number = number.substring(index1+1,number.length());
            nextFileCount = Integer.parseInt(number);
        }
        return nextFileCount;
    }

    private static String extractMimeType(final String encoded) {
        final Pattern mime = Pattern.compile("^data:([a-zA-Z0-9]+/[a-zA-Z0-9]+).*,.*");
        final Matcher matcher = mime.matcher(encoded);
        if (!matcher.find())
            return "";
        return matcher.group(1).toLowerCase();
    }

    private static void writeHTMLData(String inputData, String outputFilepath) {
        BufferedWriter writer = null;
        try {
            writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(new File(outputFilepath)), Charset.forName("UTF-8")));
            writer.write(inputData);
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if(writer != null)
                    writer.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    public static byte[] decodeImage(String imageDataString) {
        return Base64.decodeBase64(imageDataString);
    }

    private static String readHTMLData(String inputFile) {
        String data = "";
        String str = "";

        try (BufferedReader reader = new BufferedReader(
                new InputStreamReader(new FileInputStream(new File(inputFile)), StandardCharsets.UTF_8))) {
            while ((str = reader.readLine()) != null) {
                data += str;
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        return data;
    }

------------------------------------------------------------------------------------------

File2.java
------------------------------------------------------------------------------------------

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;

import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import com.newgen.clos.logging.consoleLogger.Console;
public class ImageServlet extends HttpServlet {
    public void init() throws ServletException {
    public ImageServlet() {
        super();
    }

    protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
        String param = request.getParameter("image");
        Console.log("Image Servlet executed");
        Console.log("File Name Requested: " + param);
        param.replace("\"", "");
        param.replace("%20"," ");
        File file = new File(param);
        response.setHeader("Content-Type", getServletContext().getMimeType(param));
        response.setHeader("Content-Length", String.valueOf(file.length()));
        response.setHeader("Content-Disposition", "inline; filename=\"" + param + "\"");
        Files.copy(file.toPath(), response.getOutputStream());
    }
}
------------------------------------------------------------------------------------------

如何将HTML转换为格式样式完整的DOCX

问题描述投票：0回答：2

2个回答

最新问题

如何将HTML转换为格式样式完整的DOCX

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2