我有一些代码使用Java Apache POI库来打开Microsoft Word文档并使用Apache POI将其转换为html,它还获取文档上图像的字节数组数据。但我需要将此信息转换为html以写入html文件。任何提示或建议将不胜感激。请记住,我是桌面开发人员,而不是网络程序员,所以当你提出建议时,请记住这一点。下面的代码获取图像。
private void parseWordText(File file) throws IOException {
FileInputStream fs = new FileInputStream(file);
doc = new HWPFDocument(fs);
PicturesTable picTable = doc.getPicturesTable();
if (picTable != null){
picList = new ArrayList<Picture>(picTable.getAllPictures());
if (!picList.isEmpty()) {
for (Picture pic : picList) {
byte[] byteArray = pic.getContent();
pic.suggestFileExtension();
pic.suggestFullFileName();
pic.suggestPictureType();
pic.getStartOffset();
}
}
}
然后下面的代码将文档转换为html。有没有办法在下面的代码中将byteArray添加到ByteArrayOutputStream?
private void convertWordDoctoHTML(File file) throws ParserConfigurationException, TransformerConfigurationException, TransformerException, IOException {
HWPFDocumentCore wordDocument = null;
try {
wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream(file));
} catch (IOException ex) {
Exceptions.printStackTrace(ex);
}
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
wordToHtmlConverter.processDocument(wordDocument);
org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
NamedNodeMap node = htmlDocument.getAttributes();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
acDocTextArea.setText(newDocText);
htmlText = result;
}
查看org.apache.poi.hwpf.converter.WordToHtmlConverter
的源代码
http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToHtmlConverter.java?view=markup&pathrev=1180740
它在JavaDoc中声明:
此实现不会创建图像或链接。这可以通过覆盖{@link #processImage(Element,boolean,Picture)}方法来改变
如果你在第790行查看AbstractWordConverter.java中的processImage(...)
方法,看起来该方法正在调用另一个名为processImageWithoutPicturesManager(...)
的方法。
http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup&pathrev=1180740
这个方法再次在WordToHtmlConverter
中定义,看起来很像你想要增长代码的地方(第317行):
@Override
protected void processImageWithoutPicturesManager(Element currentBlock,
boolean inlined, Picture picture)
{
// no default implementation -- skip
currentBlock.appendChild(htmlDocumentFacade.document
.createComment("Image link to '"
+ picture.suggestFullFileName() + "' can be here"));
}
我认为你有可能开始将图像插入到流程中。
创建转换器的子类,例如
public class InlineImageWordToHtmlConverter extends WordToHtmlConverter
然后覆盖该方法并将任何代码放入其中。 我没有对它进行测试,但它应该是我理论上看到的正确方法。
@ user4887078正如@Guga说的那样直截了当,我所做的只是看看org.apache.poi.xwpf.converter.core.FileImageExtractor和Voila!它确实按预期工作,虽然它可能仍需要一些重构和优化。
HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(is);
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder()
.newDocument());
wordToHtmlConverter.setPicturesManager(new PicturesManager() {
@Override
public String savePicture(byte[] bytes, PictureType pictureType, String s, float v, float v1) {
File imageFile = new File("pages/imgs", s);
imageFile.getParentFile().mkdirs();
InputStream in = null;
FileOutputStream out = null;
try {
in = new ByteArrayInputStream(bytes);
out = new FileOutputStream(imageFile);
IOUtils.copy(in, out);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (in != null) {
IOUtils.closeQuietly(in);
}
if (out != null) {
IOUtils.closeQuietly(out);
}
}
return "imgs/" + imageFile.getName();
}
});
wordToHtmlConverter.processDocument(wordDocument);
Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
FileOutputStream fos = new FileOutputStream(outFile);
使用它应该是有用的。
public class InlineImageWordToHtmlConverter extends WordToHtmlConverter{
public InlineImageWordToHtmlConverter(Document document) {
super(document);
}
@Override
protected void processImageWithoutPicturesManager(Element currentBlock, boolean inlined, Picture picture) {
Element img = super.getDocument().createElement("img");
img.setAttribute("src", "data:image/png;base64,"+Base64.getEncoder().encodeToString(picture.getContent()));
currentBlock.appendChild(img);
}
}