我试图从一个多页的PDF文档中提取文本,几乎所有的文档都提取得很好,但有几个文档却出现了编码10000的错误。 文档页面中唯一不能工作的独特之处在于它们上面有一个按钮和表格字段。
{
var pageNumbersToSave = new List<int>();
for (var i = 1; i <= r.NumberOfPages; i++)
{
try
{
var s = PdfTextExtractor.GetTextFromPage( r, i, new SimpleTextExtractionStrategy() );
我也试过用PDFStamper来压平表单元素,但这并没有改变什么。
byte[] flatBytes;
using ( var r = new PdfReader( pdfBytes ) )
{
using (var ms = new MemoryStream())
{
using (var flattener = new PdfStamper(r, ms))
{
for ( var i = 1; i <= r.NumberOfPages; i++ )
{
r.AcroFields.RemoveFieldsFromPage( i );
}
flattener.FormFlattening = true;
flattener.Close();
}
flatBytes = ms.ToArray();
}
}
很明显,在顶部代码中,如果我使用stamper测试,我使用的是flatBytes而不是pdfBytes。
完整的异常消息:No data is available for encoding 10000. 有关定义自定义编码的信息,请参阅Encoding.RegisterProvider方法的文档。
Stack Trace:
at System.Text.Encoding.GetEncoding(Int32 codepage)
at System.Text.Encoding.GetEncoding(Int32 codepage, EncoderFallback encoderFallback, DecoderFallback decoderFallback)
at iTextSharp.text.xml.simpleparser.IanaEncodings.GetEncodingEncoding(String name)
at iTextSharp.text.pdf.PdfEncodings.ConvertToString(Byte[] bytes, String encoding)
at iTextSharp.text.pdf.DocumentFont.FillEncoding(PdfName encoding)
at iTextSharp.text.pdf.DocumentFont.DoType1TT()
at iTextSharp.text.pdf.DocumentFont.Init()
at iTextSharp.text.pdf.DocumentFont..ctor(PRIndirectReference refFont)
at iTextSharp.text.pdf.CMapAwareDocumentFont..ctor(PRIndirectReference refFont)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.GetFont(PRIndirectReference ind)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.SetTextFont.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.FormXObjectDoHandler.HandleXObject(PdfContentStreamProcessor processor, PdfStream stream, PdfIndirectReference refi, ICollection markedContentInfoStack)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayXObject(PdfName xobjectName)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.Do.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener, IDictionary`2 additionalContentOperators)
at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
at VerataParsers.ECWPdfExtractor.StripImagesFromPdf(Byte[] pdfBytes, Int32& adjustedPageCount) in C:\Users\Dell T5610\source\repos\SecureDirectMessaging\VerataParsers\ECWPdfExtractor.cs:line 85
通过添加System.Text.Encoding.CodePages NuGet包,然后按如下方法注册来解决。
var codePages = CodePagesEncodingProvider.Instance;
Encoding.RegisterProvider(codePages);