I'm building a module that reads the content of a pdf document (which is based on System.FileDocument).
First I made a litle setup that uses PDFBox to directly read a PDF from filesystem (pdfbox-2.0.3.jar, retrieved via Maven).
InputStream is = new FileInputStream("C:\\Users\\dude\\JaarverslagPensioenfonds.pdf"); BufferedInputStream bis = new BufferedInputStream(is); PDDocument pdDocument = PDDocument.load(bis); PDFTextStripper textStripper = new PDFTextStripper(); String text = textStripper.getText(pdDocument);
That works nice. (I could have used a more simple load(), but for this example above is better).
So now I want to use the PDF library that shows up in userlib (pdfbox-app-2.0.3.jar). I guess this is a Mendix specific tweak of pdf-box-2.0.3.jar?
I use above code, slightly tweaked, in a JavaAction:
InputStream inputStream = Core.getFileDocumentContent(getContext(), __document); PDDocument pdDocument = PDDocument.load(inputStream); PDFTextStripper textStripper = new PDFTextStripper(); String text = textStripper.getText(pdDocument);
(__document is an extention of System.FileDocument, as a IMendixObject object).
Then get this error:
java.lang.ExceptionInInitializerError: null at amicosensoring.actions.DocumentToTextAction.executeAction(DocumentToTextAction.java:41) at amicosensoring.actions.DocumentToTextAction.executeAction(DocumentToTextAction.java:1)
Line 41 is the line where the PDFTextStripper is instantiated. In the line before the PDDocument is loaded correctly (the debugger in Eclipse confirms that).
The source of the Mendix version of pdfbox is not available, but it appears that this stripper extends org.apache.pdfbox.text.LegacyPDFStreamEngine. Mind the 'Legacy' part; the 'regular' pdf-box-2.0.3.jar doesn't have that.
My question(s) is(are): what is the correct way of initialising this PDFTextStripper whitin Mendix java actions; what is this 'Legacy' about; is there a workaround?
Thanks,
Nol