This article is a series three of JS file processing. The first two articles introduce JS file processing. Those who are interested can view it.Export pdf files and word/excel/pdf/ppt online preview, this article adds the method of text and pictures in js in advance PDF.
Extract text from PDF - Core code
In fact, the core code still uses this library. This library was also mentioned in the previous article, which can mainly be used for previewing the PDFweb side.
Document address://api/draft/
/** * Retrieves the text of a specif page within a PDF Document obtained through * * @param {Integer} pageNum Specifies the number of the page * @param {PDFDocument} PDFDocumentInstance The PDF document obtained **/ function getPageText(pageNum, PDFDocumentInstance) { // Return a Promise that is solved once the text of the page is retrieven return new Promise(function (resolve, reject) { (pageNum).then(function (pdfPage) { // The main trick to obtain the text of the PDF page, use the getTextContent method ().then(function (textContent) { var textItems = ; var finalString = ''; // Concatenate the string of the item to the final string for (var i = 0; i < ; i++) { var item = textItems[i]; finalString += + ' '; } // Solve promise with the text retrieven from the page resolve(finalString); }); }); }); }
Extract pictures from PDF
The core code is as follows:
// first here I open the document ('').(async function (pdfObj) { // because I am testing, I just wanted to get page 7 const page = await (7); // now I need to get the image information and for that I get the operator list const operators = await (); // this is for the paintImageXObject one, there are other ones, like the paintJpegImage which I assume should work the same way, this gives me the whole list of indexes of where an img was inserted const rawImgOperator = .map((f, index) => (f === ? index : null)) .filter((n) => n !== null); // now you need the filename, in this example I just picked the first one from my array, your array may be empty, but I knew for sure in page 7 there was an image... in your actual code you would use loops, such info is in the argsArray, the first arg is the filename, second arg is the width and height, but the filename will suffice here const filename = [rawImgOperator[0]][0]; // now we get the object itself from using the filename (filename, async (arg) => { // and here is where we need the canvas, the object contains information such as width and height const canvas = (, ); const ctx = ('2d'); // now you need a new clamped array because the original one, may not contain rgba data, and when you insert you want to do so in rgba form, I think that a simple check of the size of the clamped array should work, if it's 3 times the size aka width*height*3 then it's rgb and shall be converted, if it's 4 times, then it's rgba and can be used as it is; in my case it had to be converted, and I think it will be the most common case const data = new Uint8ClampedArray( * * 4); let k = 0; let i = 0; while (i < ) { data[k] = [i]; // r data[k + 1] = [i + 1]; // g data[k + 2] = [i + 2]; // b data[k + 3] = 255; // a i += 3; k += 4; } // now here I create the image data context const imgData = (, ); (data); (imgData, 0, 0); // get myself a buffer const buff = (); // and I wrote the file, worked like charm, but this buffer encodes for a png image, which can be rather large, with an image conversion utility like you may get better results by compressing the thing. ('test', buff); }); });
summary
This article mainly introduces the method of obtaining text and pictures in pdf in js. In fact, pdf to word is also a general idea. It mainly obtains text and pictures and puts them in word documents. This article mainly uses the pdfjs library and refers to the issue/mozilla//issues/13541
The above is the detailed content of how JavaScript extracts pictures and text in PDFs. For more information about JavaScript extracting PDF pictures and text, please pay attention to my other related articles!