The PDF to image conversion has a role in several applications. In addition to the right format, the number of threads can also be increased to parallelize and speed up the conversion. The above table clearly shows that chosing jpg format is faster than the other two formats. Note: The time taken is measured in seconds. The time complexity of the technique depends on the chosen format. The performance of pdftoppm is way better than its alternative ImageMagick in terms of quality. Image.save("page_" + str(index) + ".jpg") #This method helps in converting the images in PIL Image file format to the required image format Print ("Time taken : " + str(time.time() - start_time)) Pil_images = nvert_from_path(PDF_PATH, dpi=DPI, output_folder=OUTPUT_FOLDER, first_page=FIRST_PAGE, last_page=LAST_PAGE, fmt=FORMAT, thread_count=THREAD_COUNT, userpw=USERPWD, use_cropbox=USE_CROPBOX, strict=STRICT) #strict parameter allows you to catch pdftoppm syntax error with a custom type PDFSynta圎rror #use_cropbox parameter allows you to use the crop box instead of the media box when converting #userpw parameter allows you to set a password to unlock the converted PDF #thread_count parameter allows you to set how many thread will be used for conversion. #fmt parameter allows to set the format of pdftoppm conversion (PpmImageFile, TIFF) #last_page parameter allows you to set a last page to be processed by pdftoppm #first_page parameter allows you to set a first page to be processed by pdftoppm #output_folder parameter sets the path to the folder to which the PIL images can be stored (optional) #dpi parameter assists in adjusting the resolution of the image #This method reads a pdf and converts it into a sequence of images To install this library in python, issue the command, pip install Pillow Implementation These image objects can be converted to png or jpg file formats using the library, Pillow. The Pdf2image library returns a list of image objects of type or for a given PDF based on the chosen format. The following pip command can be used to install the library, pip install pdf2image The pdftoppm library utilizes the poppler to execute the conversion. This is the python library which calls the pdftoppm library to convert a pdf to a sequence of PIL image objects. Refer Installation-2 for installing Poppler. This library forms the core for utilities like Pdf2Image, PdfToText, and PDFToHTML which deals with PDFs. The Poppler is a PDF rendering library that is based on the xpdf-3.0 code base. Refer Installation-1 to properly install python. A python 2.7 or 3.3+ forms the primary requirement. We are going to use a pythonic way for achieving the conversion. Installation Stepsįor accomplishing this task, we are going to utilize certain utilities and libraries. Can we convert a PDF to a sequence of images? Yes, we can and this forms the intention of this article. Is PDF a suitable format? No, the images are the best mode of information for image processing. Can we automate this work? Yes, we can do it through image processing. Let us imagine a situation in which we have The Invincible Iron Man comic available in PDF and we are trying to identify the pages which have the Iron Man in action. The picture sums up the motivation behind this article.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |