{"id":65490,"date":"2025-02-07T16:07:04","date_gmt":"2025-02-07T16:07:04","guid":{"rendered":"https:\/\/proxidize.com\/?post_type=blog&#038;p=65490"},"modified":"2025-10-23T11:48:25","modified_gmt":"2025-10-23T10:48:25","slug":"how-to-scrape-pdf-in-python","status":"publish","type":"blog","link":"https:\/\/proxidize.com\/blog\/how-to-scrape-pdf-in-python\/","title":{"rendered":"3 Ways to Scrape PDF in Python"},"content":{"rendered":"\n<p>There are three main ways to scrape PDF files. You could either write a script that will scrape PDF from a URL, scrape directly from a file path, or write a multifunctional scraper that can scrape whatever document you feed it through your terminal.<\/p>\n\n\n\n<p>This article will break down the three ways to scrape PDF in Python, giving you a step-by-step guide on how to write the code from all three methods while introducing any possible challenges that might arise from attempting to scrape PDF files.&nbsp;<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized centered\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1010\" height=\"569\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/02\/challenges-of-scraping-pdf.png\" alt=\"Image of four people untangling a knot. Text above reads &quot;Challenges of Scraping PDF&quot;\" class=\"wp-image-65418\" style=\"object-fit:cover\" srcset=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/02\/challenges-of-scraping-pdf.png 1010w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/02\/challenges-of-scraping-pdf-300x169.png 300w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/02\/challenges-of-scraping-pdf-768x433.png 768w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/02\/challenges-of-scraping-pdf-600x338.png 600w\" sizes=\"(max-width: 1010px) 100vw, 1010px\" \/><\/figure>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Challenges of Scraping PDF<\/h2>\n\n\n\n<p>PDF files come in unstructured data which feature differences in formatting from font sizes, styles, and colors. Some other factors that contribute to the challenges when deciding to scrape PDF are a lack of standardized formatting as PDFs are designed to maintain a specific format such as varying fonts, layouts, and graphic elements. This makes it difficult to <a href=\"https:\/\/proxidize.com\/blog\/web-scraping\/\">extract data<\/a> accurately because the texts are not consistently formatted. Occasionally, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Optical_character_recognition\" target=\"_blank\" rel=\"noopener\">optical character recognition<\/a> (OCR) is used to convert scanned documents into PDFs but it is limited by issues such as image accuracy, language, and formatting errors. PDFs can also have different layouts with mixed content types, adding a layer of difficulty when <a href=\"https:\/\/proxidize.com\/blog\/parsing-html-python-pyquery\/\">parsing and extracting information<\/a>.&nbsp;<\/p>\n\n\n\n<p>However, when deciding to scrape PDF, challenges arise in the form of various format maintenance, anti-scraping trap handling, and data structuring and formatting. Most PDF documents are scanned so scrapers fail to understand them without an OCR application. Some automated PDF scrapers have a combination of OCR, <a href=\"https:\/\/www.uipath.com\/rpa\/robotic-process-automation\" target=\"_blank\" rel=\"noopener\">RPA<\/a>, pattern and text recognition, and other techniques that help to scrape PDF.<\/p>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Error Handling, Performance, and Security<\/h3>\n\n\n\n<p>Before you begin to scrape PDF, <a href=\"https:\/\/proxidize.com\/blog\/python-requests-retry\/\">error handling<\/a> is necessary to ensure reliable data extraction, especially when dealing with complex layouts and unstructured formats. PDFs might contain non-linear text flow, missing elements, or document images that need OCR for processing. Implementing rule-based data extraction methods such as regular expressions for structured text or fallback mechanisms for missing values can enhance accuracy. Logging errors and handling exceptions such as file corruption, encryption, or PDF parser failures ensure that the extraction process remains strong without causing script crashes.&nbsp;<\/p>\n\n\n\n<p>When scraping large volumes of PDFs, there are performance considerations to keep in mind. Using <a href=\"https:\/\/aws.amazon.com\/what-is\/batch-processing\/#:~:text=Batch%20processing%20is%20the%20method,run%20on%20individual%20data%20transactions.\" target=\"_blank\" rel=\"noopener\">batch processing<\/a> capabilities and optimizing usage can improve speed. Loading only necessary pages instead of the entire document and using advanced methods such as multi-threading can reduce execution time. For large-scale applications, using AI-powered automated extraction solutions or APIs like <a href=\"https:\/\/help.openai.com\/en\/articles\/8555496-gpt-4-vision-api\" target=\"_blank\" rel=\"noopener\">GPT-4 Vision API<\/a> can enhance efficiency and accuracy.&nbsp;<\/p>\n\n\n\n<p>There are security implications that come with handling sensitive information such as medical records, insurance forms, or business documents. Proper document processes should include security features such as encrypted storage, controlled access, and redaction of sensitive data. You should be mindful of extracting data from email attachments as malicious PDFs can pose security risks. Implementing automated data extraction software with built-in validation checks can ensure that extracted data is accurate.&nbsp;<\/p>\n\n\n\t\t<div data-elementor-type=\"container\" data-elementor-id=\"85916\" class=\"elementor elementor-85916\" data-elementor-post-type=\"elementor_library\">\n\t\t\t\t<div class=\"elementor-element elementor-element-6227acb e-con-full no-scale elementor-hidden-mobile_extra elementor-hidden-mobile e-flex e-con e-child\" data-id=\"6227acb\" data-element_type=\"container\" data-e-type=\"container\" data-settings=\"{&quot;background_background&quot;:&quot;gradient&quot;}\">\n\t\t<div class=\"elementor-element elementor-element-08fce92 e-grid e-con-full e-con e-child\" data-id=\"08fce92\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t<div class=\"elementor-element elementor-element-426c265 e-con-full e-flex e-con e-child\" data-id=\"426c265\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t\t\t<div class=\"elementor-element elementor-element-b474c5f elementor-widget elementor-widget-heading\" data-id=\"b474c5f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<p class=\"elementor-heading-title elementor-size-default\">IP rotation, city and carrier targeting,<br>\nsticky sessions \u2014 control it all via API<\/p>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-e86f5a3 e-con-full e-flex e-con e-child\" data-id=\"e86f5a3\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t<div class=\"elementor-element elementor-element-6e3bca6 e-con-full e-flex e-con e-child\" data-id=\"6e3bca6\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t\t\t<div class=\"elementor-element elementor-element-2c6c495 elementor-widget__width-initial elementor-widget elementor-widget-image\" data-id=\"2c6c495\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"125\" height=\"80\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/10\/20-2.svg\" class=\"attachment-full size-full wp-image-86191\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-17d8cea inline-CTA elementor-widget elementor-widget-button\" data-id=\"17d8cea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"button.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div class=\"elementor-button-wrapper\">\n\t\t\t\t\t<a class=\"elementor-button elementor-button-link elementor-size-sm\" href=\"https:\/\/proxidize.com\/mobile-proxy-pricing\/?coupon_code=20OFFMPB\" target=\"_blank\">\n\t\t\t\t\t\t<span class=\"elementor-button-content-wrapper\">\n\t\t\t\t\t\t\t\t\t<span class=\"elementor-button-text\">Buy Proxies Now<\/span>\n\t\t\t\t\t<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized centered\"><img decoding=\"async\" width=\"1010\" height=\"569\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/02\/setting-up-script-to-scrape-pdf.png\" alt=\"Image of a large computer screen with a man standing in front of it holding a tablet. Text above reads &quot;Setting Up Script to Scrape PDF&quot;\" class=\"wp-image-65419\" style=\"object-fit:cover\" srcset=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/02\/setting-up-script-to-scrape-pdf.png 1010w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/02\/setting-up-script-to-scrape-pdf-300x169.png 300w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/02\/setting-up-script-to-scrape-pdf-768x433.png 768w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/02\/setting-up-script-to-scrape-pdf-600x338.png 600w\" sizes=\"(max-width: 1010px) 100vw, 1010px\" \/><\/figure>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Setting Up Script to Scrape PDF<\/h2>\n\n\n\n<p>There are six libraries that can help scrape PDF with each library specializing in a specific form of PDF scraping. For normal <a href=\"https:\/\/proxidize.com\/blog\/web-scraping-with-beautiful-soup\/\">text scraping<\/a>, you would need PyMuPDF, pdfplumber, or pdfminer.six. To scrape PDF tables, Camelot or pdfplumber will be a good option. For image-based PDF scraping, pdf2image combined with pytesseract will do the trick. We will present you with a script for how to scrape PDF text, tables, and images along with scripts on how to scrape PDF through a URL, directly from a file on your device, or through a scraper that you can launch in your terminal.&nbsp;<\/p>\n\n\n\n<p>For starters, open your IDE and set it to Python. Once that is complete, install the necessary libraries. Enter the following command in your terminal. This will install all the libraries we will be using for this project.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><textarea class=\"code-block-pro-copy-button-textarea\" aria-hidden=\"true\" readonly>pip install requests PyMuPDF pdfplumber pdfminer.six Camelot-py[cv] pdf2image pytesseract<\/textarea><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #EA9A97\">pip<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">install<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">requests<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">PyMuPDF<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">pdfplumber<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">pdfminer.six<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">Camelot-py[cv]<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">pdf2image<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">pytesseract<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>It will take a few minutes to install everything as it is six different libraries so be patient while it all comes through. Requests is a necessary library to have when web scraping and can help when downloading the PDF from a URL however it is not needed if scraping PDF from a local file.&nbsp;<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized centered\"><img loading=\"lazy\" decoding=\"async\" width=\"1010\" height=\"569\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/02\/how-to-scrape-pdf.png\" alt=\"Image of a woman starting at three exits with a large quest mark in front of her. Three boxes to the side read &quot;Scrape PDF from URL&quot;, &quot;Scrape PDF through Terminal&quot;, and &quot;Scrape PDF from File&quot;. Text above reads &quot;How to Scrape PDF&quot;\" class=\"wp-image-65421\" style=\"object-fit:cover\" srcset=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/02\/how-to-scrape-pdf.png 1010w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/02\/how-to-scrape-pdf-300x169.png 300w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/02\/how-to-scrape-pdf-768x433.png 768w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/02\/how-to-scrape-pdf-600x338.png 600w\" sizes=\"(max-width: 1010px) 100vw, 1010px\" \/><\/figure>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">How to Scrape PDF<\/h2>\n\n\n\n<p>In this section, we will present you with ways to scrape PDF from a URL, from a file, and through the terminal while presenting how to scrape text, images, and tables. For this example, we will be using <a href=\"https:\/\/api.slingacademy.com\/v1\/sample-data\/files\/text-and-images.pdf\" target=\"_blank\" rel=\"noopener\">this link<\/a> for text and image scraping and <a href=\"https:\/\/api.slingacademy.com\/v1\/sample-data\/files\/text-and-table.pdf\" target=\"_blank\" rel=\"noopener\">this link<\/a> for table scraping. Be sure to download both PDF documents if you wish to follow this tutorial completely.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Scrape PDF from URL<\/h3>\n\n\n\n<p>The script below will scrape PDF content by extracting images and text from the first link. Once it is run, it will print the text onto the terminal and save the images onto your file that includes this script. If you wish to use this script, all you would need to do is change the url= into the URL of the PDF you wish to scrape text and images from.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><textarea class=\"code-block-pro-copy-button-textarea\" aria-hidden=\"true\" readonly>import requests\nimport fitz  # PyMuPDF\nimport io\nfrom PIL import Image\n# PDF URL\nurl = &#8220;https:\/\/api.slingacademy.com\/v1\/sample-data\/files\/text-and-images.pdf&#8221;\n# Download the PDF\nresponse = requests.get(url)\npdf_path = &#8220;downloaded.pdf&#8221;\nwith open(pdf_path, &#8220;wb&#8221;) as f:\n    f.write(response.content)\n# Open the PDF\ndoc = fitz.open(pdf_path)\n# Extract text and images\nfor page_num, page in enumerate(doc, start=1):\n    text = page.get_text()\n    print(f&#8221;\\n&#8212; Page {page_num} Text &#8212;\\n&#8221;)\n    print(text)\n    # Extract images\n    image_list = page.get_images(full=True)\n    for img_index, img in enumerate(image_list, start=1):\n        xref = img[0]\n        base_image = doc.extract_image(xref)\n        image_bytes = base_image[&#8220;image&#8221;]\n        # Convert image bytes to PIL image\n        img = Image.open(io.BytesIO(image_bytes))\n        # Save the image\n        img_filename = f&#8221;page_{page_num}_image_{img_index}.png&#8221;\n        img.save(img_filename)\n        print(f&#8221;Saved image: {img_filename}&#8221;)\nprint(&#8220;\\nExtraction complete!&#8221;)<\/textarea><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> requests<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> fitz  <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> PyMuPDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> io<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">from<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">PIL<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> Image<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> PDF URL<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">url <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;https:\/\/api.slingacademy.com\/v1\/sample-data\/files\/text-and-images.pdf&quot;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Download the PDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">response <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> requests<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">url<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">pdf_path <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;downloaded.pdf&quot;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">with<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;wb&quot;<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">as<\/span><span style=\"color: #E0DEF4\"> f<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    f<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">write<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">content<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Open the PDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">doc <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> fitz<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract text and images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> page_num<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> page <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">enumerate<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">doc<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">start<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    text <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> page<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get_text<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">--- Page <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">page_num<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> Text ---<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">text<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    image_list <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> page<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get_images<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">full<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> img_index<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> img <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">enumerate<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_list<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">start<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        xref <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> img<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #EA9A97\">0<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        base_image <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> doc<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">extract_image<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">xref<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image_bytes <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> base_image<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #F6C177\">&quot;image&quot;<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Convert image bytes to PIL image<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        img <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> Image<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">io<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">BytesIO<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_bytes<\/span><span style=\"color: #908CAA\">))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Save the image<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        img_filename <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;page_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">page_num<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">_image_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">img_index<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">.png&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        img<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">save<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">img_filename<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Saved image: <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">img_filename<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">Extraction complete!&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>Below is the script to scrape PDF tables from the second link.&nbsp;<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><textarea class=\"code-block-pro-copy-button-textarea\" aria-hidden=\"true\" readonly>import requests\nimport camelot\n# PDF URL\nurl = &#8220;https:\/\/api.slingacademy.com\/v1\/sample-data\/files\/text-and-table.pdf&#8221;\n# Download the PDF\npdf_path = &#8220;downloaded_table.pdf&#8221;\nresponse = requests.get(url)\nwith open(pdf_path, &#8220;wb&#8221;) as f:\n    f.write(response.content)\n# Extract tables from the PDF\ntables = camelot.read_pdf(pdf_path, pages=&#8221;all&#8221;)\n# Print the number of tables found\nprint(f&#8221;Total tables extracted: {len(tables)}&#8221;)\n# Save each table as a CSV and print the extracted data\nfor i, table in enumerate(tables, start=1):\n    csv_filename = f&#8221;table_{i}.csv&#8221;\n    table.to_csv(csv_filename)\n    print(f&#8221;\\n&#8212; Table {i} &#8212;&#8220;)\n    print(table.df)  # Display the table as a Pandas DataFrame\n    print(f&#8221;Saved table to {csv_filename}&#8221;)<\/textarea><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> requests<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> camelot<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> PDF URL<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">url <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;https:\/\/api.slingacademy.com\/v1\/sample-data\/files\/text-and-table.pdf&quot;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Download the PDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">pdf_path <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;downloaded_table.pdf&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">response <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> requests<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">url<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">with<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;wb&quot;<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">as<\/span><span style=\"color: #E0DEF4\"> f<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    f<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">write<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">content<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract tables from the PDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">tables <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> camelot<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">read_pdf<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">pages<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #F6C177\">&quot;all&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Print the number of tables found<\/span><\/span>\n<span class=\"line\"><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Total tables extracted: <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #EB6F92; font-style: italic\">len<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">tables<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Save each table as a CSV and print the extracted data<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> i<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> table <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">enumerate<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">tables<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">start<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    csv_filename <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;table_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">.csv&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    table<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">to_csv<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">csv_filename<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">--- Table <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> ---&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">table<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">df<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Display the table as a Pandas DataFrame<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Saved table to <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">csv_filename<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>This will save the table in text form in the terminal of your IDE. If you wish to scrape PDF documents that includes text, images, and photos directly from a link, this will be the script you should use:<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><textarea class=\"code-block-pro-copy-button-textarea\" aria-hidden=\"true\" readonly>import requests\nimport fitz  # PyMuPDF\nimport camelot\nimport io\nfrom PIL import Image\nimport os\n# URL of the PDF\nurl = &#8220;Insert-Your-URL-Here.pdf&#8221;\n# Download the PDF\nresponse = requests.get(url)\npdf_path = &#8220;sample_with_table.pdf&#8221;\nwith open(pdf_path, &#8220;wb&#8221;) as f:\n    f.write(response.content)\n# Open the PDF with PyMuPDF\ndoc = fitz.open(pdf_path)\n# Directory to save extracted images\nimage_dir = &#8220;extracted_images&#8221;\nos.makedirs(image_dir, exist_ok=True)\n# Extract text and images\nfor page_num in range(len(doc)):\n    page = doc.load_page(page_num)\n    \n    # Extract text\n    text = page.get_text()\n    print(f&#8221;\\n&#8212; Text from Page {page_num + 1} &#8212;\\n&#8221;)\n    print(text)\n    \n    # Extract images\n    image_list = page.get_images(full=True)\n    for img_index, img in enumerate(image_list, start=1):\n        xref = img[0]\n        base_image = doc.extract_image(xref)\n        image_bytes = base_image[&#8220;image&#8221;]\n        image_ext = base_image[&#8220;ext&#8221;]\n        image = Image.open(io.BytesIO(image_bytes))\n        \n        # Save image\n        image_filename = f&#8221;{image_dir}\/page_{page_num + 1}_image_{img_index}.{image_ext}&#8221;\n        image.save(image_filename)\n        print(f&#8221;Saved image: {image_filename}&#8221;)\n# Extract tables using Camelot\ntables = camelot.read_pdf(pdf_path, pages=&#8221;all&#8221;)\n# Directory to save extracted tables\ntable_dir = &#8220;extracted_tables&#8221;\nos.makedirs(table_dir, exist_ok=True)\n# Save each table as a CSV file\nfor i, table in enumerate(tables, start=1):\n    csv_filename = f&#8221;{table_dir}\/table_{i}.csv&#8221;\n    table.to_csv(csv_filename)\n    print(f&#8221;Saved table {i} to {csv_filename}&#8221;)\n    print(f&#8221;\\n&#8212; Table {i} &#8212;\\n&#8221;)\n    print(table.df)  # Display the table as a DataFrame<\/textarea><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> requests<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> fitz  <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> PyMuPDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> camelot<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> io<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">from<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">PIL<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> Image<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> os<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> URL of the PDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">url <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;Insert-Your-URL-Here.pdf&quot;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Download the PDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">response <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> requests<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">url<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">pdf_path <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;sample_with_table.pdf&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">with<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;wb&quot;<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">as<\/span><span style=\"color: #E0DEF4\"> f<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    f<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">write<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">content<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Open the PDF with PyMuPDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">doc <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> fitz<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Directory to save extracted images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">image_dir <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;extracted_images&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">os<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">makedirs<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_dir<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">exist_ok<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract text and images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> page_num <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">range<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #EB6F92; font-style: italic\">len<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">doc<\/span><span style=\"color: #908CAA\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    page <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> doc<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">load_page<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">page_num<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract text<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    text <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> page<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get_text<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">--- Text from Page <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">page_num <\/span><span style=\"color: #3E8FB0\">+<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> ---<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">text<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    image_list <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> page<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get_images<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">full<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> img_index<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> img <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">enumerate<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_list<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">start<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        xref <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> img<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #EA9A97\">0<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        base_image <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> doc<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">extract_image<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">xref<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image_bytes <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> base_image<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #F6C177\">&quot;image&quot;<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image_ext <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> base_image<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #F6C177\">&quot;ext&quot;<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> Image<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">io<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">BytesIO<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_bytes<\/span><span style=\"color: #908CAA\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Save image<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image_filename <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_dir<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">\/page_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">page_num <\/span><span style=\"color: #3E8FB0\">+<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">_image_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">img_index<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">.<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_ext<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">save<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_filename<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Saved image: <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_filename<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract tables using Camelot<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">tables <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> camelot<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">read_pdf<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">pages<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #F6C177\">&quot;all&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Directory to save extracted tables<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">table_dir <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;extracted_tables&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">os<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">makedirs<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">table_dir<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">exist_ok<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Save each table as a CSV file<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> i<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> table <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">enumerate<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">tables<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">start<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    csv_filename <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">table_dir<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">\/table_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">.csv&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    table<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">to_csv<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">csv_filename<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Saved table <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> to <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">csv_filename<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">--- Table <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> ---<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">table<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">df<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Display the table as a DataFrame<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>This script should print out the text and the table while saving the images onto your file. Remember to replace the url= with the URL of your choice.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Scape PDF from File<\/h3>\n\n\n\n<p>If you have your PDF file on your desktop rather than a URL, the script changes slightly to accommodate the new source. The script below will scrape PDF files directly from your device when given the path to the document.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><textarea class=\"code-block-pro-copy-button-textarea\" aria-hidden=\"true\" readonly>import fitz  # PyMuPDF\nimport camelot\nimport io\nfrom PIL import Image\nimport os\n# Path to the local PDF file\npdf_path = r&#8221;C:\\Users\\Name\\Documents\\text-and-images.pdf&#8221;\n# Open the PDF with PyMuPDF\ndoc = fitz.open(pdf_path)\n# Directory to save extracted images\nimage_dir = &#8220;extracted_images&#8221;\nos.makedirs(image_dir, exist_ok=True)\n# Extract text and images\nfor page_num in range(len(doc)):\n    page = doc.load_page(page_num)\n    \n    # Extract text\n    text = page.get_text()\n    print(f&#8221;\\n&#8212; Text from Page {page_num + 1} &#8212;\\n&#8221;)\n    print(text)\n    \n    # Extract images\n    image_list = page.get_images(full=True)\n    for img_index, img in enumerate(image_list, start=1):\n        xref = img[0]\n        base_image = doc.extract_image(xref)\n        image_bytes = base_image[&#8220;image&#8221;]\n        image_ext = base_image[&#8220;ext&#8221;]\n        image = Image.open(io.BytesIO(image_bytes))\n        \n        # Save image\n        image_filename = f&#8221;{image_dir}\/page_{page_num + 1}_image_{img_index}.{image_ext}&#8221;\n        image.save(image_filename)\n        print(f&#8221;Saved image: {image_filename}&#8221;)\n# Extract tables using Camelot\ntables = camelot.read_pdf(pdf_path, pages=&#8221;all&#8221;)\n# Directory to save extracted tables\ntable_dir = &#8220;extracted_tables&#8221;\nos.makedirs(table_dir, exist_ok=True)\n# Save each table as a CSV file\nfor i, table in enumerate(tables, start=1):\n    csv_filename = f&#8221;{table_dir}\/table_{i}.csv&#8221;\n    table.to_csv(csv_filename)\n    print(f&#8221;Saved table {i} to {csv_filename}&#8221;)\n    print(f&#8221;\\n&#8212; Table {i} &#8212;\\n&#8221;)\n    print(table.df)<\/textarea><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> fitz  <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> PyMuPDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> camelot<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> io<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">from<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">PIL<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> Image<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> os<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Path to the local PDF file<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">pdf_path <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">r<\/span><span style=\"color: #F6C177\">&quot;C:<\/span><span style=\"color: #3E8FB0\">\\U<\/span><span style=\"color: #F6C177\">sers<\/span><span style=\"color: #3E8FB0\">\\N<\/span><span style=\"color: #F6C177\">ame<\/span><span style=\"color: #9CCFD8\">\\D<\/span><span style=\"color: #F6C177\">ocuments<\/span><span style=\"color: #3E8FB0\">\\t<\/span><span style=\"color: #F6C177\">ext-and-images<\/span><span style=\"color: #9CCFD8\">.<\/span><span style=\"color: #F6C177\">pdf&quot;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Open the PDF with PyMuPDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">doc <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> fitz<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Directory to save extracted images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">image_dir <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;extracted_images&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">os<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">makedirs<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_dir<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">exist_ok<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract text and images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> page_num <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">range<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #EB6F92; font-style: italic\">len<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">doc<\/span><span style=\"color: #908CAA\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    page <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> doc<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">load_page<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">page_num<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract text<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    text <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> page<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get_text<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">--- Text from Page <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">page_num <\/span><span style=\"color: #3E8FB0\">+<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> ---<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">text<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    image_list <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> page<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get_images<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">full<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> img_index<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> img <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">enumerate<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_list<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">start<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        xref <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> img<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #EA9A97\">0<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        base_image <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> doc<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">extract_image<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">xref<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image_bytes <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> base_image<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #F6C177\">&quot;image&quot;<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image_ext <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> base_image<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #F6C177\">&quot;ext&quot;<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> Image<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">io<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">BytesIO<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_bytes<\/span><span style=\"color: #908CAA\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Save image<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image_filename <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_dir<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">\/page_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">page_num <\/span><span style=\"color: #3E8FB0\">+<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">_image_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">img_index<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">.<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_ext<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">save<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_filename<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Saved image: <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_filename<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract tables using Camelot<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">tables <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> camelot<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">read_pdf<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">pages<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #F6C177\">&quot;all&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Directory to save extracted tables<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">table_dir <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;extracted_tables&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">os<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">makedirs<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">table_dir<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">exist_ok<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Save each table as a CSV file<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> i<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> table <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">enumerate<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">tables<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">start<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    csv_filename <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">table_dir<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">\/table_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">.csv&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    table<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">to_csv<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">csv_filename<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Saved table <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> to <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">csv_filename<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">--- Table <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> ---<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">table<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">df<\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>If you wish to scrape the tables document, the script remains the same but the <code>pdf_path<\/code> changes to the path of the tables document. It will look something like this:<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><textarea class=\"code-block-pro-copy-button-textarea\" aria-hidden=\"true\" readonly>import fitz  # PyMuPDF\nimport camelot\nimport io\nfrom PIL import Image\nimport os\n# Path to the local PDF file\npdf_path = r&#8221;C:\\Users\\Name\\Documents\\text-and-table.pdf&#8221;\n# Open the PDF with PyMuPDF\ndoc = fitz.open(pdf_path)\n# Directory to save extracted images\nimage_dir = &#8220;extracted_images&#8221;\nos.makedirs(image_dir, exist_ok=True)\n# Extract text and images\nfor page_num in range(len(doc)):\n    page = doc.load_page(page_num)\n    \n    # Extract text\n    text = page.get_text()\n    print(f&#8221;\\n&#8212; Text from Page {page_num + 1} &#8212;\\n&#8221;)\n    print(text)\n    \n    # Extract images\n    image_list = page.get_images(full=True)\n    for img_index, img in enumerate(image_list, start=1):\n        xref = img[0]\n        base_image = doc.extract_image(xref)\n        image_bytes = base_image[&#8220;image&#8221;]\n        image_ext = base_image[&#8220;ext&#8221;]\n        image = Image.open(io.BytesIO(image_bytes))\n        \n        # Save image\n        image_filename = f&#8221;{image_dir}\/page_{page_num + 1}_image_{img_index}.{image_ext}&#8221;\n        image.save(image_filename)\n        print(f&#8221;Saved image: {image_filename}&#8221;)\n# Extract tables using Camelot\ntables = camelot.read_pdf(pdf_path, pages=&#8221;all&#8221;)\n# Directory to save extracted tables\ntable_dir = &#8220;extracted_tables&#8221;\nos.makedirs(table_dir, exist_ok=True)\n# Save each table as a CSV file\nfor i, table in enumerate(tables, start=1):\n    csv_filename = f&#8221;{table_dir}\/table_{i}.csv&#8221;\n    table.to_csv(csv_filename)\n    print(f&#8221;Saved table {i} to {csv_filename}&#8221;)\n    print(f&#8221;\\n&#8212; Table {i} &#8212;\\n&#8221;)\n    print(table.df)<\/textarea><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> fitz  <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> PyMuPDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> camelot<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> io<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">from<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">PIL<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> Image<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> os<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Path to the local PDF file<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">pdf_path <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">r<\/span><span style=\"color: #F6C177\">&quot;C:<\/span><span style=\"color: #3E8FB0\">\\U<\/span><span style=\"color: #F6C177\">sers<\/span><span style=\"color: #3E8FB0\">\\N<\/span><span style=\"color: #F6C177\">ame<\/span><span style=\"color: #9CCFD8\">\\D<\/span><span style=\"color: #F6C177\">ocuments<\/span><span style=\"color: #3E8FB0\">\\t<\/span><span style=\"color: #F6C177\">ext-and-table<\/span><span style=\"color: #9CCFD8\">.<\/span><span style=\"color: #F6C177\">pdf&quot;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Open the PDF with PyMuPDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">doc <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> fitz<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Directory to save extracted images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">image_dir <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;extracted_images&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">os<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">makedirs<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_dir<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">exist_ok<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract text and images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> page_num <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">range<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #EB6F92; font-style: italic\">len<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">doc<\/span><span style=\"color: #908CAA\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    page <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> doc<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">load_page<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">page_num<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract text<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    text <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> page<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get_text<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">--- Text from Page <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">page_num <\/span><span style=\"color: #3E8FB0\">+<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> ---<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">text<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    image_list <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> page<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get_images<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">full<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> img_index<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> img <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">enumerate<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_list<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">start<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        xref <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> img<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #EA9A97\">0<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        base_image <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> doc<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">extract_image<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">xref<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image_bytes <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> base_image<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #F6C177\">&quot;image&quot;<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image_ext <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> base_image<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #F6C177\">&quot;ext&quot;<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> Image<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">io<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">BytesIO<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_bytes<\/span><span style=\"color: #908CAA\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Save image<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image_filename <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_dir<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">\/page_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">page_num <\/span><span style=\"color: #3E8FB0\">+<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">_image_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">img_index<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">.<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_ext<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">save<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_filename<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Saved image: <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_filename<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract tables using Camelot<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">tables <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> camelot<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">read_pdf<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">pages<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #F6C177\">&quot;all&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Directory to save extracted tables<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">table_dir <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;extracted_tables&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">os<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">makedirs<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">table_dir<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">exist_ok<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Save each table as a CSV file<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> i<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> table <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">enumerate<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">tables<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">start<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    csv_filename <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">table_dir<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">\/table_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">.csv&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    table<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">to_csv<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">csv_filename<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Saved table <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> to <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">csv_filename<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">--- Table <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> ---<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">table<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">df<\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>If you wish to scrape PDF files that contain text, image, and tables, this is the script you should use. Remember to change the <code>pdf_path=<\/code> to the path of your PDF document:&nbsp;<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><textarea class=\"code-block-pro-copy-button-textarea\" aria-hidden=\"true\" readonly>import fitz  # PyMuPDF\nimport camelot\nimport io\nfrom PIL import Image\nimport os\n# Set the PDF path here\npdf_path = r&#8221;C:\\Users\\Name\\Documents\\sample.pdf&#8221;\n# Open the PDF with PyMuPDF\ndoc = fitz.open(pdf_path)\n# Directory to save extracted images\nimage_dir = &#8220;extracted_images&#8221;\nos.makedirs(image_dir, exist_ok=True)\n# Extract text and images\nfor page_num in range(len(doc)):\n    page = doc.load_page(page_num)\n    \n    # Extract text\n    text = page.get_text()\n    print(f&#8221;\\n&#8212; Text from Page {page_num + 1} &#8212;\\n&#8221;)\n    print(text)\n    \n    # Extract images\n    image_list = page.get_images(full=True)\n    for img_index, img in enumerate(image_list, start=1):\n        xref = img[0]\n        base_image = doc.extract_image(xref)\n        image_bytes = base_image[&#8220;image&#8221;]\n        image_ext = base_image[&#8220;ext&#8221;]\n        image = Image.open(io.BytesIO(image_bytes))\n        \n        # Save image\n        image_filename = f&#8221;{image_dir}\/page_{page_num + 1}_image_{img_index}.{image_ext}&#8221;\n        image.save(image_filename)\n        print(f&#8221;Saved image: {image_filename}&#8221;)\n# Extract tables using Camelot\ntables = camelot.read_pdf(pdf_path, pages=&#8221;all&#8221;)\n# Directory to save extracted tables\ntable_dir = &#8220;extracted_tables&#8221;\nos.makedirs(table_dir, exist_ok=True)\n# Save each table as a CSV file\nfor i, table in enumerate(tables, start=1):\n    csv_filename = f&#8221;{table_dir}\/table_{i}.csv&#8221;\n    table.to_csv(csv_filename)\n    print(f&#8221;Saved table {i} to {csv_filename}&#8221;)\n    print(f&#8221;\\n&#8212; Table {i} &#8212;\\n&#8221;)\n    print(table.df)<\/textarea><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> fitz  <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> PyMuPDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> camelot<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> io<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">from<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">PIL<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> Image<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> os<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Set the PDF path here<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">pdf_path <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">r<\/span><span style=\"color: #F6C177\">&quot;C:<\/span><span style=\"color: #3E8FB0\">\\U<\/span><span style=\"color: #F6C177\">sers<\/span><span style=\"color: #3E8FB0\">\\N<\/span><span style=\"color: #F6C177\">ame<\/span><span style=\"color: #9CCFD8\">\\D<\/span><span style=\"color: #F6C177\">ocuments<\/span><span style=\"color: #9CCFD8\">\\s<\/span><span style=\"color: #F6C177\">ample<\/span><span style=\"color: #9CCFD8\">.<\/span><span style=\"color: #F6C177\">pdf&quot;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Open the PDF with PyMuPDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">doc <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> fitz<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Directory to save extracted images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">image_dir <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;extracted_images&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">os<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">makedirs<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_dir<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">exist_ok<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract text and images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> page_num <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">range<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #EB6F92; font-style: italic\">len<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">doc<\/span><span style=\"color: #908CAA\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    page <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> doc<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">load_page<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">page_num<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract text<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    text <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> page<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get_text<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">--- Text from Page <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">page_num <\/span><span style=\"color: #3E8FB0\">+<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> ---<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">text<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    image_list <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> page<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get_images<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">full<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> img_index<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> img <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">enumerate<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_list<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">start<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        xref <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> img<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #EA9A97\">0<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        base_image <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> doc<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">extract_image<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">xref<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image_bytes <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> base_image<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #F6C177\">&quot;image&quot;<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image_ext <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> base_image<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #F6C177\">&quot;ext&quot;<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> Image<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">io<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">BytesIO<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_bytes<\/span><span style=\"color: #908CAA\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Save image<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image_filename <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_dir<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">\/page_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">page_num <\/span><span style=\"color: #3E8FB0\">+<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">_image_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">img_index<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">.<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_ext<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">save<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_filename<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Saved image: <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_filename<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract tables using Camelot<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">tables <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> camelot<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">read_pdf<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">pages<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #F6C177\">&quot;all&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Directory to save extracted tables<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">table_dir <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;extracted_tables&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">os<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">makedirs<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">table_dir<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">exist_ok<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Save each table as a CSV file<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> i<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> table <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">enumerate<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">tables<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">start<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    csv_filename <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">table_dir<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">\/table_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">.csv&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    table<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">to_csv<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">csv_filename<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Saved table <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> to <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">csv_filename<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">--- Table <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> ---<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">table<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">df<\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Scrape PDF through Terminal<\/h3>\n\n\n\n<p>Finally, we will explore how you can create a scraper to scrape PDF efficiently through your terminal that can scrape PDF directly by inputting the script and the path to the PDF file you wish to extract. While you can use any of the scripts provided above and alter the path or URL, you might want to save a bit of time by creating a scraper. When you write a scraper, all you would need to do is open your terminal application on your device and choose the path of the <code>.py<\/code> file and follow it with the path to the PDF document. The script to scrape PDF will look like this:<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><textarea class=\"code-block-pro-copy-button-textarea\" aria-hidden=\"true\" readonly>import fitz  # PyMuPDF\nimport io\nfrom PIL import Image\nimport os\nimport sys\ndef extract_text_and_images(pdf_path):\n    # Open the PDF\n    doc = fitz.open(pdf_path)\n    # Create a directory to save extracted images\n    image_dir = &#8220;extracted_images&#8221;\n    os.makedirs(image_dir, exist_ok=True)\n    # Extract text and images\n    for page_num in range(len(doc)):\n        page = doc.load_page(page_num)\n        # Extract text\n        text = page.get_text()\n        print(f&#8221;\\n&#8212; Text from Page {page_num + 1} &#8212;\\n&#8221;)\n        print(text)\n        # Extract images\n        image_list = page.get_images(full=True)\n        for img_index, img in enumerate(image_list, start=1):\n            xref = img[0]\n            base_image = doc.extract_image(xref)\n            image_bytes = base_image[&#8220;image&#8221;]\n            image_ext = base_image[&#8220;ext&#8221;]\n            image = Image.open(io.BytesIO(image_bytes))\n            # Save image\n            image_filename = f&#8221;{image_dir}\/page_{page_num + 1}_image_{img_index}.{image_ext}&#8221;\n            image.save(image_filename)\n            print(f&#8221;Saved image: {image_filename}&#8221;)\nif __name__ == &#8220;__main__&#8221;:\n    if len(sys.argv) != 2:\n        print(&#8220;Usage: python script.py &lt;pdf_path>&#8221;)\n        sys.exit(1)\n    pdf_path = sys.argv[1]\n    extract_text_and_images(pdf_path)<\/textarea><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> fitz  <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> PyMuPDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> io<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">from<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">PIL<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> Image<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> os<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> sys<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">extract_text_and_images<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">pdf_path<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Open the PDF<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    doc <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> fitz<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Create a directory to save extracted images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    image_dir <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;extracted_images&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    os<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">makedirs<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_dir<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">exist_ok<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract text and images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> page_num <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">range<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #EB6F92; font-style: italic\">len<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">doc<\/span><span style=\"color: #908CAA\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        page <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> doc<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">load_page<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">page_num<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract text<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        text <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> page<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get_text<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">--- Text from Page <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">page_num <\/span><span style=\"color: #3E8FB0\">+<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> ---<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">text<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract images<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image_list <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> page<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get_images<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">full<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> img_index<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> img <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">enumerate<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_list<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">start<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            xref <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> img<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #EA9A97\">0<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            base_image <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> doc<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">extract_image<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">xref<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            image_bytes <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> base_image<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #F6C177\">&quot;image&quot;<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            image_ext <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> base_image<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #F6C177\">&quot;ext&quot;<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            image <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> Image<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">io<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">BytesIO<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_bytes<\/span><span style=\"color: #908CAA\">))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Save image<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            image_filename <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_dir<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">\/page_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">page_num <\/span><span style=\"color: #3E8FB0\">+<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">_image_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">img_index<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">.<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_ext<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            image<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">save<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_filename<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Saved image: <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_filename<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">__name__<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">==<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;__main__&quot;<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">len<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">sys<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">argv<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">!=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">2<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&quot;Usage: python script.py &lt;pdf_path&gt;&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        sys<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">exit<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    pdf_path <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> sys<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">argv<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    extract_text_and_images<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>Run this in your terminal but alter the paths to fit your paths of the <code>.py<\/code> script and the PDF file:<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><textarea class=\"code-block-pro-copy-button-textarea\" aria-hidden=\"true\" readonly>C:\\Users\\User\\File\\FileName\\.venv\\Scripts\\python.exe &#8220;C:\\Users\\User\\File\\FileName\\.venv\\FileName.py&#8221; &#8220;C:\\Users\\User\\Documents\\text-and-images.pdf&#8221;<\/textarea><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #EA9A97\">C:\\Users\\User\\File\\FileName\\.venv\\Scripts\\python.exe<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;C:\\Users\\User\\File\\FileName\\.venv\\FileName.py&quot;<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;C:\\Users\\User\\Documents\\text-and-images.pdf&quot;<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>If you wish to extract tables from a PDF document, here is the scraper script:<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><textarea class=\"code-block-pro-copy-button-textarea\" aria-hidden=\"true\" readonly>import camelot\nimport sys\nimport os\ndef extract_tables(pdf_path):\n    # Extract tables using Camelot\n    tables = camelot.read_pdf(pdf_path, pages=&#8221;all&#8221;)\n    # Create directory to save extracted tables\n    table_dir = &#8220;extracted_tables&#8221;\n    os.makedirs(table_dir, exist_ok=True)\n    # Save each table as a CSV file\n    for i, table in enumerate(tables, start=1):\n        csv_filename = f&#8221;{table_dir}\/table_{i}.csv&#8221;\n        table.to_csv(csv_filename)\n        print(f&#8221;Saved table {i} to {csv_filename}&#8221;)\n        print(f&#8221;\\n&#8212; Table {i} &#8212;\\n&#8221;)\n        print(table.df)\nif __name__ == &#8220;__main__&#8221;:\n    if len(sys.argv) != 2:\n        print(&#8220;Usage: python script.py &lt;pdf_path>&#8221;)\n        sys.exit(1)\n    pdf_path = sys.argv[1]\n    extract_tables(pdf_path)<\/textarea><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> camelot<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> sys<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> os<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">extract_tables<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">pdf_path<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Extract tables using Camelot<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    tables <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> camelot<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">read_pdf<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">pages<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #F6C177\">&quot;all&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Create directory to save extracted tables<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    table_dir <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;extracted_tables&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    os<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">makedirs<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">table_dir<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">exist_ok<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Save each table as a CSV file<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> i<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> table <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">enumerate<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">tables<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">start<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        csv_filename <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">table_dir<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">\/table_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">.csv&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        table<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">to_csv<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">csv_filename<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Saved table <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> to <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">csv_filename<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">--- Table <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> ---<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">table<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">df<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">__name__<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">==<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;__main__&quot;<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">len<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">sys<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">argv<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">!=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">2<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&quot;Usage: python script.py &lt;pdf_path&gt;&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        sys<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">exit<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    pdf_path <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> sys<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">argv<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    extract_tables<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>Here is the terminal command:<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><textarea class=\"code-block-pro-copy-button-textarea\" aria-hidden=\"true\" readonly>C:\\Users\\User\\File\\FileName\\.venv\\Scripts\\python.exe &#8220;C:\\Users\\User\\File\\FileName\\.venv\\FileName.py&#8221; &#8220;C:\\Users\\User\\Documents\\text-and-tables.pdf&#8221;<\/textarea><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #EA9A97\">C:\\Users\\User\\File\\FileName\\.venv\\Scripts\\python.exe<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;C:\\Users\\User\\File\\FileName\\.venv\\FileName.py&quot;<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;C:\\Users\\User\\Documents\\text-and-tables.pdf&quot;<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>If you wish to scrape PDF files that includes text, images, and tables, the script will look like this:<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><textarea class=\"code-block-pro-copy-button-textarea\" aria-hidden=\"true\" readonly>import camelot\nimport io\nfrom PIL import Image\nimport os\nimport sys\ndef extract_text_and_images(pdf_path):\n    doc = fitz.open(pdf_path)\n    image_dir = &#8220;extracted_images&#8221;\n    os.makedirs(image_dir, exist_ok=True)\n    for page_num in range(len(doc)):\n        page = doc.load_page(page_num)\n        text = page.get_text()\n        print(f&#8221;\\n&#8212; Text from Page {page_num + 1} &#8212;\\n&#8221;)\n        print(text)\n        image_list = page.get_images(full=True)\n        for img_index, img in enumerate(image_list, start=1):\n            xref = img[0]\n            base_image = doc.extract_image(xref)\n            image_bytes = base_image[&#8220;image&#8221;]\n            image_ext = base_image[&#8220;ext&#8221;]\n            image = Image.open(io.BytesIO(image_bytes))\n            image_filename = f&#8221;{image_dir}\/page_{page_num + 1}_image_{img_index}.{image_ext}&#8221;\n            image.save(image_filename)\n            print(f&#8221;Saved image: {image_filename}&#8221;)\ndef extract_tables(pdf_path):\n    tables = camelot.read_pdf(pdf_path, pages=&#8221;all&#8221;)\n    table_dir = &#8220;extracted_tables&#8221;\n    os.makedirs(table_dir, exist_ok=True)\n    for i, table in enumerate(tables, start=1):\n        csv_filename = f&#8221;{table_dir}\/table_{i}.csv&#8221;\n        table.to_csv(csv_filename)\n        print(f&#8221;Saved table {i} to {csv_filename}&#8221;)\n        print(f&#8221;\\n&#8212; Table {i} &#8212;\\n&#8221;)\n        print(table.df)\nif __name__ == &#8220;__main__&#8221;:\n    if len(sys.argv) != 2:\n        print(&#8220;Usage: python script.py &lt;pdf_path>&#8221;)\n        sys.exit(1)\n    pdf_path = sys.argv[1]\n    extract_text_and_images(pdf_path)\n    extract_tables(pdf_path)<\/textarea><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> camelot<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> io<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">from<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">PIL<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> Image<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> os<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> sys<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">extract_text_and_images<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">pdf_path<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    doc <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> fitz<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    image_dir <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;extracted_images&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    os<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">makedirs<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_dir<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">exist_ok<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> page_num <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">range<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #EB6F92; font-style: italic\">len<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">doc<\/span><span style=\"color: #908CAA\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        page <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> doc<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">load_page<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">page_num<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        text <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> page<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get_text<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">--- Text from Page <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">page_num <\/span><span style=\"color: #3E8FB0\">+<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> ---<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">text<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        image_list <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> page<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get_images<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">full<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> img_index<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> img <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">enumerate<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_list<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">start<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            xref <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> img<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #EA9A97\">0<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            base_image <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> doc<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">extract_image<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">xref<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            image_bytes <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> base_image<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #F6C177\">&quot;image&quot;<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            image_ext <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> base_image<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #F6C177\">&quot;ext&quot;<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            image <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> Image<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">io<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">BytesIO<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_bytes<\/span><span style=\"color: #908CAA\">))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            image_filename <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_dir<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">\/page_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">page_num <\/span><span style=\"color: #3E8FB0\">+<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">_image_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">img_index<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">.<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_ext<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            image<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">save<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">image_filename<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Saved image: <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">image_filename<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">extract_tables<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">pdf_path<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    tables <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> camelot<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">read_pdf<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">pages<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #F6C177\">&quot;all&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    table_dir <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;extracted_tables&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    os<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">makedirs<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">table_dir<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">exist_ok<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">True<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> i<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> table <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">enumerate<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">tables<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">start<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        csv_filename <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">table_dir<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">\/table_<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">.csv&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        table<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">to_csv<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">csv_filename<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Saved table <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> to <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">csv_filename<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">--- Table <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">i<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\"> ---<\/span><span style=\"color: #3E8FB0\">\\n<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">table<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">df<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">__name__<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">==<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;__main__&quot;<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">len<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">sys<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">argv<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">!=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">2<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #EB6F92; font-style: italic\">print<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&quot;Usage: python script.py &lt;pdf_path&gt;&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        sys<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">exit<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    pdf_path <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> sys<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">argv<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #EA9A97\">1<\/span><span style=\"color: #908CAA\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    extract_text_and_images<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    extract_tables<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">pdf_path<\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>The terminal command would be this:<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><textarea class=\"code-block-pro-copy-button-textarea\" aria-hidden=\"true\" readonly>python &#8220;C:\\Users\\User\\File\\FileName\\.venv\\Example.py&#8221; &#8220;C:\\Users\\User\\Documents\\Example.pdf&#8221;<\/textarea><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #EA9A97\">python<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;C:\\Users\\User\\File\\FileName\\.venv\\Example.py&quot;<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;C:\\Users\\User\\Documents\\Example.pdf&quot;<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion&nbsp;<\/h2>\n\n\n\n<p>Choosing to scrape PDF in Python is a useful skill that enables the extraction of unstructured data into usable formats. By using libraries like PyMuPDF, pdfplumber, and Camelot, you can handle text, images, and tables within PDFs.&nbsp;<\/p>\n\n\n\n<p>Key Takeaways:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>There are various approaches to scrape PDF, including URLs, local file paths, and terminal-based scripts.<\/li>\n\n\n\n<li>Choosing the right library is crucial; for text extraction, PyMuPDF and pdfplumber are effective, while Camelot excels in table extraction.<\/li>\n\n\n\n<li>PDFs often lack standardized formatting, presenting challenges in data extraction that require specialized handling.<\/li>\n\n\n\n<li>Proper environment setup, including installing necessary libraries and understanding their dependencies, is essential to successfully scrape PDF.<\/li>\n\n\n\n<li>Implementing PDF scraping can automate data extraction processes, reducing manual effort and increasing efficiency in tasks like data analysis and reporting.<\/li>\n<\/ul>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>While challenges exist due to the diverse nature of PDF structures, understanding the appropriate tools and techniques allows for effective data extraction and integration into various workflows.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Frequently Asked Questions<\/h2>\n\n\n<div id=\"rank-math-faq\" class=\"rank-math-block\">\n<div class=\"rank-math-list \">\n<div id=\"faq-question-1738943149907\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">What is the best method for extracting text from PDFs?<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>The best method depends on the document format and structure of the PDF. If the text is machine-readable, libraries like PyMuPDF, pdfplumber, and pdfminer.six can efficiently extract text in a structured format. However, for scanned documents, OCR tools like pytesseract are required to convert document images into machine-readable text.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1738943181457\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">How can I extract tables from a PDF file?<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>For PDFs with selectable text, Camelot and pdfplumber are effective PDF table extraction tools. If the tables are part of an image, an OCR-based approach with pytesseract is necessary. Additionally, rule-based data extraction can be used for specific table layouts.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1738943197769\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">Can I automate PDF data extraction?<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Yes, automated data extraction software can streamline the process. Using batch processing capabilities, Python scripts can handle volumes of PDFs without manual intervention. Advanced solutions, such as AI-powered tools and the GPT-4 Vision API, can further enhance the extraction process.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1738943216214\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">What are the challenges of extracting data from PDFs?<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>The unstructured nature of many PDFs makes accurate data extraction difficult. PDFs may contain complex layouts, non-linear text flow, or multiple document formats. Additionally, embedded elements such as email attachments, column names, and party names may require special handling to ensure reliable results.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1738943234617\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">How do I extract images from a PDF?<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Libraries like PyMuPDF allow extraction from images embedded within PDFs. Once extracted, these images can be processed using advanced methods such as AI-powered OCR or traditional methods like simple text recognition.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1738943250544\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">What\u2019s the difference between structured and unstructured PDFs?<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>A structured format PDF maintains clear organization with defined elements, such as labeled tables and paragraphs. In contrast, an unstructured format contains arbitrary layouts, making reliable data extraction solutions more difficult to implement. Advanced data parsers can help convert unstructured data into a usable format.<\/p>\n\n<\/div>\n<\/div>\n<\/div>\n<\/div>","protected":false},"author":2627,"featured_media":75383,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","format":"standard","categories":[110],"tags":[],"class_list":["post-65490","blog","type-blog","status-publish","format-standard","has-post-thumbnail","hentry","category-web-scraping-and-automation"],"acf":[],"_links":{"self":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog\/65490","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/types\/blog"}],"author":[{"embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/users\/2627"}],"replies":[{"embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/comments?post=65490"}],"version-history":[{"count":5,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog\/65490\/revisions"}],"predecessor-version":[{"id":87228,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog\/65490\/revisions\/87228"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/media\/75383"}],"wp:attachment":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/media?parent=65490"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/categories?post=65490"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/tags?post=65490"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}