{"id":58586,"date":"2024-09-13T17:10:56","date_gmt":"2024-09-13T16:10:56","guid":{"rendered":"https:\/\/proxidize.com\/?post_type=use-cases&#038;p=58586"},"modified":"2025-10-02T12:25:00","modified_gmt":"2025-10-02T11:25:00","slug":"scrapy-web-scraping","status":"publish","type":"blog","link":"https:\/\/proxidize.com\/blog\/scrapy-web-scraping\/","title":{"rendered":"A Guide to Writing a Scrapy Web Scraping Script"},"content":{"rendered":"\n<p>There are many languages and libraries available to <a href=\"https:\/\/proxidize.com\/use-cases\/web-scraping-with-javascript\/\">perform a web scraping project<\/a> however, if you wish to perform a large-scale project, then using Scrapy <a href=\"https:\/\/proxidize.com\/use-cases\/web-scraping\/\">web scraping<\/a> could be the best choice. It is a library designed specifically to handle large-scale scraping projects due to its easy accessibility and extendable framework. This article aims to explain what Scrapy is, and provide a breakdown of how to use it with a step-by-step Scrapy tutorial. Finally, it will include some advanced Scrapy techniques and offer tips and tricks on perfecting your techniques.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized centered\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2024\/09\/what-is-scrapy-1024x576.png\" alt=\"The logos of Scrapy and Python under a title &quot;What is Scrapy?&quot;\" class=\"wp-image-58649\" style=\"object-fit:cover\" srcset=\"https:\/\/proxidize.com\/wp-content\/uploads\/2024\/09\/what-is-scrapy-1024x576.png 1024w, https:\/\/proxidize.com\/wp-content\/uploads\/2024\/09\/what-is-scrapy-300x169.png 300w, https:\/\/proxidize.com\/wp-content\/uploads\/2024\/09\/what-is-scrapy-768x432.png 768w, https:\/\/proxidize.com\/wp-content\/uploads\/2024\/09\/what-is-scrapy-1536x864.png 1536w, https:\/\/proxidize.com\/wp-content\/uploads\/2024\/09\/what-is-scrapy-600x338.png 600w, https:\/\/proxidize.com\/wp-content\/uploads\/2024\/09\/what-is-scrapy.png 1920w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">What is Scrapy?<\/h2>\n\n\n\n<p>Scrapy is an open-source library built for Python and works on an <a href=\"https:\/\/twisted.org\/\" target=\"_blank\" rel=\"noopener\">asynchronous networking engine<\/a> called Twisted. What this means is that it uses an event-driven networking infrastructure and allows for higher efficiency and scalability. Scrapy comes with <a href=\"https:\/\/docs.scrapy.org\/en\/latest\/topics\/api.html\" target=\"_blank\" rel=\"noopener\">an engine called Crawler<\/a> that handles low-level logic such as HTTP connection, scheduling, and entire execution flow. For high-level logic tasks, Scrapy offers up Spiders which will handle the scraping logic and performance. Users would need to provide the Crawler with a Spider object to generate request objects, parse, and retrieve the data to store.<\/p>\n\n\n\n<p>Requests is used for HTTP requests, <a href=\"https:\/\/proxidize.com\/use-cases\/web-scraping-with-beautiful-soup\/\">BeautifulSoup is used for data parsing<\/a>, Selenium is most common with JavaScript-based websites, and Scrapy offers all of these in one convenient library.<\/p>\n\n\n\n<p>Here are some common Scrapy terms:<\/p>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Callback: Due to Scrapy\u2019s asynchronous framework, most of the actions are executed in the background. This allows for concurrent and effective logic. Callback is a function that\u2019s attached to a background task.<\/li>\n\n\n\n<li>Errorback: Similar to Callback, this is triggered when a task fails instead of when it succeeds.<\/li>\n\n\n\n<li>Generator: Functions that return results one at a time instead of all at once.<\/li>\n\n\n\n<li>Settings: Located in the settings.py file of the project and is Scrapy\u2019s central configuration object.<\/li>\n<\/ul>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Scrapy includes some unique features that make it more powerful than other libraries used for scraping. These include HTTP connections, support for CSS selectors and XPath selectors, the ability to store data on FTP, S3, and a local file system, cookie and session management, JavaScript rendering with <a href=\"https:\/\/github.com\/scrapy-plugins\/scrapy-splash\" target=\"_blank\" rel=\"noopener\">Scrapy Splash<\/a>, and built-in crawling capabilities.&nbsp;<\/p>\n\n\n\n<p>With the basic information of Scrapy out of the way, it is time to start building the environment for a scraping project.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized centered\"><img decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2024\/09\/how-to-use-scrapy-web-scraping-1024x576.png\" alt=\"A diagram under the title &quot;How To Use Scrapy Web Scraping&quot;.\" class=\"wp-image-58651\" style=\"object-fit:cover\" srcset=\"https:\/\/proxidize.com\/wp-content\/uploads\/2024\/09\/how-to-use-scrapy-web-scraping-1024x576.png 1024w, https:\/\/proxidize.com\/wp-content\/uploads\/2024\/09\/how-to-use-scrapy-web-scraping-300x169.png 300w, https:\/\/proxidize.com\/wp-content\/uploads\/2024\/09\/how-to-use-scrapy-web-scraping-768x432.png 768w, https:\/\/proxidize.com\/wp-content\/uploads\/2024\/09\/how-to-use-scrapy-web-scraping-1536x864.png 1536w, https:\/\/proxidize.com\/wp-content\/uploads\/2024\/09\/how-to-use-scrapy-web-scraping-600x338.png 600w, https:\/\/proxidize.com\/wp-content\/uploads\/2024\/09\/how-to-use-scrapy-web-scraping.png 1920w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">How To Use Scrapy Web Scraping<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Install Python and Scrapy<\/h3>\n\n\n\n<p>Before starting a Scrapy project, you must ensure that you have Python installed. This can be done easily by visiting the <a href=\"https:\/\/www.python.org\/downloads\/\" target=\"_blank\" rel=\"noopener\">Python website<\/a> and downloading the latest version. Once that is complete, open a terminal and use the pip command to install Scrapy:<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#232136;display:none;background-color:#e0def4\" aria-label=\"Copy\" data-copied-text=\"Copied!\" data-has-text-button=\"textSimple\" data-inside-header-type=\"none\" aria-live=\"polite\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>pip install scrapy<\/textarea><\/pre><span class=\"cbp-btn-text\">Copy<\/span><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #E0DEF4\">pip install scrapy<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Create a New Project<\/h3>\n\n\n\n<p>Now that you have the library installed, the next step is to create a new project. Come up with a name for your project, for this example, we will simply name the project \u201cscraping_example\u201d. Enter the following command in your terminal:<\/p>\n\n\n\n<p>scrapy startproject scraping_example<\/p>\n\n\n\n<p>This will create a list of files for your project that you will use to control the Scrapy spiders, settings, and so on. This will look like this:<\/p>\n\n\n\n<p>\u251c\u2500\u2500 scraping_example<\/p>\n\n\n\n<p>\u2502 &nbsp; \u251c\u2500\u2500 __init__.py<\/p>\n\n\n\n<p>\u2502 &nbsp; \u251c\u2500\u2500 items.py<\/p>\n\n\n\n<p>\u2502 &nbsp; \u251c\u2500\u2500 middlewares.py<\/p>\n\n\n\n<p>\u2502 &nbsp; \u251c\u2500\u2500 pipelines.py<\/p>\n\n\n\n<p>\u2502 &nbsp; \u251c\u2500\u2500 settings.py<\/p>\n\n\n\n<p>\u2502 &nbsp; \u2514\u2500\u2500 spiders<\/p>\n\n\n\n<p>\u2502 &nbsp; &nbsp; &nbsp; \u251c\u2500\u2500 __init__.py<\/p>\n\n\n\n<p>\u2514\u2500\u2500 scrapy.cfg<\/p>\n\n\n\n<p>Items.py is a model for the extracted data. It can be customized to inherit the Scrapy item class. Middlewares.py changes the request\/response lifecycle. Pipelines.py processes the extracted data, cleans the HTML, validates the data, and exports it into a customer format or saves it onto a database. \/spiders contains basic Spider classes. Basic Spiders are classes that define how a website should be scraped such as which links to follow and how to extract the data. Scrapy.cfg is the configuration file for the project\u2019s main settings.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Create a Spider<\/h3>\n\n\n\n<p>To create a spider, you would need to navigate to the spider directory inside your project:<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#232136;display:none;background-color:#e0def4\" aria-label=\"Copy\" data-copied-text=\"Copied!\" data-has-text-button=\"textSimple\" data-inside-header-type=\"none\" aria-live=\"polite\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>cd scraping_example\/spiders<\/textarea><\/pre><span class=\"cbp-btn-text\">Copy<\/span><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #E0DEF4\">cd scraping_example<\/span><span style=\"color: #3E8FB0\">\/<\/span><span style=\"color: #E0DEF4\">spiders<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Next up, you would need to create a Python file and input which information needs to be scrapped. For the example below, we will be using the website \u2018<a href=\"https:\/\/quotes.toscrape.com\/\" target=\"_blank\" rel=\"noopener\">quotes.toscrape<\/a>\u2019 and gathering the text, author, and tags:<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#232136;display:none;background-color:#e0def4\" aria-label=\"Copy\" data-copied-text=\"Copied!\" data-has-text-button=\"textSimple\" data-inside-header-type=\"none\" aria-live=\"polite\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>import scrapy\n\nclass ExampleSpider(scrapy.Spider):\n\n\u00a0\u00a0\u00a0\u00a0name = \"example\"\n\n\u00a0\u00a0\u00a0\u00a0start_urls = &#91;\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0'http:\/\/quotes.toscrape.com\/',\n\n\u00a0\u00a0\u00a0\u00a0&#93;\n\n\u00a0\u00a0\u00a0\u00a0def parse(self, response):\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0for quote in response.css('div.quote'):\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0yield {\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0'text': quote.css('span.text::text').get(),\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0'author': quote.css('span small::text').get(),\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0'tags': quote.css('div.tags a.tag::text').getall(),\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0}\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0next_page = response.css('li.next a::attr(href)').get()\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if next_page is not None:\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0yield response.follow(next_page, self.parse)<\/textarea><\/pre><span class=\"cbp-btn-text\">Copy<\/span><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> scrapy<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">class<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">ExampleSpider<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">scrapy<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #C4A7E7; font-style: italic\">Spider<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0name <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;example&quot;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0start_urls <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #908CAA\">&#91;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #F6C177\">&#39;http:\/\/quotes.toscrape.com\/&#39;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #908CAA\">&#93;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">parse<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">self<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">response<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> quote <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">css<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;div.quote&#39;<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">yield<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #908CAA\">{<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #F6C177\">&#39;text&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> quote<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">css<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;span.text::text&#39;<\/span><span style=\"color: #908CAA\">).<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(),<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #F6C177\">&#39;author&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> quote<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">css<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;span small::text&#39;<\/span><span style=\"color: #908CAA\">).<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(),<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #F6C177\">&#39;tags&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> quote<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">css<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;div.tags a.tag::text&#39;<\/span><span style=\"color: #908CAA\">).<\/span><span style=\"color: #E0DEF4\">getall<\/span><span style=\"color: #908CAA\">(),<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #908CAA\">}<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0next_page <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">css<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;li.next a::attr(href)&#39;<\/span><span style=\"color: #908CAA\">).<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> next_page <\/span><span style=\"color: #3E8FB0\">is<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">not<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">None<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">yield<\/span><span style=\"color: #E0DEF4\"> response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">follow<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">next_page<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">parse<\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Run the Spider<\/h3>\n\n\n\n<p>Finally, you need to run the Spider and enter the website you wish to scrape. The spider will crawl the website and find the necessary information.&nbsp; Additionally, you could add a method for saving the project such as a <a href=\"https:\/\/proxidize.com\/blog\/json-vs-csv\/\" data-type=\"link\" data-id=\"https:\/\/proxidize.com\/blog\/json-vs-csv\/\">JSON or a CSV<\/a>:<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#232136;display:none;background-color:#e0def4\" aria-label=\"Copy\" data-copied-text=\"Copied!\" data-has-text-button=\"textSimple\" data-inside-header-type=\"none\" aria-live=\"polite\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>scrapy crawl (website name) -o output.json<\/textarea><\/pre><span class=\"cbp-btn-text\">Copy<\/span><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #E0DEF4\">scrapy crawl <\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">website name<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-<\/span><span style=\"color: #E0DEF4\">o output<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">json<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Running your Script<\/h3>\n\n\n\n<p>The final step is to run your script. With all the information entered above including choosing the website to scrape, the exact information you wish to retrieve, and the method of saving, all you need to do is input the code to start running the script.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#232136;display:none;background-color:#e0def4\" aria-label=\"Copy\" data-copied-text=\"Copied!\" data-has-text-button=\"textSimple\" data-inside-header-type=\"none\" aria-live=\"polite\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>from scrapy.crawler import CrawlerProcess\u00a0\n\nfrom project.spiders.test_spider import SpiderName\n\nprocess = CrawlerProcess()\n\nprocess.crawl(SpiderName, arg1=val1,arg2=val2)\n\nprocess.start()<\/textarea><\/pre><span class=\"cbp-btn-text\">Copy<\/span><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">from<\/span><span style=\"color: #E0DEF4\"> scrapy<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">crawler <\/span><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> CrawlerProcess\u00a0<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">from<\/span><span style=\"color: #E0DEF4\"> project<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">spiders<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">test_spider <\/span><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> SpiderName<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">process <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> CrawlerProcess<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">process<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">crawl<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">SpiderName<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">arg1<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\">val1<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #C4A7E7; font-style: italic\">arg2<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\">val2<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">process<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">start<\/span><span style=\"color: #908CAA\">()<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>With that, your script should run and crawl the information you need without worry. Here is the full script including all the information you would need. You could take this script and use it but remember to change the necessary information to what you wish to scrape.&nbsp;<\/p>\n\n\n\n<p>In the terminal, enter these lines:<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#232136;display:none;background-color:#e0def4\" aria-label=\"Copy\" data-copied-text=\"Copied!\" data-has-text-button=\"textSimple\" data-inside-header-type=\"none\" aria-live=\"polite\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>pip install scrapy\n\nscrapy startproject scraping_example\n\ncd scraping_example\/spiders<\/textarea><\/pre><span class=\"cbp-btn-text\">Copy<\/span><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #E0DEF4\">pip install scrapy<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">scrapy startproject scraping_example<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">cd scraping_example<\/span><span style=\"color: #3E8FB0\">\/<\/span><span style=\"color: #E0DEF4\">spiders<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Then in your main reader, enter this script:<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#232136;display:none;background-color:#e0def4\" aria-label=\"Copy\" data-copied-text=\"Copied!\" data-has-text-button=\"textSimple\" data-inside-header-type=\"none\" aria-live=\"polite\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>import scrapy\n\nclass ExampleSpider(scrapy.Spider):\n\n\u00a0\u00a0\u00a0\u00a0name = \"example\"\n\n\u00a0\u00a0\u00a0\u00a0start_urls = &#91;\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0'http:\/\/quotes.toscrape.com\/',\n\n\u00a0\u00a0\u00a0\u00a0&#93;\n\n\u00a0\u00a0\u00a0\u00a0def parse(self, response):\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0for quote in response.css('div.quote'):\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0yield {\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0'text': quote.css('span.text::text').get(),\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0'author': quote.css('span small::text').get(),\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0'tags': quote.css('div.tags a.tag::text').getall(),\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0}\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0next_page = response.css('li.next a::attr(href)').get()\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if next_page is not None:\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0yield response.follow(next_page, self.parse)\n\nscrapy crawl (website name) -o output.json\n\nfrom scrapy.crawler import CrawlerProcess\u00a0\n\nfrom project.spiders.test_spider import SpiderName\n\nprocess = CrawlerProcess()\n\nprocess.crawl(SpiderName, arg1=val1,arg2=val2)\n\nprocess.start()<\/textarea><\/pre><span class=\"cbp-btn-text\">Copy<\/span><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> scrapy<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">class<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">ExampleSpider<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">scrapy<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #C4A7E7; font-style: italic\">Spider<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0name <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;example&quot;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0start_urls <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #908CAA\">&#91;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #F6C177\">&#39;http:\/\/quotes.toscrape.com\/&#39;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #908CAA\">&#93;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">parse<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">self<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">response<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> quote <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">css<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;div.quote&#39;<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">yield<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #908CAA\">{<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #F6C177\">&#39;text&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> quote<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">css<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;span.text::text&#39;<\/span><span style=\"color: #908CAA\">).<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(),<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #F6C177\">&#39;author&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> quote<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">css<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;span small::text&#39;<\/span><span style=\"color: #908CAA\">).<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(),<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #F6C177\">&#39;tags&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> quote<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">css<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;div.tags a.tag::text&#39;<\/span><span style=\"color: #908CAA\">).<\/span><span style=\"color: #E0DEF4\">getall<\/span><span style=\"color: #908CAA\">(),<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #908CAA\">}<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0next_page <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">css<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;li.next a::attr(href)&#39;<\/span><span style=\"color: #908CAA\">).<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> next_page <\/span><span style=\"color: #3E8FB0\">is<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">not<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">None<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">yield<\/span><span style=\"color: #E0DEF4\"> response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">follow<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">next_page<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">parse<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">scrapy crawl <\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">website name<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-<\/span><span style=\"color: #E0DEF4\">o output<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">json<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">from<\/span><span style=\"color: #E0DEF4\"> scrapy<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">crawler <\/span><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> CrawlerProcess\u00a0<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">from<\/span><span style=\"color: #E0DEF4\"> project<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">spiders<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">test_spider <\/span><span style=\"color: #3E8FB0\">import<\/span><span style=\"color: #E0DEF4\"> SpiderName<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">process <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> CrawlerProcess<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">process<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">crawl<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">SpiderName<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">arg1<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\">val1<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #C4A7E7; font-style: italic\">arg2<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\">val2<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">process<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">start<\/span><span style=\"color: #908CAA\">()<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Tips and Tricks<\/h2>\n\n\n\n<p>One of the most vital steps when it comes to web scraping with any language or library is ensure you are using a proxy. Using a proxy adds an extra layer of security as it hides your IP address and rotates between different IPs, allowing your actions to remain hidden from a website and lessening IP bans. Implementing a proxy within a Scrapy script is simple and requires only a few extra lines of code. This section will discuss some methods by which this can be done.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Add a Meta Parameter<\/h3>\n\n\n\n<p>The first step to adding your proxy is to distinguish a meta parameter using the scrapy.requests method.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#232136;display:none;background-color:#e0def4\" aria-label=\"Copy\" data-copied-text=\"Copied!\" data-has-text-button=\"textSimple\" data-inside-header-type=\"none\" aria-live=\"polite\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>yield scrapy.Request(\n\nurl,\u00a0\n\ncallback=self.parse,\u00a0\n\nmeta={'proxy': 'http:\/\/&lt;PROXY_IP_ADDRESS>:&lt;PROXY_PORT>'}\n\n)<\/textarea><\/pre><span class=\"cbp-btn-text\">Copy<\/span><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">yield<\/span><span style=\"color: #E0DEF4\"> scrapy<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">Request<\/span><span style=\"color: #908CAA\">(<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">url<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\">\u00a0<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #C4A7E7; font-style: italic\">callback<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">parse<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\">\u00a0<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #C4A7E7; font-style: italic\">meta<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #908CAA\">{<\/span><span style=\"color: #F6C177\">&#39;proxy&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&#39;http:\/\/&lt;PROXY_IP_ADDRESS&gt;:&lt;PROXY_PORT&gt;&#39;<\/span><span style=\"color: #908CAA\">}<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Simply enter the proxy IP and port within the labeled space and you will be good to go.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Create a Custom Middleware<\/h3>\n\n\n\n<p>Once a middleware is specified, every request will be routed through it. Scrapy\u2019s middleware is a layer that intercepts requests. This helps when working on a larger project that involves multiple spiders.&nbsp;<\/p>\n\n\n\n<p>Extend the proxyMiddleware class and add it to the settings.py file. This can be done as such:<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#232136;display:none;background-color:#e0def4\" aria-label=\"Copy\" data-copied-text=\"Copied!\" data-has-text-button=\"textSimple\" data-inside-header-type=\"none\" aria-live=\"polite\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>class CustomProxyMiddleware(object):\n\n\u00a0\u00a0\u00a0\u00a0def __init__(self):\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.proxy = 'http:\/\/&lt;PROXY_IP_ADDRESS>:&lt;PROXY_PORT>'\n\n\u00a0\u00a0\u00a0\u00a0def process_request(self, request, spider):\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if 'proxy' not in request.meta:\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0request.meta&#91;'proxy'&#93; = self.proxy\n\n\u00a0\u00a0\u00a0\u00a0def get_proxy(self):\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0return self.proxy<\/textarea><\/pre><span class=\"cbp-btn-text\">Copy<\/span><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">class<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">CustomProxyMiddleware<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #9CCFD8\">object<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">__init__<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">self<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">proxy <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&#39;http:\/\/&lt;PROXY_IP_ADDRESS&gt;:&lt;PROXY_PORT&gt;&#39;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">process_request<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">self<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">request<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">spider<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&#39;proxy&#39;<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">not<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> request<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">meta<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0request<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">meta<\/span><span style=\"color: #908CAA\">&#91;<\/span><span style=\"color: #F6C177\">&#39;proxy&#39;<\/span><span style=\"color: #908CAA\">&#93;<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">proxy<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">get_proxy<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">self<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">return<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">proxy<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Finally, add the middleware to the DOWNLOAD_MIDDLEWARE settings in the settings.py file:<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#232136;display:none;background-color:#e0def4\" aria-label=\"Copy\" data-copied-text=\"Copied!\" data-has-text-button=\"textSimple\" data-inside-header-type=\"none\" aria-live=\"polite\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>class CustomProxyMiddleware(object):\n\n\u00a0\u00a0\u00a0\u00a0def __init__(self):\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.proxy = 'http:\/\/&lt;PROXY_IP_ADDRESS>:&lt;PROXY_PORT>'\n\n\u00a0\u00a0\u00a0\u00a0def process_request(self, request, spider):\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if 'proxy' not in request.meta:\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0request.meta&#91;'proxy'&#93; = self.proxy\n\n\u00a0\u00a0\u00a0\u00a0def get_proxy(self):\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0return self.proxy<\/textarea><\/pre><span class=\"cbp-btn-text\">Copy<\/span><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">class<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">CustomProxyMiddleware<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #9CCFD8\">object<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">__init__<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">self<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">proxy <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&#39;http:\/\/&lt;PROXY_IP_ADDRESS&gt;:&lt;PROXY_PORT&gt;&#39;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">process_request<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">self<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">request<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">spider<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&#39;proxy&#39;<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">not<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> request<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">meta<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0request<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">meta<\/span><span style=\"color: #908CAA\">&#91;<\/span><span style=\"color: #F6C177\">&#39;proxy&#39;<\/span><span style=\"color: #908CAA\">&#93;<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">proxy<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">get_proxy<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">self<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">return<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">proxy<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Using either of these methods would help input your proxy within the code and make your scraping efforts smooth.&nbsp;<\/p>\n\n\n\n<p>You must keep in mind that if you wish to use proxies, you must enter either line of code under the def prase statement. As an example, it would look like this:<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#232136;display:none;background-color:#e0def4\" aria-label=\"Copy\" data-copied-text=\"Copied!\" data-has-text-button=\"textSimple\" data-inside-header-type=\"none\" aria-live=\"polite\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>def parse(self, response):\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0for quote in response.css('div.quote'):\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0yield {\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0'text': quote.css('span.text::text').get(),\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0'author': quote.css('span small::text').get(),\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0'tags': quote.css('div.tags a.tag::text').getall(),\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0}\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0next_page = response.css('li.next a::attr(href)').get()\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if next_page is not None:\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0yield scrapy.Request(\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0response.urljoin(next_page),\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0callback=self.parse,\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0meta={'proxy': 'http:\/\/&lt;PROXY_IP_ADDRESS>:&lt;PROXY_PORT>'}\n\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0)<\/textarea><\/pre><span class=\"cbp-btn-text\">Copy<\/span><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">parse<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">self<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">response<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">for<\/span><span style=\"color: #E0DEF4\"> quote <\/span><span style=\"color: #3E8FB0\">in<\/span><span style=\"color: #E0DEF4\"> response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">css<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;div.quote&#39;<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">yield<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #908CAA\">{<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #F6C177\">&#39;text&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> quote<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">css<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;span.text::text&#39;<\/span><span style=\"color: #908CAA\">).<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(),<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #F6C177\">&#39;author&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> quote<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">css<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;span small::text&#39;<\/span><span style=\"color: #908CAA\">).<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(),<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #F6C177\">&#39;tags&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> quote<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">css<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;div.tags a.tag::text&#39;<\/span><span style=\"color: #908CAA\">).<\/span><span style=\"color: #E0DEF4\">getall<\/span><span style=\"color: #908CAA\">(),<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #908CAA\">}<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0next_page <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">css<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;li.next a::attr(href)&#39;<\/span><span style=\"color: #908CAA\">).<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> next_page <\/span><span style=\"color: #3E8FB0\">is<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">not<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">None<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #3E8FB0\">yield<\/span><span style=\"color: #E0DEF4\"> scrapy<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">Request<\/span><span style=\"color: #908CAA\">(<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">urljoin<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">next_page<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #C4A7E7; font-style: italic\">callback<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">parse<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #C4A7E7; font-style: italic\">meta<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #908CAA\">{<\/span><span style=\"color: #F6C177\">&#39;proxy&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&#39;http:\/\/&lt;PROXY_IP_ADDRESS&gt;:&lt;PROXY_PORT&gt;&#39;<\/span><span style=\"color: #908CAA\">}<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Scrapy is a wonderful library to use if your scraping projects are bigger than expected. If you have an understanding of Python, it should be quite simple to pick up. Using the guide provided in this article, you should be able to build your scraping script quite easily as well as implement a proxy within the code. If you wish to add an extra layer of protection, consider using an <a href=\"https:\/\/proxidize.com\/antidetect-browser\/\">antidetect browser<\/a> to truly keep all your details hidden and use web scraping with Scrapy comfortably and without any distractions.<\/p>\n","protected":false},"author":2627,"featured_media":75973,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","format":"standard","categories":[110],"tags":[],"class_list":["post-58586","blog","type-blog","status-publish","format-standard","has-post-thumbnail","hentry","category-web-scraping-and-automation"],"acf":[],"_links":{"self":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog\/58586","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/types\/blog"}],"author":[{"embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/users\/2627"}],"replies":[{"embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/comments?post=58586"}],"version-history":[{"count":4,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog\/58586\/revisions"}],"predecessor-version":[{"id":84898,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog\/58586\/revisions\/84898"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/media\/75973"}],"wp:attachment":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/media?parent=58586"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/categories?post=58586"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/tags?post=58586"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}