{"id":90595,"date":"2025-11-28T17:48:28","date_gmt":"2025-11-28T17:48:28","guid":{"rendered":"https:\/\/proxidize.com\/?post_type=blog&#038;p=90595"},"modified":"2025-11-28T18:04:26","modified_gmt":"2025-11-28T18:04:26","slug":"firecrawl-self-host-proxies","status":"publish","type":"blog","link":"https:\/\/proxidize.com\/blog\/firecrawl-self-host-proxies\/","title":{"rendered":"Firecrawl Self Host Guide: 2 Easy Ways to Integrate Proxies"},"content":{"rendered":"\n<p><strong>Should I go with a cloud or self-hosted solution?<\/strong> The question of convenience vs control is a topic of discussion in every Slack channel out there. As developers, we love to have control over stuff we build, but sometimes the associated cost and overhead you add to your team is not worth it. One side argues that you should pay the cloud to save time; the other side claims that open-source self hosting is the only way to scale without going bankrupt.<\/p>\n\n\n\n<p><strong>Let\u2019s take Firecrawl as an example.<\/strong> A market intelligence agent needs to scrape thousands of websites to provide high quality data. The discussion naturally turns to <strong>whether a Firecrawl self host version would be the most reasonable solution rather than paying for the Firecrawl API<\/strong>.<\/p>\n\n\n\n<p>I would argue that, other than budget, there are always \u201chidden costs\u201d that we as developers don&#8217;t take into consideration \u2014 we just love building, what I can say? <strong>Let\u2019s put each option side by side. If you&#8217;re impatient, you can go straight to the <a href=\"#proxy-integration\" data-type=\"internal\" data-id=\"#proxy-integration\">proxy integration<\/a>.<\/strong><\/p>\n\n\n\n<p><strong>The Firecrawl API is a great choice. It handles everything for you. You provide what you want and the data will come to you clean and ready<\/strong>; no need to worry about anything else. You might start scraping 100 pages today, but say the business thrives and you suddenly need 100,000 pages. <strong>The bill grows larger every month<\/strong> and every month you postpone the dream of buying your own Porsche.<\/p>\n\n\n\n<p>By contrast, <strong>Firecrawl self host is a good choice too. You have the code hosted on your servers, which is more secure, saves you money<\/strong>, and \u2014 as a developer \u2014 you get to modify it. Happy days! You deploy it and soon enough you hit a wall. Your logs turn red with <a href=\"https:\/\/proxidize.com\/blog\/403-error\/\" target=\"_blank\" rel=\"noreferrer noopener\">403 Forbidden<\/a> errors. The site works locally on your machines, but your DigitalOcean or AWS server is getting instantly blocked by Cloudflare.<\/p>\n\n\n\n<p><strong>The consequences of either choice become more obvious as you scale<\/strong> and scrape more data. Sticking with the cloud is just too expensive and self hosting reveals a brutal truth to you: the system works great, but your <a href=\"https:\/\/proxidize.com\/blog\/what-is-an-ip-address\/\" target=\"_blank\" rel=\"noreferrer noopener\">IP address<\/a> is getting flagged as a bot or a spam. It&#8217;s like having a Porsche engine you\u2019re forced to drive 20 with.<\/p>\n\n\n\n<p>Every project more complicated than a \u201cHello World\u201d scrape, explodes into conflicting opinions about rotating IPs, residential proxies, and avoiding captchas. Another day in the life of a developer trying to scrape. It\u2019s <em>so<\/em> easy to switch back to the cloud and pay up.<\/p>\n\n\n\n<p>That\u2019s why we are going to break down the solution practically. <strong>We will show you how to keep the cost savings of your Firecrawl self host and prevent you from getting blocked by websites by using proxies.<\/strong> Cheers to build the ultimate AI data pipelines without getting banned.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"536\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/what_is_firecrawl-img-1024x536.jpg\" alt=\"a drawing of HTML becoming a JSON under the title &quot;What is Firecrawl?&quot;\" class=\"wp-image-90601\" srcset=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/what_is_firecrawl-img-1024x536.jpg 1024w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/what_is_firecrawl-img-300x157.jpg 300w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/what_is_firecrawl-img-768x402.jpg 768w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/what_is_firecrawl-img-600x314.jpg 600w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/what_is_firecrawl-img.jpg 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-firecrawl\">What is Firecrawl?<\/h2>\n\n\n\n<p>To understand why Firecrawl is different, we first have to understand how the traditional data extraction works. When you usually start <a href=\"https:\/\/proxidize.com\/blog\/web-scraping\/\" target=\"_blank\" rel=\"noreferrer noopener\">web scraping<\/a>, you will get a very messy HTML code with a lot of &lt;div&gt; tags, navigation bars, and other nonsense. You have to clean it using <a href=\"https:\/\/proxidize.com\/blog\/python-vs-javascript\/\" target=\"_blank\" rel=\"noreferrer noopener\">Python or JavaScript<\/a> and put out a clean JSON to the LLMs to prevent wasting valuable tokens and extra cost.<\/p>\n\n\n\n<p>Built to solve this kind of problem, <strong><a href=\"https:\/\/www.firecrawl.dev\/\" target=\"_blank\" rel=\"noreferrer noopener\">Firecrawl<\/a> scrapes websites and gives you clean JSON or markdown<\/strong> ready to be used in any LLM. This saves time and effort, as you don\u2019t have to see extra data that you won\u2019t use.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The \u201cWeb-to-LLM\u201d Converter<\/h3>\n\n\n\n<p>Firecrawl acts like the middle man between the data and your AI model. You see, back in the old days you had to write custom code for the selectors for every website you visit (crazy, I know). Nowadays, AI web scrapers just do that automatically for you. <strong>Firecrawl scrapes any website you want and produces a standardized format <\/strong>(like <a href=\"https:\/\/llmstxt.org\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">llms.txt<\/a>) a file that AI agents can read and understand.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Beyond Basic Scraping: Maps &amp; RAG<\/h3>\n\n\n\n<p>As we said, most scrapers are simple: grab the HTML of a website of your choice and give it to you. But we live in a fast paced world now, so every second counts. Firecrawl AI agents come onto the scene and piece together the website targeted by your scrape \u2014&nbsp;this is where it shines as a deep research tool.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">The \/map Endpoint (AI Sitemap Generator)&nbsp;<\/h4>\n\n\n\n<p>Before your AI can read any website, it needs to know how the website is structured and what pages exist. Normally every website has a file called <strong>sitemap.xml<\/strong>, but these files are often outdated or incomplete. The Firecrawl endpoint solves this problem by acting as a detective that discovers the pages of your target website.&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Why AI Needs It: <\/strong>In a nutshell it gives a good overview of the website before you start. Let\u2019s say you\u2019re building a customer support bot. You don&#8217;t want to scrape just the homepage; you need every piece of information you can find, from FAQs, the help center, and API documentation. The <strong>\/map<\/strong> endpoints results in a clean list of every URL that you can filter before wasting any scraping credits.<\/li>\n\n\n\n<li><strong>The Hidden Challenge: <\/strong>Mapping websites requires sending multiple \u201chead\u201d requests in seconds. This behaviour is not normal so the chance of you being flagged is high, that\u2019s why you should have the option to rotate proxies or IPs to prevent any blocking.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Powering RAG Pipelines<\/h4>\n\n\n\n<p>RAG, or <a href=\"https:\/\/aws.amazon.com\/what-is\/retrieval-augmented-generation\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Retrieval-Augmented Generation<\/a>, is the architecture that allows LLMs to \u201cknow\u201d about private data. Firecrawl is purpose-built for this.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Clean Data In, Answers out: <\/strong>The best thing about Firecrawl is that it doesn\u2019t <em>just<\/em> clean text, it returns clean markdown that can be used for any LLM since it has a hierarchy (headers, lists, tables). This is very important for vector databases (like Pinecone, Weaviate and MongoDB) because it keeps related information together.<\/li>\n\n\n\n<li><strong>LLM.txt Support: <\/strong>If you really care about your SEO and want to increase the likelihood of your website being recommended by AI chatbots, having an llm.txt is important. It makes it easier for AI chatbots to understand your website in the most efficient way possible.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Deep Research and Batch Scraping&nbsp;<\/h4>\n\n\n\n<p>For agents that need to \u201cbrowse\u201d the web to answer a simple (or not very simple) question such as \u201cFind all pricing plans for these 5 competitors\u201d, Firecrawl offers deep research capabilities.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>The Crawl Endpoint: <\/strong>Unlike the scrape endpoint which hits one page, Firecrawl provides another endpoint called crawl. It traverses a site <a href=\"https:\/\/www.geeksforgeeks.org\/dsa\/difference-between-bfs-and-dfs\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">(BFS\/DFS)<\/a> to a specified depth.<\/li>\n\n\n\n<li><strong>Batch Operations: <\/strong>You can submit a Firecrawl batch scrape job to process hundreds of URLs at the same time.<\/li>\n\n\n\n<li><strong>The Trap: <\/strong>This is where the Firecrawl self host crowd gets banned the fastest. Deep crawling will spike unusual traffic patterns that will trigger rate limits right away. Without having high-quality IPs masking this activity, your \u201cdeep research\u201d agent will likely get banned after a couple of pages.<\/li>\n<\/ul>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large resized\" id=\"hidden-problem\"><img decoding=\"async\" width=\"1024\" height=\"536\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/hidden_problem_with_firecrawl_self_host-img-1024x536.jpg\" alt=\"a drawing of a server hidden behind another server under the title &quot;Problem With Firecrawl Self Hosting&quot;\" class=\"wp-image-90613\" srcset=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/hidden_problem_with_firecrawl_self_host-img-1024x536.jpg 1024w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/hidden_problem_with_firecrawl_self_host-img-300x157.jpg 300w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/hidden_problem_with_firecrawl_self_host-img-768x402.jpg 768w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/hidden_problem_with_firecrawl_self_host-img-600x314.jpg 600w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/hidden_problem_with_firecrawl_self_host-img.jpg 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">The Hidden Problem with Firecrawl Self Hosting<\/h2>\n\n\n\n<p>Self-hosting is great. You deploy your own version of the code and you scrape whatever you want with it. You do whatever you want with the data. In other words, you are in control, with no middle man required. However, there is a catch. <strong>When you switch from cloud to self-host you gain a lot, but<\/strong> you lose out as well. You <strong>lose the invisible infrastructure that makes scraping possible in the first place<\/strong>, i.e. IP rotation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The \u201cLocalhost\u201d IP Trap<\/h3>\n\n\n\n<p><strong>When you self-host Firecrawl locally or deploy it via Docker, every single request comes from your machine\u2019s static IP address.<\/strong> It\u2019s like calling someone a million times from one number (<strong>you <em>will<\/em> get blocked<\/strong> eventually).<\/p>\n\n\n\n<p><strong>For its cloud version, Firecrawl has agreements with proxy providers<\/strong> (like <a href=\"https:\/\/proxidize.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Proxidize<\/a>), who give Firecrawl access to a large pool of IPs Firecrawl uses those proxies while scraping or crawling websites, which makes them unlikely to get caught because they aren\u2019t using the same IP for each request. <strong>Switching between IP addresses is the only way to do large-scale web scraping.<\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Why Cloudflare Hates Your VPS<\/h3>\n\n\n\n<p><strong>Most developers deploy their self-hosted instances on cloud providers<\/strong> such as AWS, DigitalOcean, or Hetzner because it\u2019s cheap and scalable. But something they sometimes forget to take into account is <a href=\"https:\/\/proxidize.com\/blog\/ip-score\/#ip-reputation\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>IP reputation<\/strong><\/a>. Modern anti-bot systems can see the reputation of the incoming IP addresses. If you have a bad one, you are in trouble:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>The Datacenter Flag:<\/strong> Let\u2019s be clear here. Websites know that real people don\u2019t doomscroll or tweet from AWS servers, so that\u2019s already a red flag.<\/li>\n\n\n\n<li><strong>The ASN Block: <\/strong>Security systems look at the <a href=\"https:\/\/www.cloudflare.com\/learning\/network-layer\/what-is-an-autonomous-system\/\" target=\"_blank\" rel=\"noopener\">ASN (Au<\/a><a href=\"https:\/\/www.cloudflare.com\/learning\/network-layer\/what-is-an-autonomous-system\/\" target=\"_blank\" rel=\"noreferrer noopener\">t<\/a><a href=\"https:\/\/www.cloudflare.com\/learning\/network-layer\/what-is-an-autonomous-system\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">onomous System Number)<\/a> of your IP. If it belongs to a hosting provider like AWS or Google Cloud, for example, it is automatically flagged as a &#8220;non-human&#8221; traffic.<\/li>\n<\/ul>\n\n\n\n<p>Even if your scraper is the best, most efficient code in the world, you will probably hit 403 Forbidden errors or <a href=\"https:\/\/proxidize.com\/blog\/error-code-502\/\" target=\"_blank\" rel=\"noreferrer noopener\">502 error<\/a> screens before you even load the HTML. That\u2019s why you need a high-quality third-party proxy provider to help you overcome the anti-bot systems and identify as a human.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large resized\"><img decoding=\"async\" width=\"1024\" height=\"536\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/firecrawl_proxy-img-1024x536.jpg\" alt=\"A drawing of the firecrawl logo next to a server under the title &quot;Why You Need a Firecrawl Proxy&quot;\" class=\"wp-image-90614\" srcset=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/firecrawl_proxy-img-1024x536.jpg 1024w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/firecrawl_proxy-img-300x157.jpg 300w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/firecrawl_proxy-img-768x402.jpg 768w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/firecrawl_proxy-img-600x314.jpg 600w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/firecrawl_proxy-img.jpg 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"firecrawl-proxy\">Why You Need a Firecrawl Proxy<\/h2>\n\n\n\n<p>Since you are <strong>self hosting Firecrawl, you are responsible for the \u201cnetworking\u201d layer<\/strong>. Most developers try to save money by <strong>buying cheap proxies<\/strong>, but this <strong>is a mistake that will cost you<\/strong> money and time. To scrape more high value data without having any problems, you need to fundamentally change how your crawler looks to the outside world. That starts with <a href=\"https:\/\/proxidize.com\/proxy-server\/\" target=\"_blank\" rel=\"noreferrer noopener\">proxy servers<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Problem with Cheap Proxy Providers<\/h3>\n\n\n\n<p>Buying cheap proxies is not the solution for trying to get better results while scraping because they have IPs that have been abused by thousands of other users before you and thus they\u2019re already on numerous blacklists.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Instant Flagging: <\/strong>Providers like Cloudflare have vast databases of these \u201cdirty IPs\u201d. Trying to use these IPs will get you flagged right away and get black listed before you even hit your first request.<\/li>\n\n\n\n<li><strong>The Captcha Loop:<\/strong> Maybe your first few requests went through and life is good. Your IP is not clean so you will get captchas from time to time and it will be annoying. More than just annoying, it\u2019s expensive, so it\u2019s something you should consider as well.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\t\t<div data-elementor-type=\"container\" data-elementor-id=\"85913\" class=\"elementor elementor-85913\" data-elementor-post-type=\"elementor_library\">\n\t\t\t\t<div class=\"elementor-element elementor-element-2bece1e e-con-full no-scale elementor-hidden-mobile_extra elementor-hidden-mobile e-flex e-con e-child\" data-id=\"2bece1e\" data-element_type=\"container\" data-e-type=\"container\" data-settings=\"{&quot;background_background&quot;:&quot;gradient&quot;}\">\n\t\t<div class=\"elementor-element elementor-element-6238a87 e-grid e-con-full e-con e-child\" data-id=\"6238a87\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t<div class=\"elementor-element elementor-element-f8e1416 e-con-full e-flex e-con e-child\" data-id=\"f8e1416\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t\t\t<div class=\"elementor-element elementor-element-25ccff7 elementor-widget elementor-widget-heading\" data-id=\"25ccff7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<p class=\"elementor-heading-title elementor-size-default\">A completely anonymous profile starts<br>\nwith the highest quality mobile proxies<\/p>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-8dff58a e-con-full e-flex e-con e-child\" data-id=\"8dff58a\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t<div class=\"elementor-element elementor-element-ffecf2a e-con-full e-flex e-con e-child\" data-id=\"ffecf2a\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t\t\t<div class=\"elementor-element elementor-element-75ae4a0 elementor-widget__width-initial elementor-widget elementor-widget-image\" data-id=\"75ae4a0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"125\" height=\"80\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/10\/20-2.svg\" class=\"attachment-full size-full wp-image-86191\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-40324b9 inline-CTA elementor-widget elementor-widget-button\" data-id=\"40324b9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"button.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div class=\"elementor-button-wrapper\">\n\t\t\t\t\t<a class=\"elementor-button elementor-button-link elementor-size-sm\" href=\"https:\/\/proxidize.com\/mobile-proxy-pricing\/?coupon_code=20OFFMPB\" target=\"_blank\">\n\t\t\t\t\t\t<span class=\"elementor-button-content-wrapper\">\n\t\t\t\t\t\t\t\t\t<span class=\"elementor-button-text\">Buy Proxies Now<\/span>\n\t\t\t\t\t<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Scraping Sophisticated Platforms&nbsp;<\/h3>\n\n\n\n<p><strong>Platforms like LinkedIn and Instagram have really strict anti-scraping measures<\/strong> and lots of mechanisms to verify that people are human. Trying to scrape them with a cheap proxy or your own local IP won\u2019t work at scale. <strong>You will need high quality IP addresses, which you can only get via trusted proxy providers.<\/strong><\/p>\n\n\n\n<p>Another thing to consider is that once you scroll these website or any website in general, session continuity and sticky session are super important, you don\u2019t want your crawl to fail mid session after hours of waiting it and for the so sophisticated platforms you need sticky sessions to keep you logged in and keep scraping them non-stop.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"536\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/step-by-step_guide_configuring_proxies_for_firecrawl-1024x536.jpg\" alt=\"A drawing of a server and laptops with proxies under the title &quot;Step-by-Step guide: Configuring Proxies for Firecrawl Self Host&quot;\" class=\"wp-image-90598\" srcset=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/step-by-step_guide_configuring_proxies_for_firecrawl-1024x536.jpg 1024w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/step-by-step_guide_configuring_proxies_for_firecrawl-300x157.jpg 300w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/step-by-step_guide_configuring_proxies_for_firecrawl-768x402.jpg 768w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/step-by-step_guide_configuring_proxies_for_firecrawl-600x314.jpg 600w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/11\/step-by-step_guide_configuring_proxies_for_firecrawl.jpg 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"proxy-integration\">Step-by-Step Guide: Configuring Proxies for Firecrawl Self Host<\/h2>\n\n\n\n<p>Since <a href=\"https:\/\/docs.firecrawl.dev\/sdks\/python\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Firecrawl python sdk<\/a> or Node client connects to your self host, your proxy confirmation should happen on the Docker side, i.e. at the infrastructure level. If you don\u2019t do that your crawl will be exposed to be blocked by Cloudflare and we don&#8217;t want that.<\/p>\n\n\n\n<p>Here are two ways you can inject your proxies and mask your request with them to prevent your Firecrawl self host from getting blocked while crawling websites.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Method 1: The Simple .env<\/h3>\n\n\n\n<p>This is one of the simplest ways out there. You just need to <strong>create an .env file in your project (preferably in the root)<\/strong>. Here are the steps to make it easier for you:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Navigate to the root of the project you are working on.<\/li>\n\n\n\n<li>Open (or create) your .env file.<\/li>\n\n\n\n<li>After creating the .env file you should add the proxy variables to it, normally this kind of information you get from the proxy provider you subscribed with.<\/li>\n<\/ol>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly># .env file configuration\n# Format: http:\/\/username:password@proxy-gateway.com:port\n\n# Standard Proxy Variables\nHTTP_PROXY=http:\/\/user123:pass123@proxy-gateway.com:8080<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> .env file configuration<\/span><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Format: http:\/\/username:password@proxy-gateway.com:port<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #908CAA; font-style: italic\">#<\/span><span style=\"color: #6E6A86; font-style: italic\"> Standard Proxy Variables<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4; font-style: italic\">HTTP_PROXY<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #F6C177\">http:\/\/user123:pass123@proxy-gateway.com:8080<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p>This is a very simple, straightforward way to do it. <strong>When you run the Docker file, it will read the variables inside the .env file and use them.<\/strong> Make sure you specify the location of the .env file for the Docker file to know the path to get the data from.<\/p>\n\n\n\n<p><strong>Note: <\/strong>Proxy formats can be different between proxy providers. You <em>can<\/em> change the format, but each provider will have a different way to do that.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Method 2: Docker Compose for Playwright<\/h3>\n\n\n\n<p>For scraping at scale, Playwright&nbsp;\u2014 a <a href=\"https:\/\/proxidize.com\/blog\/python-libraries-for-web-scraping\/\" target=\"_blank\" rel=\"noreferrer noopener\">Python library for web scraping<\/a> \u2014 is one of the best options a developer can make. It\u2019s open-source, easy to use, and lets you integrate proxies.&nbsp;<\/p>\n\n\n\n<p>Sometimes the .env file isn\u2019t the ideal solution since the Playwright container might be isolated. In that case, <strong>you need to inject proxies specifically into the browser service configuration<\/strong>:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Open your Docker-compose.yaml file<\/li>\n\n\n\n<li>Locate the playwright-service section.<\/li>\n\n\n\n<li>Add the proxy variables under the environment key.<\/li>\n<\/ol>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>services:\n  playwright-service:\n    image: mendableai\/firecrawl-playwright-service:latest\n    environment:\n      # Inject Proxy for Browser Traffic\n      - PROXY_SERVER=http:\/\/proxy-gateway.com:8080\n      - PROXY_USERNAME=user123\n      - PROXY_PASSWORD=pass123\n      \n      # Fallback to standard conventions\n      - HTTPS_PROXY=http:\/\/user123:pass123@proxy-gateway.com:8080\n    depends_on:\n      - redis<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #E0DEF4\">services:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  playwright-service:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    image: mendableai\/firecrawl-playwright-service:latest<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    environment:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">      # Inject Proxy for Browser Traffic<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">      - PROXY_SERVER=http:\/\/proxy-gateway.com:8080<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">      - PROXY_USERNAME=user123<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">      - PROXY_PASSWORD=pass123<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">      <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">      # Fallback to standard conventions<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">      - HTTPS_PROXY=http:\/\/user123:pass123@proxy-gateway.com:8080<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    depends_on:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">      - redis<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p><strong>Pro Tip:<\/strong> If you do a large amount of web scraping, make sure that you have a provider that offers session rotation, i.e. the ability to rotate after every request or at will. It\u2019s super important to have it since every request or every Playwright browser opened will have a new clean IP to use, which prevents you from getting blocked.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"conclusion\">Conclusion&nbsp;<\/h2>\n\n\n\n<p>Firecrawl is one of the best scraping platforms out there. Yes it\u2019s still a startup, but it has a large audience of developers and, let\u2019s not forget, it&#8217;s also backed by <a href=\"https:\/\/www.ycombinator.com\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">one of the most famous combinators in the world<\/a>.<\/p>\n\n\n\n<p><strong>Key takeaways:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>If you have the technical experience<\/strong> and you don\u2019t mind putting in a little bit of effort, going with a <strong>Firecrawl self host option might be a good option for you<\/strong>.<\/li>\n\n\n\n<li><strong>The more you scale<\/strong> your project, t<strong>he more you are going to pay<\/strong> to operate it. That\u2019s life.<\/li>\n\n\n\n<li><strong>If you prefer to have everything in one place<\/strong> and spare yourself a headache while scraping \u2014 <strong>and you don\u2019t mind the cost \u2014 going with Firecrawl API is the best option for you<\/strong>.<\/li>\n\n\n\n<li><strong>If you decide to self host Firecrawl, having a great proxy provider is essential<\/strong> to prevent any problems and get great results.<\/li>\n\n\n\n<li><strong>Don&#8217;t ever use your local IP for scraping projects<\/strong>, large or small, since it might get blacklisted.<br><\/li>\n<\/ul>\n\n\n\n<p>Firecrawl have some decent features to offer. That\u2019s why people are using it. The ability to take messy HTML with unused divs and CSS selectors and turn it into a clean JSON or markdown that can be used directly into LLMs is a great feature to have. It really saves time and effort. No need to create additional scripts to clean the data after collecting it, unlike&nbsp; traditional scrapers.<\/p>\n\n\n\n<p>The ability to choose between scraping, crawling, and mapping is great as well. Many people might want to know what a website has to offer, so they\u2019ll decide to generate a content ma. By contrast, if you want to explore the website\u2019s URLs you go with the crawl option. If you want to get <em>all<\/em> the information from one website you normally go with scraping.<\/p>\n\n\n\n<p>To have all of Firecrawl\u2019s amazing features <em>and<\/em> remain in complete control, you should choose the Firecrawl self host option. A good proxy provider is involved one way or another to prevent any problems or cutoffs while scraping. Choosing to go with the cloud version means you don\u2019t have to worry about it, though.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"faq\">Frequently Asked Questions<\/h2>\n\n\n<div id=\"rank-math-faq\" class=\"rank-math-block\">\n<div class=\"rank-math-list \">\n<div id=\"faq-question-1764349620577\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">Is Firecrawl free to use?<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Yes, the self-hosted version of Firecrawl is open-source and can be used by anyone under the AGPL-3.0 license, though you will be responsible for the costs of the server (VPS) and proxy infrastructure needed to run Firecrawl\u2019s self-hosted version.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1764350404086\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">What is the main difference between Firecrawl Cloud and self hosting Firecrawl?<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>In the cloud version the service is fully managed by the Firecrawl team. Both technical and non-technical people can use it. With Firecrawl\u2019s self-hosted version you will have to manage all the infrastructure and the servers related to hosting the code, which requires some technical expertise.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1764350405431\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">Why am I getting 403 Forbidden errors on my self-hosted Firecrawl instance?<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>This happens because you are sending requests from your local IP or using servers like AWS or Google Cloud to do the scraping for you. Cloudflare will block these IPs. To prevent such errors you need to use <a href=\"https:\/\/proxidize.com\/proxy-server\/mobile-proxies\/\">mobile proxies<\/a> or residential proxies.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1764350406167\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">Does Firecrawl respect robots.txt?<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Yes. By default, it will search for a website\u2019s robots.txt first and see what it is allowed and disallowed from and adjust the scraper accordingly.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1764350406900\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">Can Firecrawl scrape websites behind a login?<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Yes, Firecrawl does support that, but you will need to provide it with the credentials in the header of the request. Using sticky sessions here is important to prevent any information from being deleted mid-session.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1764350408264\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">What is the difference between \/map, \/scrape, \/crawl?<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>\/scrape is used to extract data from a single URL into a markdown or JSON; \/map is used to draw a sitemap of the website you are trying to scrape, without scraping it; and \/crawl is used to follow links from start to finish.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1764350408917\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">What are the hardware requirements to self-host Firecrawl?<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>To run the server comfortably you will need to run it with Docker Compose with 2GB of RAM, along with PostgreSQL databases and redis instance.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1764350410316\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">Does Firecrawl self-host have rate limits?<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>No it\u2019s unlimited. There is no limit rate and you can scrape as much as your hardware allows you to.<\/p>\n\n<\/div>\n<\/div>\n<\/div>\n<\/div>","protected":false},"author":8854,"featured_media":90603,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","format":"standard","categories":[110],"tags":[],"class_list":["post-90595","blog","type-blog","status-publish","format-standard","has-post-thumbnail","hentry","category-web-scraping-and-automation"],"acf":[],"_links":{"self":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog\/90595","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/types\/blog"}],"author":[{"embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/users\/8854"}],"replies":[{"embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/comments?post=90595"}],"version-history":[{"count":9,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog\/90595\/revisions"}],"predecessor-version":[{"id":90616,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog\/90595\/revisions\/90616"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/media\/90603"}],"wp:attachment":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/media?parent=90595"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/categories?post=90595"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/tags?post=90595"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}