{"id":80626,"date":"2025-08-19T15:16:28","date_gmt":"2025-08-19T14:16:28","guid":{"rendered":"https:\/\/proxidize.com\/?post_type=blog&#038;p=80626"},"modified":"2025-10-30T16:06:03","modified_gmt":"2025-10-30T16:06:03","slug":"reddit-scraper","status":"publish","type":"blog","link":"https:\/\/proxidize.com\/blog\/reddit-scraper\/","title":{"rendered":"Reddit Scraper: How to Scrape Reddit for Free"},"content":{"rendered":"\n<p>I want to scrape Reddit data. Simple enough, right? It turns out that most Reddit scraping tutorials fall into two categories: the \u201chere\u2019s how to grab 10 posts\u201d quick and messy scripts, or the over-engineered enterprise solutions that feel like using a spaceship.<\/p>\n\n\n\n<p>The problem is that real-world scraping is messy: Some builders only need to scrape a couple of Reddit posts for analysis purposes, others want to scrape thousands of records for research purposes, and others still are only interested in the comments, and so on. The truth is that you might start with a small job today and find yourself needing something more scaled up next month.<\/p>\n\n\n\n<p>I needed something that could handle both kinds of requests without breaking, so I built a Reddit scraper that gets smarter as your needs grow. Small jobs stay simple and fast; large jobs automatically get <a href=\"https:\/\/proxidize.com\/proxy-server\/\" target=\"_blank\" rel=\"noreferrer noopener\">proxy<\/a> protection and async processing \u2014 same interface but different engines under the hood.<\/p>\n\n\n\n<p>Here\u2019s how I did it, the problems I tackled along the way, and why Reddit turned out to be surprisingly friendly to scrape (spoiler: it\u2019s not like the other platforms). If you\u2019re not interested in the journey and just want the code, <a href=\"#reddit-scraper-end\" data-type=\"internal\" data-id=\"#reddit-scraper-end\"><em>here you go<\/em><\/a>. We also have a <a href=\"https:\/\/proxidize.com\/blog\/twitter-scraper\/\" target=\"_blank\" rel=\"noreferrer noopener\">Twitter scraper<\/a> you might be interested in.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized centered\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"536\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-why-python-img-1024x536.jpg\" alt=\"\" class=\"wp-image-80648\" style=\"object-fit:cover\" srcset=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-why-python-img-1024x536.jpg 1024w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-why-python-img-300x157.jpg 300w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-why-python-img-768x402.jpg 768w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-why-python-img-600x314.jpg 600w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-why-python-img.jpg 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Why Python? (And Not the Reasons You Think)<\/h2>\n\n\n\n<p>I picked <a href=\"https:\/\/proxidize.com\/blog\/what-is-python\/\" target=\"_blank\" data-type=\"blog\" data-id=\"84057\" rel=\"noreferrer noopener\">Python<\/a> for this project, but not for the usual \u201cPython is great for scraping\u201d reasons that every builder agrees on.<\/p>\n\n\n\n<p>Here\u2019s the thing: Reddit scraping is not that complex, but it heavily depends on your use case. Scraping a couple of posts is not the same as scraping thousands of posts: the more you try to scale, the more problems you will face, such as async requests, blocked IPs, and error handling that doesn\u2019t fall over after the first timeout.<\/p>\n\n\n\n<p><\/p>\n\n\n\t\t<div data-elementor-type=\"container\" data-elementor-id=\"85693\" class=\"elementor elementor-85693\" data-elementor-post-type=\"elementor_library\">\n\t\t\t\t<div class=\"elementor-element elementor-element-53838f9 e-con-full no-scale elementor-hidden-mobile_extra elementor-hidden-mobile e-flex e-con e-child\" data-id=\"53838f9\" data-element_type=\"container\" data-e-type=\"container\" data-settings=\"{&quot;background_background&quot;:&quot;gradient&quot;}\">\n\t\t<div class=\"elementor-element elementor-element-264a6ec e-grid e-con-full e-con e-child\" data-id=\"264a6ec\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t<div class=\"elementor-element elementor-element-4986847 e-con-full e-flex e-con e-child\" data-id=\"4986847\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t\t\t<div class=\"elementor-element elementor-element-f8b9092 elementor-widget elementor-widget-heading\" data-id=\"f8b9092\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<p class=\"elementor-heading-title elementor-size-default\">High-quality scraping and automation  \nstarts with high-quality mobile proxies<\/p>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-fd5a829 e-con-full e-flex e-con e-child\" data-id=\"fd5a829\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t<div class=\"elementor-element elementor-element-0087840 e-con-full e-flex e-con e-child\" data-id=\"0087840\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t\t\t<div class=\"elementor-element elementor-element-1e530dc elementor-widget__width-initial elementor-widget elementor-widget-image\" data-id=\"1e530dc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"125\" height=\"80\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/10\/20-2.svg\" class=\"attachment-full size-full wp-image-86191\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f634f7d inline-CTA elementor-widget elementor-widget-button\" data-id=\"f634f7d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"button.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div class=\"elementor-button-wrapper\">\n\t\t\t\t\t<a class=\"elementor-button elementor-button-link elementor-size-sm\" href=\"https:\/\/proxidize.com\/mobile-proxy-pricing\/?coupon_code=20OFFMPB\" target=\"_blank\">\n\t\t\t\t\t\t<span class=\"elementor-button-content-wrapper\">\n\t\t\t\t\t\t\t\t\t<span class=\"elementor-button-text\">Buy Proxies Now<\/span>\n\t\t\t\t\t<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\n\n\n\n<p><\/p>\n\n\n\n<p>Most programming languages force you to pick a lane and you stick to it: Go? Super fast but verbose for simple tasks. <a href=\"https:\/\/proxidize.com\/blog\/what-is-javascript\/\" target=\"_blank\" data-type=\"blog\" data-id=\"83360\" rel=\"noreferrer noopener\">JavaScript<\/a>? Although it\u2019s my personal favorite and it\u2019s great for async, it\u2019s painful for data processing. PHP? I\u2019ve been working with it recently for a custom plugin, and let\u2019s not go there.<\/p>\n\n\n\n<p>Python offered me something different: a way to take two completely different approaches in the same ecosystem.<\/p>\n\n\n\n<p>For example if I want to do a quick job, I could use requests \u2014 dead simple, reliable, and perfect for synchronous pagination:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>def _make_request(self, url: str, params: Optional&#091;Dict&#093; = None) -> Optional[Dict&#091;str, Any&#093;]:\n    try:\n        response = self.session.get(url, params=params)\n        response.raise_for_status()\n        self._sleep_with_delay()\n        return response.json()\n    except requests.exceptions.RequestException as e:\n        logger.error(f\"Request failed for {url}: {e}\")\n        return None<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">_make_request<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">self<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">url<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">str<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">params<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> Optional<\/span><span style=\"color: #908CAA\">&#091;<\/span><span style=\"color: #E0DEF4\">Dict<\/span><span style=\"color: #908CAA\">&#093;<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">None<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #908CAA\">-&gt;<\/span><span style=\"color: #E0DEF4\"> Optional<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #E0DEF4\">Dict<\/span><span style=\"color: #908CAA\">&#091;<\/span><span style=\"color: #9CCFD8\">str<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> Any<\/span><span style=\"color: #908CAA\">&#093;]:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">try<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        response <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">session<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">url<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">params<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\">params<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">raise_for_status<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">_sleep_with_delay<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #3E8FB0\">return<\/span><span style=\"color: #E0DEF4\"> response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">json<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">except<\/span><span style=\"color: #E0DEF4\"> requests<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">exceptions<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">RequestException <\/span><span style=\"color: #3E8FB0\">as<\/span><span style=\"color: #E0DEF4\"> e<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        logger<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">error<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;Request failed for <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">url<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">: <\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">e<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #3E8FB0\">return<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">None<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>For the heavy lifting I used async with proxy rotation:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>proxy_url = None\nif self.proxy_manager:\n    proxy_dict = self.proxy_manager.get_next_http_proxy()\n    if proxy_dict:\n        proxy_url = proxy_dict.get('http', proxy_dict.get('https'))\n\nasync with session.get(url, params=params, proxy=proxy_url) as response:\n    if response.status == 200:\n        data = await response.json()\n        return data<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #E0DEF4\">proxy_url <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">None<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">proxy_manager<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    proxy_dict <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">proxy_manager<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get_next_http_proxy<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> proxy_dict<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        proxy_url <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> proxy_dict<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;http&#39;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> proxy_dict<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;https&#39;<\/span><span style=\"color: #908CAA\">))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">async<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">with<\/span><span style=\"color: #E0DEF4\"> session<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">url<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">params<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\">params<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">proxy<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\">proxy_url<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">as<\/span><span style=\"color: #E0DEF4\"> response<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">status <\/span><span style=\"color: #3E8FB0\">==<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">200<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        data <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">await<\/span><span style=\"color: #E0DEF4\"> response<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">json<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #3E8FB0\">return<\/span><span style=\"color: #E0DEF4\"> data<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>I built both Reddit scrapers in the same codebase and let the system choose which one to use based on the job size. Python\u2019s ecosystem just works perfectly for this kind of stuff:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/click.palletsprojects.com\/en\/stable\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Click<\/a> for nice CLI interfaces<\/li>\n\n\n\n<li><a href=\"https:\/\/rich.readthedocs.io\/en\/stable\/introduction.html\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Rich<\/a> for progress bars that actually look good<\/li>\n\n\n\n<li><a href=\"https:\/\/pandas.pydata.org\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Pandas<\/a> is used when the CEO says \u201ccan you export this to CSV?\u201d<\/li>\n\n\n\n<li><a href=\"https:\/\/docs.aiohttp.org\/en\/stable\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Aiohttp<\/a> and <a href=\"https:\/\/requests.readthedocs.io\/en\/latest\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">requests<\/a> because they play nice together<\/li>\n<\/ul>\n\n\n\n<p>Python is the best for this kind of use case, and those who say otherwise\u2026 That\u2019s their opinion and they can keep it to themselves.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized centered\"><img decoding=\"async\" width=\"1024\" height=\"536\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-pagination-1024x536.jpg\" alt=\"\" class=\"wp-image-80646\" style=\"object-fit:cover\" srcset=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-pagination-1024x536.jpg 1024w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-pagination-300x157.jpg 300w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-pagination-768x402.jpg 768w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-pagination-600x314.jpg 600w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-pagination.jpg 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">The Pagination Problem (And Why It Broke My Brain)<\/h2>\n\n\n\n<p>I thought pagination would be the easy part. \u201cGrab page 1, then page 2, then page,\u201d right? Wrong.<\/p>\n\n\n\n<p>Reddit doesn\u2019t use page numbers, it uses something called <em>cursor-based pagination<\/em> with an after token. Each response gives you a token pointing to the next batch to scrape. If you miss that token or handle it wrong, it\u2019s game over. You\u2019ll be stuck in an infinite loop of the same 20\u201325 posts. To be fair, though, that wasn\u2019t the real problem.<\/p>\n\n\n\n<p>The real problem was that differently sized jobs needed completely different approaches. Here\u2019s how I split them up:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Small jobs:<\/strong> I just want to scrape a couple of posts. I want something fast, simple, and reliable; no fancy stuff.<\/li>\n\n\n\n<li><strong>Large jobs:<\/strong> I want to scrape a few hundred \u2014maybe thousands\u2014 of posts, which requires speed, error recovery, and proxy rotation to avoid getting blocked.<\/li>\n<\/ul>\n\n\n\n<p>I was faced with an annoying decision: Do I keep it simple and hit a wall later or over-engineer a complex solution that will make me hate myself when I apply it to small tasks?<\/p>\n\n\n\n<p>To which I thought: why not build both? I built two completely different pagination engines and let the system choose which one to use based on the job size.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Direct Request<\/h3>\n\n\n\n<p>For small jobs, I keep it straightforward with synchronous pagination:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>def scrape_subreddit_paginated(self, subreddit: str, sort_by: str = \"hot\",\n                              max_posts: int = 1000, batch_size: int = 100):\n    url = f\"https:\/\/www.reddit.com\/r\/{subreddit}\/{sort_by}.json\"\n    after = None\n    posts_fetched = 0\n    \n    while posts_fetched &lt; max_posts:\n        params = {\n            'limit': min(batch_size, 100),  \n            'raw_json': 1\n        }\n        \n        if after:\n            params&#091;'after'&#093; = after\n            \n        data = self._make_request(url, params)<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">scrape_subreddit_paginated<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">self<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">subreddit<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">str<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">sort_by<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">str<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;hot&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">                              <\/span><span style=\"color: #C4A7E7; font-style: italic\">max_posts<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">int<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1000<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">batch_size<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">int<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">100<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    url <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;https:\/\/www.reddit.com\/r\/<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">subreddit<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">\/<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">sort_by<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">.json&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    after <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">None<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    posts_fetched <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">while<\/span><span style=\"color: #E0DEF4\"> posts_fetched <\/span><span style=\"color: #3E8FB0\">&lt;<\/span><span style=\"color: #E0DEF4\"> max_posts<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        params <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #908CAA\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            <\/span><span style=\"color: #F6C177\">&#39;limit&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">min<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">batch_size<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">100<\/span><span style=\"color: #908CAA\">),<\/span><span style=\"color: #E0DEF4\">  <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            <\/span><span style=\"color: #F6C177\">&#39;raw_json&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #908CAA\">}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> after<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            params<\/span><span style=\"color: #908CAA\">&#091;<\/span><span style=\"color: #F6C177\">&#39;after&#39;<\/span><span style=\"color: #908CAA\">&#093;<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> after<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        data <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">_make_request<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">url<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> params<\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">The Async Beast<\/h3>\n\n\n\n<p>For large jobs, I went fully with async with proxy rotation:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>async def scrape_subreddit(self, subreddit: str, sort_by: str = \"hot\", \n                          limit: int = 25) -> List[Dict&#091;str, Any&#093;]:\n    while posts_fetched &lt; limit:\n        data = await self._make_request(url, params) \n        await asyncio.sleep(self.delay)<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">async<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">scrape_subreddit<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">self<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">subreddit<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">str<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">sort_by<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">str<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;hot&quot;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">                          <\/span><span style=\"color: #C4A7E7; font-style: italic\">limit<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">int<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">25<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #908CAA\">-&gt;<\/span><span style=\"color: #E0DEF4\"> List<\/span><span style=\"color: #908CAA\">[<\/span><span style=\"color: #E0DEF4\">Dict<\/span><span style=\"color: #908CAA\">&#091;<\/span><span style=\"color: #9CCFD8\">str<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> Any<\/span><span style=\"color: #908CAA\">&#093;]:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">while<\/span><span style=\"color: #E0DEF4\"> posts_fetched <\/span><span style=\"color: #3E8FB0\">&lt;<\/span><span style=\"color: #E0DEF4\"> limit<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        data <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">await<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">_make_request<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">url<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> params<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #3E8FB0\">await<\/span><span style=\"color: #E0DEF4\"> asyncio<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">sleep<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">delay<\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>With these two approaches, we\u2019ve got the same interface but with a different engine for each use case!<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized centered\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"536\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-json-chaos-img-1024x536.jpg\" alt=\"\" class=\"wp-image-80645\" style=\"object-fit:cover\" srcset=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-json-chaos-img-1024x536.jpg 1024w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-json-chaos-img-300x157.jpg 300w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-json-chaos-img-768x402.jpg 768w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-json-chaos-img-600x314.jpg 600w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-json-chaos-img.jpg 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Taming Reddit\u2019s JSON Chaos<\/h2>\n\n\n\n<p>Reddit\u2019s API gives you data of course, but calling it \u201cclean\u201d would be generous and you\u2019ll probably need to see an ophthalmologist after sifting through it.<\/p>\n\n\n\n<p>Here\u2019s what a raw Reddit post looks like in JSON:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>{\n  \"kind\": \"t3\",\n  \"data\": {\n    \"subreddit\": \"programming\",\n    \"selftext\": \"\",\n    \"author_fullname\": \"t2_abc123\",\n    \"title\": \"Some programming post\",\n    \"subreddit_name_prefixed\": \"r\/programming\",\n    \"ups\": 42,\n    \"downs\": 0,\n    \"score\": 42,\n    \"created_utc\": 1703875200.0,\n    \"num_comments\": 15,\n    \"permalink\": \"\/r\/programming\/comments\/abc123\/some_post\/\",\n    \"url\": \"https:\/\/example.com\",\n    \"author\": \"username\",\n    \"is_self\": false,\n    \"stickied\": false,\n    \"over_18\": false,\n    \/\/ ... and about 50 more fields you don't need\n  }\n}<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #908CAA\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">kind<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;t3&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">data<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #908CAA\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">subreddit<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;programming&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">selftext<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">author_fullname<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;t2_abc123&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">title<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;Some programming post&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">subreddit_name_prefixed<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;r\/programming&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">ups<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">42<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">downs<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">0<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">score<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">42<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">created_utc<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1703875200.0<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">num_comments<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">15<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">permalink<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;\/r\/programming\/comments\/abc123\/some_post\/&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">url<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;https:\/\/example.com&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">author<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;username&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">is_self<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">false<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">stickied<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">false<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">over_18<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">false<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA; font-style: italic\">\/\/<\/span><span style=\"color: #6E6A86; font-style: italic\"> ... and about 50 more fields you don&#39;t need<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #908CAA\">}<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>This is honestly a nightmare for analysis:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Timestamps are Unix epochs (good luck reading those)<\/li>\n\n\n\n<li>Inconsistent field names<\/li>\n\n\n\n<li>Tons of fields you will never use<\/li>\n\n\n\n<li>Some fields might be missing entirely<\/li>\n<\/ul>\n\n\n\n<p>I needed consistent, clean data that would make my life easier and wouldn\u2019t break my analysis code.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Cleaning Process<\/h3>\n\n\n\n<p>Here\u2019s how I transformed Reddit\u2019s data into something nice and useable:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>def _clean_post_data(self, raw_post: Dict) -> Dict&#091;str, Any&#093;:\n    return {\n        'id': raw_post.get('id'),\n        'title': raw_post.get('title'),\n        'author': raw_post.get('author'),\n        'score': raw_post.get('score', 0),\n        'upvotes': raw_post.get('ups', 0),\n        'downvotes': raw_post.get('downs', 0),\n        'upvote_ratio': raw_post.get('upvote_ratio', 0),\n        'url': raw_post.get('url'),\n        'permalink': f\"https:\/\/reddit.com{raw_post.get('permalink', '')}\",\n        'created_utc': raw_post.get('created_utc'),\n        'created_date': self._format_date(raw_post.get('created_utc')),\n        'num_comments': raw_post.get('num_comments', 0),\n        'subreddit': raw_post.get('subreddit'),\n        'is_self_post': raw_post.get('is_self', False),\n        'is_nsfw': raw_post.get('over_18', False),\n        'is_stickied': raw_post.get('stickied', False),\n        'flair': raw_post.get('link_flair_text'),\n        'post_text': raw_post.get('selftext', ''),\n        'domain': raw_post.get('domain')\n    } <\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">_clean_post_data<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">self<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">raw_post<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> Dict<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #908CAA\">-&gt;<\/span><span style=\"color: #E0DEF4\"> Dict<\/span><span style=\"color: #908CAA\">&#091;<\/span><span style=\"color: #9CCFD8\">str<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> Any<\/span><span style=\"color: #908CAA\">&#093;:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">return<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #908CAA\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;id&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;id&#39;<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;title&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;title&#39;<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;author&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;author&#39;<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;score&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;score&#39;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">0<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;upvotes&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;ups&#39;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">0<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;downvotes&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;downs&#39;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">0<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;upvote_ratio&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;upvote_ratio&#39;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">0<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;url&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;url&#39;<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;permalink&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">f<\/span><span style=\"color: #F6C177\">&quot;https:\/\/reddit.com<\/span><span style=\"color: #3E8FB0\">{<\/span><span style=\"color: #E0DEF4\">raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;permalink&#39;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&#39;&#39;<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #3E8FB0\">}<\/span><span style=\"color: #F6C177\">&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;created_utc&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;created_utc&#39;<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;created_date&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #E0DEF4; font-style: italic\">self<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">_format_date<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;created_utc&#39;<\/span><span style=\"color: #908CAA\">)),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;num_comments&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;num_comments&#39;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">0<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;subreddit&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;subreddit&#39;<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;is_self_post&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;is_self&#39;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">False<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;is_nsfw&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;over_18&#39;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">False<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;is_stickied&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;stickied&#39;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">False<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;flair&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;link_flair_text&#39;<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;post_text&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;selftext&#39;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&#39;&#39;<\/span><span style=\"color: #908CAA\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #F6C177\">&#39;domain&#39;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> raw_post<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">get<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #F6C177\">&#39;domain&#39;<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">}<\/span><span style=\"color: #E0DEF4\"> <\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">What This Gets You<\/h3>\n\n\n\n<p>This is the same post, but after cleaning it up it now looks like this:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>{\n  \"id\": \"abc123\",\n  \"title\": \"Some programming post\", \n  \"author\": \"username\",\n  \"score\": 42,\n  \"created_date\": \"2025-08-19 12:00:00\",\n  \"num_comments\": 15,\n  \"permalink\": \"https:\/\/reddit.com\/r\/programming\/comments\/abc123\/some_post\/\",\n  \"is_nsfw\": false,\n  \"subreddit\": \"programming\"\n}<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #908CAA\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">id<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;abc123&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">title<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;Some programming post&quot;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">author<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;username&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">score<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">42<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">created_date<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;2025-08-19 12:00:00&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">num_comments<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">15<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">permalink<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;https:\/\/reddit.com\/r\/programming\/comments\/abc123\/some_post\/&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">is_nsfw<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">false<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">subreddit<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;programming&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #908CAA\">}<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Now this is the kind of data we can work with that won\u2019t break the code:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Consistent field names across all posts<\/li>\n\n\n\n<li>Human-readable data<\/li>\n\n\n\n<li>No missing fields<\/li>\n\n\n\n<li>Only the data you actually need<\/li>\n<\/ul>\n\n\n\n<p>Reddit usually changes their APIs (and they do that a lot). When they do, I only need to update the cleaning function in one place and we\u2019re back in business.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized centered\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"536\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-data-storage-img-1024x536.jpg\" alt=\"\" class=\"wp-image-80643\" style=\"object-fit:cover\" srcset=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-data-storage-img-1024x536.jpg 1024w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-data-storage-img-300x157.jpg 300w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-data-storage-img-768x402.jpg 768w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-data-storage-img-600x314.jpg 600w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-data-storage-img.jpg 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Data Storage: Pick Your Favorite<\/h2>\n\n\n\n<p>Clean data is useless if you can\u2019t get it where you need it \u2014 facts. While building the code, I fielded a lot of requests from the team who said things like \u201ccan we get this in CVS format?\u201d or \u201ccan we use this to train models?\u201d.<\/p>\n\n\n\n<p>Trying to guess what someone needs is a losing game (I just hate talking to people). The data scientist wants JSON for flexibility, the analyst wants CVS for his beloved Excel sheet, so I built multiple output formats to let people decide for themselves.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">JSON: The Default Choice<\/h3>\n\n\n\n<p>Most of the time, JSON just works (why would you choose anything else, really?)<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>def save_data(data, output_file: str, format: str = \"json\"):\n    if format == \"json\":\n        with open(output_file, 'w', encoding='utf-8') as f:\n            json.dump(data, f, indent=2, ensure_ascii=False)<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">def<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">save_data<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #C4A7E7; font-style: italic\">data<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">output_file<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">str<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">format<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #9CCFD8\">str<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;json&quot;<\/span><span style=\"color: #908CAA\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">format<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">==<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;json&quot;<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #3E8FB0\">with<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">open<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">output_file<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&#39;w&#39;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">encoding<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #F6C177\">&#39;utf-8&#39;<\/span><span style=\"color: #908CAA\">)<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">as<\/span><span style=\"color: #E0DEF4\"> f<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">            json<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">dump<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">data<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> f<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">indent<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">2<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">ensure_ascii<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">False<\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>It\u2019s clean, readable, and preserves data types \u2014 perfect for feeding into other tools or doing analysis in general.<\/p>\n\n\n\n<p>What you get as we mentioned earlier:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>[\n  {\n    \"id\": \"abc123\",\n    \"title\": \"Some programming post\",\n    \"author\": \"username\", \n    \"score\": 42,\n    \"created_date\": \"2025-08-19 12:00:00\",\n    \"comments\": &#091;\n      {\n        \"author\": \"commenter1\",\n        \"body\": \"Great post!\",\n        \"score\": 5\n      }\n    &#093;\n  }\n]<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #908CAA\">[<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">id<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;abc123&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">title<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;Some programming post&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">author<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;username&quot;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">score<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">42<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">created_date<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;2025-08-19 12:00:00&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">comments<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #908CAA\">&#091;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">      <\/span><span style=\"color: #908CAA\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">author<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;commenter1&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">body<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;Great post!&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">score<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">5<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">      <\/span><span style=\"color: #908CAA\">}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&#093;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #908CAA\">]<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">CSV: For the Excel(lent) People<\/h3>\n\n\n\n<p>Sometimes you just need a spreadsheet (don\u2019t ask me why):<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>elif format == \"csv\":\n    df = pd.DataFrame(data)\n    df.to_csv(output_file, index=False)<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #3E8FB0\">elif<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EB6F92; font-style: italic\">format<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">==<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;csv&quot;<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    df <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> pd<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">DataFrame<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">data<\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    df<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">to_csv<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">output_file<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">index<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">False<\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>What you get:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>id,title,author,score,created_date,num_comments\nabc123,Some programming post,username,42,2025-08-19 12:00:00,15\ndef456,Another post,user2,128,2025-08-19 13:00:00,42<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #e0def4\">id,title,author,score,created_date,num_comments<\/span><\/span>\n<span class=\"line\"><span style=\"color: #e0def4\">abc123,Some programming post,username,42,2025-08-19 12:00:00,15<\/span><\/span>\n<span class=\"line\"><span style=\"color: #e0def4\">def456,Another post,user2,128,2025-08-19 13:00:00,42<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>The perfect format for quick analysis, sharing with non-technical people (I love you guys), or just importing it into a database.<\/p>\n\n\n\n<p>As we said earlier, we added a CLI, with which we can make our lives easier via command line:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>python3 -m reddit_scraper.cli json subreddit programming --limit 50 --output posts.json\n\npython3 -m reddit_scraper.cli json subreddit programming --limit 50 --output posts.csv --format csv\n\npython3 -m reddit_scraper.cli interactive<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">json<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">subreddit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">programming<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--limit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">50<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--output<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">posts.json<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">json<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">subreddit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">programming<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--limit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">50<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--output<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">posts.csv<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--format<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">csv<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">interactive<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Handling Edge Cases<\/h3>\n\n\n\n<p>Real data can sometimes have problems:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Unicode characters in titles and comments (hence ensure_ascii=False)<\/li>\n\n\n\n<li>Large datasets that might not fit in memory<\/li>\n\n\n\n<li>Nested data like comments (JSON handles this gracefully)<\/li>\n<\/ul>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>json.dump(data, f, indent=2, ensure_ascii=False)<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #E0DEF4\">json<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">dump<\/span><span style=\"color: #908CAA\">(<\/span><span style=\"color: #E0DEF4\">data<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> f<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">indent<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">2<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #C4A7E7; font-style: italic\">ensure_ascii<\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #EA9A97\">False<\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>At the end of the day, whether you\u2019re a builder, data analyst, or just someone who loves spreadsheets, the code will handle all these cases for you.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized centered\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"536\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-the-brain-img-1024x536.jpg\" alt=\"\" class=\"wp-image-80647\" style=\"object-fit:cover\" srcset=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-the-brain-img-1024x536.jpg 1024w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-the-brain-img-300x157.jpg 300w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-the-brain-img-768x402.jpg 768w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-the-brain-img-600x314.jpg 600w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-the-brain-img.jpg 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">The Brain: When to Go Simple vs When to Go Heavy<\/h2>\n\n\n\n<p>Here\u2019s the part that took me way too long to figure out. After a couple of hours of testing, I found out when we actually need the complex stuff.<\/p>\n\n\n\n<p>I started by always using the async scraper with proxy rotation. It seemed logical at the time, but I was wrong as usual. For small jobs, all of that complexity was just\u2026 slow.<\/p>\n\n\n\n<p>Spinning up async sessions, initializing proxy managers, health-checking our <a href=\"https:\/\/proxidize.com\/proxy-server\/reddit-proxies\/\" target=\"_blank\" rel=\"noreferrer noopener\">Reddit proxies<\/a> \u2014 all of that just to scrape a few posts that would have taken a couple of seconds to scrape using simple GET requests.<\/p>\n\n\n\n<p>The problem is that we can\u2019t take the simple route all the time. If we want to do larger jobs then we can\u2019t do the normal GET request. We might hit Reddit\u2019s API limit or get your <a href=\"https:\/\/proxidize.com\/blog\/what-is-an-ip-address\/\" target=\"_blank\" data-type=\"blog\" data-id=\"83356\" rel=\"noreferrer noopener\">IP addresses<\/a> flagged, so we needed to create a decision engine that could look at the job and apply the right approach automatically.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Magic Number Is 100<\/h3>\n\n\n\n<p>The determining factor is, unsurprisingly, Reddit\u2019s API limit. I figured that out after a couple of tests (I should have read the documentation).<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>use_proxies = inputs&#091;'post_count'&#093; > 100 and config_manager.has_proxies()\n\nif inputs&#091;'post_count'&#093; > 100:\n    return await handle_large_scraping_job(\n        inputs&#091;'subject'&#093;, inputs&#091;'post_count'&#093;, inputs&#091;'sort_method'&#093;,\n        scraper_config, proxy_manager, captcha_solver,\n        use_proxies, use_captcha\n    )\nelse:\n    return handle_regular_scraping_job(\n        inputs&#091;'subject'&#093;, inputs&#091;'post_count'&#093;, inputs&#091;'sort_method'&#093;,\n        scraper_config, proxy_manager, captcha_solver,\n        use_proxies, use_captcha\n    )<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #E0DEF4\">use_proxies <\/span><span style=\"color: #3E8FB0\">=<\/span><span style=\"color: #E0DEF4\"> inputs<\/span><span style=\"color: #908CAA\">&#091;<\/span><span style=\"color: #F6C177\">&#39;post_count&#39;<\/span><span style=\"color: #908CAA\">&#093;<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">&gt;<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">100<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">and<\/span><span style=\"color: #E0DEF4\"> config_manager<\/span><span style=\"color: #908CAA\">.<\/span><span style=\"color: #E0DEF4\">has_proxies<\/span><span style=\"color: #908CAA\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">if<\/span><span style=\"color: #E0DEF4\"> inputs<\/span><span style=\"color: #908CAA\">&#091;<\/span><span style=\"color: #F6C177\">&#39;post_count&#39;<\/span><span style=\"color: #908CAA\">&#093;<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">&gt;<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">100<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">return<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">await<\/span><span style=\"color: #E0DEF4\"> handle_large_scraping_job<\/span><span style=\"color: #908CAA\">(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        inputs<\/span><span style=\"color: #908CAA\">&#091;<\/span><span style=\"color: #F6C177\">&#39;subject&#39;<\/span><span style=\"color: #908CAA\">&#093;,<\/span><span style=\"color: #E0DEF4\"> inputs<\/span><span style=\"color: #908CAA\">&#091;<\/span><span style=\"color: #F6C177\">&#39;post_count&#39;<\/span><span style=\"color: #908CAA\">&#093;,<\/span><span style=\"color: #E0DEF4\"> inputs<\/span><span style=\"color: #908CAA\">&#091;<\/span><span style=\"color: #F6C177\">&#39;sort_method&#39;<\/span><span style=\"color: #908CAA\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        scraper_config<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> proxy_manager<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> captcha_solver<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        use_proxies<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> use_captcha<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #3E8FB0\">else<\/span><span style=\"color: #908CAA\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #3E8FB0\">return<\/span><span style=\"color: #E0DEF4\"> handle_regular_scraping_job<\/span><span style=\"color: #908CAA\">(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        inputs<\/span><span style=\"color: #908CAA\">&#091;<\/span><span style=\"color: #F6C177\">&#39;subject&#39;<\/span><span style=\"color: #908CAA\">&#093;,<\/span><span style=\"color: #E0DEF4\"> inputs<\/span><span style=\"color: #908CAA\">&#091;<\/span><span style=\"color: #F6C177\">&#39;post_count&#39;<\/span><span style=\"color: #908CAA\">&#093;,<\/span><span style=\"color: #E0DEF4\"> inputs<\/span><span style=\"color: #908CAA\">&#091;<\/span><span style=\"color: #F6C177\">&#39;sort_method&#39;<\/span><span style=\"color: #908CAA\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        scraper_config<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> proxy_manager<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> captcha_solver<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">        use_proxies<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> use_captcha<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>This will translate to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>&lt;= 100 posts: Direct requests work fine, Reddit doesn&#8217;t care, job is done in a matter of seconds.&nbsp;<\/li>\n\n\n\n<li>&gt; 100 posts: You start hitting rate limits (now we are talking), and here we will use proxy rotation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">What This Looks Like in Practice<\/h3>\n\n\n\n<p>Do you want 50 posts from r\/programming?<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>Method: RequestsScraper direct (small job)\nProxies: No<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #e0def4\">Method: RequestsScraper direct (small job)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #e0def4\">Proxies: No<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Direct HTTP requests&nbsp;<\/li>\n\n\n\n<li>No proxy overhead<\/li>\n\n\n\n<li>Done in 5\u201310 seconds<\/li>\n<\/ul>\n\n\n\n<p>Do you want 500 posts from r\/programming?<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>Method: JSONScraper with proxies (large job)\nProxies: Yes<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #e0def4\">Method: JSONScraper with proxies (large job)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #e0def4\">Proxies: Yes<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li>HTTP proxy rotation on every request<\/li>\n\n\n\n<li>Async pagination<\/li>\n\n\n\n<li>Protected from rate limit<\/li>\n<\/ul>\n\n\n\n<p>The system will take the number of posts as input and then act based on that number.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized centered\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"536\" src=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-customization-img-1024x536.jpg\" alt=\"\" class=\"wp-image-80642\" style=\"object-fit:cover\" srcset=\"https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-customization-img-1024x536.jpg 1024w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-customization-img-300x157.jpg 300w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-customization-img-768x402.jpg 768w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-customization-img-600x314.jpg 600w, https:\/\/proxidize.com\/wp-content\/uploads\/2025\/08\/reddit-scraper-customization-img.jpg 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">It\u2019s All About the Customization<\/h2>\n\n\n\n<p>Most <a href=\"https:\/\/proxidize.com\/blog\/web-scraping\/\" target=\"_blank\" rel=\"noreferrer noopener\">web scraping<\/a> tutorials serve only one use case most of the time, so while building this I was thinking of dynamic ways of adding your subreddit or sorting.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Different Subreddits: Just Change the Parameter<\/h3>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>python3 -m reddit_scraper.cli interactive\n\npython3 -m reddit_scraper.cli json subreddit python --limit 50\n\npython3 -m reddit_scraper.cli json subreddit technology --config config.json --limit 50<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">interactive<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">json<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">subreddit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">python<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--limit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">50<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">json<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">subreddit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">technology<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--config<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">config.json<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--limit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">50<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Different Sorting: Hot, New, Top, Rising<\/h3>\n\n\n\n<p>Reddit has different ways to sort posts, and each gives you different data:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>python3 -m reddit_scraper.cli json subreddit programming --sort hot --limit 50\n\npython3 -m reddit_scraper.cli json subreddit programming --sort new --limit 50\n\npython3 -m reddit_scraper.cli json subreddit programming --sort top --limit 50\n\npython3 -m reddit_scraper.cli json subreddit programming --sort rising --limit 50<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">json<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">subreddit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">programming<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--sort<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">hot<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--limit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">50<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">json<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">subreddit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">programming<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--sort<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">new<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--limit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">50<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">json<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">subreddit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">programming<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--sort<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">top<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--limit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">50<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">json<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">subreddit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">programming<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--sort<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">rising<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--limit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">50<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Comments: When You Need the Full Picture<\/h3>\n\n\n\n<p>In some cases you need the full context of the post you are looking for:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>python3 -m reddit_scraper.cli json subreddit programming --limit 100\n\npython3 -m reddit_scraper.cli json subreddit-with-comments programming --limit 100 --include-comments --comment-limit 50\n\npython3 -m reddit_scraper.cli json comments programming POST_ID --sort best --output single_post_comments.json<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">json<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">subreddit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">programming<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--limit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">100<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">json<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">subreddit-with-comments<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">programming<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--limit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">100<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--include-comments<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--comment-limit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">50<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">json<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">comments<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">programming<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">POST_ID<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--sort<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">best<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--output<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">single_post_comments.json<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>When you include comments, the scraper makes a separate API call for each post\u2019s comments. That\u2019s where the proxy rotation really comes into play, because for each 100 posts there will be 101 total requests (1 for posts + 100 for comments).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Output Formats<\/h3>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(1 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>python3 -m reddit_scraper.cli json subreddit programming --limit 50 --output analysis.json\n \npython3 -m reddit_scraper.cli json subreddit programming --limit 50 --output analysis.csv --format csv\n\npython3 -m reddit_scraper.cli interactive<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">json<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">subreddit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">programming<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--limit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">50<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--output<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">analysis.json<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">json<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">subreddit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">programming<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--limit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">50<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--output<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">analysis.csv<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">--format<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">csv<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">python3<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">-m<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">reddit_scraper.cli<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">interactive<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Configuration File<\/h3>\n\n\n\n<p>In the config.json file we have all the environment variables we need:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>{\n  \"proxies\": &#091;\n    {\n      \"host\": \"your-proxy.com\",\n      \"port\": 8080,\n      \"username\": \"your_username\", \n      \"password\": \"your_password\",\n      \"proxy_type\": \"http\"\n    }\n  &#093;,\n  \"scraping\": {\n    \"default_delay\": 1.0,\n    \"max_retries\": 3,\n    \"requests_per_minute\": 60,\n    \"user_agent\": \"RedditScraper\/1.0.0\"\n  }\n}<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #908CAA\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">proxies<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #908CAA\">&#091;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">      <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">host<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;your-proxy.com&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">      <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">port<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">8080<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">      <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">username<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;your_username&quot;<\/span><span style=\"color: #908CAA\">,<\/span><span style=\"color: #E0DEF4\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">      <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">password<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;your_password&quot;<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">      <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">proxy_type<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;http&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">scraping<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #908CAA\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">default_delay<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">1.0<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">max_retries<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">3<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">requests_per_minute<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">60<\/span><span style=\"color: #908CAA\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">    <\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #9CCFD8\">user_agent<\/span><span style=\"color: #908CAA\">&quot;<\/span><span style=\"color: #908CAA\">:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">&quot;RedditScraper\/1.0.0&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #908CAA\">}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #908CAA\">}<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:12px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>It\u2019s flexible, we don\u2019t hardcode anything into the code, obviously\u2026 We&#8217;re professional here.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Interactive Mode<\/h3>\n\n\n\n<p>We wanted you to love the CLI more, so we created an interactive one for you!<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.75rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#e0def4;--cbp-line-number-width:calc(2 * 0.6 * .75rem);line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#e0def4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>Enter subreddit name: programming\nHow many posts? &#091;50&#093;: 200  \nSort method (hot, new, top, rising) &#091;hot&#093;: new\nUse captcha solving? &#091;Y\/n&#093;: n\nProxy usage: Yes (automatic for >100 posts)\nOutput filename: programming_new_posts.json\n\nStarting scrape:\n  Subreddit: r\/programming\n  Posts: 200\n  Sort: new  \n  Method: JSONScraper with proxies (large job)\n  Proxies: Yes\n  Captcha: No<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki rose-pine-moon\" style=\"background-color: #232136\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #EA9A97\">Enter<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">subreddit<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">name:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">programming<\/span><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">How<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">many<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">posts?<\/span><span style=\"color: #E0DEF4\"> &#091;50&#093;: 200  <\/span><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">Sort<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">method<\/span><span style=\"color: #E0DEF4\"> (hot, <\/span><span style=\"color: #F6C177\">new,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">top,<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">rising<\/span><span style=\"color: #E0DEF4\">) <\/span><span style=\"color: #908CAA\">&#091;<\/span><span style=\"color: #E0DEF4\">hot<\/span><span style=\"color: #908CAA\">&#093;<\/span><span style=\"color: #E0DEF4\">: new<\/span><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">Use<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">captcha<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">solving?<\/span><span style=\"color: #E0DEF4\"> &#091;Y\/n&#093;: n<\/span><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">Proxy<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">usage:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">Yes<\/span><span style=\"color: #E0DEF4\"> (automatic <\/span><span style=\"color: #F6C177\">for<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #3E8FB0\">&gt;<\/span><span style=\"color: #F6C177\">100<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">posts<\/span><span style=\"color: #E0DEF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">Output<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">filename:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">programming_new_posts.json<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #EA9A97\">Starting<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">scrape:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #EA9A97\">Subreddit:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">r\/programming<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #EA9A97\">Posts:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #EA9A97\">200<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #EA9A97\">Sort:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">new<\/span><span style=\"color: #E0DEF4\">  <\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #EA9A97\">Method:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">JSONScraper<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">with<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">proxies<\/span><span style=\"color: #E0DEF4\"> (large <\/span><span style=\"color: #F6C177\">job<\/span><span style=\"color: #E0DEF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #EA9A97\">Proxies:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">Yes<\/span><\/span>\n<span class=\"line\"><span style=\"color: #E0DEF4\">  <\/span><span style=\"color: #EA9A97\">Captcha:<\/span><span style=\"color: #E0DEF4\"> <\/span><span style=\"color: #F6C177\">No<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"reddit-scraper-end\">Conclusion<\/h2>\n\n\n\n<p>Here\u2019s the thing about Reddit, it\u2019s one of the easiest platforms to scrape compared to the most known platforms. Clean JSON endpoints, reasonable rate limits; Reddit actually wants you to scrape their data on their own terms.<\/p>\n\n\n\n<p>However, the difference between building a small project and a production scale project comes down to one key point: for different job sizes we have different approaches. The goal is not to build the most complex Reddit scraper of all time, it\u2019s about building systems that adapt to the job at hand, regardless of size.<\/p>\n\n\n\n<p><strong>Real Lessons:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use JSON endpoints (\/r\/subreddit\/hot.json) not HTML parsing<\/li>\n\n\n\n<li>Handle pagination properly with Reddit\u2019s after token<\/li>\n\n\n\n<li>Clean the data early, Reddit\u2019s JSON is messy<\/li>\n\n\n\n<li>Automate the complexity: users shouldn\u2019t have to think about whether or not they should be using proxies<\/li>\n\n\n\n<li>Start simple, scale smart, add features when you actually need them<\/li>\n<\/ul>\n\n\n\n<p>In the end, we built this project to scale and be applicable to different use cases. Feel free to take at the code: it&#8217;s an <a href=\"https:\/\/github.com\/proxidize\/reddit-scraper\" target=\"_blank\" rel=\"noreferrer noopener\">open-source Reddit scraper<\/a> created for the builders who want to access data in the easiest way possible.<\/p>\n","protected":false},"author":8854,"featured_media":80644,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","format":"standard","categories":[266],"tags":[],"class_list":["post-80626","blog","type-blog","status-publish","format-standard","has-post-thumbnail","hentry","category-tech-tutorials-and-programming"],"acf":[],"_links":{"self":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog\/80626","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/types\/blog"}],"author":[{"embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/users\/8854"}],"replies":[{"embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/comments?post=80626"}],"version-history":[{"count":8,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog\/80626\/revisions"}],"predecessor-version":[{"id":88025,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/blog\/80626\/revisions\/88025"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/media\/80644"}],"wp:attachment":[{"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/media?parent=80626"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/categories?post=80626"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/proxidize.com\/wp-json\/wp\/v2\/tags?post=80626"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}