This program is designed to scrape content from Patreon creators, focusing primarily on posts. The scraped content is saved as individual HTML files and can be further consolidated into a single HTML file.
getlinks.py
: Fetches the links to the posts of a given Patreon creator.createHtml.py
: Downloads the content of the links fetched bygetlinks.py
and saves them as individual HTML files. It also handles downloading images linked within the posts.consolidatehtml.py
: Consolidates all the HTML files generated bycreateHtml.py
into a single HTML file.main.py
: An orchestrator that runs the above three modules in sequence.
- Starting a Remote Debugging Session in Chrome:
- Depending on your platform, start Chrome with remote debugging enabled on port 9223:
- Windows:
chrome.exe --remote-debugging-port=9223
- macOS:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9223
- Linux:
google-chrome --remote-debugging-port=9223
- Windows:
- Ensure no other Chrome windows are open when you start this.
- Important: If the Patreon creator's content is behind a paywall, log in with your Patreon account (that's subscribed to the creator) in the Chrome session before running the script.
- Depending on your platform, start Chrome with remote debugging enabled on port 9223:
You can insert this updated section in the appropriate location in your README.md.
-
Running the Program:
- Navigate to the directory containing the scripts.
- Run
main.py
:python main.py
- Follow the on-screen prompts.
-
Providing Inputs:
- Patreon Creator Name: When prompted, enter the username of the Patreon creator whose posts you want to scrape.
- Date Range: You'll be asked to provide a start month/year and an end month/year. This range determines which posts will be scraped based on their publication date.
- Select a Directory: A file dialog will open, asking you to select a directory. The HTML files (and associated images) will be saved in a subdirectory within your chosen directory.
-
Output:
- The individual posts will be saved as HTML files in the specified directory, under a subfolder named after the Patreon creator.
- A consolidated HTML file will also be generated in a new specified directory.
-
Post-Processing:
- You can open the generated HTML files in a browser. For the consolidated file, ensure all images and linked content display correctly.
-
Re-Scraping or Re-Running:
- If you encounter issues with the fetched links or the output seems incorrect, delete the
links.json
file and run the program again. - If you wish to scrape the posts again or perform another scraping session for a different creator, you should either delete the existing
patreonHTML
folder or rename it to avoid potential conflicts.
- If you encounter issues with the fetched links or the output seems incorrect, delete the
- Ensure the required Python packages are installed. You can do this using:
pip install -r requirements.txt
- Make sure the ChromeDriver version matches the version of your Chrome browser.
- The program is designed to handle most common scenarios, but there may be unique posts or media types that aren't perfectly handled. Always verify the output.
-
ChromeDriver Version Mismatch: If there's a mismatch between ChromeDriver and your Chrome browser, download the correct version of ChromeDriver and replace the existing executable.
-
Content Not Appearing in HTML: Sometimes, the content might not appear as expected due to the dynamic nature of web pages. Ensure JavaScript content has loaded before scraping.
-
Network Errors: If you encounter network-related errors, ensure you have a stable internet connection and that the Patreon website is accessible from your location.
-
File Errors: Ensure the directory where files are being written has the necessary write permissions.
For further issues or customization requirements, refer to the source code or seek assistance from a developer familiar with web scraping and Python.