Reddit Data Scraper π
A modern, interactive Reddit data scraper built with Streamlit. Extract posts, comments, and analytics from any subreddit or specific Reddit post with a beautiful, responsive interface.

β¨ Features
- π Subreddit Scraping: Extract posts from any subreddit with advanced filtering options
- π Single Post Analysis: Deep dive into specific posts and their entire comment threads
- π Real-time Analytics: Interactive charts and visualizations using Plotly
- π― Advanced Filtering: Filter by date range, score, comments, awards, NSFW content, and more
- π₯ Multiple Export Formats: Download data as CSV or JSON
- π Modern Dark Theme: Beautiful, responsive UI with custom CSS styling
- β‘ Fast & Efficient: Optimized data fetching with caching for better performance
π Quick Start (Streamlit Cloud)
1. Deploy to Streamlit Cloud
- Fork this repository to your GitHub account
- Get Reddit API credentials:
- Go to Reddit Apps
- Click "Create App" or "Create Another App"
- Choose "script" as the app type
- Note down your
client_idandclient_secret
- Deploy to Streamlit Cloud:
- Go to share.streamlit.io
- Click "New app" and connect your GitHub repository
- Set the main file path to
main.py - Configure secrets (see step 4)
- Configure App Secrets:
- In your Streamlit Cloud app dashboard, go to Settings > Secrets
- Add your Reddit API credentials:
REDDIT_CLIENT_ID = "your_client_id_here" REDDIT_CLIENT_SECRET = "your_client_secret_here" REDDIT_USER_AGENT = "YourAppName/1.0 by /u/yourusername"
- Deploy your app! π
Your Reddit scraper will be live and accessible to everyone!
π οΈ Local Development
Prerequisites
- Python 3.8 or higher
- Reddit API credentials (see above)
Installation
-
Clone the repository:
git clone https://github.com/pakagronglb/reddit-scraper.git cd reddit-scraper -
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate -
Install dependencies:
pip install -r requirements.txt -
Set up environment variables:
- Copy
.streamlit/secrets.tomlto create your local secrets file - Or create a
.envfile with:REDDIT_CLIENT_ID=your_client_id_here REDDIT_CLIENT_SECRET=your_client_secret_here REDDIT_USER_AGENT=YourAppName/1.0 by /u/yourusername
- Copy
-
Run the application:
streamlit run main.py
The app will be available at http://localhost:8501
π Usage Guide
Subreddit Scraping
- Enter Subreddit Name: Type the subreddit name (without r/)
- Set Time Filter: Choose from "All", "Last Week", "Last Month", "Last Year", or custom date range
- Configure Filters: Set minimum score, comments, awards, and content preferences
- Start Scraping: Click the "Start Scraping" button
- View Results: Explore the data with interactive charts and tables
- Download Data: Export your results as CSV or JSON
Single Post Analysis
- Enter Post URL: Paste the full Reddit post URL
- Configure Options: Set comment filtering and sorting preferences
- Scrape Post: Click "Scrape Post & Comments"
- Analyze Results: Review post metrics and comment analytics
- Export Data: Download post and comment data separately
π§ Configuration
Streamlit Configuration
The app includes optimized Streamlit configuration in .streamlit/config.toml:
- Theme: Custom dark theme with Reddit-inspired colors
- Performance: Optimized caching and data handling
- Security: XSRF protection and secure headers
Environment Variables
| Variable | Description | Example |
|---|---|---|
REDDIT_CLIENT_ID |
Reddit API client ID | abcd1234efgh5678 |
REDDIT_CLIENT_SECRET |
Reddit API client secret | your_secret_key_here |
REDDIT_USER_AGENT |
User agent string | RedditScraper/1.0 by /u/username |
π Data Export
The scraper provides comprehensive data export options:
Post Data Fields
- ID, Title, Post Text, Subreddit
- Author, Created UTC, Score, Up-vote Ratio
- Total Comments, Total Awards, Flair
- Content flags (NSFW, Spoiler, OC)
- URLs and Permalinks
Comment Data Fields
- Comment ID, Parent ID, Comment Text
- Author, Score, Created UTC
- Permalink, Submitter Status
π Privacy & Rate Limiting
- Rate Limiting: The app respects Reddit's API rate limits
- Data Privacy: No data is stored permanently; everything is processed in real-time
- Caching: Uses Streamlit's caching for better performance (1-hour TTL)
- Security: API credentials are handled securely through Streamlit secrets
π Troubleshooting
Common Issues
-
"Invalid credentials" error:
- Verify your Reddit API credentials
- Ensure the user agent string is descriptive
- Check that your Reddit app type is set to "script"
-
"No posts found" error:
- Verify the subreddit name is correct
- Check if the subreddit is private or banned
- Try adjusting your date filters
-
Rate limiting:
- The app automatically handles rate limits
- If you hit limits, wait a few minutes before retrying
-
App won't start:
- Check that all dependencies are installed
- Verify Python version compatibility (3.8+)
- Ensure environment variables are set correctly
Performance Tips
- Use specific date ranges instead of "All" for large subreddits
- Apply filters to reduce data volume
- Clear browser cache if the app becomes slow
π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Development Setup
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes and test thoroughly
- Commit your changes:
git commit -m 'Add feature' - Push to the branch:
git push origin feature-name - Submit a pull request
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- Built with Streamlit for the web interface
- Uses PRAW for Reddit API access
- Visualizations powered by Plotly
- Data processing with Pandas
π Support
If you encounter any issues or have questions:
- Check the troubleshooting section
- Search existing issues
- Create a new issue with detailed information about the problem
Happy Scraping! π
Made with β€οΈ for the Reddit community