GitHub Actions Compatibility Guide
GitHub Actions Compatibility Guide
✅ Yes, the MTB scraper works with GitHub Actions!
The Playwright browser automation solution is fully compatible with GitHub Actions after proper configuration.
Required Setup
1. GitHub Actions Workflow
The workflow file .github/workflows/scrape-races.yml has been created with:
- Playwright installation: Automatically installs Chromium browser
- System dependencies: Installs required libraries with
playwright install-deps - Python environment: Sets up Python 3.12 with pip caching
- Automated commits: Pushes generated race files back to repository
2. Key Components
Python Dependencies (scripts/requirements.txt)
requests
beautifulsoup4
playwright
google-generativeai # Optional
packaging
Playwright Browser Installation
- name: Install Playwright browsers
run: |
playwright install chromium
playwright install-deps chromium
This ensures GitHub Actions runners have the Chromium browser binary and all system dependencies.
3. Workflow Triggers
Scheduled (Daily at 6:00 AM UTC)
on:
schedule:
- cron: '0 6 * * *'
Manual Trigger
workflow_dispatch: # Trigger from GitHub UI
Important Differences from Local Execution
Local (Your Machine)
- Uses
venv/virtual environment - Chromium installed to
~/.cache/ms-playwright/ - Manual execution:
venv/bin/python3 scripts/scrape_and_generate.py 2025-11-30
GitHub Actions
- Fresh Python environment each run
- Chromium installed to GitHub runner’s cache
- Automatic execution: Runs daily at scheduled time
- Auto-commits results to repository
Testing the Workflow
Method 1: Manual Trigger (Recommended for Testing)
- Go to GitHub repository → Actions tab
- Select “Scrape Daily Races” workflow
- Click Run workflow button
- Select branch (main)
- Click Run workflow
Method 2: Wait for Scheduled Run
- Automatically runs daily at 6:00 AM UTC
- Check Actions tab for execution logs
Method 3: Test Locally First
# Activate virtual environment
source venv/bin/activate
# Test with today's date
TODAY=$(date -u +%Y-%m-%d)
python scripts/scrape_and_generate.py "$TODAY"
Configuration Options
1. Change Schedule Time
Edit .github/workflows/scrape-races.yml:
schedule:
# Run at 8:00 PM UTC (4:00 PM EDT)
- cron: '0 20 * * *'
Cron format: minute hour day month day-of-week
2. Add Gemini API Key (Optional)
If using Google Gemini for race summaries:
- Go to repository Settings → Secrets and variables → Actions
- Click New repository secret
- Name:
GEMINI_API_KEY - Value: Your API key
- Uncomment in workflow:
- name: Run scraper for today
run: |
TODAY=$(date -u +%Y-%m-%d)
python scripts/scrape_and_generate.py "$TODAY"
env:
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
3. Disable Auto-Commit
If you want to review changes before committing:
- name: Upload artifacts instead of committing
uses: actions/upload-artifact@v4
with:
name: race-files
path: stages/
Troubleshooting
Issue: Playwright Installation Fails
Solution: Ensure both commands run:
playwright install chromium
playwright install-deps chromium
Issue: 403 Forbidden Errors Persist
Solution: GitHub Actions runners have different IPs. Consider:
- Adding random delays between requests
- Running during off-peak hours
- Using proxy services (paid)
Issue: Workflow Runs But No Commits
Check:
- View workflow logs in Actions tab
- Look for “No changes to commit” message
- Verify scraper found races for that date
Issue: Permission Denied on Push
Solution: Ensure repository has Write permissions:
- Repository Settings → Actions → General
- Scroll to “Workflow permissions”
- Select Read and write permissions
- Save
Monitoring
View Execution Logs
- Go to repository → Actions
- Click on workflow run
- Click on job name “scrape”
- Expand steps to view logs
Download Artifacts (on failure)
If scraper fails, logs are uploaded as artifacts:
- Failed workflow run → Artifacts section
- Download
scraper-logs.zip - Review error details
Performance Considerations
Browser Automation in CI/CD
- Execution time: 2-10 minutes (depends on races scheduled that day)
- Memory usage: ~500MB for Chromium
- GitHub Actions limits: 2000 minutes/month (Free tier)
- Cost per run: ~2-10 minutes = 0.1-0.5% of monthly quota
- Note: Only processes races for the current day, not bulk historical data
Optimization Tips
- Test with low-activity dates first:
# Test with a date that has fewer races python scripts/scrape_and_generate.py 2025-12-25 # Likely fewer races on Christmas - Cache Playwright browsers:
- name: Cache Playwright browsers
uses: actions/cache@v4
with:
path: ~/.cache/ms-playwright
key: ${{ runner.os }}-playwright-${{ hashFiles('scripts/requirements.txt') }}
- Run only on specific days:
schedule: # Only on race days (Saturday/Sunday) - cron: '0 6 * * 6,0'
Deployment Checklist
.github/workflows/scrape-races.ymlcreated- Playwright installation configured
- Auto-commit enabled
- Repository write permissions enabled
- Test manual workflow trigger
- Review first automated run
- Monitor execution time
- Adjust schedule if needed
Alternative: Disable Browser Automation
If GitHub Actions has issues with Playwright, fallback to HTTP-only:
Remove browser fallback in scrape_and_generate.py:
# Comment out browser_fetch() calls
# html = browser_fetch(url)
Pros: Faster, uses less memory Cons: More 403 errors, fewer races scraped
Summary
✅ Yes, it works with GitHub Actions!
The workflow automatically:
- Installs Python dependencies
- Installs Playwright + Chromium browser
- Runs scraper with current date
- Commits generated files
- Pushes to repository
Next Steps:
- Enable repository write permissions
- Test with manual workflow trigger
- Monitor first automated run
- Adjust schedule/configuration as needed
Important: The scraper processes only races scheduled for the target date. It fetches today’s races from ProcyclingStats and generates markdown files for those specific events.
Created: 2025-01-XX Tested on: ubuntu-latest runner Python Version: 3.12 Browser: Chromium 141.0.7390.37