GitHub Actions Compatibility Guide

✅ Yes, the MTB scraper works with GitHub Actions!

The Playwright browser automation solution is fully compatible with GitHub Actions after proper configuration.

Required Setup

1. GitHub Actions Workflow

The workflow file .github/workflows/scrape-races.yml has been created with:

Playwright installation: Automatically installs Chromium browser
System dependencies: Installs required libraries with playwright install-deps
Python environment: Sets up Python 3.12 with pip caching
Automated commits: Pushes generated race files back to repository

2. Key Components

Python Dependencies (scripts/requirements.txt)

requests
beautifulsoup4
playwright
google-generativeai  # Optional
packaging

Playwright Browser Installation

- name: Install Playwright browsers
  run: |
    playwright install chromium
    playwright install-deps chromium

This ensures GitHub Actions runners have the Chromium browser binary and all system dependencies.

3. Workflow Triggers

Scheduled (Daily at 6:00 AM UTC)

on:
  schedule:
    - cron: '0 6 * * *'

Manual Trigger

  workflow_dispatch:  # Trigger from GitHub UI

Important Differences from Local Execution

Local (Your Machine)

Uses venv/ virtual environment
Chromium installed to ~/.cache/ms-playwright/
Manual execution: venv/bin/python3 scripts/scrape_and_generate.py 2025-11-30

GitHub Actions

Fresh Python environment each run
Chromium installed to GitHub runner’s cache
Automatic execution: Runs daily at scheduled time
Auto-commits results to repository

Testing the Workflow

Method 1: Manual Trigger (Recommended for Testing)

Go to GitHub repository → Actions tab
Select “Scrape Daily Races” workflow
Click Run workflow button
Select branch (main)
Click Run workflow

Method 2: Wait for Scheduled Run

Automatically runs daily at 6:00 AM UTC
Check Actions tab for execution logs

Method 3: Test Locally First

# Activate virtual environment
source venv/bin/activate

# Test with today's date
TODAY=$(date -u +%Y-%m-%d)
python scripts/scrape_and_generate.py "$TODAY"

Configuration Options

1. Change Schedule Time

Edit .github/workflows/scrape-races.yml:

schedule:
  # Run at 8:00 PM UTC (4:00 PM EDT)
  - cron: '0 20 * * *'

Cron format: minute hour day month day-of-week

2. Add Gemini API Key (Optional)

If using Google Gemini for race summaries:

Go to repository Settings → Secrets and variables → Actions
Click New repository secret
Name: GEMINI_API_KEY
Value: Your API key
Uncomment in workflow:

- name: Run scraper for today
  run: |
    TODAY=$(date -u +%Y-%m-%d)
    python scripts/scrape_and_generate.py "$TODAY"
  env:
    GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}

3. Disable Auto-Commit

If you want to review changes before committing:

- name: Upload artifacts instead of committing
  uses: actions/upload-artifact@v4
  with:
    name: race-files
    path: stages/

Troubleshooting

Issue: Playwright Installation Fails

Solution: Ensure both commands run:

playwright install chromium
playwright install-deps chromium

Issue: 403 Forbidden Errors Persist

Solution: GitHub Actions runners have different IPs. Consider:

Adding random delays between requests
Running during off-peak hours
Using proxy services (paid)

Issue: Workflow Runs But No Commits

Check:

View workflow logs in Actions tab
Look for “No changes to commit” message
Verify scraper found races for that date

Issue: Permission Denied on Push

Solution: Ensure repository has Write permissions:

Repository Settings → Actions → General
Scroll to “Workflow permissions”
Select Read and write permissions
Save

Monitoring

View Execution Logs

Go to repository → Actions
Click on workflow run
Click on job name “scrape”
Expand steps to view logs

Download Artifacts (on failure)

If scraper fails, logs are uploaded as artifacts:

Failed workflow run → Artifacts section
Download scraper-logs.zip
Review error details

Performance Considerations

Browser Automation in CI/CD

Execution time: 2-10 minutes (depends on races scheduled that day)
Memory usage: ~500MB for Chromium
GitHub Actions limits: 2000 minutes/month (Free tier)
Cost per run: ~2-10 minutes = 0.1-0.5% of monthly quota
Note: Only processes races for the current day, not bulk historical data

Optimization Tips

Test with low-activity dates first:

# Test with a date that has fewer races
python scripts/scrape_and_generate.py 2025-12-25  # Likely fewer races on Christmas

Cache Playwright browsers:

- name: Cache Playwright browsers
  uses: actions/cache@v4
  with:
    path: ~/.cache/ms-playwright
    key: ${{ runner.os }}-playwright-${{ hashFiles('scripts/requirements.txt') }}

Run only on specific days:

schedule:
  # Only on race days (Saturday/Sunday)
  - cron: '0 6 * * 6,0'

Deployment Checklist

.github/workflows/scrape-races.yml created
Playwright installation configured
Auto-commit enabled
Repository write permissions enabled
Test manual workflow trigger
Review first automated run
Monitor execution time
Adjust schedule if needed

Alternative: Disable Browser Automation

If GitHub Actions has issues with Playwright, fallback to HTTP-only:

Remove browser fallback in scrape_and_generate.py:

# Comment out browser_fetch() calls
# html = browser_fetch(url)

Pros: Faster, uses less memory Cons: More 403 errors, fewer races scraped

Summary

✅ Yes, it works with GitHub Actions!

The workflow automatically:

Installs Python dependencies
Installs Playwright + Chromium browser
Runs scraper with current date
Commits generated files
Pushes to repository

Next Steps:

Enable repository write permissions
Test with manual workflow trigger
Monitor first automated run
Adjust schedule/configuration as needed

Important: The scraper processes only races scheduled for the target date. It fetches today’s races from ProcyclingStats and generates markdown files for those specific events.

Created: 2025-01-XX Tested on: ubuntu-latest runner Python Version: 3.12 Browser: Chromium 141.0.7390.37