GitHub Actions Compatibility Guide

GitHub Actions Compatibility Guide

✅ Yes, the MTB scraper works with GitHub Actions!

The Playwright browser automation solution is fully compatible with GitHub Actions after proper configuration.

Required Setup

1. GitHub Actions Workflow

The workflow file .github/workflows/scrape-races.yml has been created with:

  • Playwright installation: Automatically installs Chromium browser
  • System dependencies: Installs required libraries with playwright install-deps
  • Python environment: Sets up Python 3.12 with pip caching
  • Automated commits: Pushes generated race files back to repository

2. Key Components

Python Dependencies (scripts/requirements.txt)

requests
beautifulsoup4
playwright
google-generativeai  # Optional
packaging

Playwright Browser Installation

- name: Install Playwright browsers
  run: |
    playwright install chromium
    playwright install-deps chromium

This ensures GitHub Actions runners have the Chromium browser binary and all system dependencies.

3. Workflow Triggers

Scheduled (Daily at 6:00 AM UTC)

on:
  schedule:
    - cron: '0 6 * * *'

Manual Trigger

  workflow_dispatch:  # Trigger from GitHub UI

Important Differences from Local Execution

Local (Your Machine)

  • Uses venv/ virtual environment
  • Chromium installed to ~/.cache/ms-playwright/
  • Manual execution: venv/bin/python3 scripts/scrape_and_generate.py 2025-11-30

GitHub Actions

  • Fresh Python environment each run
  • Chromium installed to GitHub runner’s cache
  • Automatic execution: Runs daily at scheduled time
  • Auto-commits results to repository

Testing the Workflow

  1. Go to GitHub repository → Actions tab
  2. Select “Scrape Daily Races” workflow
  3. Click Run workflow button
  4. Select branch (main)
  5. Click Run workflow

Method 2: Wait for Scheduled Run

  • Automatically runs daily at 6:00 AM UTC
  • Check Actions tab for execution logs

Method 3: Test Locally First

# Activate virtual environment
source venv/bin/activate

# Test with today's date
TODAY=$(date -u +%Y-%m-%d)
python scripts/scrape_and_generate.py "$TODAY"

Configuration Options

1. Change Schedule Time

Edit .github/workflows/scrape-races.yml:

schedule:
  # Run at 8:00 PM UTC (4:00 PM EDT)
  - cron: '0 20 * * *'

Cron format: minute hour day month day-of-week

2. Add Gemini API Key (Optional)

If using Google Gemini for race summaries:

  1. Go to repository SettingsSecrets and variablesActions
  2. Click New repository secret
  3. Name: GEMINI_API_KEY
  4. Value: Your API key
  5. Uncomment in workflow:
- name: Run scraper for today
  run: |
    TODAY=$(date -u +%Y-%m-%d)
    python scripts/scrape_and_generate.py "$TODAY"
  env:
    GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}

3. Disable Auto-Commit

If you want to review changes before committing:

- name: Upload artifacts instead of committing
  uses: actions/upload-artifact@v4
  with:
    name: race-files
    path: stages/

Troubleshooting

Issue: Playwright Installation Fails

Solution: Ensure both commands run:

playwright install chromium
playwright install-deps chromium

Issue: 403 Forbidden Errors Persist

Solution: GitHub Actions runners have different IPs. Consider:

  • Adding random delays between requests
  • Running during off-peak hours
  • Using proxy services (paid)

Issue: Workflow Runs But No Commits

Check:

  1. View workflow logs in Actions tab
  2. Look for “No changes to commit” message
  3. Verify scraper found races for that date

Issue: Permission Denied on Push

Solution: Ensure repository has Write permissions:

  1. Repository SettingsActionsGeneral
  2. Scroll to “Workflow permissions”
  3. Select Read and write permissions
  4. Save

Monitoring

View Execution Logs

  1. Go to repository → Actions
  2. Click on workflow run
  3. Click on job name “scrape”
  4. Expand steps to view logs

Download Artifacts (on failure)

If scraper fails, logs are uploaded as artifacts:

  1. Failed workflow run → Artifacts section
  2. Download scraper-logs.zip
  3. Review error details

Performance Considerations

Browser Automation in CI/CD

  • Execution time: 2-10 minutes (depends on races scheduled that day)
  • Memory usage: ~500MB for Chromium
  • GitHub Actions limits: 2000 minutes/month (Free tier)
  • Cost per run: ~2-10 minutes = 0.1-0.5% of monthly quota
  • Note: Only processes races for the current day, not bulk historical data

Optimization Tips

  1. Test with low-activity dates first:
    # Test with a date that has fewer races
    python scripts/scrape_and_generate.py 2025-12-25  # Likely fewer races on Christmas
    
  2. Cache Playwright browsers:
- name: Cache Playwright browsers
  uses: actions/cache@v4
  with:
    path: ~/.cache/ms-playwright
    key: ${{ runner.os }}-playwright-${{ hashFiles('scripts/requirements.txt') }}
  1. Run only on specific days:
    schedule:
      # Only on race days (Saturday/Sunday)
      - cron: '0 6 * * 6,0'
    

Deployment Checklist

  • .github/workflows/scrape-races.yml created
  • Playwright installation configured
  • Auto-commit enabled
  • Repository write permissions enabled
  • Test manual workflow trigger
  • Review first automated run
  • Monitor execution time
  • Adjust schedule if needed

Alternative: Disable Browser Automation

If GitHub Actions has issues with Playwright, fallback to HTTP-only:

Remove browser fallback in scrape_and_generate.py:

# Comment out browser_fetch() calls
# html = browser_fetch(url)

Pros: Faster, uses less memory Cons: More 403 errors, fewer races scraped

Summary

Yes, it works with GitHub Actions!

The workflow automatically:

  1. Installs Python dependencies
  2. Installs Playwright + Chromium browser
  3. Runs scraper with current date
  4. Commits generated files
  5. Pushes to repository

Next Steps:

  1. Enable repository write permissions
  2. Test with manual workflow trigger
  3. Monitor first automated run
  4. Adjust schedule/configuration as needed

Important: The scraper processes only races scheduled for the target date. It fetches today’s races from ProcyclingStats and generates markdown files for those specific events.


Created: 2025-01-XX Tested on: ubuntu-latest runner Python Version: 3.12 Browser: Chromium 141.0.7390.37

Productos Recomendados

Como Afiliados de Amazon, ganamos por compras calificadas. Precios orientativos, sujetos a cambios.
books

La Historia Oficial del Tour de Francia (Edición Española)

Historia oficial del Tour de Francia en español

Ver en Amazon
books

Paris-Roubaix: Un Viaje por el Infierno del Norte

La mítica carrera de los adoquines del norte de Francia

Ver en Amazon
books

Official History of the Tour de France (English)

Historia oficial del Tour de Francia en inglés (edición revisada)

Ver en Amazon