BixArena Leaderboard Snapshot Automation¶

Metadata¶

Status: Proposed
Author: Thomas Schaffter
Created: 2026-01-09
Last Updated: 2026-01-12

Summary¶

Automate the daily generation of BixArena leaderboard snapshots using AWS Lambda and EventBridge scheduling, eliminating the manual process and improving system security by removing the need for external database access via SSH tunneling.

Motivation¶

Problem Statement¶

The BixArena leaderboard snapshot generation currently requires manual intervention by engineering team members. This manual process involves:

Setting up SSH port forwarding via AWS SSM to a bastion host
Running a Python script from bixarena-tools that connects directly to the production database
Generating the leaderboard snapshot from battle evaluation data
Publishing the snapshot to make it visible to users

This process must be repeated each time a leaderboard update is needed, consuming engineering resources and limiting how frequently leaderboards can be updated.

Current State¶

Process: Manual execution of Python script from bixarena-tools
Frequency: Ad-hoc, dependent on engineer availability (typically Rong)
Database Access: Direct connection from development environment via AWS SSM port forwarding through bastion host
Security: Database accessible from outside the private network (via tunneling)
Reliability: Dependent on manual execution, prone to being forgotten or delayed

Desired State¶

Process: Fully automated, scheduled execution
Frequency: Predictable daily updates at consistent times
Database Access: Lambda functions running within VPC private subnets
Security: Database connections confined to private network, no external access required
Reliability: Guaranteed daily execution with monitoring and alerting

Business Impact¶

Credibility Risk: Infrequent or irregular leaderboard updates can negatively impact BixArena's credibility with users who expect to see current rankings reflecting recent battles and evaluations.

User Engagement: Stale leaderboards reduce user engagement and confidence in the platform's activity level.

Engineering Efficiency: Manual process consumes engineering time that could be better spent on feature development.

Proposal¶

Overview¶

Implement automated daily leaderboard snapshot generation using a hybrid approach:

Primary Execution Path: AWS Lambda function triggered by EventBridge (CloudWatch Events) on a daily schedule
Secondary Execution Path: Protected admin API endpoint in AI service for on-demand generation
Shared Logic: Reusable leaderboard library extracted from bixarena-tools

High-Level Architecture¶

graph TD
    A[EventBridge Rule<br/>Daily 2 AM UTC] -->|Triggers| B[Lambda Function<br/>leaderboard-snapshot-gen]
    C[AI Service<br/>Admin Endpoint] -->|Async Invoke| B
    B -->|Reads| D[RDS PostgreSQL<br/>Battle Evaluations]
    B -->|Writes| E[RDS PostgreSQL<br/>Leaderboard Snapshots]
    B -->|Notifies| F[SNS Topic]
    F -->|Triggers| G[Slack Notification]

    H[Admin User] -->|Authenticated Request| C

    style B fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#bbf,stroke:#333,stroke-width:2px
    style E fill:#bbf,stroke:#333,stroke-width:2px

Key Components¶

Shared Library (libs/bixarena/leaderboard/python/)
- Extract snapshot generation logic from bixarena-tools
- Reusable by both Lambda function and CLI tool
- Contains Bradley-Terry ranking algorithm, database queries, snapshot generation
Lambda Function (Container Image)
- Scheduled daily execution via EventBridge
- Runs in VPC private subnet with access to RDS
- Retrieves credentials from AWS Secrets Manager
- Publishes notifications to SNS on success/failure
Admin API Endpoint
- Protected endpoint in AI service (/admin/generate-snapshot)
- Requires JWT authentication with admin role
- Async Lambda invocation (returns immediately with correlation ID)
- Enables on-demand snapshot generation when needed
Monitoring & Notifications
- CloudWatch logs with structured logging
- SNS notifications to Slack on success/failure
- CloudWatch alarms for Lambda errors
- Correlation IDs for tracing requests

Key Design Decisions¶

Lambda vs ECS Scheduled Task: Lambda chosen for simplicity, cost-effectiveness, and built-in retry logic
Container Image vs Zip: Container image chosen to support NumPy/SciPy dependencies without size constraints
RDS Proxy: Optional connection pooling for Lambda (to be evaluated)
Auto-publish: Snapshots automatically set to visible=true after generation
Async API Invocation: Admin endpoint returns immediately (HTTP 202) to avoid timeout issues

Security Improvements¶

Network Isolation: Lambda functions run in VPC private subnets, no external database access
Least Privilege IAM: Scoped permissions for Lambda execution and ECS task roles
Secrets Management: Database credentials stored in AWS Secrets Manager
Authentication: Admin API requires JWT with admin role validation
Input Validation: Whitelist-based validation of all Lambda inputs
Audit Logging: Track trigger source (scheduled/manual), user email, correlation IDs

Alternatives Considered¶

Alternative 1: ECS Scheduled Tasks¶

Description: Use ECS scheduled tasks instead of Lambda functions.

Pros:

Familiar deployment model (already using ECS for AI service)
No execution time limits (Lambda has 15-minute max)
Can reuse existing Docker images and infrastructure

Cons:

Higher cost (ECS task runs for full duration, Lambda charges per 100ms)
More complex infrastructure (task definitions, scheduling, networking)
Requires VPC and load balancer configuration
Overkill for a simple scheduled job

Decision: Rejected. Lambda is more appropriate for scheduled batch jobs with predictable execution times under 15 minutes.

Alternative 2: Step Functions for Orchestration¶

Description: Use AWS Step Functions to orchestrate a multi-step workflow (fetch data → compute rankings → save snapshot → notify).

Pros:

Visual workflow representation
Built-in error handling and retries
State persistence across steps
Can handle complex multi-step workflows

Cons:

Added complexity for a simple single-step process
Additional cost for Step Functions executions
Current snapshot generation is a single atomic operation
Unnecessary for initial implementation

Decision: Rejected for v1. Consider if future requirements introduce multi-step workflows (e.g., data validation → generation → approval → publishing).

Alternative 3: Keep Manual Process¶

Description: Continue with manual CLI execution via SSH tunneling.

Pros:

No infrastructure changes required
Engineers maintain direct control
Simple debugging and troubleshooting

Cons:

Does not address core problems (reliability, frequency, security)
Continues to consume engineering resources
Scales poorly as more leaderboards are added
Security risk from external database access

Decision: Rejected. Automation is necessary to improve reliability, security, and engineering efficiency.

Alternative 4: AWS Batch¶

Description: Use AWS Batch for scheduled job execution.

Pros:

Designed for batch compute workloads
Can handle long-running jobs (hours)
Built-in retry and failure handling

Cons:

More complex than Lambda for simple jobs
Requires compute environment configuration
Higher cost for our use case
Slower cold starts compared to Lambda

Decision: Rejected. AWS Batch is better suited for large-scale batch processing. Lambda is sufficient for our needs.

Performance Implications¶

Current Performance:

Manual execution time: ~2-5 minutes (depending on data volume)
Database query time: ~30 seconds for battle evaluations
Bradley-Terry computation: ~1-2 minutes (1000 bootstrap iterations)

Expected Performance with Lambda:

Execution time: 2-5 minutes (same as manual)
Cold start: ~2-3 seconds (container image)
Memory: 2GB (sufficient for NumPy/SciPy)
Timeout: 15 minutes (3x buffer)

Database Impact:

Read queries: Same as manual process (fetch battle evaluations)
Write queries: Same as manual process (insert snapshot + entries)
Connection pooling: RDS Proxy can be added if connection limits become an issue

Network Performance:

Lambda in same VPC as RDS: Low latency, no data transfer costs
No external network hops (unlike current SSH tunneling approach)

Security Considerations¶

Current Security Concerns¶

External Database Access: SSH tunneling exposes database to connections from development environments
Credential Management: Engineers need database credentials to run manual script
Audit Trail: Limited visibility into who generated snapshots and when
Network Exposure: Bastion host and SSM session manager create additional attack surface

Proposed Security Improvements¶

Network Isolation:
- Lambda runs in VPC private subnets
- No ingress from external networks
- Database never exposed outside VPC
IAM-Based Access Control:
- Lambda execution role with least-privilege permissions
- Scoped access to Secrets Manager, RDS, SNS
- No long-lived credentials in environment variables
Secrets Management:
- Database credentials stored in AWS Secrets Manager
- Automatic rotation supported
- Encrypted at rest and in transit
Authentication & Authorization:
- Admin API requires valid JWT token
- Admin role validation (Cognito groups)
- User email tracked for audit logging
Audit Logging:
- CloudWatch logs capture all executions
- Structured logging with correlation IDs
- Track trigger source (scheduled vs manual) and user identity
Input Validation:
- Whitelist allowed leaderboard IDs
- Validate bootstrap iteration ranges
- Sanitize all inputs before database queries

Timeline¶

Phase 1: Foundation¶

Extract bixarena-leaderboard library from bixarena-tools
Write unit tests for snapshot generation logic
Update CLI tool to use library (Optional, no longer needed when calling the API server)

Phase 2: Lambda Function¶

Implement Lambda handler with Docker container
Create CDK infrastructure (Lambda, EventBridge, SNS)
Add Slack notifications
Deploy to dev environment

Phase 3: Admin API¶

Add protected endpoint to AI service
Implement JWT authentication with admin role check
Grant IAM permissions for Lambda invocation
Test end-to-end flow

Phase 4: Testing & Deployment¶

Integration testing in dev environment
Verify daily scheduled executions
Test manual triggering via API
Deploy to staging and production

Total Estimated Timeline: 2-3 weeks

Success Criteria¶

Immediate Success Criteria (Week 1-2)¶

First successful automated snapshot generated in dev environment
Slack notification received for success and failure scenarios
Snapshot correctly marked as visible (auto-publish works)
Lambda execution time < 5 minutes (within performance targets)
No database connection errors or timeouts

Short-Term Success Criteria (30 Days)¶

Operational Reliability: 95% success rate over 30 consecutive days (≤2 failures/month)
Scheduled Execution: Daily snapshots generated at consistent time (2 AM UTC ±5 minutes)
Zero Manual Interventions: No engineer involvement required for routine updates
Alert Response Time: All failures detected and notified to Slack within 5 minutes
Security Compliance: Zero external database connections (SSH tunneling eliminated)

Long-Term Success Criteria (90 Days)¶

Update Frequency: Leaderboard updated daily without interruption for 90 days
User Engagement: Increase in user visits to leaderboard page (baseline TBD)
Engineering Time Saved: 4-8 hours/month of engineering time freed (estimated)
System Scalability: Successfully extended to support multiple leaderboard types (open-source, multimodal)
Monitoring Coverage: CloudWatch dashboard with key metrics (success rate, duration, entry counts)

Measurable KPIs¶

Success Rate: (successful_executions / total_executions) × 100
- Target: ≥95%
Execution Consistency: Standard deviation of execution times
- Target: Within ±10% of mean execution time
Notification Delivery: Time from execution completion to Slack notification
- Target: <1 minute
Security Incidents: External database access attempts
- Target: 0 (after cutover from manual process)
Engineering Efficiency: Hours saved per month
- Baseline: ~6-8 hours/month (manual execution + troubleshooting)
- Target: Reduced to ~1 hour/month (monitoring only)

Open Questions¶

Scheduling & Frequency¶

Daily Schedule: Is 2 AM UTC the optimal time? Should we consider user timezone distributions?
Multiple Executions: Should we support multiple daily snapshots (e.g., morning and evening updates)?
Holiday Handling: Should execution be paused during holidays or maintenance windows?

Failure Handling¶

Retry Policy: How many automatic retries before alerting engineers? (Current: 2 retries)
Failure Escalation: Should failures page on-call engineer or just notify Slack?
Partial Failures: How to handle cases where snapshot generates but notification fails?

Notification Preferences¶

Slack Channel: Which channel should receive notifications? (#bixarena-alerts, #bixarena-leaderboard, or both?)
Notification Content: Should success notifications include full details or just a summary?
Quiet Hours: Should notifications be suppressed during certain hours to avoid noise?

Leaderboard Scope¶

Multiple Leaderboards: When will we add support for other leaderboards (open-source, multimodal)?
Leaderboard Priority: If multiple leaderboards exist, should they run sequentially or in parallel?
Bootstrap Iterations: Is 1000 iterations appropriate for production, or should it be configurable per leaderboard?

Database Optimization¶

RDS Proxy: Should we implement RDS Proxy immediately or wait until we see connection pooling needs?
Read Replicas: Should snapshot generation read from a replica to reduce load on primary database?
Query Optimization: Are current database queries optimized for production data volumes?

Monitoring & Observability¶

CloudWatch Dashboard: What metrics are most valuable to track? (Duration, entry counts, success rate?)
Custom Metrics: Should we publish custom metrics to CloudWatch for leaderboard-specific KPIs?
Log Retention: Is 30 days sufficient for CloudWatch logs, or should we archive to S3?

Security & Compliance¶

Audit Requirements: Do we need to track snapshot generation in a compliance log beyond CloudWatch?
Access Control: Should multiple admin roles exist (e.g., admin-read-only vs admin-write)?
Rate Limiting: Is 100 API requests/day sufficient rate limit for manual triggering?

Future Enhancements¶

Snapshot Retention: Should old snapshots be automatically archived or deleted after a certain period?
Snapshot Comparison: Should the system detect and alert on significant ranking changes between snapshots?
User Notifications: Should users be notified when new snapshots are published (e.g., email, in-app)?

BixArena Leaderboard Snapshot Automation¶

Metadata¶

Summary¶

Motivation¶

Problem Statement¶

Current State¶

Desired State¶

Business Impact¶

Proposal¶

Overview¶

High-Level Architecture¶

Key Components¶

Key Design Decisions¶

Security Improvements¶

Alternatives Considered¶

Alternative 1: ECS Scheduled Tasks¶

Alternative 2: Step Functions for Orchestration¶

Alternative 3: Keep Manual Process¶

Alternative 4: AWS Batch¶

Performance Implications¶

Security Considerations¶

Current Security Concerns¶

Proposed Security Improvements¶

Timeline¶

Phase 1: Foundation¶

Phase 2: Lambda Function¶

Phase 3: Admin API¶

Phase 4: Testing & Deployment¶

Success Criteria¶

Immediate Success Criteria (Week 1-2)¶

Short-Term Success Criteria (30 Days)¶

Long-Term Success Criteria (90 Days)¶

Measurable KPIs¶

Open Questions¶

Scheduling & Frequency¶

Failure Handling¶

Notification Preferences¶

Leaderboard Scope¶

Database Optimization¶

Monitoring & Observability¶

Security & Compliance¶

Future Enhancements¶

References¶