553 lines
16 KiB
Markdown
553 lines
16 KiB
Markdown
# Risk Assessment & Known Issues - Booking Workflow Plan
|
|
|
|
**Document Version:** 1.0
|
|
**Date:** January 9, 2025
|
|
**Status:** Pre-Implementation Review
|
|
|
|
---
|
|
|
|
## 🔴 Critical Risks
|
|
|
|
### 1. Race Conditions & Concurrency
|
|
**Risk Level:** HIGH - Could cause double bookings or data loss
|
|
|
|
**Issues:**
|
|
- User books appointment while background sync is running → duplicate or conflicting data
|
|
- Two admins approve same booking simultaneously → status conflicts
|
|
- Nextcloud event modified during sync → data inconsistency
|
|
- No database transaction handling in appointments API
|
|
|
|
**Mitigation Required:**
|
|
- Add database transaction locks for booking creation
|
|
- Implement optimistic locking with ETags for updates
|
|
- Add conflict resolution logic with "last write wins" or manual reconciliation
|
|
- Add unique constraints to prevent duplicates
|
|
|
|
**Missing from Plan:** Transaction handling completely absent
|
|
|
|
---
|
|
|
|
### 2. Authentication & Authorization Gaps
|
|
**Risk Level:** HIGH - Security vulnerability
|
|
|
|
**Issues:**
|
|
- Assumption that `session.user.id` exists and matches `appointments.client_id` format
|
|
- Admin role checking duplicated in every page - error-prone
|
|
- No middleware protecting admin routes - easy to miss a check
|
|
- User table schema not verified in plan
|
|
|
|
**Mitigation Required:**
|
|
- Create authentication middleware for all admin routes
|
|
- Verify user schema has compatible `id` field
|
|
- Add comprehensive auth tests
|
|
- Use Next.js middleware for route protection
|
|
|
|
**Missing from Plan:** No middleware implementation, schema verification
|
|
|
|
---
|
|
|
|
### 3. Background Sync Reliability
|
|
**Risk Level:** HIGH - Core functionality breaks
|
|
|
|
**Issues:**
|
|
- Worker failures are only logged - no alerts or retries
|
|
- Nextcloud down = all syncs fail with no recovery
|
|
- Network timeouts cause partial syncs
|
|
- 5-minute sync interval = 5-minute lag for critical status changes
|
|
- No queue for failed operations
|
|
|
|
**Mitigation Required:**
|
|
- Implement retry queue with exponential backoff
|
|
- Add Cloudflare Workers monitoring/alerting
|
|
- Create health check endpoint
|
|
- Consider webhook alternative to reduce lag
|
|
- Add dead letter queue for permanent failures
|
|
|
|
**Missing from Plan:** Retry mechanism, monitoring, alerting
|
|
|
|
---
|
|
|
|
### 4. Email Notification Dependency
|
|
**Risk Level:** HIGH - User communication breaks
|
|
|
|
**Issues:**
|
|
- Entire workflow depends on email but marked as "TODO"
|
|
- Users/artists never know about status changes without email
|
|
- SMTP configuration might not be set
|
|
- No email templates defined
|
|
- No fallback if email fails
|
|
|
|
**Mitigation Required:**
|
|
- Implement email system BEFORE other phases
|
|
- Choose email provider (SendGrid, Postmark, AWS SES)
|
|
- Create email templates
|
|
- Add in-app notifications as backup
|
|
- Queue failed emails for retry
|
|
|
|
**Missing from Plan:** Email is Phase 3+ but should be Phase 1
|
|
|
|
---
|
|
|
|
## 🟡 Medium Risks
|
|
|
|
### 5. Status Detection Brittleness
|
|
**Risk Level:** MEDIUM - Incorrect status updates
|
|
|
|
**Issues:**
|
|
- Relies on "REQUEST:" prefix - artist could manually edit title
|
|
- External calendar events could be misidentified as bookings
|
|
- ical.js might not parse STATUS field correctly
|
|
- No validation that event belongs to booking system
|
|
- Magic string "REQUEST:" is hardcoded everywhere
|
|
|
|
**Mitigation Required:**
|
|
- Add unique identifier (UUID) in event description
|
|
- Validate event source before processing
|
|
- Add manual reconciliation UI for admins
|
|
- Move magic strings to constants
|
|
- Add event ownership verification
|
|
|
|
**Missing from Plan:** Event validation, reconciliation UI
|
|
|
|
---
|
|
|
|
### 6. CalDAV/Nextcloud Availability
|
|
**Risk Level:** MEDIUM - Degrades user experience
|
|
|
|
**Issues:**
|
|
- Nextcloud down = slow booking submission (waits for timeout)
|
|
- CalDAV credentials could expire without notice
|
|
- Network latency makes availability checks slow (300ms debounce helps but not enough)
|
|
- Multiple calendars per artist not supported
|
|
- Calendar URL format might vary by Nextcloud version
|
|
|
|
**Mitigation Required:**
|
|
- Add CalDAV health check endpoint
|
|
- Implement credential rotation monitoring
|
|
- Add faster timeout for availability checks (2-3 seconds max)
|
|
- Cache availability results briefly
|
|
- Test with different Nextcloud versions
|
|
|
|
**Missing from Plan:** Health checks, caching, timeout limits
|
|
|
|
---
|
|
|
|
### 7. Performance & Scalability
|
|
**Risk Level:** MEDIUM - Won't scale beyond ~50 artists
|
|
|
|
**Issues:**
|
|
- Background worker syncs ALL artists every 5 minutes (expensive)
|
|
- Fetches 90-day event range every sync (slow with many bookings)
|
|
- No pagination on bookings DataTable (breaks with 1000+ bookings)
|
|
- Availability check fires on every form field change
|
|
- No incremental sync using sync-token
|
|
|
|
**Mitigation Required:**
|
|
- Implement incremental sync with sync-token (CalDAV supports this)
|
|
- Add pagination to bookings table
|
|
- Limit event range to 30 days with on-demand expansion
|
|
- Implement smarter caching for availability
|
|
- Consider sync only changed calendars
|
|
|
|
**Missing from Plan:** Incremental sync, pagination, performance testing
|
|
|
|
---
|
|
|
|
### 8. Timezone Edge Cases
|
|
**Risk Level:** MEDIUM - Wrong-time bookings
|
|
|
|
**Issues:**
|
|
- Hardcoded America/Denver prevents expansion
|
|
- Daylight Saving Time transitions not tested
|
|
- Date comparison between systems has timezone bugs potential
|
|
- User browser timezone vs server vs Nextcloud timezone
|
|
- No verification that times are displayed correctly
|
|
|
|
**Mitigation Required:**
|
|
- Store all times in UTC internally
|
|
- Use date-fns-tz for ALL timezone operations
|
|
- Test DST transitions (spring forward, fall back)
|
|
- Add timezone to user preferences if expanding
|
|
- Display timezone clearly in UI
|
|
|
|
**Missing from Plan:** DST testing, UTC storage verification
|
|
|
|
---
|
|
|
|
### 9. Data Consistency & Integrity
|
|
**Risk Level:** MEDIUM - Data quality degrades
|
|
|
|
**Issues:**
|
|
- ETag conflicts if event updated simultaneously
|
|
- No global unique constraint on `caldav_uid` (only per artist)
|
|
- `calendar_sync_logs` will grow unbounded
|
|
- No validation on calendar URL format
|
|
- No cascade delete handling documented
|
|
|
|
**Mitigation Required:**
|
|
- Add global unique constraint on `caldav_uid`
|
|
- Implement log rotation (keep last 90 days)
|
|
- Validate calendar URLs with regex
|
|
- Add ETag conflict resolution
|
|
- Document cascade delete behavior
|
|
|
|
**Missing from Plan:** Constraints, log rotation, URL validation
|
|
|
|
---
|
|
|
|
## 🟢 Low Risks (Nice to Have)
|
|
|
|
### 10. User Experience Gaps
|
|
**Issues:**
|
|
- No way to edit booking after submission
|
|
- No user-facing cancellation flow
|
|
- Confirmation page doesn't show sync status
|
|
- No booking history for users
|
|
- No real-time updates (5-min lag)
|
|
|
|
**Mitigation:** Add these as Phase 2 features post-launch
|
|
|
|
---
|
|
|
|
### 11. Admin Experience Gaps
|
|
**Issues:**
|
|
- No bulk operations in dashboard
|
|
- No manual reconciliation UI for conflicts
|
|
- No artist notification preferences
|
|
- No test connection button (only validates on save)
|
|
|
|
**Mitigation:** Add as Phase 3 enhancements
|
|
|
|
---
|
|
|
|
### 12. Testing Coverage
|
|
**Issues:**
|
|
- No automated tests (marked TODO)
|
|
- Manual checklist not integrated into CI/CD
|
|
- No load testing
|
|
- No concurrent booking tests
|
|
|
|
**Mitigation:** Add comprehensive test suite before production
|
|
|
|
---
|
|
|
|
### 13. Monitoring & Observability
|
|
**Issues:**
|
|
- No monitoring for worker failures
|
|
- Toast errors disappear on navigation
|
|
- No dashboard for sync health
|
|
- No Sentry or error tracking
|
|
|
|
**Mitigation:** Add monitoring in Phase 4
|
|
|
|
---
|
|
|
|
### 14. Deployment & Operations
|
|
**Issues:**
|
|
- Workers cron needs separate deployment
|
|
- No staging strategy
|
|
- No migration rollback plan
|
|
- Environment variables not documented
|
|
|
|
**Mitigation:** Create deployment runbook
|
|
|
|
---
|
|
|
|
## 🔧 Technical Debt & Limitations
|
|
|
|
### 15. Architecture Limitations
|
|
- Single Nextcloud credentials (no per-artist OAuth)
|
|
- One calendar per artist only
|
|
- No recurring appointments
|
|
- No multi-day appointments
|
|
- No support for artist breaks/vacations
|
|
|
|
### 16. Code Quality Issues
|
|
- Admin role checks duplicated (should be middleware)
|
|
- Magic strings not in constants
|
|
- No API versioning
|
|
- No TypeScript strict mode mentioned
|
|
|
|
### 17. Missing Features (Known)
|
|
- Email notifications (CRITICAL)
|
|
- Automated tests (CRITICAL)
|
|
- Background worker deployment (CRITICAL)
|
|
- Booking edit flow
|
|
- User cancellation
|
|
- Webhook support
|
|
- In-app notifications
|
|
- SMS option
|
|
|
|
---
|
|
|
|
## 🚨 Showstopper Scenarios
|
|
|
|
### Scenario 1: Nextcloud Down During Peak Hours
|
|
**Impact:** Users book but syncs fail → artists don't see bookings
|
|
**Current Plan:** Fallback to DB-only
|
|
**Gap:** No retry queue when Nextcloud returns
|
|
**Required:** Implement sync queue
|
|
|
|
### Scenario 2: Background Worker Stops
|
|
**Impact:** No Nextcloud→Web sync, status changes invisible
|
|
**Current Plan:** Worker runs but no monitoring
|
|
**Gap:** No alerts if worker dies
|
|
**Required:** Health monitoring + alerting
|
|
|
|
### Scenario 3: Double Booking
|
|
**Impact:** Two users book same slot simultaneously
|
|
**Current Plan:** Availability check before booking
|
|
**Gap:** Race condition between check and create
|
|
**Required:** Transaction locks
|
|
|
|
### Scenario 4: Email System Down
|
|
**Impact:** Zero user/artist communication
|
|
**Current Plan:** Email marked as TODO
|
|
**Gap:** No fallback communication method
|
|
**Required:** Email + in-app notifications
|
|
|
|
### Scenario 5: DST Transition Bug
|
|
**Impact:** Appointments booked 1 hour off
|
|
**Current Plan:** Use date-fns-tz
|
|
**Gap:** No DST testing mentioned
|
|
**Required:** DST test suite
|
|
|
|
---
|
|
|
|
## 📋 Pre-Launch Checklist
|
|
|
|
### ✅ Must-Have (Blocking)
|
|
1. [ ] Implement email notification system with templates
|
|
2. [ ] Add authentication middleware for admin routes
|
|
3. [ ] Implement retry queue for failed syncs
|
|
4. [ ] Add transaction handling to appointments API
|
|
5. [ ] Deploy and test background worker
|
|
6. [ ] Verify timezone handling with DST tests
|
|
7. [ ] Add monitoring and alerting (Cloudflare Workers analytics + Sentry)
|
|
8. [ ] Write critical path tests (booking flow, sync flow)
|
|
9. [ ] Create deployment runbook
|
|
10. [ ] Set up staging environment with test Nextcloud
|
|
|
|
### ⚠️ Should-Have (Important)
|
|
- [ ] Rate limiting on booking endpoint
|
|
- [ ] CSRF protection verification
|
|
- [ ] Calendar URL validation with regex
|
|
- [ ] Sync log rotation (90-day retention)
|
|
- [ ] Admin reconciliation UI for conflicts
|
|
- [ ] User booking history page
|
|
- [ ] Load test background worker (100+ artists)
|
|
- [ ] Global unique constraint on caldav_uid
|
|
|
|
### 💚 Nice-to-Have (Post-Launch)
|
|
- [ ] Webhook support for instant sync (eliminate 5-min lag)
|
|
- [ ] In-app real-time notifications (WebSockets)
|
|
- [ ] User edit/cancel flows
|
|
- [ ] Bulk admin operations
|
|
- [ ] Multiple calendars per artist
|
|
- [ ] SMS notification option
|
|
- [ ] Recurring appointment support
|
|
|
|
---
|
|
|
|
## 🎯 Revised Implementation Order
|
|
|
|
### Phase 0: Critical Foundation (NEW - REQUIRED FIRST)
|
|
**Duration:** 2-3 days
|
|
**Blockers:** Authentication, email, transactions
|
|
|
|
1. Add authentication middleware to protect admin routes
|
|
2. Verify user schema matches `appointments.client_id`
|
|
3. Add transaction handling to appointments API
|
|
4. Choose and set up email provider (SendGrid recommended)
|
|
5. Create basic email templates
|
|
6. Add error tracking (Sentry)
|
|
|
|
**Acceptance Criteria:**
|
|
- Admin routes redirect unauthorized users
|
|
- Email sends successfully in dev
|
|
- Transaction prevents double bookings
|
|
- Errors logged to Sentry
|
|
|
|
---
|
|
|
|
### Phase 1: Core Booking Flow ✅ (As Planned)
|
|
**Duration:** 3-4 days
|
|
**Dependencies:** Phase 0 complete
|
|
|
|
1. Booking form submission with React Query
|
|
2. Confirmation page with timezone display
|
|
3. CalDAV sync on booking creation
|
|
4. Email notification on booking submission
|
|
|
|
**Acceptance Criteria:**
|
|
- User can submit booking
|
|
- Booking appears in Nextcloud with REQUEST: prefix
|
|
- User receives confirmation email
|
|
- Toast shows success/error
|
|
|
|
---
|
|
|
|
### Phase 2: Admin Infrastructure ✅ (As Planned)
|
|
**Duration:** 3-4 days
|
|
**Dependencies:** Phase 1 complete
|
|
|
|
1. Calendar configuration UI
|
|
2. Bookings DataTable with filters
|
|
3. Approve/reject actions
|
|
4. Status sync to Nextcloud
|
|
|
|
**Acceptance Criteria:**
|
|
- Admin can link calendars
|
|
- Admin sees pending bookings
|
|
- Approve updates status + Nextcloud
|
|
- Email sent on status change
|
|
|
|
---
|
|
|
|
### Phase 3: Background Sync ⚠️ (Enhanced)
|
|
**Duration:** 4-5 days
|
|
**Dependencies:** Phase 2 complete
|
|
|
|
1. Smart status detection logic
|
|
2. Background worker implementation
|
|
3. **NEW:** Retry queue for failed syncs
|
|
4. **NEW:** Health check endpoint
|
|
5. **NEW:** Cloudflare Workers monitoring
|
|
|
|
**Acceptance Criteria:**
|
|
- Worker runs every 5 minutes
|
|
- Status changes detected from Nextcloud
|
|
- Failed syncs retry 3 times
|
|
- Alerts sent on persistent failures
|
|
- Health check returns sync status
|
|
|
|
---
|
|
|
|
### Phase 4: Production Hardening (NEW - CRITICAL)
|
|
**Duration:** 3-4 days
|
|
**Dependencies:** Phase 3 complete
|
|
|
|
1. Comprehensive error handling
|
|
2. Rate limiting (10 bookings/user/hour)
|
|
3. DST timezone testing
|
|
4. Load testing (100 artists, 1000 bookings)
|
|
5. Monitoring dashboard
|
|
6. Sync log rotation
|
|
7. Admin reconciliation UI
|
|
|
|
**Acceptance Criteria:**
|
|
- All errors handled gracefully
|
|
- Rate limits prevent abuse
|
|
- DST transitions work correctly
|
|
- Worker handles load without issues
|
|
- Admins can see sync health
|
|
- Logs don't grow unbounded
|
|
|
|
---
|
|
|
|
### Phase 5: Staging & Launch 🚀
|
|
**Duration:** 2-3 days
|
|
**Dependencies:** Phase 4 complete
|
|
|
|
1. Deploy to staging with test Nextcloud
|
|
2. Run full test suite
|
|
3. Load test in staging
|
|
4. Security review
|
|
5. Deploy to production
|
|
6. Monitor for 48 hours
|
|
|
|
**Acceptance Criteria:**
|
|
- All tests pass in staging
|
|
- No critical errors in 24h staging run
|
|
- Security review approved
|
|
- Production deploy successful
|
|
- Zero critical issues in first 48h
|
|
|
|
---
|
|
|
|
## 💡 Recommendations
|
|
|
|
### Before Starting Implementation
|
|
|
|
**Critical Decisions Needed:**
|
|
1. ✅ Which email provider? (Recommend: SendGrid or Postmark)
|
|
2. ✅ Confirm user schema structure
|
|
3. ✅ Set up staging Nextcloud instance
|
|
4. ✅ Choose error tracking (Sentry vs Cloudflare Logs)
|
|
5. ✅ Define rate limits for bookings
|
|
|
|
**Infrastructure Setup:**
|
|
1. Create staging environment
|
|
2. Set up Nextcloud test instance
|
|
3. Configure email provider
|
|
4. Set up error tracking
|
|
5. Document all environment variables
|
|
|
|
---
|
|
|
|
### During Implementation
|
|
|
|
**Code Quality:**
|
|
1. Add TypeScript strict mode
|
|
2. Create constants file for magic strings
|
|
3. Write tests alongside features
|
|
4. Add comprehensive JSDoc comments
|
|
5. Use auth middleware everywhere
|
|
|
|
**Testing Strategy:**
|
|
1. Unit tests for sync logic
|
|
2. Integration tests for booking flow
|
|
3. E2E tests for critical paths
|
|
4. Load tests for background worker
|
|
5. DST timezone tests
|
|
|
|
---
|
|
|
|
### After Implementation
|
|
|
|
**Operations:**
|
|
1. Create runbook for common issues
|
|
2. Train staff on admin dashboards
|
|
3. Set up monitoring alerts (PagerDuty/Slack)
|
|
4. Document troubleshooting steps
|
|
5. Plan for scaling (if needed)
|
|
|
|
**Monitoring:**
|
|
1. Track booking success rate (target: >99%)
|
|
2. Track sync success rate (target: >95%)
|
|
3. Track email delivery rate (target: >98%)
|
|
4. Monitor worker execution time (target: <30s)
|
|
5. Alert on 3 consecutive sync failures
|
|
|
|
---
|
|
|
|
## 📊 Risk Summary
|
|
|
|
| Category | Critical | Medium | Low | Total |
|
|
|----------|----------|--------|-----|-------|
|
|
| Bugs/Issues | 4 | 5 | 5 | 14 |
|
|
| Missing Features | 3 | 2 | 8 | 13 |
|
|
| Technical Debt | 2 | 3 | 5 | 10 |
|
|
| **TOTAL** | **9** | **10** | **18** | **37** |
|
|
|
|
**Showstoppers:** 5 scenarios requiring mitigation
|
|
**Blocking Issues:** 9 must-fix before production
|
|
**Estimated Additional Work:** 8-10 days (new Phase 0 + Phase 4)
|
|
|
|
---
|
|
|
|
## ✅ Next Steps
|
|
|
|
1. **Review this document with team** - Discuss acceptable risks
|
|
2. **Prioritize Phase 0 items** - Authentication + email are blocking
|
|
3. **Set up infrastructure** - Staging env, email provider, monitoring
|
|
4. **Revise timeline** - Add 8-10 days for hardening phases
|
|
5. **Get approval** - Confirm scope changes are acceptable
|
|
6. **Begin Phase 0** - Don't skip the foundation!
|
|
|
|
---
|
|
|
|
**Document Status:** Ready for Review
|
|
**Requires Action:** Team discussion and approval before proceeding
|
|
|