Deep Dive — SafeLLM / Fall Risk Detection AI System (safellm3/safellm_deploy)

Scope note: this workspace contains multiple iterations (safellm/, safellm2/, safellm3/). The only directory with git history is safellm3/safellm_deploy/ (contains .git/), so this Deep Dive treats that as the “repo” for evidence and history.

1. What This Is (one paragraph)

SafeLLM is a deployable web app + API that takes a single photo of a home environment, classifies the scene into one of 11 fall‑risk categories, retrieves fall‑prevention guidelines from a small curated knowledge base, and returns a structured safety report (score, hazards, prioritized actions, cost/difficulty) plus an optional AI‑generated “visual improvements” image that overlays the recommended fixes. The repo also contains extracted guideline documents (e.g., CDC STEADI PDFs → markdown) under knowledge_base/processed/ for provenance/transparency.

2. Who It’s For + Use Cases

Primary users (as described in the repo docs):

Family caregivers assessing an elderly parent’s home for preventable fall hazards.
Clinicians / discharge planners doing a quick home safety pre‑screen.
Home modification services triaging what to fix first and estimating effort/cost.
Real estate / property managers evaluating accessibility and safety.

What “success” means (repo evidence + gaps):

Success (evidenced): system returns a structured report and can boot/build/test reliably (run_full_deploy_test.sh).
Success (inferred but not measured in repo): fewer missed hazards, fewer hallucinated hazards, actionable fixes, low latency, low cost. Unknown (not found in repo): defined product metrics (accuracy, NPS, retention, clinical outcomes). Suggested metrics are in §8.

3. Product Surface Area (Features)

A. End‑user web experience (React)

Upload photo → POST /assess (multipart file) from the UI (frontend/src/App.jsx:27).
Live “Analyzing…” UI while waiting (non‑streaming; one blocking request).
Structured results page:
- Score + risk level
- Hazard lists (critical/important/minor)
- Priority action plan
- Cost + difficulty
- Knowledge Base References section (shows which guidelines were retrieved and match %)
- Visual Safety Improvements (polls backend until edited image is ready) (frontend/src/components/Results.jsx:43)
Print report via window.print() (frontend‑only).

B. Backend API (FastAPI)

User‑visible endpoints:

GET / serves the built frontend (frontend/dist/) if present, else returns API info (backend/api.py:235).
GET /health returns workflow status + active model configuration (backend/api.py:267).
POST /assess runs the core pipeline (steps 0–3 sync; step 4 async) (backend/api.py:290).
POST /scene-detect runs scene classification only (backend/api.py:529).
GET /categories returns supported scene categories (backend/api.py:587).
GET /stats returns curated KB chunk counts by category (backend/api.py:609).
GET /edit_status/{image_id} returns async image-edit job status (backend/api.py:642).
GET /edited_images/{image_id}_edited.png serves the edited image (backend/api.py:633).

C. Knowledge base tooling

Curated KB stored as structured markdown under knowledge_base/curated_knowledge/ and compiled into JSONL (knowledge_base/curated_chunks/metadata.jsonl) via knowledge_base/process_curated_knowledge.py.
FAISS index built via knowledge_base/create_curated_embeddings.py (OpenAI embeddings).
KB linting via knowledge_base/kb_lint.py + unit tests in tests/test_knowledge_base.py.

D. Deployment/build tooling

Dockerfile builds frontend + runs backend (Cloud Run‑style PORT support).
run_full_deploy_test.sh simulates a deploy build: venv deps → pytest → frontend build.
start_server.sh and start_server.bat for local startup (shell + Windows).

Constraints / caveats (evidence vs unknown):

No auth / no user accounts (evidenced by code search; no auth middleware; see §7).
Docs drift: multiple READMEs describe older model choices and paths (e.g., gpt‑4o vs gemini/gpt‑5; different env var names). Evidence is in file-level references in §11.

4. Architecture Overview

A. Components (text diagram)

Browser (React + Vite)
  - Uploads image to backend (/assess)
  - Renders structured JSON results
  - Polls /edit_status for async edited image
        |
        v
FastAPI backend (backend/api.py)
  - Saves upload -> uploads/<uuid>.<ext>
  - Normalizes EXIF orientation + downscales large photos
  - Runs workflow steps 0–3 synchronously
  - Spawns async background task for step 4 (image edit)
        |
        v
LangGraph-style workflow implementation (backend/workflow.py)
  Step 1: Scene detection (Gemini or OpenAI)
  Step 2: Hybrid retrieval (FAISS + BM25) over curated KB
  Step 3: Safety assessment (Gemini or OpenAI) -> strict JSON
  Step 4: Image editing (Gemini Image / OpenRouter / OpenAI Images) -> edited_images/<uuid>_edited.png

B. Repo inventory (top 2–3 levels, focus on runtime)

safellm3/safellm_deploy/
  backend/
    api.py              # FastAPI server + endpoints
    workflow.py         # 4-step workflow + providers + async image edit
  frontend/
    src/                # React UI (upload/results/polling)
  knowledge_base/
    curated_knowledge/  # human-authored markdown hazards by scene
    curated_chunks/     # JSONL metadata for 105 curated chunks
    curated_embeddings/ # FAISS index used at runtime
    processed/          # extracted source docs (CDC PDFs -> markdown) for transparency
  prompts/
    scene_detection_prompt.py
    safety_assessment_prompt.py
    image_editing_prompt.py
  tests/
    test_knowledge_base.py
  Dockerfile
  requirements.txt
  run_full_deploy_test.sh
  start_server.sh
  test_frontend.py      # manual integration script (skipped under pytest)

D. Key modules (what they do / why they matter)

backend/api.py: FastAPI app that owns the HTTP contract (uploads, responses, polling) and also serves the built SPA in production; it’s the main deployable surface and where reliability controls (EXIF fixes, cleanup, async jobs) live.
backend/workflow.py: Core orchestration for the 4-step pipeline (provider selection, determinism, retrieval wiring, image-edit generation); this is where most AI behavior is defined.
knowledge_base/curated_retrieval.py: Hybrid FAISS+BM25 retrieval that grounds the LLM in a small, scene-filtered knowledge base; it strongly shapes output relevance and consistency.
knowledge_base/process_curated_knowledge.py: “Compiler” from curated markdown → structured JSONL chunks; enforces enumerations (risk levels, hazard types) and creates stable IDs for retrieval.
knowledge_base/create_curated_embeddings.py: Builds the FAISS index used at runtime; without it, retrieval cannot load.
prompts/scene_detection_prompt.py: Defines the scene classifier output shape and allowed categories; constrains LLM1 to avoid adding noise.
prompts/safety_assessment_prompt.py: Defines strict JSON schema + scoring conventions for LLM2; primary control surface for hallucination and output stability.
prompts/image_editing_prompt.py: Converts a small structured “edit plan” into a constrained natural-language image prompt; drives consistent visuals across providers.
frontend/src/App.jsx: Upload handler and environment-based API routing (VITE_API_BASE vs localhost); defines the user flow into /assess.
frontend/src/components/Results.jsx: Results rendering and async polling for the edited image (/edit_status/{image_id}); defines the post-upload UX.
tests/test_knowledge_base.py: Unit tests that protect curated KB processing/validation from regressions.
run_full_deploy_test.sh: Repeatable “deploy simulation” (deps → pytest → build) that makes build confidence auditable.
test_frontend.py: Manual end-to-end script (uploads real images and polls image edits); useful for smoke testing but intentionally skipped in CI-style pytest runs.

C. Key runtime assumptions

Backend is single-process and keeps job status in memory (JOBS = {} in backend/api.py:154), so pending image edits are not durable across restarts.
File storage for uploads/edited images is local disk; cleanup is best-effort (24h window) (backend/api.py:60, called on startup and after successful /assess).
External providers must be reachable for /assess to complete fully (OpenAI embeddings always; Gemini/OpenAI for LLMs; optional OpenRouter/OpenAI Images for step 4).

5. Data Model

There is no database in this deployable repo. Data is stored as:

A. Runtime request state (in-memory)

Per-request workflow state is a Python dict matching WorkflowState in backend/workflow.py (contains image_base64, scene_category, retrieved_knowledge, hazards, etc.).
Async image editing job status is stored in an in-memory dict JOBS (backend/api.py:154, /edit_status/{image_id} at backend/api.py:642).

B. Runtime files (local disk)

uploads/<uuid>.<ext>: incoming images (kept at least long enough for async edit to run).
edited_images/<uuid>_edited.png: the “visual improvements” output image.
Cleanup: files older than 24h are deleted on startup and after successful /assess (backend/api.py:60, backend/api.py:183, backend/api.py:370).

C. Curated Knowledge Base (static artifacts in repo)

knowledge_base/curated_knowledge/*.md: scene-specific hazard lists (structured markdown).
knowledge_base/curated_chunks/metadata.jsonl: 105 lines (one per chunk) with fields like chunk_id, category, hazard_name, risk_level, keywords, hazard_types, version. (Example schema is visible by reading any JSONL line; see processor in knowledge_base/process_curated_knowledge.py.)
knowledge_base/curated_embeddings/faiss_index/: FAISS index built from curated chunks.

D. “Raw/processed” source docs (transparency / provenance)

knowledge_base/raw_documents/ and knowledge_base/processed/: downloaded PDFs (e.g., CDC STEADI) and extracted markdown by category. Example file: knowledge_base/processed/indoor/kitchen/cdc_81518_DS1_extracted.md contains per-page text plus metadata header.
Note: the “raw documents” pipeline in this deployable folder references text_extractor.py (knowledge_base/process_documents.py:18) which is not present in safellm3/safellm_deploy/ (it exists in sibling directories). Running that pipeline here is Unknown (likely broken without copying that file).

6. AI System Design (if applicable)

A. Knowledge ingestion & curation

Two parallel concepts exist in this repo:

Curated knowledge (used at runtime)
- Source: knowledge_base/curated_knowledge/*.md (e.g., knowledge_base/curated_knowledge/kitchen_safety.md).
- Processing: knowledge_base/process_curated_knowledge.py parses markdown sections into structured JSONL with standardized:
  - risk levels (CRITICAL|HIGH|MEDIUM-HIGH|MEDIUM|LOW-MEDIUM|LOW) (knowledge_base/process_curated_knowledge.py:20)
  - hazard types enumeration (slip|trip|visibility|accessibility|structural|clutter|electrical|weather) (knowledge_base/process_curated_knowledge.py:23)
- Output: knowledge_base/curated_chunks/metadata.jsonl + FAISS embeddings index.
Downloaded source docs (mostly for provenance / auditability)
- Sources are declared in knowledge_base/config.py (DOCUMENT_SOURCES includes CDC STEADI URLs).
- Processed markdown under knowledge_base/processed/... includes metadata such as source_key, filename, and processed_at.
- This path is not referenced by the runtime retrieval implementation in backend/workflow.py (runtime uses CuratedHybridRetriever).

B. Embeddings

Runtime retriever uses OpenAI embeddings:

Embedding model: text-embedding-3-small (knowledge_base/curated_retrieval.py:32).
Vector store: FAISS, loaded from disk (knowledge_base/curated_retrieval.py:39).

Operational impact:

Every retrieval likely requires embedding the query (cost + latency).
Optimization opportunity: the current retrieval query is category-driven and repeated per scene ("{category} fall safety hazards and improvements for elderly"), so embeddings can be cached per category.

C. Retrieval (vector + keyword hybrid)

Retriever implementation: knowledge_base/curated_retrieval.py:

Loads all curated chunks from knowledge_base/curated_chunks/metadata.jsonl (knowledge_base/curated_retrieval.py:28).
Builds BM25 index over hazard_name + content + keywords (knowledge_base/curated_retrieval.py:68–93).
Hybrid scoring defaults:
- vector_weight=0.6, bm25_weight=0.4 (knowledge_base/curated_retrieval.py:191–192)
- score threshold filtering min_score_threshold=0.3 (knowledge_base/curated_retrieval.py:193)
- adaptive threshold exists but disabled by default for consistency (use_adaptive_threshold=False) (knowledge_base/curated_retrieval.py:194)

Runtime retrieval in the workflow (backend/workflow.py):

Uses a fixed k=5 “consistency mode” (backend/workflow.py:551).
Uses a category-driven query and filters by the detected category (backend/workflow.py:557).
- Tradeoff: more deterministic retrieval vs less image-specific retrieval.

D. Generation

Step 1 — Scene detection (LLM1)

Prompt: prompts/scene_detection_prompt.py produces strict JSON with scene_category and confidence.
Provider selection is model-name based:
- gemini-* → Gemini (backend/workflow.py:207)
- otherwise → OpenAI (backend/workflow.py:207)
Robustness mechanisms (from code + git history):
- Strict JSON schema for Gemini outputs + fallback parsing when JSON decode fails.
- Git history shows multiple fixes for “iPhone direct camera upload” and Gemini empty responses (e.g., commit f1b6c43).

Step 3 — Safety assessment (LLM2)

Prompt: prompts/safety_assessment_prompt.py defines a strict JSON schema and scoring rules.
Provider selection mirrors Step 1 (backend/workflow.py:208).
The backend performs basic “post-LLM” validation and attaches validation_warnings (citation count, summary length, hazard counts) (backend/workflow.py:843–895).

Important consistency note:

The current prompt/schema includes internal contradictions (e.g., schema enumerations that force confidence="high" while other parts mention low/medium; some post-processing expects different hazard field names). This is not a runtime crash in normal paths, but it is a maintainability risk (§9).

E. Visual feedback (image editing; Step 4)

Step 4 is optional and runs asynchronously in the API:

/assess schedules a background task and returns immediately (backend/api.py:405).
Job status is polled via /edit_status/{image_id} (backend/api.py:642).

How the image edit is generated (backend/workflow.py:969+):

Ask LLM2 for a small editing plan (JSON; ≤3 annotations) (backend/workflow.py:980+).
Convert that plan into a constrained natural-language image prompt (build_prompt_from_plan) (backend/workflow.py:1085, prompts/image_editing_prompt.py:150).
Call one of:
- Gemini image generation (if IMAGE_EDIT_MODEL starts with gemini-)
- OpenRouter image generation (if IMAGE_EDIT_MODEL contains /) (backend/workflow.py:1142, POST https://openrouter.ai/api/v1/chat/completions at backend/workflow.py:1188)
- OpenAI Images edit API (client.images.edit) (backend/workflow.py:1272)
Save output to edited_images/<image_id>_edited.png and serve it via backend route (backend/workflow.py:1310+, backend/api.py:633).

F. Evaluation (what exists)

Evidenced evaluation/testing assets:

Unit tests for curated knowledge processing + linting: tests/test_knowledge_base.py.
A manual end-to-end integration script (skipped under pytest) that uploads test images and polls edit status: test_frontend.py (pytestmark skip at test_frontend.py:16).
Sample test images + saved API outputs: test_images/bathroom2.jpg, test_images/bathroom2_api_result.json, etc.

Unknown (not found in repo):

Formal offline eval harness (gold labels, precision/recall, calibration).
Regression tests for hallucination rate or grounding quality.

7. Reliability, Security, and Privacy

Reliability & correctness mechanisms

EXIF orientation normalization + downscaling on upload (addresses iPhone camera photos and reduces model token load): backend/api.py:94 with MAX_IMAGE_DIMENSION default 2048 (backend/api.py:91).
Retry logic for Gemini safety assessment (handles transient empty responses): in backend/workflow.py Step 3 (Gemini path).
Async image editing is non-fatal: Step 4 failures do not fail the whole assessment (backend/workflow.py:1320+).
Disk growth control: deletes uploads/edited images older than 24h (backend/api.py:60).
Determinism controls:
- DETERMINISTIC_MODE + LLM_SEED (backend/workflow.py:225–226)
- fixed retrieval k=5 and category-driven query (backend/workflow.py:551–557)
- strict JSON schemas in prompts (prompt-level determinism)

Security posture (current)

No authentication / authorization. Anyone with network access to the backend can call /assess.
CORS is fully open (allow_origins=["*"]) (backend/api.py:52), which is convenient for demos but not safe for a public deployment without further controls.
File upload validation checks MIME type starts with image/ (backend/api.py:308) but does not enforce a server-side size limit (frontend UI suggests “max 10MB” but backend does not appear to enforce it).
Prompt injection risk exists via image content (e.g., text in the image). Mitigations in repo are mostly prompt-level constraints + strict JSON output; there is no explicit “prompt injection sanitizer” or allowlist of visual evidence.
In-memory job state means attackers could potentially cause memory growth by spamming /assess (no rate limit).

Privacy & data handling

Local disk stores uploaded images and edited images at least temporarily (uploads are retained for async editing; best-effort cleanup after 24h).
Third-party processing:
- OpenAI: embeddings (always) and optionally vision + image editing depending on env config.
- Google Gemini: vision LLM steps if configured.
- OpenRouter: image generation if configured.
Repo docs claim “images are not stored permanently” (README.md), but the actual server implementation stores them on disk and cleans them later. Treat “not stored permanently” as “not intended to be retained long-term,” not as “never written to disk.”

8. Performance & Cost

What is evidenced

Build/test performance: run_full_deploy_test.sh completes quickly locally; frontend build outputs ~209kB JS gzip ~65kB (Vite build output).
Backend prints timing for workflow steps 0–3 on each /assess (backend/api.py prints a timing summary; not captured here).
The upload pipeline includes image downscaling to reduce Gemini token usage (explicitly mentions truncated JSON risk) (backend/api.py:91–123).

What is unknown (not measured in repo)

p50/p95 latency for /assess and for async image edit completion.
p50/p95 cost per image (varies by provider + model).
Accuracy metrics: scene classification accuracy, hazard precision/recall, hallucination rate.

Suggested metrics to add (high leverage)

Latency: per-step timings already captured in state (timing_* in API response); emit to logs as structured JSON and track p50/p95.
Cost: record model+token usage per request (OpenAI usage fields; Gemini usage metadata) and track cost per request.
Quality:
- Scene classification confusion matrix using a labeled test set.
- Hazard extraction: precision/recall against a rubric for each scene.
- Citation/grounding: fraction of hazards explicitly tied to retrieved guidelines.
Reliability: /assess error rate by provider; async image edit failure rate.

9. Hardest Problems + Key Tradeoffs

High-resolution mobile photos vs structured JSON reliability
- Problem: large images can increase model token usage and cause truncated/malformed JSON from Gemini.
- Mitigation: EXIF normalization + downscale to MAX_IMAGE_DIMENSION (default 2048) (backend/api.py:91–123).
- Tradeoff: downscaling may hide small hazards (e.g., cords, subtle step edges).
Determinism vs image-specific retrieval
- Chosen: fixed k=5 and category-driven query (backend/workflow.py:551–557) to stabilize retrieval + outputs.
- Tradeoff: retrieval may miss hazards that would be retrieved by an image-driven query; relies on LLM2 to pick relevant guidelines from a smaller set.
Multi-provider orchestration (Gemini + OpenAI + OpenRouter)
- Enables picking best model per subtask (vision vs images vs embeddings).
- Tradeoff: operational complexity (3 keys, 3 SLAs, different failure modes).
Async image edit UX vs durable job system
- Current: FastAPI BackgroundTasks + in-memory JOBS store.
- Tradeoff: simplest implementation, but not durable across restarts and not horizontally scalable without shared state.
Security simplicity vs production hardening
- Current: CORS wildcard, no auth, no rate limiting.
- Tradeoff: easy demo deployment vs exposure risk if publicly reachable.
Docs maintainability vs rapid iteration
- Many docs describe earlier versions (gpt‑4o, different folder roots, different env vars).
- Tradeoff: faster iteration, but increases onboarding and interview/due-diligence risk because “the code says X, docs say Y.”

10. Operational Guide (Repro & Deploy)

A. Local setup (tested)

From safellm3/safellm_deploy/:

Run the “deploy simulation” script (installs deps into .venv, runs pytest, builds frontend):

bash ./run_full_deploy_test.sh

Start the backend (serves API and, if built, the frontend from the same port):

bash ./start_server.sh
# or:
cd backend
../.venv/bin/python api.py

Verify safe endpoints:

curl http://localhost:8765/health
curl http://localhost:8765/categories
curl http://localhost:8765/stats

Observed in this environment (smoke test):

/health returned workflow_initialized=true and the active models:
- scene_detection_model=gemini-2.5-flash, safety_assessment_model=gemini-2.5-flash, image_edit_model=google/gemini-2.5-flash-image.
/stats returned total_chunks=105 and category counts consistent with curated KB.

Run a manual end-to-end test (will trigger external model calls; costs apply):

python test_frontend.py

Notes:

This script is skipped in pytest (test_frontend.py:16) and expects the backend to already be running.
It uploads test_images/bathroom2.jpg and test_images/cable.jpg, then polls /edit_status/{image_id}.

B. Required environment variables (names only)

From .env.template:

OPENAI_API_KEY (embeddings; also used for OpenAI LLM/image modes)
GEMINI_API_KEY (or GOOGLE_API_KEY) (required when using gemini-* models)
OPENROUTER_API_KEY (required when IMAGE_EDIT_MODEL uses OpenRouter, e.g. google/gemini-2.5-flash-image)
SCENE_DETECTION_MODEL, SAFETY_ASSESSMENT_MODEL, IMAGE_EDIT_MODEL
API_PORT (local) and/or PORT (Cloud Run style)
DETERMINISTIC_MODE, LLM_SEED, LLM_MAX_OUTPUT_TOKENS (optional)
MAX_IMAGE_DIMENSION (optional; upload downscale control)
IMAGE_EDIT_SIZE (OpenAI image edit size; optional)
DEBUG (optional; enables richer debug payload from /assess)

C. Rebuilding the curated knowledge base (requires OpenAI embeddings)

From repo root safellm3/safellm_deploy/:

python knowledge_base/process_curated_knowledge.py
python knowledge_base/kb_lint.py
python knowledge_base/create_curated_embeddings.py

D. Deployment

Option 1: Single-container deployment (backend serves built frontend)

Dockerfile is multi-stage:
- Stage 1 builds React frontend (npm run build).
- Stage 2 installs Python deps and runs backend/api.py (Cloud Run style PORT support).
Build/run:

docker build -t safellm-fall-risk .
docker run -p 8765:8080 \
  -e PORT=8080 \
  -e OPENAI_API_KEY=... \
  -e GEMINI_API_KEY=... \
  -e OPENROUTER_API_KEY=... \
  safellm-fall-risk

(Do not commit or paste real secrets into docs/logs.)

Option 2: Split frontend/backend (traditional SPA hosting)

Frontend: cd frontend && npm run build and host frontend/dist/ on a static host.
Backend: host FastAPI service and configure VITE_API_BASE in frontend to point to it.

E. Debugging common failures

Backend won’t start:
- Check .env exists and required keys are present for configured providers.
- Check knowledge_base/curated_embeddings/faiss_index/ exists; if missing, build embeddings.
/assess returns 500:
- Enable DEBUG=1 and inspect debug payload (prompts + outputs + top RAG results).
- Verify provider keys: embeddings (OpenAI) are required even if LLMs are Gemini.
Async edited image never appears:
- Poll /edit_status/{image_id} and inspect status/error.
- If using OpenRouter image generation, ensure OPENROUTER_API_KEY is set.
Mobile/iPhone upload produces weird orientation or fails:
- Verify EXIF transpose and resizing behavior in _normalize_image_orientation (backend/api.py:94).
- Tune MAX_IMAGE_DIMENSION downward if Gemini outputs are truncated.

11. Evidence Map (repo anchors)

Claim	Evidence (file:line)
Backend listens on Cloud Run `PORT` if set, else `API_PORT` (default 8765)	`backend/api.py:34`
Backend serves built frontend `frontend/dist` at `/`	`backend/api.py:235`
Static assets are mounted when `frontend/dist` exists	`backend/api.py:147`, `backend/api.py:150`
CORS is currently wildcard (`allow_origins=["*"]`)	`backend/api.py:52`
Upload pipeline normalizes EXIF orientation and downscales large images	`backend/api.py:91`, `backend/api.py:94`
Old uploads/edited images are pruned (24h default)	`backend/api.py:60`, `backend/api.py:157`
`/assess` runs steps 0–3 sync and schedules step 4 async	`backend/api.py:290`, `backend/api.py:405`
Async edit jobs are tracked in memory via `JOBS` and exposed via `/edit_status`	`backend/api.py:154`, `backend/api.py:642`
Edited images are saved under `edited_images/` and served via `/edited_images/{id}_edited.png`	`backend/api.py:633`, `backend/workflow.py:1301`
Model/provider selection is driven by env vars and naming conventions	`backend/workflow.py:202–216`
Determinism toggle + seed exist	`backend/workflow.py:225–226`
Runtime retriever is initialized as `CuratedHybridRetriever()`	`backend/workflow.py:275`
Retrieval uses fixed `k=5` and a category-driven query	`backend/workflow.py:551–557`
Curated hybrid retrieval uses OpenAI embeddings + FAISS + BM25	`knowledge_base/curated_retrieval.py:32`, `knowledge_base/curated_retrieval.py:39`, `knowledge_base/curated_retrieval.py:68`
Hybrid score weights default to 0.6 vector / 0.4 BM25	`knowledge_base/curated_retrieval.py:191–192`
KB processor standardizes risk levels + hazard types	`knowledge_base/process_curated_knowledge.py:20`, `knowledge_base/process_curated_knowledge.py:23`
Safety assessment performs validation warnings (citations/length/hazard counts)	`backend/workflow.py:843–895`
Image editing can call OpenRouter image generation endpoint	`backend/workflow.py:1188`
Image editing can call OpenAI `client.images.edit`	`backend/workflow.py:1272`
Frontend calls `/assess` and displays results	`frontend/src/App.jsx:27`
Frontend polls `/edit_status/{image_id}` for async edited image	`frontend/src/components/Results.jsx:43`
`.env.template` documents required keys and model env vars (names only)	`.env.template:2–22`
Dockerfile is multi-stage: Node build → Python runtime, sets `PORT=8080`	`Dockerfile:3`, `Dockerfile:47`, `Dockerfile:54`
Deploy simulation script runs pytest + frontend build	`run_full_deploy_test.sh:48`, `run_full_deploy_test.sh:68`
Git history includes fixes for iPhone camera uploads and workflow startup robustness	`git log` in `safellm3/safellm_deploy/.git/` (e.g., commit `f1b6c43`)
Config includes CDC STEADI source URLs for document ingestion	`knowledge_base/config.py:112–118`
Docs claim images are “not stored permanently” (note: implementation writes to disk and prunes later)	`README.md:273`, `backend/api.py:60`

12. Interview Question Bank + Answer Outlines

System design

Q1: Walk me through the end-to-end request path for a photo upload. Where are the slow parts?

Frontend sends multipart upload to POST /assess (frontend/src/App.jsx:27, backend/api.py:290).
Backend saves file, normalizes EXIF + downscales, then runs workflow steps 0–3 synchronously (backend/api.py:94, backend/api.py:290).
Step 1: scene classifier; Step 2: hybrid retrieval; Step 3: vision safety assessment (backend/workflow.py).
Step 4 (image edit) is async; client polls /edit_status/{image_id} (frontend/src/components/Results.jsx:43, backend/api.py:642).
Likely slow parts: vision LLM calls + image generation; retrieval is local but embeddings call may be remote.
Evidence: backend/api.py:290, backend/workflow.py:551, backend/workflow.py:969.

Q2: How would you scale this backend horizontally?

Current blockers: in-memory JOBS state and local-disk image storage (not shared across instances).
Make job state durable: Redis / DB + queue (Celery/RQ) or managed task queue.
Move images to object storage (S3/GCS) with signed URLs.
Ensure retrieval artifacts are bundled or cached per instance; warm start loads FAISS index at boot.
Evidence: backend/api.py:154, backend/api.py:633, backend/workflow.py:275.

Q3: How would you enforce tenant isolation / authentication?

Add auth layer (JWT/OAuth) at FastAPI, enforce per-user rate limits and storage namespaces.
Right now there is no auth; CORS is wildcard.
Evidence: backend/api.py:52 and no auth code found.

Q4: What’s your data retention policy and how do you implement it?

Intended: temporary storage only; best-effort cleanup after 24 hours (_cleanup_old_images).
In practice: images are written to disk; deletions occur on startup and after successful /assess.
For production: add periodic cleanup job + explicit retention configuration + delete-on-success option.
Evidence: backend/api.py:60, backend/api.py:157.

AI/RAG

Q5: Why use hybrid retrieval (FAISS + BM25) on a curated KB?

Vector search captures semantic similarity; BM25 captures exact keyword matches and domain terms.
Hybrid weights are explicitly tuned (0.6/0.4) and threshold-filtered to avoid noise.
Curated KB is small and structured, improving precision vs open web scraping.
Evidence: knowledge_base/curated_retrieval.py:68, knowledge_base/curated_retrieval.py:191–194.

Q6: How do you control hallucinations?

Hard constraints: strict JSON schema outputs, consistent scoring rules, and “only report what is visible” instructions (prompt-level).
Retrieval grounding: the prompt injects retrieved guidelines and instructs explicit citations; backend warns when citations are missing.
Limitations: no automated grounding verifier; no image-region evidence mapping; confidence field currently may be forced high by schema.
Evidence: prompts/safety_assessment_prompt.py (schema + instructions), backend/workflow.py:843–895.

Q7: Why is the retrieval query category-driven instead of image-driven?

Chosen for determinism and stability: fixed query + fixed k reduces run-to-run variability.
Tradeoff: less adaptive retrieval; relies on small curated KB and LLM2 to choose relevant hazards.
Evidence: backend/workflow.py:551–557.

Q8: How would you evaluate this system offline?

Create a labeled dataset of images per scene + expected hazards and priority actions.
Compute:
- scene classification accuracy
- hazard precision/recall by risk tier
- grounding score: % hazards mapped to retrieved guidelines
- cost/difficulty calibration vs human baseline
Add regression tests that run on PRs with fixed seeds and stable snapshots.
Evidence: existing hooks: tests/test_knowledge_base.py, test_images/*, test_frontend.py.

Debugging & reliability

Q9: Tell me about a tricky production bug you fixed.

Example evidenced in git history: “Root cause of iPhone direct camera upload failures” and Gemini robustness fixes.
Likely root issues (from code): EXIF orientation, huge pixel dimensions causing token blowups and truncated JSON.
Fix: EXIF transpose + downscale + retries/fallback parsing.
Evidence: commit f1b6c43 and backend/api.py:94.

Q10: What happens when image editing fails?

Step 4 exceptions are caught and do not fail the main response; job status becomes error and frontend continues showing text results.
Evidence: backend/workflow.py:1320+, backend/api.py:405, backend/api.py:642.

Product sense

Q11: What’s the core user promise and how do you keep it?

Promise: “upload one photo, get actionable fall-prevention improvements.”
Kept by: structured output, prioritized action plan, cost/difficulty, and visual explanation image.
Risk: no explicit user feedback loop in product; no saved reports; no trust UI beyond “knowledge base references.”
Evidence: frontend/src/components/Results.jsx + KB reference section.

Q12: If you had to pick one metric to optimize first, what is it?

Suggest: “Actionability rate” (users who implement ≥1 recommendation) + “time-to-first-action.”
Instrument: track which priority actions were shown + user follow-up surveys.
Repo currently has no analytics; would need to add.
Evidence: no telemetry found in code search.

Behavioral

Q13: How did you manage ambiguity in requirements?

Built a small, high-precision curated KB (105 chunks) rather than scraping the web; kept scope to 11 scenes.
Added determinism controls to keep outputs stable for demos and testing.
Validated by unit tests for KB processing and a deploy simulation script.
Evidence: knowledge_base/curated_chunks/metadata.jsonl, backend/workflow.py:551, run_full_deploy_test.sh.

Q14: How do you communicate tradeoffs to stakeholders?

Example: “We use fixed retrieval to reduce variability, which may miss some edge hazards; roadmap adds image-driven query + eval harness.”
Backed by clear evidence anchors and a phased hardening plan (auth, storage, queue).
Evidence: backend/workflow.py:551–557, backend/api.py:154.

13. Roadmap (high-leverage upgrades)

Must (production hardening)

Add auth + rate limiting (protect /assess, mitigate abuse); restrict CORS in production (backend/api.py:52).
Make async image-edit jobs durable (queue + shared state) and move images to object storage (S3/GCS).
Unify docs with current code: model defaults, env var names (VITE_API_BASE vs older VITE_API_URL), deployment steps, and correct file paths.
Add server-side upload limits (max bytes, pixel count) and explicit content-type sniffing.
Remove prompt/schema contradictions and align validation/post-processing with actual output fields.
Add CI (GitHub Actions) to run pytest + npm run build on PRs (mirrors run_full_deploy_test.sh).

Nice-to-have (quality + UX)

Cache per-category embeddings to reduce latency/cost (query is repeated per category).
Add offline eval harness + regression suite (labeled images; grounding metrics).
Add streaming UX / progress per step (backend already has step timings).
Add multi-image “whole home” report that merges hazards across rooms.
Add multilingual UI and report export (PDF) with citations.
Add observability: structured logs, tracing, and provider usage counters (cost dashboards).

Deep Dive — SafeLLM / Fall Risk Detection AI System (safellm3/safellm_deploy)

Scope note: this workspace contains multiple iterations (safellm/, safellm2/, safellm3/). The only directory with git history is safellm3/safellm_deploy/ (contains .git/), so this Deep Dive treats that as the “repo” for evidence and history.

1. What This Is (one paragraph)

2. Who It’s For + Use Cases

Primary users (as described in the repo docs):

Family caregivers assessing an elderly parent’s home for preventable fall hazards.
Clinicians / discharge planners doing a quick home safety pre‑screen.
Home modification services triaging what to fix first and estimating effort/cost.
Real estate / property managers evaluating accessibility and safety.

What “success” means (repo evidence + gaps):

Success (evidenced): system returns a structured report and can boot/build/test reliably (run_full_deploy_test.sh).
Success (inferred but not measured in repo): fewer missed hazards, fewer hallucinated hazards, actionable fixes, low latency, low cost. Unknown (not found in repo): defined product metrics (accuracy, NPS, retention, clinical outcomes). Suggested metrics are in §8.

3. Product Surface Area (Features)

A. End‑user web experience (React)

Upload photo → POST /assess (multipart file) from the UI (frontend/src/App.jsx:27).
Live “Analyzing…” UI while waiting (non‑streaming; one blocking request).
Structured results page:
- Score + risk level
- Hazard lists (critical/important/minor)
- Priority action plan
- Cost + difficulty
- Knowledge Base References section (shows which guidelines were retrieved and match %)
- Visual Safety Improvements (polls backend until edited image is ready) (frontend/src/components/Results.jsx:43)
Print report via window.print() (frontend‑only).

B. Backend API (FastAPI)

User‑visible endpoints:

GET / serves the built frontend (frontend/dist/) if present, else returns API info (backend/api.py:235).
GET /health returns workflow status + active model configuration (backend/api.py:267).
POST /assess runs the core pipeline (steps 0–3 sync; step 4 async) (backend/api.py:290).
POST /scene-detect runs scene classification only (backend/api.py:529).
GET /categories returns supported scene categories (backend/api.py:587).
GET /stats returns curated KB chunk counts by category (backend/api.py:609).
GET /edit_status/{image_id} returns async image-edit job status (backend/api.py:642).
GET /edited_images/{image_id}_edited.png serves the edited image (backend/api.py:633).

C. Knowledge base tooling

Curated KB stored as structured markdown under knowledge_base/curated_knowledge/ and compiled into JSONL (knowledge_base/curated_chunks/metadata.jsonl) via knowledge_base/process_curated_knowledge.py.
FAISS index built via knowledge_base/create_curated_embeddings.py (OpenAI embeddings).
KB linting via knowledge_base/kb_lint.py + unit tests in tests/test_knowledge_base.py.

D. Deployment/build tooling

Dockerfile builds frontend + runs backend (Cloud Run‑style PORT support).
run_full_deploy_test.sh simulates a deploy build: venv deps → pytest → frontend build.
start_server.sh and start_server.bat for local startup (shell + Windows).

Constraints / caveats (evidence vs unknown):

No auth / no user accounts (evidenced by code search; no auth middleware; see §7).
Docs drift: multiple READMEs describe older model choices and paths (e.g., gpt‑4o vs gemini/gpt‑5; different env var names). Evidence is in file-level references in §11.

4. Architecture Overview

A. Components (text diagram)

Browser (React + Vite)
  - Uploads image to backend (/assess)
  - Renders structured JSON results
  - Polls /edit_status for async edited image
        |
        v
FastAPI backend (backend/api.py)
  - Saves upload -> uploads/<uuid>.<ext>
  - Normalizes EXIF orientation + downscales large photos
  - Runs workflow steps 0–3 synchronously
  - Spawns async background task for step 4 (image edit)
        |
        v
LangGraph-style workflow implementation (backend/workflow.py)
  Step 1: Scene detection (Gemini or OpenAI)
  Step 2: Hybrid retrieval (FAISS + BM25) over curated KB
  Step 3: Safety assessment (Gemini or OpenAI) -> strict JSON
  Step 4: Image editing (Gemini Image / OpenRouter / OpenAI Images) -> edited_images/<uuid>_edited.png

B. Repo inventory (top 2–3 levels, focus on runtime)

safellm3/safellm_deploy/
  backend/
    api.py              # FastAPI server + endpoints
    workflow.py         # 4-step workflow + providers + async image edit
  frontend/
    src/                # React UI (upload/results/polling)
  knowledge_base/
    curated_knowledge/  # human-authored markdown hazards by scene
    curated_chunks/     # JSONL metadata for 105 curated chunks
    curated_embeddings/ # FAISS index used at runtime
    processed/          # extracted source docs (CDC PDFs -> markdown) for transparency
  prompts/
    scene_detection_prompt.py
    safety_assessment_prompt.py
    image_editing_prompt.py
  tests/
    test_knowledge_base.py
  Dockerfile
  requirements.txt
  run_full_deploy_test.sh
  start_server.sh
  test_frontend.py      # manual integration script (skipped under pytest)

D. Key modules (what they do / why they matter)

backend/api.py: FastAPI app that owns the HTTP contract (uploads, responses, polling) and also serves the built SPA in production; it’s the main deployable surface and where reliability controls (EXIF fixes, cleanup, async jobs) live.
backend/workflow.py: Core orchestration for the 4-step pipeline (provider selection, determinism, retrieval wiring, image-edit generation); this is where most AI behavior is defined.
knowledge_base/curated_retrieval.py: Hybrid FAISS+BM25 retrieval that grounds the LLM in a small, scene-filtered knowledge base; it strongly shapes output relevance and consistency.
knowledge_base/process_curated_knowledge.py: “Compiler” from curated markdown → structured JSONL chunks; enforces enumerations (risk levels, hazard types) and creates stable IDs for retrieval.
knowledge_base/create_curated_embeddings.py: Builds the FAISS index used at runtime; without it, retrieval cannot load.
prompts/scene_detection_prompt.py: Defines the scene classifier output shape and allowed categories; constrains LLM1 to avoid adding noise.
prompts/safety_assessment_prompt.py: Defines strict JSON schema + scoring conventions for LLM2; primary control surface for hallucination and output stability.
prompts/image_editing_prompt.py: Converts a small structured “edit plan” into a constrained natural-language image prompt; drives consistent visuals across providers.
frontend/src/App.jsx: Upload handler and environment-based API routing (VITE_API_BASE vs localhost); defines the user flow into /assess.
frontend/src/components/Results.jsx: Results rendering and async polling for the edited image (/edit_status/{image_id}); defines the post-upload UX.
tests/test_knowledge_base.py: Unit tests that protect curated KB processing/validation from regressions.
run_full_deploy_test.sh: Repeatable “deploy simulation” (deps → pytest → build) that makes build confidence auditable.
test_frontend.py: Manual end-to-end script (uploads real images and polls image edits); useful for smoke testing but intentionally skipped in CI-style pytest runs.

C. Key runtime assumptions

Backend is single-process and keeps job status in memory (JOBS = {} in backend/api.py:154), so pending image edits are not durable across restarts.
File storage for uploads/edited images is local disk; cleanup is best-effort (24h window) (backend/api.py:60, called on startup and after successful /assess).
External providers must be reachable for /assess to complete fully (OpenAI embeddings always; Gemini/OpenAI for LLMs; optional OpenRouter/OpenAI Images for step 4).

5. Data Model

There is no database in this deployable repo. Data is stored as:

A. Runtime request state (in-memory)

Per-request workflow state is a Python dict matching WorkflowState in backend/workflow.py (contains image_base64, scene_category, retrieved_knowledge, hazards, etc.).
Async image editing job status is stored in an in-memory dict JOBS (backend/api.py:154, /edit_status/{image_id} at backend/api.py:642).

B. Runtime files (local disk)

uploads/<uuid>.<ext>: incoming images (kept at least long enough for async edit to run).
edited_images/<uuid>_edited.png: the “visual improvements” output image.
Cleanup: files older than 24h are deleted on startup and after successful /assess (backend/api.py:60, backend/api.py:183, backend/api.py:370).

C. Curated Knowledge Base (static artifacts in repo)

knowledge_base/curated_knowledge/*.md: scene-specific hazard lists (structured markdown).
knowledge_base/curated_chunks/metadata.jsonl: 105 lines (one per chunk) with fields like chunk_id, category, hazard_name, risk_level, keywords, hazard_types, version. (Example schema is visible by reading any JSONL line; see processor in knowledge_base/process_curated_knowledge.py.)
knowledge_base/curated_embeddings/faiss_index/: FAISS index built from curated chunks.

D. “Raw/processed” source docs (transparency / provenance)

knowledge_base/raw_documents/ and knowledge_base/processed/: downloaded PDFs (e.g., CDC STEADI) and extracted markdown by category. Example file: knowledge_base/processed/indoor/kitchen/cdc_81518_DS1_extracted.md contains per-page text plus metadata header.
Note: the “raw documents” pipeline in this deployable folder references text_extractor.py (knowledge_base/process_documents.py:18) which is not present in safellm3/safellm_deploy/ (it exists in sibling directories). Running that pipeline here is Unknown (likely broken without copying that file).

6. AI System Design (if applicable)

A. Knowledge ingestion & curation

Two parallel concepts exist in this repo:

Curated knowledge (used at runtime)
- Source: knowledge_base/curated_knowledge/*.md (e.g., knowledge_base/curated_knowledge/kitchen_safety.md).
- Processing: knowledge_base/process_curated_knowledge.py parses markdown sections into structured JSONL with standardized:
  - risk levels (CRITICAL|HIGH|MEDIUM-HIGH|MEDIUM|LOW-MEDIUM|LOW) (knowledge_base/process_curated_knowledge.py:20)
  - hazard types enumeration (slip|trip|visibility|accessibility|structural|clutter|electrical|weather) (knowledge_base/process_curated_knowledge.py:23)
- Output: knowledge_base/curated_chunks/metadata.jsonl + FAISS embeddings index.
Downloaded source docs (mostly for provenance / auditability)
- Sources are declared in knowledge_base/config.py (DOCUMENT_SOURCES includes CDC STEADI URLs).
- Processed markdown under knowledge_base/processed/... includes metadata such as source_key, filename, and processed_at.
- This path is not referenced by the runtime retrieval implementation in backend/workflow.py (runtime uses CuratedHybridRetriever).

B. Embeddings

Runtime retriever uses OpenAI embeddings:

Embedding model: text-embedding-3-small (knowledge_base/curated_retrieval.py:32).
Vector store: FAISS, loaded from disk (knowledge_base/curated_retrieval.py:39).

Operational impact:

Every retrieval likely requires embedding the query (cost + latency).
Optimization opportunity: the current retrieval query is category-driven and repeated per scene ("{category} fall safety hazards and improvements for elderly"), so embeddings can be cached per category.

C. Retrieval (vector + keyword hybrid)

Retriever implementation: knowledge_base/curated_retrieval.py:

Loads all curated chunks from knowledge_base/curated_chunks/metadata.jsonl (knowledge_base/curated_retrieval.py:28).
Builds BM25 index over hazard_name + content + keywords (knowledge_base/curated_retrieval.py:68–93).
Hybrid scoring defaults:
- vector_weight=0.6, bm25_weight=0.4 (knowledge_base/curated_retrieval.py:191–192)
- score threshold filtering min_score_threshold=0.3 (knowledge_base/curated_retrieval.py:193)
- adaptive threshold exists but disabled by default for consistency (use_adaptive_threshold=False) (knowledge_base/curated_retrieval.py:194)

Runtime retrieval in the workflow (backend/workflow.py):

Uses a fixed k=5 “consistency mode” (backend/workflow.py:551).
Uses a category-driven query and filters by the detected category (backend/workflow.py:557).
- Tradeoff: more deterministic retrieval vs less image-specific retrieval.

D. Generation

Step 1 — Scene detection (LLM1)

Prompt: prompts/scene_detection_prompt.py produces strict JSON with scene_category and confidence.
Provider selection is model-name based:
- gemini-* → Gemini (backend/workflow.py:207)
- otherwise → OpenAI (backend/workflow.py:207)
Robustness mechanisms (from code + git history):
- Strict JSON schema for Gemini outputs + fallback parsing when JSON decode fails.
- Git history shows multiple fixes for “iPhone direct camera upload” and Gemini empty responses (e.g., commit f1b6c43).

Step 3 — Safety assessment (LLM2)

Prompt: prompts/safety_assessment_prompt.py defines a strict JSON schema and scoring rules.
Provider selection mirrors Step 1 (backend/workflow.py:208).
The backend performs basic “post-LLM” validation and attaches validation_warnings (citation count, summary length, hazard counts) (backend/workflow.py:843–895).

Important consistency note:

The current prompt/schema includes internal contradictions (e.g., schema enumerations that force confidence="high" while other parts mention low/medium; some post-processing expects different hazard field names). This is not a runtime crash in normal paths, but it is a maintainability risk (§9).

E. Visual feedback (image editing; Step 4)

Step 4 is optional and runs asynchronously in the API:

/assess schedules a background task and returns immediately (backend/api.py:405).
Job status is polled via /edit_status/{image_id} (backend/api.py:642).

How the image edit is generated (backend/workflow.py:969+):

Ask LLM2 for a small editing plan (JSON; ≤3 annotations) (backend/workflow.py:980+).
Convert that plan into a constrained natural-language image prompt (build_prompt_from_plan) (backend/workflow.py:1085, prompts/image_editing_prompt.py:150).
Call one of:
- Gemini image generation (if IMAGE_EDIT_MODEL starts with gemini-)
- OpenRouter image generation (if IMAGE_EDIT_MODEL contains /) (backend/workflow.py:1142, POST https://openrouter.ai/api/v1/chat/completions at backend/workflow.py:1188)
- OpenAI Images edit API (client.images.edit) (backend/workflow.py:1272)
Save output to edited_images/<image_id>_edited.png and serve it via backend route (backend/workflow.py:1310+, backend/api.py:633).

F. Evaluation (what exists)

Evidenced evaluation/testing assets:

Unit tests for curated knowledge processing + linting: tests/test_knowledge_base.py.
A manual end-to-end integration script (skipped under pytest) that uploads test images and polls edit status: test_frontend.py (pytestmark skip at test_frontend.py:16).
Sample test images + saved API outputs: test_images/bathroom2.jpg, test_images/bathroom2_api_result.json, etc.

Unknown (not found in repo):

Formal offline eval harness (gold labels, precision/recall, calibration).
Regression tests for hallucination rate or grounding quality.

7. Reliability, Security, and Privacy

Reliability & correctness mechanisms

EXIF orientation normalization + downscaling on upload (addresses iPhone camera photos and reduces model token load): backend/api.py:94 with MAX_IMAGE_DIMENSION default 2048 (backend/api.py:91).
Retry logic for Gemini safety assessment (handles transient empty responses): in backend/workflow.py Step 3 (Gemini path).
Async image editing is non-fatal: Step 4 failures do not fail the whole assessment (backend/workflow.py:1320+).
Disk growth control: deletes uploads/edited images older than 24h (backend/api.py:60).
Determinism controls:
- DETERMINISTIC_MODE + LLM_SEED (backend/workflow.py:225–226)
- fixed retrieval k=5 and category-driven query (backend/workflow.py:551–557)
- strict JSON schemas in prompts (prompt-level determinism)

Security posture (current)

No authentication / authorization. Anyone with network access to the backend can call /assess.
CORS is fully open (allow_origins=["*"]) (backend/api.py:52), which is convenient for demos but not safe for a public deployment without further controls.
File upload validation checks MIME type starts with image/ (backend/api.py:308) but does not enforce a server-side size limit (frontend UI suggests “max 10MB” but backend does not appear to enforce it).
Prompt injection risk exists via image content (e.g., text in the image). Mitigations in repo are mostly prompt-level constraints + strict JSON output; there is no explicit “prompt injection sanitizer” or allowlist of visual evidence.
In-memory job state means attackers could potentially cause memory growth by spamming /assess (no rate limit).

Privacy & data handling

Local disk stores uploaded images and edited images at least temporarily (uploads are retained for async editing; best-effort cleanup after 24h).
Third-party processing:
- OpenAI: embeddings (always) and optionally vision + image editing depending on env config.
- Google Gemini: vision LLM steps if configured.
- OpenRouter: image generation if configured.
Repo docs claim “images are not stored permanently” (README.md), but the actual server implementation stores them on disk and cleans them later. Treat “not stored permanently” as “not intended to be retained long-term,” not as “never written to disk.”

8. Performance & Cost

What is evidenced

Build/test performance: run_full_deploy_test.sh completes quickly locally; frontend build outputs ~209kB JS gzip ~65kB (Vite build output).
Backend prints timing for workflow steps 0–3 on each /assess (backend/api.py prints a timing summary; not captured here).
The upload pipeline includes image downscaling to reduce Gemini token usage (explicitly mentions truncated JSON risk) (backend/api.py:91–123).

What is unknown (not measured in repo)

p50/p95 latency for /assess and for async image edit completion.
p50/p95 cost per image (varies by provider + model).
Accuracy metrics: scene classification accuracy, hazard precision/recall, hallucination rate.

Suggested metrics to add (high leverage)

Latency: per-step timings already captured in state (timing_* in API response); emit to logs as structured JSON and track p50/p95.
Cost: record model+token usage per request (OpenAI usage fields; Gemini usage metadata) and track cost per request.
Quality:
- Scene classification confusion matrix using a labeled test set.
- Hazard extraction: precision/recall against a rubric for each scene.
- Citation/grounding: fraction of hazards explicitly tied to retrieved guidelines.
Reliability: /assess error rate by provider; async image edit failure rate.

9. Hardest Problems + Key Tradeoffs

High-resolution mobile photos vs structured JSON reliability
- Problem: large images can increase model token usage and cause truncated/malformed JSON from Gemini.
- Mitigation: EXIF normalization + downscale to MAX_IMAGE_DIMENSION (default 2048) (backend/api.py:91–123).
- Tradeoff: downscaling may hide small hazards (e.g., cords, subtle step edges).
Determinism vs image-specific retrieval
- Chosen: fixed k=5 and category-driven query (backend/workflow.py:551–557) to stabilize retrieval + outputs.
- Tradeoff: retrieval may miss hazards that would be retrieved by an image-driven query; relies on LLM2 to pick relevant guidelines from a smaller set.
Multi-provider orchestration (Gemini + OpenAI + OpenRouter)
- Enables picking best model per subtask (vision vs images vs embeddings).
- Tradeoff: operational complexity (3 keys, 3 SLAs, different failure modes).
Async image edit UX vs durable job system
- Current: FastAPI BackgroundTasks + in-memory JOBS store.
- Tradeoff: simplest implementation, but not durable across restarts and not horizontally scalable without shared state.
Security simplicity vs production hardening
- Current: CORS wildcard, no auth, no rate limiting.
- Tradeoff: easy demo deployment vs exposure risk if publicly reachable.
Docs maintainability vs rapid iteration
- Many docs describe earlier versions (gpt‑4o, different folder roots, different env vars).
- Tradeoff: faster iteration, but increases onboarding and interview/due-diligence risk because “the code says X, docs say Y.”

10. Operational Guide (Repro & Deploy)

A. Local setup (tested)

From safellm3/safellm_deploy/:

Run the “deploy simulation” script (installs deps into .venv, runs pytest, builds frontend):

bash ./run_full_deploy_test.sh

Start the backend (serves API and, if built, the frontend from the same port):

bash ./start_server.sh
# or:
cd backend
../.venv/bin/python api.py

Verify safe endpoints:

curl http://localhost:8765/health
curl http://localhost:8765/categories
curl http://localhost:8765/stats

Observed in this environment (smoke test):

/health returned workflow_initialized=true and the active models:
- scene_detection_model=gemini-2.5-flash, safety_assessment_model=gemini-2.5-flash, image_edit_model=google/gemini-2.5-flash-image.
/stats returned total_chunks=105 and category counts consistent with curated KB.

Run a manual end-to-end test (will trigger external model calls; costs apply):

python test_frontend.py

Notes:

This script is skipped in pytest (test_frontend.py:16) and expects the backend to already be running.
It uploads test_images/bathroom2.jpg and test_images/cable.jpg, then polls /edit_status/{image_id}.

B. Required environment variables (names only)

From .env.template:

OPENAI_API_KEY (embeddings; also used for OpenAI LLM/image modes)
GEMINI_API_KEY (or GOOGLE_API_KEY) (required when using gemini-* models)
OPENROUTER_API_KEY (required when IMAGE_EDIT_MODEL uses OpenRouter, e.g. google/gemini-2.5-flash-image)
SCENE_DETECTION_MODEL, SAFETY_ASSESSMENT_MODEL, IMAGE_EDIT_MODEL
API_PORT (local) and/or PORT (Cloud Run style)
DETERMINISTIC_MODE, LLM_SEED, LLM_MAX_OUTPUT_TOKENS (optional)
MAX_IMAGE_DIMENSION (optional; upload downscale control)
IMAGE_EDIT_SIZE (OpenAI image edit size; optional)
DEBUG (optional; enables richer debug payload from /assess)

C. Rebuilding the curated knowledge base (requires OpenAI embeddings)

From repo root safellm3/safellm_deploy/:

python knowledge_base/process_curated_knowledge.py
python knowledge_base/kb_lint.py
python knowledge_base/create_curated_embeddings.py

D. Deployment

Option 1: Single-container deployment (backend serves built frontend)

Dockerfile is multi-stage:
- Stage 1 builds React frontend (npm run build).
- Stage 2 installs Python deps and runs backend/api.py (Cloud Run style PORT support).
Build/run:

docker build -t safellm-fall-risk .
docker run -p 8765:8080 \
  -e PORT=8080 \
  -e OPENAI_API_KEY=... \
  -e GEMINI_API_KEY=... \
  -e OPENROUTER_API_KEY=... \
  safellm-fall-risk

(Do not commit or paste real secrets into docs/logs.)

Option 2: Split frontend/backend (traditional SPA hosting)

Frontend: cd frontend && npm run build and host frontend/dist/ on a static host.
Backend: host FastAPI service and configure VITE_API_BASE in frontend to point to it.

E. Debugging common failures

Backend won’t start:
- Check .env exists and required keys are present for configured providers.
- Check knowledge_base/curated_embeddings/faiss_index/ exists; if missing, build embeddings.
/assess returns 500:
- Enable DEBUG=1 and inspect debug payload (prompts + outputs + top RAG results).
- Verify provider keys: embeddings (OpenAI) are required even if LLMs are Gemini.
Async edited image never appears:
- Poll /edit_status/{image_id} and inspect status/error.
- If using OpenRouter image generation, ensure OPENROUTER_API_KEY is set.
Mobile/iPhone upload produces weird orientation or fails:
- Verify EXIF transpose and resizing behavior in _normalize_image_orientation (backend/api.py:94).
- Tune MAX_IMAGE_DIMENSION downward if Gemini outputs are truncated.

11. Evidence Map (repo anchors)

Claim	Evidence (file:line)
Backend listens on Cloud Run `PORT` if set, else `API_PORT` (default 8765)	`backend/api.py:34`
Backend serves built frontend `frontend/dist` at `/`	`backend/api.py:235`
Static assets are mounted when `frontend/dist` exists	`backend/api.py:147`, `backend/api.py:150`
CORS is currently wildcard (`allow_origins=["*"]`)	`backend/api.py:52`
Upload pipeline normalizes EXIF orientation and downscales large images	`backend/api.py:91`, `backend/api.py:94`
Old uploads/edited images are pruned (24h default)	`backend/api.py:60`, `backend/api.py:157`
`/assess` runs steps 0–3 sync and schedules step 4 async	`backend/api.py:290`, `backend/api.py:405`
Async edit jobs are tracked in memory via `JOBS` and exposed via `/edit_status`	`backend/api.py:154`, `backend/api.py:642`
Edited images are saved under `edited_images/` and served via `/edited_images/{id}_edited.png`	`backend/api.py:633`, `backend/workflow.py:1301`
Model/provider selection is driven by env vars and naming conventions	`backend/workflow.py:202–216`
Determinism toggle + seed exist	`backend/workflow.py:225–226`
Runtime retriever is initialized as `CuratedHybridRetriever()`	`backend/workflow.py:275`
Retrieval uses fixed `k=5` and a category-driven query	`backend/workflow.py:551–557`
Curated hybrid retrieval uses OpenAI embeddings + FAISS + BM25	`knowledge_base/curated_retrieval.py:32`, `knowledge_base/curated_retrieval.py:39`, `knowledge_base/curated_retrieval.py:68`
Hybrid score weights default to 0.6 vector / 0.4 BM25	`knowledge_base/curated_retrieval.py:191–192`
KB processor standardizes risk levels + hazard types	`knowledge_base/process_curated_knowledge.py:20`, `knowledge_base/process_curated_knowledge.py:23`
Safety assessment performs validation warnings (citations/length/hazard counts)	`backend/workflow.py:843–895`
Image editing can call OpenRouter image generation endpoint	`backend/workflow.py:1188`
Image editing can call OpenAI `client.images.edit`	`backend/workflow.py:1272`
Frontend calls `/assess` and displays results	`frontend/src/App.jsx:27`
Frontend polls `/edit_status/{image_id}` for async edited image	`frontend/src/components/Results.jsx:43`
`.env.template` documents required keys and model env vars (names only)	`.env.template:2–22`
Dockerfile is multi-stage: Node build → Python runtime, sets `PORT=8080`	`Dockerfile:3`, `Dockerfile:47`, `Dockerfile:54`
Deploy simulation script runs pytest + frontend build	`run_full_deploy_test.sh:48`, `run_full_deploy_test.sh:68`
Git history includes fixes for iPhone camera uploads and workflow startup robustness	`git log` in `safellm3/safellm_deploy/.git/` (e.g., commit `f1b6c43`)
Config includes CDC STEADI source URLs for document ingestion	`knowledge_base/config.py:112–118`
Docs claim images are “not stored permanently” (note: implementation writes to disk and prunes later)	`README.md:273`, `backend/api.py:60`

12. Interview Question Bank + Answer Outlines

System design

Q1: Walk me through the end-to-end request path for a photo upload. Where are the slow parts?

Frontend sends multipart upload to POST /assess (frontend/src/App.jsx:27, backend/api.py:290).
Backend saves file, normalizes EXIF + downscales, then runs workflow steps 0–3 synchronously (backend/api.py:94, backend/api.py:290).
Step 1: scene classifier; Step 2: hybrid retrieval; Step 3: vision safety assessment (backend/workflow.py).
Step 4 (image edit) is async; client polls /edit_status/{image_id} (frontend/src/components/Results.jsx:43, backend/api.py:642).
Likely slow parts: vision LLM calls + image generation; retrieval is local but embeddings call may be remote.
Evidence: backend/api.py:290, backend/workflow.py:551, backend/workflow.py:969.

Q2: How would you scale this backend horizontally?

Current blockers: in-memory JOBS state and local-disk image storage (not shared across instances).
Make job state durable: Redis / DB + queue (Celery/RQ) or managed task queue.
Move images to object storage (S3/GCS) with signed URLs.
Ensure retrieval artifacts are bundled or cached per instance; warm start loads FAISS index at boot.
Evidence: backend/api.py:154, backend/api.py:633, backend/workflow.py:275.

Q3: How would you enforce tenant isolation / authentication?

Add auth layer (JWT/OAuth) at FastAPI, enforce per-user rate limits and storage namespaces.
Right now there is no auth; CORS is wildcard.
Evidence: backend/api.py:52 and no auth code found.

Q4: What’s your data retention policy and how do you implement it?

Intended: temporary storage only; best-effort cleanup after 24 hours (_cleanup_old_images).
In practice: images are written to disk; deletions occur on startup and after successful /assess.
For production: add periodic cleanup job + explicit retention configuration + delete-on-success option.
Evidence: backend/api.py:60, backend/api.py:157.

AI/RAG

Q5: Why use hybrid retrieval (FAISS + BM25) on a curated KB?

Vector search captures semantic similarity; BM25 captures exact keyword matches and domain terms.
Hybrid weights are explicitly tuned (0.6/0.4) and threshold-filtered to avoid noise.
Curated KB is small and structured, improving precision vs open web scraping.
Evidence: knowledge_base/curated_retrieval.py:68, knowledge_base/curated_retrieval.py:191–194.

Q6: How do you control hallucinations?

Hard constraints: strict JSON schema outputs, consistent scoring rules, and “only report what is visible” instructions (prompt-level).
Retrieval grounding: the prompt injects retrieved guidelines and instructs explicit citations; backend warns when citations are missing.
Limitations: no automated grounding verifier; no image-region evidence mapping; confidence field currently may be forced high by schema.
Evidence: prompts/safety_assessment_prompt.py (schema + instructions), backend/workflow.py:843–895.

Q7: Why is the retrieval query category-driven instead of image-driven?

Chosen for determinism and stability: fixed query + fixed k reduces run-to-run variability.
Tradeoff: less adaptive retrieval; relies on small curated KB and LLM2 to choose relevant hazards.
Evidence: backend/workflow.py:551–557.

Q8: How would you evaluate this system offline?

Create a labeled dataset of images per scene + expected hazards and priority actions.
Compute:
- scene classification accuracy
- hazard precision/recall by risk tier
- grounding score: % hazards mapped to retrieved guidelines
- cost/difficulty calibration vs human baseline
Add regression tests that run on PRs with fixed seeds and stable snapshots.
Evidence: existing hooks: tests/test_knowledge_base.py, test_images/*, test_frontend.py.

Debugging & reliability

Q9: Tell me about a tricky production bug you fixed.

Example evidenced in git history: “Root cause of iPhone direct camera upload failures” and Gemini robustness fixes.
Likely root issues (from code): EXIF orientation, huge pixel dimensions causing token blowups and truncated JSON.
Fix: EXIF transpose + downscale + retries/fallback parsing.
Evidence: commit f1b6c43 and backend/api.py:94.

Q10: What happens when image editing fails?

Step 4 exceptions are caught and do not fail the main response; job status becomes error and frontend continues showing text results.
Evidence: backend/workflow.py:1320+, backend/api.py:405, backend/api.py:642.

Product sense

Q11: What’s the core user promise and how do you keep it?

Promise: “upload one photo, get actionable fall-prevention improvements.”
Kept by: structured output, prioritized action plan, cost/difficulty, and visual explanation image.
Risk: no explicit user feedback loop in product; no saved reports; no trust UI beyond “knowledge base references.”
Evidence: frontend/src/components/Results.jsx + KB reference section.

Q12: If you had to pick one metric to optimize first, what is it?

Suggest: “Actionability rate” (users who implement ≥1 recommendation) + “time-to-first-action.”
Instrument: track which priority actions were shown + user follow-up surveys.
Repo currently has no analytics; would need to add.
Evidence: no telemetry found in code search.

Behavioral

Q13: How did you manage ambiguity in requirements?

Built a small, high-precision curated KB (105 chunks) rather than scraping the web; kept scope to 11 scenes.
Added determinism controls to keep outputs stable for demos and testing.
Validated by unit tests for KB processing and a deploy simulation script.
Evidence: knowledge_base/curated_chunks/metadata.jsonl, backend/workflow.py:551, run_full_deploy_test.sh.

Q14: How do you communicate tradeoffs to stakeholders?

Example: “We use fixed retrieval to reduce variability, which may miss some edge hazards; roadmap adds image-driven query + eval harness.”
Backed by clear evidence anchors and a phased hardening plan (auth, storage, queue).
Evidence: backend/workflow.py:551–557, backend/api.py:154.

13. Roadmap (high-leverage upgrades)

Must (production hardening)

Add auth + rate limiting (protect /assess, mitigate abuse); restrict CORS in production (backend/api.py:52).
Make async image-edit jobs durable (queue + shared state) and move images to object storage (S3/GCS).
Unify docs with current code: model defaults, env var names (VITE_API_BASE vs older VITE_API_URL), deployment steps, and correct file paths.
Add server-side upload limits (max bytes, pixel count) and explicit content-type sniffing.
Remove prompt/schema contradictions and align validation/post-processing with actual output fields.
Add CI (GitHub Actions) to run pytest + npm run build on PRs (mirrors run_full_deploy_test.sh).

Nice-to-have (quality + UX)

Cache per-category embeddings to reduce latency/cost (query is repeated per category).
Add offline eval harness + regression suite (labeled images; grounding metrics).
Add streaming UX / progress per step (backend already has step timings).
Add multi-image “whole home” report that merges hazards across rooms.
Add multilingual UI and report export (PDF) with citations.
Add observability: structured logs, tracing, and provider usage counters (cost dashboards).

Overview

Deep dive

Deep Dive — SafeLLM / Fall Risk Detection AI System (safellm3/safellm_deploy)

1. What This Is (one paragraph)

2. Who It’s For + Use Cases

3. Product Surface Area (Features)

A. End‑user web experience (React)

B. Backend API (FastAPI)

C. Knowledge base tooling

D. Deployment/build tooling

4. Architecture Overview

A. Components (text diagram)

B. Repo inventory (top 2–3 levels, focus on runtime)

D. Key modules (what they do / why they matter)

C. Key runtime assumptions

5. Data Model

A. Runtime request state (in-memory)

B. Runtime files (local disk)

C. Curated Knowledge Base (static artifacts in repo)

D. “Raw/processed” source docs (transparency / provenance)

6. AI System Design (if applicable)

A. Knowledge ingestion & curation

B. Embeddings

C. Retrieval (vector + keyword hybrid)

D. Generation

Step 1 — Scene detection (LLM1)

Step 3 — Safety assessment (LLM2)

E. Visual feedback (image editing; Step 4)

F. Evaluation (what exists)

7. Reliability, Security, and Privacy

Reliability & correctness mechanisms

Security posture (current)

Privacy & data handling

8. Performance & Cost

What is evidenced

What is unknown (not measured in repo)

Suggested metrics to add (high leverage)

9. Hardest Problems + Key Tradeoffs

10. Operational Guide (Repro & Deploy)

A. Local setup (tested)

B. Required environment variables (names only)

C. Rebuilding the curated knowledge base (requires OpenAI embeddings)

D. Deployment

Option 1: Single-container deployment (backend serves built frontend)

Option 2: Split frontend/backend (traditional SPA hosting)

E. Debugging common failures

11. Evidence Map (repo anchors)

12. Interview Question Bank + Answer Outlines

System design

AI/RAG

Debugging & reliability

Product sense

Behavioral

13. Roadmap (high-leverage upgrades)

Must (production hardening)

Nice-to-have (quality + UX)

Overview

Deep dive

Deep Dive — SafeLLM / Fall Risk Detection AI System (safellm3/safellm_deploy)

1. What This Is (one paragraph)

2. Who It’s For + Use Cases

3. Product Surface Area (Features)

A. End‑user web experience (React)

B. Backend API (FastAPI)

C. Knowledge base tooling

D. Deployment/build tooling

4. Architecture Overview

A. Components (text diagram)

B. Repo inventory (top 2–3 levels, focus on runtime)

D. Key modules (what they do / why they matter)

C. Key runtime assumptions

5. Data Model

A. Runtime request state (in-memory)

B. Runtime files (local disk)

C. Curated Knowledge Base (static artifacts in repo)

D. “Raw/processed” source docs (transparency / provenance)

6. AI System Design (if applicable)

A. Knowledge ingestion & curation

B. Embeddings

C. Retrieval (vector + keyword hybrid)