Deep Dive — SafeLLM / Fall Risk Detection AI System (safellm3/safellm_deploy)
Scope note: this workspace contains multiple iterations (
safellm/,safellm2/,safellm3/). The only directory with git history issafellm3/safellm_deploy/(contains.git/), so this Deep Dive treats that as the “repo” for evidence and history.
1. What This Is (one paragraph)
SafeLLM is a deployable web app + API that takes a single photo of a home environment, classifies the scene into one of 11 fall‑risk categories, retrieves fall‑prevention guidelines from a small curated knowledge base, and returns a structured safety report (score, hazards, prioritized actions, cost/difficulty) plus an optional AI‑generated “visual improvements” image that overlays the recommended fixes. The repo also contains extracted guideline documents (e.g., CDC STEADI PDFs → markdown) under knowledge_base/processed/ for provenance/transparency.
2. Who It’s For + Use Cases
Primary users (as described in the repo docs):
- Family caregivers assessing an elderly parent’s home for preventable fall hazards.
- Clinicians / discharge planners doing a quick home safety pre‑screen.
- Home modification services triaging what to fix first and estimating effort/cost.
- Real estate / property managers evaluating accessibility and safety.
What “success” means (repo evidence + gaps):
- Success (evidenced): system returns a structured report and can boot/build/test reliably (
run_full_deploy_test.sh). - Success (inferred but not measured in repo): fewer missed hazards, fewer hallucinated hazards, actionable fixes, low latency, low cost. Unknown (not found in repo): defined product metrics (accuracy, NPS, retention, clinical outcomes). Suggested metrics are in §8.
3. Product Surface Area (Features)
A. End‑user web experience (React)
- Upload photo →
POST /assess(multipart file) from the UI (frontend/src/App.jsx:27). - Live “Analyzing…” UI while waiting (non‑streaming; one blocking request).
- Structured results page:
- Score + risk level
- Hazard lists (critical/important/minor)
- Priority action plan
- Cost + difficulty
- Knowledge Base References section (shows which guidelines were retrieved and match %)
- Visual Safety Improvements (polls backend until edited image is ready) (
frontend/src/components/Results.jsx:43)
- Print report via
window.print()(frontend‑only).
B. Backend API (FastAPI)
User‑visible endpoints:
GET /serves the built frontend (frontend/dist/) if present, else returns API info (backend/api.py:235).GET /healthreturns workflow status + active model configuration (backend/api.py:267).POST /assessruns the core pipeline (steps 0–3 sync; step 4 async) (backend/api.py:290).POST /scene-detectruns scene classification only (backend/api.py:529).GET /categoriesreturns supported scene categories (backend/api.py:587).GET /statsreturns curated KB chunk counts by category (backend/api.py:609).GET /edit_status/{image_id}returns async image-edit job status (backend/api.py:642).GET /edited_images/{image_id}_edited.pngserves the edited image (backend/api.py:633).
C. Knowledge base tooling
- Curated KB stored as structured markdown under
knowledge_base/curated_knowledge/and compiled into JSONL (knowledge_base/curated_chunks/metadata.jsonl) viaknowledge_base/process_curated_knowledge.py. - FAISS index built via
knowledge_base/create_curated_embeddings.py(OpenAI embeddings). - KB linting via
knowledge_base/kb_lint.py+ unit tests intests/test_knowledge_base.py.
D. Deployment/build tooling
Dockerfilebuilds frontend + runs backend (Cloud Run‑style PORT support).run_full_deploy_test.shsimulates a deploy build: venv deps → pytest → frontend build.start_server.shandstart_server.batfor local startup (shell + Windows).
Constraints / caveats (evidence vs unknown):
- No auth / no user accounts (evidenced by code search; no auth middleware; see §7).
- Docs drift: multiple READMEs describe older model choices and paths (e.g., gpt‑4o vs gemini/gpt‑5; different env var names). Evidence is in file-level references in §11.
4. Architecture Overview
A. Components (text diagram)
Browser (React + Vite)
- Uploads image to backend (/assess)
- Renders structured JSON results
- Polls /edit_status for async edited image
|
v
FastAPI backend (backend/api.py)
- Saves upload -> uploads/<uuid>.<ext>
- Normalizes EXIF orientation + downscales large photos
- Runs workflow steps 0–3 synchronously
- Spawns async background task for step 4 (image edit)
|
v
LangGraph-style workflow implementation (backend/workflow.py)
Step 1: Scene detection (Gemini or OpenAI)
Step 2: Hybrid retrieval (FAISS + BM25) over curated KB
Step 3: Safety assessment (Gemini or OpenAI) -> strict JSON
Step 4: Image editing (Gemini Image / OpenRouter / OpenAI Images) -> edited_images/<uuid>_edited.png
B. Repo inventory (top 2–3 levels, focus on runtime)
safellm3/safellm_deploy/
backend/
api.py # FastAPI server + endpoints
workflow.py # 4-step workflow + providers + async image edit
frontend/
src/ # React UI (upload/results/polling)
knowledge_base/
curated_knowledge/ # human-authored markdown hazards by scene
curated_chunks/ # JSONL metadata for 105 curated chunks
curated_embeddings/ # FAISS index used at runtime
processed/ # extracted source docs (CDC PDFs -> markdown) for transparency
prompts/
scene_detection_prompt.py
safety_assessment_prompt.py
image_editing_prompt.py
tests/
test_knowledge_base.py
Dockerfile
requirements.txt
run_full_deploy_test.sh
start_server.sh
test_frontend.py # manual integration script (skipped under pytest)
D. Key modules (what they do / why they matter)
backend/api.py: FastAPI app that owns the HTTP contract (uploads, responses, polling) and also serves the built SPA in production; it’s the main deployable surface and where reliability controls (EXIF fixes, cleanup, async jobs) live.backend/workflow.py: Core orchestration for the 4-step pipeline (provider selection, determinism, retrieval wiring, image-edit generation); this is where most AI behavior is defined.knowledge_base/curated_retrieval.py: Hybrid FAISS+BM25 retrieval that grounds the LLM in a small, scene-filtered knowledge base; it strongly shapes output relevance and consistency.knowledge_base/process_curated_knowledge.py: “Compiler” from curated markdown → structured JSONL chunks; enforces enumerations (risk levels, hazard types) and creates stable IDs for retrieval.knowledge_base/create_curated_embeddings.py: Builds the FAISS index used at runtime; without it, retrieval cannot load.prompts/scene_detection_prompt.py: Defines the scene classifier output shape and allowed categories; constrains LLM1 to avoid adding noise.prompts/safety_assessment_prompt.py: Defines strict JSON schema + scoring conventions for LLM2; primary control surface for hallucination and output stability.prompts/image_editing_prompt.py: Converts a small structured “edit plan” into a constrained natural-language image prompt; drives consistent visuals across providers.frontend/src/App.jsx: Upload handler and environment-based API routing (VITE_API_BASEvs localhost); defines the user flow into/assess.frontend/src/components/Results.jsx: Results rendering and async polling for the edited image (/edit_status/{image_id}); defines the post-upload UX.tests/test_knowledge_base.py: Unit tests that protect curated KB processing/validation from regressions.run_full_deploy_test.sh: Repeatable “deploy simulation” (deps → pytest → build) that makes build confidence auditable.test_frontend.py: Manual end-to-end script (uploads real images and polls image edits); useful for smoke testing but intentionally skipped in CI-style pytest runs.
C. Key runtime assumptions
- Backend is single-process and keeps job status in memory (
JOBS = {}inbackend/api.py:154), so pending image edits are not durable across restarts. - File storage for uploads/edited images is local disk; cleanup is best-effort (24h window) (
backend/api.py:60, called on startup and after successful/assess). - External providers must be reachable for
/assessto complete fully (OpenAI embeddings always; Gemini/OpenAI for LLMs; optional OpenRouter/OpenAI Images for step 4).
5. Data Model
There is no database in this deployable repo. Data is stored as:
A. Runtime request state (in-memory)
- Per-request workflow state is a Python dict matching
WorkflowStateinbackend/workflow.py(contains image_base64, scene_category, retrieved_knowledge, hazards, etc.). - Async image editing job status is stored in an in-memory dict
JOBS(backend/api.py:154,/edit_status/{image_id}atbackend/api.py:642).
B. Runtime files (local disk)
uploads/<uuid>.<ext>: incoming images (kept at least long enough for async edit to run).edited_images/<uuid>_edited.png: the “visual improvements” output image.- Cleanup: files older than 24h are deleted on startup and after successful
/assess(backend/api.py:60,backend/api.py:183,backend/api.py:370).
C. Curated Knowledge Base (static artifacts in repo)
knowledge_base/curated_knowledge/*.md: scene-specific hazard lists (structured markdown).knowledge_base/curated_chunks/metadata.jsonl: 105 lines (one per chunk) with fields likechunk_id,category,hazard_name,risk_level,keywords,hazard_types,version. (Example schema is visible by reading any JSONL line; see processor inknowledge_base/process_curated_knowledge.py.)knowledge_base/curated_embeddings/faiss_index/: FAISS index built from curated chunks.
D. “Raw/processed” source docs (transparency / provenance)
knowledge_base/raw_documents/andknowledge_base/processed/: downloaded PDFs (e.g., CDC STEADI) and extracted markdown by category. Example file:knowledge_base/processed/indoor/kitchen/cdc_81518_DS1_extracted.mdcontains per-page text plus metadata header.- Note: the “raw documents” pipeline in this deployable folder references
text_extractor.py(knowledge_base/process_documents.py:18) which is not present insafellm3/safellm_deploy/(it exists in sibling directories). Running that pipeline here is Unknown (likely broken without copying that file).
6. AI System Design (if applicable)
A. Knowledge ingestion & curation
Two parallel concepts exist in this repo:
-
Curated knowledge (used at runtime)
- Source:
knowledge_base/curated_knowledge/*.md(e.g.,knowledge_base/curated_knowledge/kitchen_safety.md). - Processing:
knowledge_base/process_curated_knowledge.pyparses markdown sections into structured JSONL with standardized:- risk levels (
CRITICAL|HIGH|MEDIUM-HIGH|MEDIUM|LOW-MEDIUM|LOW) (knowledge_base/process_curated_knowledge.py:20) - hazard types enumeration (
slip|trip|visibility|accessibility|structural|clutter|electrical|weather) (knowledge_base/process_curated_knowledge.py:23)
- risk levels (
- Output:
knowledge_base/curated_chunks/metadata.jsonl+ FAISS embeddings index.
- Source:
-
Downloaded source docs (mostly for provenance / auditability)
- Sources are declared in
knowledge_base/config.py(DOCUMENT_SOURCESincludes CDC STEADI URLs). - Processed markdown under
knowledge_base/processed/...includes metadata such assource_key,filename, andprocessed_at. - This path is not referenced by the runtime retrieval implementation in
backend/workflow.py(runtime usesCuratedHybridRetriever).
- Sources are declared in
B. Embeddings
Runtime retriever uses OpenAI embeddings:
- Embedding model:
text-embedding-3-small(knowledge_base/curated_retrieval.py:32). - Vector store: FAISS, loaded from disk (
knowledge_base/curated_retrieval.py:39).
Operational impact:
- Every retrieval likely requires embedding the query (cost + latency).
- Optimization opportunity: the current retrieval query is category-driven and repeated per scene (
"{category} fall safety hazards and improvements for elderly"), so embeddings can be cached per category.
C. Retrieval (vector + keyword hybrid)
Retriever implementation: knowledge_base/curated_retrieval.py:
- Loads all curated chunks from
knowledge_base/curated_chunks/metadata.jsonl(knowledge_base/curated_retrieval.py:28). - Builds BM25 index over
hazard_name + content + keywords(knowledge_base/curated_retrieval.py:68–93). - Hybrid scoring defaults:
vector_weight=0.6,bm25_weight=0.4(knowledge_base/curated_retrieval.py:191–192)- score threshold filtering
min_score_threshold=0.3(knowledge_base/curated_retrieval.py:193) - adaptive threshold exists but disabled by default for consistency (
use_adaptive_threshold=False) (knowledge_base/curated_retrieval.py:194)
Runtime retrieval in the workflow (backend/workflow.py):
- Uses a fixed
k=5“consistency mode” (backend/workflow.py:551). - Uses a category-driven query and filters by the detected category (
backend/workflow.py:557).- Tradeoff: more deterministic retrieval vs less image-specific retrieval.
D. Generation
Step 1 — Scene detection (LLM1)
- Prompt:
prompts/scene_detection_prompt.pyproduces strict JSON withscene_categoryandconfidence. - Provider selection is model-name based:
gemini-*→ Gemini (backend/workflow.py:207)- otherwise → OpenAI (
backend/workflow.py:207)
- Robustness mechanisms (from code + git history):
- Strict JSON schema for Gemini outputs + fallback parsing when JSON decode fails.
- Git history shows multiple fixes for “iPhone direct camera upload” and Gemini empty responses (e.g., commit
f1b6c43).
Step 3 — Safety assessment (LLM2)
- Prompt:
prompts/safety_assessment_prompt.pydefines a strict JSON schema and scoring rules. - Provider selection mirrors Step 1 (
backend/workflow.py:208). - The backend performs basic “post-LLM” validation and attaches
validation_warnings(citation count, summary length, hazard counts) (backend/workflow.py:843–895).
Important consistency note:
- The current prompt/schema includes internal contradictions (e.g., schema enumerations that force
confidence="high"while other parts mentionlow/medium; some post-processing expects different hazard field names). This is not a runtime crash in normal paths, but it is a maintainability risk (§9).
E. Visual feedback (image editing; Step 4)
Step 4 is optional and runs asynchronously in the API:
/assessschedules a background task and returns immediately (backend/api.py:405).- Job status is polled via
/edit_status/{image_id}(backend/api.py:642).
How the image edit is generated (backend/workflow.py:969+):
- Ask LLM2 for a small editing plan (JSON; ≤3 annotations) (
backend/workflow.py:980+). - Convert that plan into a constrained natural-language image prompt (
build_prompt_from_plan) (backend/workflow.py:1085,prompts/image_editing_prompt.py:150). - Call one of:
- Gemini image generation (if
IMAGE_EDIT_MODELstarts withgemini-) - OpenRouter image generation (if
IMAGE_EDIT_MODELcontains/) (backend/workflow.py:1142, POSThttps://openrouter.ai/api/v1/chat/completionsatbackend/workflow.py:1188) - OpenAI Images edit API (
client.images.edit) (backend/workflow.py:1272)
- Gemini image generation (if
- Save output to
edited_images/<image_id>_edited.pngand serve it via backend route (backend/workflow.py:1310+,backend/api.py:633).
F. Evaluation (what exists)
Evidenced evaluation/testing assets:
- Unit tests for curated knowledge processing + linting:
tests/test_knowledge_base.py. - A manual end-to-end integration script (skipped under pytest) that uploads test images and polls edit status:
test_frontend.py(pytestmarkskip attest_frontend.py:16). - Sample test images + saved API outputs:
test_images/bathroom2.jpg,test_images/bathroom2_api_result.json, etc.
Unknown (not found in repo):
- Formal offline eval harness (gold labels, precision/recall, calibration).
- Regression tests for hallucination rate or grounding quality.
7. Reliability, Security, and Privacy
Reliability & correctness mechanisms
- EXIF orientation normalization + downscaling on upload (addresses iPhone camera photos and reduces model token load):
backend/api.py:94withMAX_IMAGE_DIMENSIONdefault 2048 (backend/api.py:91). - Retry logic for Gemini safety assessment (handles transient empty responses): in
backend/workflow.pyStep 3 (Gemini path). - Async image editing is non-fatal: Step 4 failures do not fail the whole assessment (
backend/workflow.py:1320+). - Disk growth control: deletes uploads/edited images older than 24h (
backend/api.py:60). - Determinism controls:
DETERMINISTIC_MODE+LLM_SEED(backend/workflow.py:225–226)- fixed retrieval
k=5and category-driven query (backend/workflow.py:551–557) - strict JSON schemas in prompts (prompt-level determinism)
Security posture (current)
- No authentication / authorization. Anyone with network access to the backend can call
/assess. - CORS is fully open (
allow_origins=["*"]) (backend/api.py:52), which is convenient for demos but not safe for a public deployment without further controls. - File upload validation checks MIME type starts with
image/(backend/api.py:308) but does not enforce a server-side size limit (frontend UI suggests “max 10MB” but backend does not appear to enforce it). - Prompt injection risk exists via image content (e.g., text in the image). Mitigations in repo are mostly prompt-level constraints + strict JSON output; there is no explicit “prompt injection sanitizer” or allowlist of visual evidence.
- In-memory job state means attackers could potentially cause memory growth by spamming
/assess(no rate limit).
Privacy & data handling
- Local disk stores uploaded images and edited images at least temporarily (uploads are retained for async editing; best-effort cleanup after 24h).
- Third-party processing:
- OpenAI: embeddings (always) and optionally vision + image editing depending on env config.
- Google Gemini: vision LLM steps if configured.
- OpenRouter: image generation if configured.
- Repo docs claim “images are not stored permanently” (
README.md), but the actual server implementation stores them on disk and cleans them later. Treat “not stored permanently” as “not intended to be retained long-term,” not as “never written to disk.”
8. Performance & Cost
What is evidenced
- Build/test performance:
run_full_deploy_test.shcompletes quickly locally; frontend build outputs ~209kB JS gzip ~65kB (Vite build output). - Backend prints timing for workflow steps 0–3 on each
/assess(backend/api.pyprints a timing summary; not captured here). - The upload pipeline includes image downscaling to reduce Gemini token usage (explicitly mentions truncated JSON risk) (
backend/api.py:91–123).
What is unknown (not measured in repo)
- p50/p95 latency for
/assessand for async image edit completion. - p50/p95 cost per image (varies by provider + model).
- Accuracy metrics: scene classification accuracy, hazard precision/recall, hallucination rate.
Suggested metrics to add (high leverage)
- Latency: per-step timings already captured in state (
timing_*in API response); emit to logs as structured JSON and track p50/p95. - Cost: record model+token usage per request (OpenAI usage fields; Gemini usage metadata) and track cost per request.
- Quality:
- Scene classification confusion matrix using a labeled test set.
- Hazard extraction: precision/recall against a rubric for each scene.
- Citation/grounding: fraction of hazards explicitly tied to retrieved guidelines.
- Reliability:
/assesserror rate by provider; async image edit failure rate.
9. Hardest Problems + Key Tradeoffs
-
High-resolution mobile photos vs structured JSON reliability
- Problem: large images can increase model token usage and cause truncated/malformed JSON from Gemini.
- Mitigation: EXIF normalization + downscale to
MAX_IMAGE_DIMENSION(default 2048) (backend/api.py:91–123). - Tradeoff: downscaling may hide small hazards (e.g., cords, subtle step edges).
-
Determinism vs image-specific retrieval
- Chosen: fixed
k=5and category-driven query (backend/workflow.py:551–557) to stabilize retrieval + outputs. - Tradeoff: retrieval may miss hazards that would be retrieved by an image-driven query; relies on LLM2 to pick relevant guidelines from a smaller set.
- Chosen: fixed
-
Multi-provider orchestration (Gemini + OpenAI + OpenRouter)
- Enables picking best model per subtask (vision vs images vs embeddings).
- Tradeoff: operational complexity (3 keys, 3 SLAs, different failure modes).
-
Async image edit UX vs durable job system
- Current: FastAPI
BackgroundTasks+ in-memoryJOBSstore. - Tradeoff: simplest implementation, but not durable across restarts and not horizontally scalable without shared state.
- Current: FastAPI
-
Security simplicity vs production hardening
- Current: CORS wildcard, no auth, no rate limiting.
- Tradeoff: easy demo deployment vs exposure risk if publicly reachable.
-
Docs maintainability vs rapid iteration
- Many docs describe earlier versions (gpt‑4o, different folder roots, different env vars).
- Tradeoff: faster iteration, but increases onboarding and interview/due-diligence risk because “the code says X, docs say Y.”
10. Operational Guide (Repro & Deploy)
A. Local setup (tested)
From safellm3/safellm_deploy/:
- Run the “deploy simulation” script (installs deps into
.venv, runs pytest, builds frontend):
bash ./run_full_deploy_test.sh
- Start the backend (serves API and, if built, the frontend from the same port):
bash ./start_server.sh
# or:
cd backend
../.venv/bin/python api.py
- Verify safe endpoints:
curl http://localhost:8765/health
curl http://localhost:8765/categories
curl http://localhost:8765/stats
Observed in this environment (smoke test):
/healthreturnedworkflow_initialized=trueand the active models:scene_detection_model=gemini-2.5-flash,safety_assessment_model=gemini-2.5-flash,image_edit_model=google/gemini-2.5-flash-image.
/statsreturnedtotal_chunks=105and category counts consistent with curated KB.
- Run a manual end-to-end test (will trigger external model calls; costs apply):
python test_frontend.py
Notes:
- This script is skipped in pytest (
test_frontend.py:16) and expects the backend to already be running. - It uploads
test_images/bathroom2.jpgandtest_images/cable.jpg, then polls/edit_status/{image_id}.
B. Required environment variables (names only)
From .env.template:
OPENAI_API_KEY(embeddings; also used for OpenAI LLM/image modes)GEMINI_API_KEY(orGOOGLE_API_KEY) (required when usinggemini-*models)OPENROUTER_API_KEY(required whenIMAGE_EDIT_MODELuses OpenRouter, e.g.google/gemini-2.5-flash-image)SCENE_DETECTION_MODEL,SAFETY_ASSESSMENT_MODEL,IMAGE_EDIT_MODELAPI_PORT(local) and/orPORT(Cloud Run style)DETERMINISTIC_MODE,LLM_SEED,LLM_MAX_OUTPUT_TOKENS(optional)MAX_IMAGE_DIMENSION(optional; upload downscale control)IMAGE_EDIT_SIZE(OpenAI image edit size; optional)DEBUG(optional; enables richer debug payload from/assess)
C. Rebuilding the curated knowledge base (requires OpenAI embeddings)
From repo root safellm3/safellm_deploy/:
python knowledge_base/process_curated_knowledge.py
python knowledge_base/kb_lint.py
python knowledge_base/create_curated_embeddings.py
D. Deployment
Option 1: Single-container deployment (backend serves built frontend)
Dockerfileis multi-stage:- Stage 1 builds React frontend (
npm run build). - Stage 2 installs Python deps and runs
backend/api.py(Cloud Run stylePORTsupport).
- Stage 1 builds React frontend (
- Build/run:
docker build -t safellm-fall-risk .
docker run -p 8765:8080 \
-e PORT=8080 \
-e OPENAI_API_KEY=... \
-e GEMINI_API_KEY=... \
-e OPENROUTER_API_KEY=... \
safellm-fall-risk
(Do not commit or paste real secrets into docs/logs.)
Option 2: Split frontend/backend (traditional SPA hosting)
- Frontend:
cd frontend && npm run buildand hostfrontend/dist/on a static host. - Backend: host FastAPI service and configure
VITE_API_BASEin frontend to point to it.
E. Debugging common failures
- Backend won’t start:
- Check
.envexists and required keys are present for configured providers. - Check
knowledge_base/curated_embeddings/faiss_index/exists; if missing, build embeddings.
- Check
/assessreturns 500:- Enable
DEBUG=1and inspect debug payload (prompts + outputs + top RAG results). - Verify provider keys: embeddings (OpenAI) are required even if LLMs are Gemini.
- Enable
- Async edited image never appears:
- Poll
/edit_status/{image_id}and inspectstatus/error. - If using OpenRouter image generation, ensure
OPENROUTER_API_KEYis set.
- Poll
- Mobile/iPhone upload produces weird orientation or fails:
- Verify EXIF transpose and resizing behavior in
_normalize_image_orientation(backend/api.py:94). - Tune
MAX_IMAGE_DIMENSIONdownward if Gemini outputs are truncated.
- Verify EXIF transpose and resizing behavior in
11. Evidence Map (repo anchors)
| Claim | Evidence (file:line) |
|---|---|
Backend listens on Cloud Run PORT if set, else API_PORT (default 8765) | backend/api.py:34 |
Backend serves built frontend frontend/dist at / | backend/api.py:235 |
Static assets are mounted when frontend/dist exists | backend/api.py:147, backend/api.py:150 |
CORS is currently wildcard (allow_origins=["*"]) | backend/api.py:52 |
| Upload pipeline normalizes EXIF orientation and downscales large images | backend/api.py:91, backend/api.py:94 |
| Old uploads/edited images are pruned (24h default) | backend/api.py:60, backend/api.py:157 |
/assess runs steps 0–3 sync and schedules step 4 async | backend/api.py:290, backend/api.py:405 |
Async edit jobs are tracked in memory via JOBS and exposed via /edit_status | backend/api.py:154, backend/api.py:642 |
Edited images are saved under edited_images/ and served via /edited_images/{id}_edited.png | backend/api.py:633, backend/workflow.py:1301 |
| Model/provider selection is driven by env vars and naming conventions | backend/workflow.py:202–216 |
| Determinism toggle + seed exist | backend/workflow.py:225–226 |
Runtime retriever is initialized as CuratedHybridRetriever() | backend/workflow.py:275 |
Retrieval uses fixed k=5 and a category-driven query | backend/workflow.py:551–557 |
| Curated hybrid retrieval uses OpenAI embeddings + FAISS + BM25 | knowledge_base/curated_retrieval.py:32, knowledge_base/curated_retrieval.py:39, knowledge_base/curated_retrieval.py:68 |
| Hybrid score weights default to 0.6 vector / 0.4 BM25 | knowledge_base/curated_retrieval.py:191–192 |
| KB processor standardizes risk levels + hazard types | knowledge_base/process_curated_knowledge.py:20, knowledge_base/process_curated_knowledge.py:23 |
| Safety assessment performs validation warnings (citations/length/hazard counts) | backend/workflow.py:843–895 |
| Image editing can call OpenRouter image generation endpoint | backend/workflow.py:1188 |
Image editing can call OpenAI client.images.edit | backend/workflow.py:1272 |
Frontend calls /assess and displays results | frontend/src/App.jsx:27 |
Frontend polls /edit_status/{image_id} for async edited image | frontend/src/components/Results.jsx:43 |
.env.template documents required keys and model env vars (names only) | .env.template:2–22 |
Dockerfile is multi-stage: Node build → Python runtime, sets PORT=8080 | Dockerfile:3, Dockerfile:47, Dockerfile:54 |
| Deploy simulation script runs pytest + frontend build | run_full_deploy_test.sh:48, run_full_deploy_test.sh:68 |
| Git history includes fixes for iPhone camera uploads and workflow startup robustness | git log in safellm3/safellm_deploy/.git/ (e.g., commit f1b6c43) |
| Config includes CDC STEADI source URLs for document ingestion | knowledge_base/config.py:112–118 |
| Docs claim images are “not stored permanently” (note: implementation writes to disk and prunes later) | README.md:273, backend/api.py:60 |
12. Interview Question Bank + Answer Outlines
System design
Q1: Walk me through the end-to-end request path for a photo upload. Where are the slow parts?
- Frontend sends multipart upload to
POST /assess(frontend/src/App.jsx:27,backend/api.py:290). - Backend saves file, normalizes EXIF + downscales, then runs workflow steps 0–3 synchronously (
backend/api.py:94,backend/api.py:290). - Step 1: scene classifier; Step 2: hybrid retrieval; Step 3: vision safety assessment (
backend/workflow.py). - Step 4 (image edit) is async; client polls
/edit_status/{image_id}(frontend/src/components/Results.jsx:43,backend/api.py:642). - Likely slow parts: vision LLM calls + image generation; retrieval is local but embeddings call may be remote.
Evidence:backend/api.py:290,backend/workflow.py:551,backend/workflow.py:969.
Q2: How would you scale this backend horizontally?
- Current blockers: in-memory
JOBSstate and local-disk image storage (not shared across instances). - Make job state durable: Redis / DB + queue (Celery/RQ) or managed task queue.
- Move images to object storage (S3/GCS) with signed URLs.
- Ensure retrieval artifacts are bundled or cached per instance; warm start loads FAISS index at boot.
Evidence:backend/api.py:154,backend/api.py:633,backend/workflow.py:275.
Q3: How would you enforce tenant isolation / authentication?
- Add auth layer (JWT/OAuth) at FastAPI, enforce per-user rate limits and storage namespaces.
- Right now there is no auth; CORS is wildcard.
Evidence:backend/api.py:52and no auth code found.
Q4: What’s your data retention policy and how do you implement it?
- Intended: temporary storage only; best-effort cleanup after 24 hours (
_cleanup_old_images). - In practice: images are written to disk; deletions occur on startup and after successful
/assess. - For production: add periodic cleanup job + explicit retention configuration + delete-on-success option.
Evidence:backend/api.py:60,backend/api.py:157.
AI/RAG
Q5: Why use hybrid retrieval (FAISS + BM25) on a curated KB?
- Vector search captures semantic similarity; BM25 captures exact keyword matches and domain terms.
- Hybrid weights are explicitly tuned (0.6/0.4) and threshold-filtered to avoid noise.
- Curated KB is small and structured, improving precision vs open web scraping.
Evidence:knowledge_base/curated_retrieval.py:68,knowledge_base/curated_retrieval.py:191–194.
Q6: How do you control hallucinations?
- Hard constraints: strict JSON schema outputs, consistent scoring rules, and “only report what is visible” instructions (prompt-level).
- Retrieval grounding: the prompt injects retrieved guidelines and instructs explicit citations; backend warns when citations are missing.
- Limitations: no automated grounding verifier; no image-region evidence mapping; confidence field currently may be forced high by schema.
Evidence:prompts/safety_assessment_prompt.py(schema + instructions),backend/workflow.py:843–895.
Q7: Why is the retrieval query category-driven instead of image-driven?
- Chosen for determinism and stability: fixed query + fixed
kreduces run-to-run variability. - Tradeoff: less adaptive retrieval; relies on small curated KB and LLM2 to choose relevant hazards.
Evidence:backend/workflow.py:551–557.
Q8: How would you evaluate this system offline?
- Create a labeled dataset of images per scene + expected hazards and priority actions.
- Compute:
- scene classification accuracy
- hazard precision/recall by risk tier
- grounding score: % hazards mapped to retrieved guidelines
- cost/difficulty calibration vs human baseline
- Add regression tests that run on PRs with fixed seeds and stable snapshots.
Evidence: existing hooks:tests/test_knowledge_base.py,test_images/*,test_frontend.py.
Debugging & reliability
Q9: Tell me about a tricky production bug you fixed.
- Example evidenced in git history: “Root cause of iPhone direct camera upload failures” and Gemini robustness fixes.
- Likely root issues (from code): EXIF orientation, huge pixel dimensions causing token blowups and truncated JSON.
- Fix: EXIF transpose + downscale + retries/fallback parsing.
Evidence: commitf1b6c43andbackend/api.py:94.
Q10: What happens when image editing fails?
- Step 4 exceptions are caught and do not fail the main response; job status becomes
errorand frontend continues showing text results.
Evidence:backend/workflow.py:1320+,backend/api.py:405,backend/api.py:642.
Product sense
Q11: What’s the core user promise and how do you keep it?
- Promise: “upload one photo, get actionable fall-prevention improvements.”
- Kept by: structured output, prioritized action plan, cost/difficulty, and visual explanation image.
- Risk: no explicit user feedback loop in product; no saved reports; no trust UI beyond “knowledge base references.”
Evidence:frontend/src/components/Results.jsx+ KB reference section.
Q12: If you had to pick one metric to optimize first, what is it?
- Suggest: “Actionability rate” (users who implement ≥1 recommendation) + “time-to-first-action.”
- Instrument: track which priority actions were shown + user follow-up surveys.
- Repo currently has no analytics; would need to add.
Evidence: no telemetry found in code search.
Behavioral
Q13: How did you manage ambiguity in requirements?
- Built a small, high-precision curated KB (105 chunks) rather than scraping the web; kept scope to 11 scenes.
- Added determinism controls to keep outputs stable for demos and testing.
- Validated by unit tests for KB processing and a deploy simulation script.
Evidence:knowledge_base/curated_chunks/metadata.jsonl,backend/workflow.py:551,run_full_deploy_test.sh.
Q14: How do you communicate tradeoffs to stakeholders?
- Example: “We use fixed retrieval to reduce variability, which may miss some edge hazards; roadmap adds image-driven query + eval harness.”
- Backed by clear evidence anchors and a phased hardening plan (auth, storage, queue).
Evidence:backend/workflow.py:551–557,backend/api.py:154.
13. Roadmap (high-leverage upgrades)
Must (production hardening)
- Add auth + rate limiting (protect
/assess, mitigate abuse); restrict CORS in production (backend/api.py:52). - Make async image-edit jobs durable (queue + shared state) and move images to object storage (S3/GCS).
- Unify docs with current code: model defaults, env var names (
VITE_API_BASEvs olderVITE_API_URL), deployment steps, and correct file paths. - Add server-side upload limits (max bytes, pixel count) and explicit content-type sniffing.
- Remove prompt/schema contradictions and align validation/post-processing with actual output fields.
- Add CI (GitHub Actions) to run
pytest+npm run buildon PRs (mirrorsrun_full_deploy_test.sh).
Nice-to-have (quality + UX)
- Cache per-category embeddings to reduce latency/cost (query is repeated per category).
- Add offline eval harness + regression suite (labeled images; grounding metrics).
- Add streaming UX / progress per step (backend already has step timings).
- Add multi-image “whole home” report that merges hazards across rooms.
- Add multilingual UI and report export (PDF) with citations.
- Add observability: structured logs, tracing, and provider usage counters (cost dashboards).