Step towards Self-Serve Analytics: Natural Language Queries on Databases

This blog post is connected to my earlier post here https://waqassiddiqi.com/from-text-to-charts-generative-ui-using-assistant-ui in which I documented the UI part - This post here talks about what happens under-the-hood when user asks questions like “How many orders placed yesterday?”

What is LlamaIndex?

LlamaIndex is an open-source framework that connects large language models (LLMs) with private data. It takes unstructured or structured data (databases, documents, APIs, etc.), builds indices, and allows natural-language queries on top of them.

At its core, LlamaIndex does three things:

Data ingestion: Pulls data from multiple sources: databases, data lakes, APIs, or documents.
Indexing: Creates efficient structures so that LLMs can understand and query the data.
Querying: Let uses plain English to ask questions, with the LLM turning them into the right underlying queries.

Building Blocks of Our Implementation

SQLDatabase: A thin wrapper over SQLAlchemy that exposes our database to LlamaIndex. It examines tables, columns, table relations and can sample rows
Nodes & Schemas: LlamaIndex represents table schemas as structured Nodes (e.g. SQLTableSchema). These become part of the model’s context, so the model knows what exists.
ObjectIndex: An index over schema objects (tables/views). Useful to send questions to the right subset of tables.
NLSQLTableQueryEngine: It create prompt which includes (a) the user’s question (b) relevant schema (c) examples, and asks the LLM to suggest SQL.
Settings / Providers: We configure the global LLM & embedding models. For our use case, we are using Azure OpenAI deployment.

The Request Lifecycle (What happens on `/ask-sql`)

User asks: “Top 5 customers by total invoice amount for 2024.”
Schema context: SQLDatabase exposes table/column metadata. Pre-index these with ObjectIndex for cleaner prompts.

# app/sql_engine.py (excerpt)
from sqlalchemy import create_engine
from llama_index.core import SQLDatabase
from llama_index.core.objects import SQLTableNodeMapping, ObjectIndex
from llama_index.core.indices.struct_store import SQLTableSchema
from llama_index.core import VectorStoreIndex

engine = create_engine(settings.sqlalchemy_url, pool_pre_ping=True)
sqldb = SQLDatabase(engine)

node_mapping = SQLTableNodeMapping(sqldb)
schema_nodes = [SQLTableSchema(table_name=t) for t in sqldb.get_usable_table_names()]
obj_index = ObjectIndex.from_objects(schema_nodes, node_mapping, VectorStoreIndex)

Prompt build: NLSQLTableQueryEngine assembles a system prompt with rules (only SELECT, use explicit JOINs, respect LIMIT), the filtered schema, and examples.

# app/sql_engine.py (excerpt)
from llama_index.core.query_engine import NLSQLTableQueryEngine

table_retriever = obj_index.as_retriever(similarity_top_k=3)

query_engine = NLSQLTableQueryEngine(
    sql_database=sqldb,
    table_retriever=table_retriever,
    synthesize_response=True,
)

resp = query_engine.query("Top 5 customers by total invoice amount in 2024")
candidate_sql = getattr(resp, "metadata", {}).get("sql_query") or str(resp)

LLM proposes SQL: Azure OpenAI returns a candidate query (e.g., a SELECT with joins/filters).
Guardrails: We validate the SQL (deny DDL/DML, multiple statements, comments), whitelist tables, and enforce LIMIT.

# app/guardrails.py usage (excerpt)
from sqlalchemy import text
from .guardrails import validate_sql, whitelist_tables, enforce_limit

validate_sql(candidate_sql)
whitelist_tables(candidate_sql)
safe_sql = enforce_limit(candidate_sql, settings.max_rows)

Execution: If validation passes, we run it via SQLAlchemy to get results.

df = pd.read_sql(text(safe_sql), engine)

Prompting (Why it matters)

Good prompts lower hallucinations and push the model toward minimal, correct SQL.

We encode:

Rules: No SELECT *, explicit JOINs, always include LIMIT, no data modification.
Scope: Only the whitelisted tables/columns (keeps search space small and safer).
Few‑shot examples: Show the model how we want dates, GROUP BYs, and aliases formatted.
Fallback behavior: If the question is ambiguous, ask a brief clarifying question (we allow the engine to respond with a follow‑up, or you can implement a retry policy).

Schema Exposure & Scoping

Large schemas overwhelm the prompt and increase error rates. To Keep the context small and relevant:

Create read‑only reporting views that reflect analytics needs (clean joins, intuitive names).
Maintain a synonym dictionary (e.g., “AU” → country='Australia', “last quarter” → [date range]).
For multi‑tenant apps, scope by RLS (Row-Level Security) or inject tenant predicates into generated SQL.

Guardrails (What we block & why)

Forbidden tokens: Reject anything that looks like DDL/DML or comment injection: DROP, UPDATE, --, /* ... */, etc.
Whitelist tables: Only allow known‑safe tables/views.
Single‑statement only: Block semicolons and multi‑statements.
LIMIT: Always enforce an upper bound on rows.
AST validation (optional): For stricter control, parse the SQL (e.g., via sqlglot) and inspect the abstract syntax tree.

As I heard a guy speaking on this topic:

Treat the model like an untrusted user—you must validate and constrain everything it produces.

Common Pitfalls (and how we are going to mititage)

Ambiguous business terms
“Sales” could mean gross, net, or invoiced.

Mitigation: a glossary/synonym map and/or clarify‑question policy.
Join path confusion
With multiple possible join routes, the model guesses.

Mitigation: expose reporting views with the joins baked in; hide raw tables.
Date math & time zones
“Last month” depends on locale/time zone.

Mitigation: convert relative periods to exact ranges server‑side (e.g., Australia/Perth), and teach the model the pattern.
NULL semantics & aggregates
SUM/COUNT with NULLs can be tricky.

Mitigation: steer model via few‑shots to use COALESCE and correct grouping.
Performance outliers
A generated query may be logically correct but slow.

Mitigation: LIMITs, timeouts, and a vault of approved views with proper indexes.
Prompt bloat: cost/latency
Too much schema in context increases tokens.

Mitigation: scope tables, cache schema summaries, and keep few‑shots tight.
Security drift
New tables accidentally become queryable.

Mitigation: explicit allowlist.

Known Shortcomings / What LlamaIndex won’t solve for you

Ground truth vs. business logic: LlamaIndex can suggest SQL, but it doesn’t know the business semantics; we must add them (views/glossary).
Optimiser hints / fine‑grained planning: The LLM cannot guarantee optimal queries; we still need indexes and materialized views.
100% correctness: No framework can fully prevent occasional hallucinations or mis‑joins.
Complex multi‑step analytics: Long, multi‑CTE analytics may exceed token budgets. Consider prebuilt views or “tooling” that lets the app expose safe stored procedures.
Stateful chats: Carrying lots of conversation context increases tokens and error surface. Be deliberate with memory.

Performance & Cost Tuning Recipe

Keep schema small (views, not raw tables).
Short prompts with a handful of quality few‑shots.
Cache schema summaries and reuse the engine.
LIMIT early, and use Postgres statement timeouts.
Prefer a smaller model for straightforward queries; upgrade only when needed.
Introduce result caching or materialised views for popular questions.

Future Improvements

AST validator: Integrate sqlglot to parse and verify only safe SELECT shapes (no cross joins, bounded GROUP BY, approved functions).
Query templates: Allow approved parameterized templates for high‑risk analytics while still supporting free‑text for low‑risk ones.
Observability to Azure: Push traces/metrics to Azure Monitor / App Insights; add OpenTelemetry spans for each stage.
Synonym & unit dictionary: Map business terms (e.g., “AU”, “LTM”) and units (e.g., k, m) to canonical forms pre‑prompt.
Partial structured outputs: Ask the model for YAML‑typed intents before emitting SQL, it can be easier to validate.

Link to example github repository: https://github.com/waqassiddiqi/llamaindex-postgres-nlp-starter

Step towards Self-Serve Analytics: Natural Language Queries on Databases

What is LlamaIndex?

Building Blocks of Our Implementation

The Request Lifecycle (What happens on `/ask-sql`)

Prompting (Why it matters)

Schema Exposure & Scoping

Guardrails (What we block & why)

Common Pitfalls (and how we are going to mititage)

Known Shortcomings / What LlamaIndex won’t solve for you

Performance & Cost Tuning Recipe

Future Improvements

Comments

More from this blog

From LLM to RAG: How Azure OpenAI Powers Smarter Maritime ERP at mBluewave

Why Most Data Teams Don’t Need Real-Time Streaming

The Hidden Costs of Data Pipelines (and How to Control Them)

Shifting Security Left in Data Platforms: Docker Image Scanning with Trivy

Command Palette

What is LlamaIndex?

Building Blocks of Our Implementation

The Request Lifecycle (What happens on /ask-sql)

Prompting (Why it matters)

Schema Exposure & Scoping

Guardrails (What we block & why)

Common Pitfalls (and how we are going to mititage)

Known Shortcomings / What LlamaIndex won’t solve for you

Performance & Cost Tuning Recipe

Future Improvements

Comments

More from this blog

The Request Lifecycle (What happens on `/ask-sql`)