arXiv

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

June 2, 2026 · Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta · Original Source

Title: Evaluating Local LLMs for Natural-Language-to-SQL Queries in Biopharmaceutical Production: An Empirical Study on Standard Hardware

Abstract:

Biopharmaceutical manufacturers are bound by strict regulatory standards, including FDA guidelines, EU Good Manufacturing Practice (GMP) protocols, and the EU AI Act, which often limit the deployment of cloud-based artificial intelligence solutions. While locally hosted large language models (LLMs) present a privacy-centric alternative, their effectiveness in pharmaceutical production environments has not been thoroughly investigated. This research assesses the performance of four open-source LLMs—specifically Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B—installed locally via Ollama to facilitate natural-language-to-SQL translation for a pharmaceutical manufacturing database.

To facilitate this evaluation, a platform named PharmaBatchDB AI was constructed using a synthetic Microsoft SQL Server database. This dataset comprises roughly 63,000 entries distributed across Batch, Manufacturing Execution System (MES), and Clean-In-Place (CIP) modules. The models were tested against 60 industry-specific natural-language inquiries, with performance measured by SQL extraction rate, SQL compliance, factual consistency, ROUGE-L scores, hallucination frequency, throughput, and latency.

The findings indicate that Qwen 2.5 Coder 7B, Llama 3.1 8B, and Mistral 7B successfully generated SQL queries for every task, whereas Meditron 7B was largely ineffective, failing on most tasks due to insufficient context-window capacity and weak SQL generation skills. Among the successful models, Llama 3.1 8B demonstrated the highest level of SQL compliance, while Qwen 2.5 Coder 7B showed superior text similarity and factual consistency. Statistical analysis revealed no significant performance gap between these two leading models.

Ultimately, the study concludes that general-purpose LLMs fine-tuned for coding tasks surpass specialized biomedical models in generating structured queries for pharmaceutical data. While it is technically feasible to implement GxP-aligned natural-language query systems entirely locally on consumer-grade hardware, current performance metrics necessitate human supervision and subsequent validation before such systems can be safely utilized in regulated environments.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC