An Empirical Study of LLM-Based Data Quality Rule Generation from Natural Language Requirements
Supervisor Name
Moamin Abughazala
Supervisor Email
m.abughazaleh@najah.edu
University
An Najah N. University
Research field
Computer Science
Bio
Ass. Professor at An Najah University 9+ yrs in Research 15+ yrs Experience in Software development and Testing
Description
Modern data-intensive systems rely heavily on automated data validation frameworks such as Great Expectations, Deequ, and Pandera to ensure the reliability and correctness of datasets. However, writing data quality validation rules manually requires both domain knowledge and familiarity with the validation framework syntax, which can be time-consuming and error-prone. Recent advances in Large Language Models (LLMs) have demonstrated promising capabilities in generating code and assisting software engineering tasks. This creates an opportunity to automate the transformation of natural language data quality requirements into executable validation rules. This project aims to conduct an empirical comparative study of multiple LLMs and prompting strategies to evaluate their ability to generate correct and executable data quality validation rules from natural language requirements. The study will: - construct a benchmark dataset of data quality requirements, - generate validation rules using several LLMs, - evaluate the generated rules using execution-based metrics, - analyze common error patterns in LLM-generated rules. The outcome of the project will provide insights into the capabilities and limitations of LLMs for automating data quality validation tasks and propose best practices for using LLMs in data engineering workflows.
