An Empirical Study of LLM-Based Data Quality Rule Generation from Natural Language Requirements

Supervisor Name

Moamin Abughazala

Supervisor Email

m.abughazaleh@najah.edu

University

An Najah N. University

Research field

Computer Science

Bio

Ass. Professor at An Najah University 9+ yrs in Research 15+ yrs Experience in Software development and Testing

Description

Modern data-intensive systems rely heavily on automated data validation frameworks such as Great Expectations, Deequ, and Pandera to ensure the reliability and correctness of datasets. However, writing data quality validation rules manually requires both domain knowledge and familiarity with the validation framework syntax, which can be time-consuming and error-prone. Recent advances in Large Language Models (LLMs) have demonstrated promising capabilities in generating code and assisting software engineering tasks. This creates an opportunity to automate the transformation of natural language data quality requirements into executable validation rules. This project aims to conduct an empirical comparative study of multiple LLMs and prompting strategies to evaluate their ability to generate correct and executable data quality validation rules from natural language requirements. The study will: - construct a benchmark dataset of data quality requirements, - generate validation rules using several LLMs, - evaluate the generated rules using execution-based metrics, - analyze common error patterns in LLM-generated rules. The outcome of the project will provide insights into the capabilities and limitations of LLMs for automating data quality validation tasks and propose best practices for using LLMs in data engineering workflows.

Dr. Yousef Najajreh

Dr. Reham Nazal

Dr. Rana Samara

Dr. Ahmed Bassalat

Dr. Nidal Farhat

Prof. Haynes Miller