Large language models (LLMs) such as GPT-4o are increasingly used to support the development of exam items for knowledge-based certification tests. While early studies established that LLMs can generate plausible item drafts, they also revealed common issues: hallucination, lack of depth, and superficial coverage of the topic. Most prior research found that LLM-generated items typically require SME review and revision, though newer models and more structured prompting strategies have shown promise.
This study builds on those findings by evaluating three advanced prompting strategies: retrieval-augmented generation (RAG), agentic AI, and thinking AI. Each of these approaches seeks to improve upon the limitations of zero-shot prompting, which simply instructs a model to "write an item" without further structure or context.
Retrieval-Augmented Generation (RAG) involves enriching the model's prompt with curated content retrieved from authoritative sources. This is intended to increase factual accuracy and domain relevance, especially in areas where precise distinctions matter. In our previous implementation using generic psychometric content, RAG did not improve item usability—likely because the model already encoded similar knowledge internally. In current research, we apply RAG to narrower, higher-stakes domains where hallucination is more likely and costlier: Illinois traffic law and MariaDB version 10. We hypothesize that RAG will be more effective in these contexts, where tight alignment to specific regulations or versioned documentation is essential.
Agentic AI refers to prompting strategies that allow the model to engage in multi-step workflows, incorporating its own evaluation and revision. Our earlier implementation used a single model with limited autonomy and no tools, relying on fixed psychometric rules and a one-pass generate-review-revise cycle. This was often ineffective, as the model struggled to critique its own output. Our current approach introduces two independent models—one to generate, the other to review and revise—with multiple rounds of interaction. This division of labor is designed to overcome the model’s self-evaluation blind spots, improve quality, and reduce the need for SME revision.
Thinking AI encompasses prompting strategies that direct the model to reason step-by-step within a single session. These include “instructed thinking” (a fixed sequence of steps, such as planning and revision) and “iterated thinking” (cycles of self-questioning and improvement). We are comparing these strategies with OpenAI’s GPT-4o (formerly o3), which builds on earlier structured reasoning models. Thinking prompts are more computationally expensive but may yield greater logical consistency, item complexity, and nuance—especially in areas requiring subtle conceptual understanding or balanced distractors.
In addition to evaluating these prompting strategies, we have also refined our evaluation metrics. Previous scoring methods focused on the amount of revision required, which could reward trivial but technically correct items. We are introducing a complementary measure of “suitability,” aligned with how SMEs assess item quality—including factors like depth, relevance, and instructional value.
All AI-generated content is treated as draft material subject to SME oversight. An SME initiates the item by defining a topic or objective, then selects and edits among AI-generated drafts. Items are only accepted into the pool after being reviewed by at least two additional SMEs. This process reflects our commitment to ethical use of generative AI—prioritizing transparency, accountability, and human judgment in high-stakes assessment contexts.
This research evaluates how far prompting strategies can be pushed to improve AI-authored item quality. By isolating the effects of structured input, iterative reasoning, and model collaboration, we aim to identify practical methods for reducing SME workload without compromising quality. The findings have implications for any test development program considering LLMs as a tool for accelerating item authoring, while preserving the rigor and fairness essential to certification.