Multiple-Choice Questions are Efficient and Robust LLM Evaluators In this work, we convert two of the most popular LLM evaluation benchmarks - GSM8K and MATH - into multiple-choice format, and also construct a new program reasoning benchmark PythonIO from HumanEval and MBPP
Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions . . . Multiple-choice question answering (MCQA) is often used to evaluate large language models (LLMs) To see if MCQA assesses LLMs as intended, we probe if LLMs can perform MCQA with choices-only prompts, where models must select the correct answer only from the choices
The dataset for Generating Multiple Choice Questions from Scientific . . . The experimental evaluation of our dataset reveals the potential of LLMs in producing diverse and high-quality MCQs The results highlight the current capabilities of LLMs in handling different types of scientific data and generating meaningful educational content
Evaluation of LLM Generated Multiple Choice Questions Motivated by . . . Abstract y reduce the amount of time educators must spend developing content for their courses This research explores the potential of large language models (LLMs), specifically GPT-4o, to automate the generation of multiple-choice questio
Generation and Assessment of Multiple-Choice Questions from Video . . . We present an empirical study evaluating the quality of multiple-choice questions (MCQs) generated by Large Language Models (LLMs) from a corpus of video transcripts of course lectures in an online data science degree program
MMLU | DeepEval - The Open-Source LLM Evaluation Framework MMLU (Massive Multitask Language Understanding) is a benchmark for evaluating LLMs through multiple-choice questions These questions cover 57 subjects such as math, history, law, and ethics For more information, visit the MMLU GitHub page
Can Multiple-choice Questions Really Be Useful in Detecting the . . . Our results reveal a relatively low correlation between answers from MCQs and LFGQs for identical questions Additionally, we propose two methods to quantify the consistency and confidence of LLMs’ output, which can be generalized to other QA evaluation benchmarks
Self-evaluation of LLMs on challenging LLM-generated STEM MCQs Given the absence of benchmark STEM datasets on multiple-choice questions (MCQs) created by LLMs, we employed various LLMs (e g , Vicuna-13B, Bard, and GPT-3 5) to generate MCQs on STEM topics curated from Wikipedia