Enhancing large language model performance for automatic zero-shot multiple-choice question answering via Single-Token Logit prompting

In the context of artificial intelligence increasingly being applied in education, Large Language Models (LLMs), despite their remarkable capabilities, still exhibit notable limitations when handling multiple-choice questions. Building on this insight, the research team, including Dr. Vu Duc Ly – lecturer at School of Computing and Information Technology, Eastern International University (EIU), and one of the main authors – proposed a method called Single-Token Logit (STL). This approach enables AI to evaluate each answer option independently, thereby significantly improving accuracy and opening up more practical applications of AI in education.

The research entitled “Enhancing large language model performance for automatic zero-shot multiple-choice question answering via single-token logit prompting” has recently been published in the journal Computers and Education: Artificial Intelligence – was published in an open-access journal by Elsevier, currently ranked Q1, and also holds the #1 position in Artificial Intelligence and #2 in Education according to SCImago Journal Rank.

The research content can be summarized as follows:

Although Large Language Models (LLMs) offer significant potential for educational applications, they still demonstrate clear limitations when answering multiple-choice questions (MCQs). Since LLMs are optimized for autoregressive token prediction, their performance declines considerably when the answer choices are shuffled — a phenomenon known as the Multiple-Choice Symbol Binding (MCSB) limitation.

To mitigate this issue, we introduce a new prompting technique called Single-Token Logit (STL). Instead of evaluating the output logits of all answer labels, STL extracts and normalizes the logit value of a single token — specifically, “yes” — to independently verify each option.

The research team conducted a comprehensive evaluation of STL against established baseline methods, including Labels Token Logits (LTL) and Chain-of-Thought (CoT), on the ARC, OpenBookQA, and SciQ datasets. The findings show that:

Superior performance: In most configurations, STL matched or outperformed the standard baseline method (LTL), achieving an improvement of up to 11 percentage points.
Reasonable operational cost: STL maintained only a slight increase in computational resources, including latency and GPU memory usage, compared with LTL.
Statistical reliability: Sample-level McNemar’s tests (p < 0.05) confirmed that STL was statistically superior to LTL and highly competitive with CoT, which is known to be computationally expensive.
High applicability: Finally, the research team demonstrated the robustness of STL in knowledge-intensive environments by integrating it with Retrieval-Augmented Generation (RAG). In this setting, the method achieved an accuracy of up to 81.06% on the combined ARC dataset using the Mistral 7B model—an increase of 9.36 percentage points over the initial no-context baseline (LTL) of 71.7%.

Dr. Vu Duc Ly – Lecturer at School of Computing and Information Technology, EIU

Sharing more about this research, Dr. Vu Duc Ly said:

This study is an outcome beyond the research team’s expectations, marking a significant step forward in the application of AI in education. The paper was published in an open-access journal by Elsevier, currently ranked Q1 and leading in the field of Education, while also being among the top 5 journals in Applied Computer Science and Artificial Intelligence according to SCImago.

Notably, this achievement is also the result of effective collaboration and connection between researchers from Eastern International University and Ho Chi Minh City University of Technology. This success not only affirms the team’s research capacity but also opens up expectations for further strengthening scientific collaboration in the future.

|

|

Tuition Fee & Scholarships

tra cứu văn bằng

Enhancing large language model performance for automatic zero-shot multiple-choice question answering via Single-Token Logit prompting