
, Atanu Mukhopadhyay2
, Santanu Mukhopadhyay3
1Department of Pediatric and Preventive Dentistry, Dr R Ahmed Dental College and Hospital, Kolkata, India
2Data and Analytics Manager, Amazon.com, New York, NY, USA
3Department of Dentistry, Malda Medical College and Hospital, Malda, India
© 2025 Yeungnam University College of Medicine, Yeungnam University Institute of Medical Science
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Conflicts of interest
No potential conflict of interest relevant to this article was reported.
Funding
None.
Author contributions
Conceptualization, Resources: all authors; Data curation: RB, AM; Formal analysis, Supervision: AM, SM; Software, Validation: AM; Investigation: SM; Methodology: RB; Writing-original draft: SM; Writing-review & editing: RB, AM.
Data availability statement
All study materials, including the fluoride-related MCQs (Supplementary Material 1), open-ended questions (Supplementary Material 2), and Python analysis code (Supplementary Material 3), are provided with this manuscript to ensure reproducibility. The raw LLM response dataset was deleted following data analysis and is no longer available.
| Model | Correct | Incorrect | Total | Accuracy (%) |
|---|---|---|---|---|
| ChatGPT-4 | 44 | 6 | 50 | 88.0 |
| Claude 3.5 Sonnet | 47 | 3 | 50 | 94.0 |
| Copilot | 47 | 3 | 50 | 94.0 |
| Grok 3 | 46 | 4 | 50 | 92.0 |
Values are presented as number (%) or mean±standard deviation.
Each rater evaluated 10 questions, the maximum score per question is 20, and the maximum total score is 200.
The maximum score for each category (accuracy, depth, clarity, evidence) is 5, and the maximum total score is 20 for all models (ChatGPT 4, Copilot, Grok 3, and Claude 3.5 Sonnet).
ChatGPT-4: OpenAI, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA.
| Model | Cohen’s kappa | Agreement level |
|---|---|---|
| ChatGPT-4 | 0.231 | Slight agreement |
| Claude 3.5 Sonnet | 0.104 | Slight agreement |
| Copilot | 0.315 | Fair agreement |
| Grok 3 | 0.259 | Slight agreement |
| Model | Spearman’s rank correlation coefficient | p-value |
|---|---|---|
| ChatGPT-4 | 0.653 | 0.040 |
| Claude 3.5 Sonnet | 0.469 | 0.171 |
| Copilot | 0.862 | <0.001 |
| Grok 3 | 0.871 | <0.001 |
| Model | Incorrect question | Total number of errors | Accuracy (%) |
|---|---|---|---|
| ChatGPT-4 | Q2, Q19, Q33, Q37, Q43, Q50 | 6 | 88 |
| Claude 3.5 Sonnet | Q2, Q33, Q41 | 3 | 94 |
| Copilot | Q15, Q28, Q32 | 3 | 94 |
| Grok 3 | Q15, Q28, Q32, Q33 | 4 | 92 |
| Model | Correct | Incorrect | Total | Accuracy (%) |
|---|---|---|---|---|
| ChatGPT-4 | 44 | 6 | 50 | 88.0 |
| Claude 3.5 Sonnet | 47 | 3 | 50 | 94.0 |
| Copilot | 47 | 3 | 50 | 94.0 |
| Grok 3 | 46 | 4 | 50 | 92.0 |
| Model | Rater | Accuracy | Depth | Clarity | Evidence | Total |
|---|---|---|---|---|---|---|
| ChatGPT-4 | 1 | 48±4.2 | 43±4.8 | 47±4.8 | 40±4.7 | 177±9.5 |
| 2 | 49±3.2 | 45±5.3 | 47±4.8 | 40±4.7 | 181±7.4 | |
| Copilot | 1 | 49±3.2 | 46±5.2 | 45±5.3 | 38±4.2 | 178±9.2 |
| 2 | 48±4.2 | 42±4.2 | 44±5.2 | 40±0.0 | 174±9.7 | |
| Grok 3 | 1 | 48±4.2 | 43±4.8 | 42±4.2 | 42±4.2 | 17.5±8.5 |
| 2 | 47±4.8 | 44±5.2 | 45±5.3 | 41±3.2 | 177±13.4 | |
| Claude 3.5 Sonnet | 1 | 49±3.2 | 45±5.3 | 49±3.2 | 43±4.8 | 183±6.7 |
| 2 | 50±0.0 | 46±5.2 | 49±3.2 | 38±4.2 | 188±9.2 |
| Model | Cohen’s kappa | Agreement level |
|---|---|---|
| ChatGPT-4 | 0.231 | Slight agreement |
| Claude 3.5 Sonnet | 0.104 | Slight agreement |
| Copilot | 0.315 | Fair agreement |
| Grok 3 | 0.259 | Slight agreement |
| Model | Spearman’s rank correlation coefficient | p-value |
|---|---|---|
| ChatGPT-4 | 0.653 | 0.040 |
| Claude 3.5 Sonnet | 0.469 | 0.171 |
| Copilot | 0.862 | <0.001 |
| Grok 3 | 0.871 | <0.001 |
| Metric | ANOVA (p-value) | Kruskal-Wallis (p-value) | Interpretation |
|---|---|---|---|
| Accuracy | 0.867 | 0.858 | Statistically not significant |
| Depth | 0.456 | 0.441 | Statistically not significant |
| Clarity | 0.009 | 0.014 | Statistically significant |
| Evidence | 0.156 | 0.158 | Statistically not significant |
| Total | 0.215 | 0.219 | Statistically not significant |
| Model | Incorrect question | Total number of errors | Accuracy (%) |
|---|---|---|---|
| ChatGPT-4 | Q2, Q19, Q33, Q37, Q43, Q50 | 6 | 88 |
| Claude 3.5 Sonnet | Q2, Q33, Q41 | 3 | 94 |
| Copilot | Q15, Q28, Q32 | 3 | 94 |
| Grok 3 | Q15, Q28, Q32, Q33 | 4 | 92 |
ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.
Values are presented as number (%) or mean±standard deviation. Each rater evaluated 10 questions, the maximum score per question is 20, and the maximum total score is 200. The maximum score for each category (accuracy, depth, clarity, evidence) is 5, and the maximum total score is 20 for all models (ChatGPT 4, Copilot, Grok 3, and Claude 3.5 Sonnet). ChatGPT-4: OpenAI, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA.
ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.
ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.
ANOVA, analysis of variance.
ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.