Skip Navigation
Skip to contents

JYMS : Journal of Yeungnam Medical Science

Indexed in: ESCI, Scopus, PubMed,
PubMed Central, CAS, DOAJ, KCI
FREE article processing charge
OPEN ACCESS
SEARCH
Search

Articles

Page Path
HOME > J Yeungnam Med Sci > Volume 42; 2025 > Article
Original article
Dentistry
Performance of large language models in fluoride-related dental knowledge: a comparative evaluation study of ChatGPT-4, Claude 3.5 Sonnet, Copilot, and Grok 3
Raju Biswas1orcid, Atanu Mukhopadhyay2orcid, Santanu Mukhopadhyay3orcid
Journal of Yeungnam Medical Science 2025;42:53.
DOI: https://doi.org/10.12701/jyms.2025.42.53
Published online: September 1, 2025

1Department of Pediatric and Preventive Dentistry, Dr R Ahmed Dental College and Hospital, Kolkata, India

2Data and Analytics Manager, Amazon.com, New York, NY, USA

3Department of Dentistry, Malda Medical College and Hospital, Malda, India

Corresponding author: Santanu Mukhopadhyay, MDS Department of Dentistry, Malda Medical College and Hospital, Uma Roy Sarani, Malda 732101, West Bengal, India Tel: +91-3512-221087 • E-mail: msantanu25@gmail.com
• Received: July 14, 2025   • Revised: August 23, 2025   • Accepted: August 27, 2025

© 2025 Yeungnam University College of Medicine, Yeungnam University Institute of Medical Science

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

  • 848 Views
  • 128 Download
  • Background
    Large language models (LLMs) are increasingly used in medical and dental education to enhance clinical reasoning, patient communication, and academic learning. This study evaluates the effectiveness of four advanced LLMs— ChatGPT-4 (OpenAI), Claude 3.5 Sonnet (Anthropic), Microsoft Copilot, and Grok 3 (xAI)—in conveying fluoride-related dental knowledge.
  • Methods
    A cross-sectional comparative study was conducted using a mixed-methods approach. Each LLM answered 50 multiple-choice questions (MCQs) and 10 open-ended questions on fluoride chemistry, clinical applications, and safety concerns. Two blinded experts rated the open-ended responses on accuracy, depth, clarity, and evidence. Interrater reliability was assessed using Cohen’s kappa and Spearman’s correlation, and statistical analyses were performed using analysis of variance, Kruskal-Wallis, and post-hoc tests.
  • Results
    All models showed high MCQ accuracy (88%–94%). Claude 3.5 Sonnet achieved the highest scores in open-ended responses, especially for clarity (p=0.009). Minor differences in accuracy, depth, and evidence were not statistically significant. Overall, all LLMs performed strongly, with high interrater agreement supporting result reliability.
  • Conclusion
    Advanced LLMs show strong potential as supportive tools in dental education and patient communication on fluoride use. Claude 3.5 Sonnet demonstrated superior linguistic clarity, enhancing its educational value. Continued evaluation and clinical oversight are crucial for their safe and effective integration into dentistry.
Large language models (LLMs) based on deep learning and natural language processing technologies are emerging as valuable tools in healthcare. They can assist professionals in clinical decision-making by offering rapid access to extensive medical knowledge and current research findings [1-3]. Additionally, LLMs have the potential to be virtual teaching assistants capable of delivering personalized learning experiences [4-7]. Within the field of dentistry, LLMs may serve both as educational resources for students and tools to enhance patient understanding. Their ability to process large volumes of text and generate coherent, human-like responses makes them promising assets for dental education and clinical support [8-10].
According to the World Health Organization, primary tooth caries affect up to 43% of the global population, notably more than permanent tooth caries, underscoring the urgent need for effective oral health interventions [11]. Fluoride remains a cornerstone of modern dental care and is recognized for its role in strengthening enamel and preventing dental caries. It functions through mechanisms that promote enamel remineralization and inhibit bacterial activity. Fluoride is commonly used in public health via water fluoridation, toothpaste, and mouth rinses. However, its use must be carefully managed, as excessive exposure can lead to dental fluorosis, manifesting as discoloration or pitting of teeth, which has raised public concerns regarding safety [12,13]. Therefore, a comprehensive understanding of the chemical properties, applications, and potential risks of fluoride is essential for clinicians and the public.
Despite the growing interest in LLMs for healthcare applications, a significant research gap remains regarding their performance in specialized dental topics, particularly fluoride-related knowledge. Although previous studies evaluated LLMs in general medical contexts, limited research has specifically examined the accuracy, depth of understanding, and clarity of LLMs in addressing complex fluoride chemistry, clinical applications, and safety considerations in dentistry. Furthermore, comparative analyses between different LLM platforms on domain-specific dental content are scarce, leaving educators and clinicians uncertain about which models might be the most suitable for educational or clinical support purposes.
If accurate and reliable, LLMs can complement dental education with evidence-based information on topics such as fluoride [14]. They also have the potential to address public concerns by offering scientifically supported explanations, thereby promoting trust and informed use.
This study aimed to evaluate the performance of four advanced LLMs—ChatGPT-4 (OpenAI), Claude 3.5 Sonnet (Anthropic), Microsoft Copilot, and Grok 3 (xAI)—in delivering fluoride-related content. Using a mixed methods approach involving multiple-choice questions (MCQs) and open-ended questions, we assessed the accuracy, depth, clarity, and evidence of each model. The objective of this study was to explore the usefulness of LLMs in dental education to improve public understanding and support evidence-based clinical practice.
Ethics statement: This study did not involve human participants, identifiable data, or clinical interventions. Therefore, Institutional Review Board approval was not required.
1. Study design
This cross-sectional comparative study evaluated the performance of four LLMs on fluoride-related dental knowledge using a mixed methods framework. This study employed both MCQs and open-ended questions to assess factual knowledge, conceptual understanding, and analytical reasoning related to fluoride use in dentistry.
2. Language models evaluated
Four prominent LLMs were selected for evaluation: ChatGPT-4 (OpenAI, San Francisco, CA, USA), Claude 3.5 Sonnet (Anthropic, San Francisco, CA, USA), Microsoft Copilot (Microsoft, Redmond, WA, USA), Grok 3 (xAI, San Francisco, CA, USA). The evaluation was conducted between June 4, 2025 and June 22, 2025, with all models accessed through their standard public interfaces using default configurations to ensure consistency in the testing conditions.
3. Question development and validation
Fifty MCQs were developed by experts based on the current literature and clinical guidelines to cover the core aspects of fluoride science and dental applications, including factual content, mechanistic understanding, and clinical principles (Supplementary Material 1). LLMs were tested using the original validated questions, without prompt engineering or added context, to evaluate their baseline performance on clinical queries. These questions were validated by a panel of three dental experts specializing in pediatric dentistry to ensure scientific accuracy, clinical relevance, and alignment with current standards in dental education.
Additionally, 10 open-ended questions were crafted to probe deeper analytical and explanatory abilities, focusing on the complex aspects of fluoride application and dental health (Supplementary Material 2). Both MCQs and open-ended prompts were reviewed and refined by dental professionals to ensure scientific rigor and suitability for evaluating LLM performance.
4. Evaluation framework

1) Scoring system for open-ended questions

Responses were evaluated using a structured four-dimensional rubric, with each criterion scored from 0 to 5 points (maximum score, 20 points per response). The four criteria were as follows: accuracy, referring to the factual correctness of scientific content; depth, referring to the breadth and detail of explanation; clarity, referring to coherence, structure, and readability; and evidence, referring to the use of supporting mechanisms or research-based reasoning.

2) Rater selection and training

Two independent raters with expertise in dental science and fluoride research scored all the open-ended responses. The raters were blinded to the identity of the language model during the evaluation. Prior to scoring, they underwent calibration sessions using sample responses to ensure the consistent application of the rubric. Discrepancies in scoring were resolved through discussion and consensus.

3) Interrater reliability assessment

Given the inherent subjectivity of evaluating open-ended qualitative responses, interrater reliability was assessed using two complementary measures: Cohen’s kappa coefficient for exact score agreement and Spearman’s rank correlation coefficient for consistency in the relative model performance rankings. Cohen’s kappa provides insight into absolute scoring consistency, whereas Spearman’s correlation evaluates whether raters maintain consistent relative judgments of model performance across different responses, which is particularly relevant for comparative studies of this nature.

4) Data collection procedure

Each LLM was presented with an identical set of questions, and each question was asked only once to avoid memory or learning effects. Responses were collected and anonymized prior to the evaluation. The MCQ responses were scored as correct (1) or incorrect (0), whereas the open-ended responses were rated independently by both raters using a standardized rubric.

5) Mode-of-failure analysis

In addition to quantitative performance metrics, we conducted a systematic analysis of error patterns to identify common misconceptions and failure modes across the LLMs.
5. Statistical analysis
Descriptive statistics (mean scores with standard deviations) were calculated for all the models and scoring dimensions. MCQ accuracy was reported as the percentage of correct responses. The normality of data distribution was assessed using the Shapiro-Wilk test to determine the appropriate statistical approach for subsequent analyses. Interrater agreement was assessed using Cohen’s kappa for exact score matching and Spearman’s rank correlation for comparative consistency in the model rankings. Cohen’s kappa values were interpreted according to established guidelines. Comparisons between models were conducted using one-way analysis of variance (ANOVA) for normally distributed data and the Kruskal-Wallis test for nonparametric distributions, with post-hoc comparisons performed using Tukey’s honest significant difference (HSD) and Dunn’s tests, respectively. Interrater agreement plots were generated using Python (ver. 3.10; Python Software Foundation, Wilmington, DE, USA) and Matplotlib library (ver. 3.6.2; Matplotlib Development Team) (Supplementary Material 3). The alpha level for significance was set at p<0.05.
1. Multiple-choice question performance
All four models showed high accuracy for the 50 MCQs, with performances ranging from 88% to 94% (Table 1, Fig. 1). Claude 3.5 Sonnet and Copilot had the highest accuracy at 94%, followed by Grok 3 at 92% and ChatGPT-4 at 88%.
2. Open-ended question performance

1) Overall mean scores

Claude 3.5 Sonnet achieved the highest total average scores from both raters (183/200 and 188/200, calculated from 10 open-ended responses) (Table 2, Fig. 2).
In summary, Claude 3.5 Sonnet achieved the highest overall scores in open-ended responses, reflecting strong performances across the evaluation dimensions by all models.

2) Interrater agreement

(1) Cohen’s kappa

Cohen’s kappa values showed low to fair agreement between the raters in assigning exact scores (Table 3). Copilot achieved the highest kappa (0.315, fair agreement), while Claude 3.5 Sonnet had the lowest (0.104, slight agreement). Cohen’s kappa values for the models ranged from 0.104 to 0.315, indicating slight to fair agreement between raters on exact scores. This low level of agreement likely reflects the subjective nature of the evaluation of open-ended answers in terms of clarity, depth, and evidence. Despite calibration, the raters may have interpreted subtle differences in the responses differently, especially in borderline cases. Although this affects the precision of the exact scores, it does not undermine the overall findings.
In summary, Cohen’s kappa values indicated only slight to fair agreement between raters on exact scores, owing to inherent subjectivity, but strong Spearman’s rank correlations confirmed consistent relative model rankings.

(2) Spearman’s rank correlation

Copilot (ρ=0.862, p<0.001) and Grok 3 (ρ=0.871, p<0.001) showed strong consistency, indicating high agreement (Table 4).

(3) Statistical comparison between models

No statistically significant differences were observed across models in terms of accuracy, depth, evidence, or total scores (Table 5). The only exception was the clarity dimension, which showed significant differences (ANOVA, p=0.009; Kruskal-Wallis, p=0.014). Post-hoc Tukey’s HSD tests revealed that Claude 3.5 Sonnet significantly outperformed Grok 3 (p=0.0009) and Copilot (p=0.0029) in clarity. When comparing the total scores across all models, the results approached but did not reach statistical significance (ANOVA, p=0.215; Kruskal-Wallis, p=0.219).
In summary, there were no significant differences among the models in terms of accuracy, depth, evidence, or total scores. However, Claude 3.5 Sonnet significantly outperformed the other models in the clarity dimension.

(4) Mode-of-failure analysis

To identify specific limitations, we analyzed the most frequently missed questions across all four LLMs (Table 6). ChatGPT-4 missed six questions, Claude 3.5 Sonnet missed three, Grok 3 missed four, and Copilot missed three. Several questions were missed by multiple models, highlighting conceptual challenges. For example, both ChatGPT-4 and Claude 3.5 Sonnet confused stannous and sodium fluoride (Q2), whereas three models struggled with populations benefiting the least from water fluoridation (Q33). Other common errors included distinguishing between pre- and post-eruptive fluoride action (Q50) and identifying compounds in silver diamine fluoride (Q15). These findings show that even top-performing LLMs have difficulty with complex fluoride topics that require detailed chemical and public health understanding.
In summary, specific conceptual challenges were identified where multiple models made errors, revealing limitations in nuanced fluoride topics, such as compound roles, population health interpretation, and fluoride action timing.
This evaluation compared the performance of four advanced LLMs—ChatGPT-4, Claude 3.5 Sonnet, Microsoft Copilot, and Grok 3—on both MCQs and open-ended questions related to fluoride use in dentistry. This analysis yielded several important results. First, all models demonstrated strong performance on MCQs, with accuracy rates ranging from 88% to 94%. Claude 3.5 Sonnet and Copilot achieved the highest scores in this category, suggesting their strong ability to retrieve, recognize, and correctly select factual dental knowledge. These results align with prior studies that identified newer LLMs as increasingly competitive in standardized question formats [15-20].
In the open-ended responses, all four models achieved high mean scores across the four core evaluation dimensions: accuracy, depth, clarity, and evidence. Claude 3.5 Sonnet consistently received the highest total scores from both expert raters, reflecting its strength in generating coherent, logically structured, detailed, and well-supported responses. While the total score differences across the models were not statistically significant, clarity was the only dimension for which significant variation was observed. Post-hoc analysis confirmed that Claude 3.5 Sonnet significantly outperformed both Grok 3 and Copilot in terms of clarity, suggesting that it may have a distinctive advantage in expressing complex dental topics with fluency, precision, and coherence. This finding highlights the importance of not only factual content but also how effectively information is communicated in educational and clinical contexts.
This finding is consistent with findings from previously published studies. For instance, Salman et al. [16] found that ChatGPT-4 outperformed Copilot and Gemini (Google DeepMind, Mountain View, CA, USA) in advanced pharmacology questions, whereas the performance of Copilot declined moderately on short-answer formats. The broader implication is that while newer LLMs, such as Claude 3.5 Sonnet, may show improved clarity and coherence, their performance on complex topics still varies by model architecture and training depth.
Similarly, Aldukhail [17] highlighted the differences in educational utility across models, observing that ChatGPT outperformed Bard (Google LLC, Mountain View, CA, USA) in most dental education tasks, particularly in crafting comprehensive exercises and explanations. This supports our finding that some LLMs, such as Claude 3.5 Sonnet, may be better suited for educational content generation, particularly where clarity and depth are essential.
The low to fair interrater agreement (Cohen’s kappa, 0.104–0.315) represents a significant methodological limitation that requires careful consideration. This finding highlights a broader challenge in LLM evaluation research: the difficulty of achieving consistent subjective assessments of artificial intelligence (AI)-generated content quality. The relatively narrow performance range among these advanced models (most scores of 4–5 out of 5) may have exacerbated rater disagreement as subtle qualitative differences become more influential in scoring decisions. Although our raters underwent extensive training and calibration, the inherent subjectivity in evaluating complex explanatory responses remains a fundamental challenge that must be addressed using improved methodological approaches. This level of agreement reflects the inherent subjectivity in evaluating qualitative content despite the use of standardized rubrics and comprehensive rater training. The limited kappa values likely indicated variations in how raters interpreted subtle differences in tone, depth, and evidence citations across model responses, highlighting the challenges inherent in the subjective assessment of AI-generated content.
However, Spearman’s rank correlation coefficients (p>0.86, p=0.001) revealed a stronger alignment between the raters in the relative ranking of model performance. Copilot and Grok 3, in particular, demonstrated strong interrater rank-order consistency, indicating that while raters may differ in assigning precise numeric scores, they largely agree on the relative quality of responses across models. This consistency confirms the reliability of the comparative insights derived from the scoring framework.
Such interrater consistency in relative rankings has also been documented by Tussie and Starosta [19], who tested models on national dental board questions and reported Claude 3.5 Sonnet and ChatGPT-4 as consistent top performers. Their findings emphasize that performance evaluation should consider both absolute accuracy and model ranking stability, a point supported by our statistical observations.
Importantly, the lack of statistically significant differences across most evaluation dimensions, combined with the nonsignificant result in the total score comparison (ANOVA, p=0.215), suggests that all four models are competitively strong performers within the context of fluoride-related dental knowledge. Nonetheless, the statistically significant advantage observed for Claude 3.5 Sonnet in clarity may reflect a refinement in its language generation architecture or training, giving it an edge in real-world applications that require patient-friendly communication or educational resource development.
Giannakopoulos et al. [20] evaluated the performance of several LLMs in answering clinical orthodontic questions and found that while the models were generally capable of producing structured answers, their responses occasionally lacked scientific rigor, clarity, or comprehensiveness, mirroring the qualitative variation observed in our open-ended data. These results are consistent with those of previous studies that evaluated LLM performance in dental education and communication. Künzle and Paris [21] observed a similar trend in the restorative and endodontic contexts, where only advanced versions, such as ChatGPT-4, achieved a suitable performance for academic dentistry. Their analysis supports the idea that only top-tier LLMs can consistently handle complex educational tasks, a conclusion mirrored in our study findings on clarity. Similarly, Buldur and Sezer [22] compared ChatGPT responses to fluoride-related questions with official guidance from the American Dental Association and found a high degree of similarity in both content and language, suggesting the utility of ChatGPT in patient education and public health messaging. However, they also noted limitations in ChatGPT’s clinical reasoning depth, reinforcing that such models may be best suited as adjuncts rather than as stand-alone tools. Yilmaz et al. [23] also indicated that LLMs show promise as supplementary educational tools.
Mode-of-failure analysis provides critical insights regarding LLM integration into dental practice. The universal failure regarding milk fluoridation (Q33) indicates a systematic gap in training data regarding alternative public health interventions. This finding is particularly concerning for global applications, where water fluoridation may not be feasible. Model-specific error patterns suggest different training emphases. ChatGPT-4 struggled with quantitative pharmacology, whereas Grok 3 and Copilot both failed in pediatric guidelines, indicating potential shared training limitations. Most importantly, the identification of high-risk errors in toxicology and public health planning provides specific benchmarks for evaluating future LLM improvements in dental applications.
Furthermore, Dermata et al. [18] observed that variations in chatbot accuracy, response quality, and content integrity across different LLMs are likely attributable to differences in training data, model algorithms, feedback loops, and update frequency. This observation is echoed in our findings, as Claude 3.5 Sonnet consistently produced more polished and fluent responses, whereas the other models demonstrated more variability. Understanding the model-specific factors driving these differences remains a valuable area for future research and optimization, particularly for high-stakes applications in clinical environments.
1. Limitations
Although this study provides valuable insights into the current performance of state-of-the-art LLMs in dental domains, several limitations should be acknowledged. Our mode-of-failure analysis further emphasizes that even high-performing LLMs demonstrate recurring weaknesses in specific fluoride topics. Several models struggled with questions related to compound-specific usage (e.g., silver diamine fluoride), differentiating pre- vs. post-eruptive effects, and identifying the populations least benefiting from fluoridation. These patterns reveal that conceptual and interpretive limitations persist, particularly in terms of nuanced or quantitative clinical knowledge. Identifying these weaknesses provides useful benchmarks for future model refinement and training optimization. The study design relied on a fixed set of fluoride-related questions that were developed and validated by dental experts. Although this ensured clinical relevance and content alignment, the results may not fully reflect the diversity and unpredictability of real-world user prompts and follow-up interactions. Although guided by a standardized rubric and trained raters, the open-ended scoring process is inherently subjective and may introduce bias. Each model was queried only once per question to standardize the responses and avoid variability from repeated prompts. Although LLMs do not learn instantly from dialogue, this approach ensures consistent evaluations across models. However, LLMs exhibit inherent randomness in response generation, owing to their probabilistic nature. Multiple runs per question would provide more robust performance estimates by capturing the response variability and allowing the calculation of confidence intervals around performance metrics. The statistical analysis treated the model responses as independent groups using ANOVA/Kruskal-Wallis tests. However, as all four models answered identical questions, the data had a paired response structure with inherent within-question correlations. This design characteristic suggests that methods specifically designed for paired categorical data, such as Cochran’s Q test with post-hoc McNemar tests, would be more statistically appropriate for comparing model accuracy rates. Future studies should consider these paired-data analytical approaches to better account for the correlated nature of the responses to identical questions. Moreover, the dynamic nature of LLM development, including ongoing fine-tuning, retraining, and application programming interface updates, implies that the performance may shift over time, potentially affecting reproducibility. LLMs are frequently updated and retrained, which can result in changes in their performance, accuracy, and response styles over time. Consequently, the findings from the LLM evaluations may not be fully reproducible or generalizable if the models are subsequently modified. This highlights the need for ongoing assessment and transparent reporting of the model versions used in research. In addition, the lack of source attribution in model responses restricts traceability and may raise concerns in contexts where evidence-based citations are critical, such as in clinical decision-making or formal education. Rokhshad et al. [24] observed that chatbots currently underperform compared to dentists, and Tokgöz Kaplan and Cankar [25] emphasized that only through additional research, clinical validation, and ongoing model refinement can LLMs be responsibly integrated into dental practice. Torres-Zegarra et al. [26] further noted that the educational potential of chatbots deserves focus beyond their examination performance.
2. Practical implications for dental education and patient communication
Despite these limitations, the present study has several practical implications. The strong performances of ChatGPT-4, Claude 3.5 Sonnet, Microsoft Copilot, and Grok 3 across both objective and subjective assessment tasks suggest that LLMs have substantial potential as supplementary tools in dental education, patient engagement, and clinical support. In academic settings, they can be used to develop quizzes, explain complex concepts, and serve as virtual tutors. In public health communication, they may help generate accurate and accessible information on fluoride use, reduce misinformation, and improve community oral health literacy.
This study found that LLMs, especially Claude 3.5 Sonnet, performed well in delivering fluoride-related dental knowledge. All models showed high accuracy, but Claude 3.5 Sonnet stood out for its clear and structured explanations. These results suggest that LLMs are helpful tools in dental education and patient communication. However, expert oversight is necessary to ensure safe and reliable use, and further research is required to explore real-world applications in clinical settings.
Supplementary Materials 1–3 can be found at https://doi.org/10.12701/jyms.2025.42.53.
Supplementary Material 1.
jyms-2025-42-53-Supplementary-Material-1.pdf
Supplementary Material 2.
jyms-2025-42-53-Supplementary-Material-2.pdf
Supplementary Material 3.
jyms-2025-42-53-Supplementary-Material-3.pdf

Conflicts of interest

No potential conflict of interest relevant to this article was reported.

Funding

None.

Author contributions

Conceptualization, Resources: all authors; Data curation: RB, AM; Formal analysis, Supervision: AM, SM; Software, Validation: AM; Investigation: SM; Methodology: RB; Writing-original draft: SM; Writing-review & editing: RB, AM.

Data availability statement

All study materials, including the fluoride-related MCQs (Supplementary Material 1), open-ended questions (Supplementary Material 2), and Python analysis code (Supplementary Material 3), are provided with this manuscript to ensure reproducibility. The raw LLM response dataset was deleted following data analysis and is no longer available.

Fig. 1.
Multiple-choice question accuracy of large language models in fluoride-related dental knowledge (accuracy computed as Correct/Total×100). ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.
jyms-2025-42-53f1.jpg
Fig. 2.
Total performance scores for open-ended questions across 10 questions (maximum of 200). ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.
jyms-2025-42-53f2.jpg
Table 1.
Multiple-choice question performance
Model Correct Incorrect Total Accuracy (%)
ChatGPT-4 44 6 50 88.0
Claude 3.5 Sonnet 47 3 50 94.0
Copilot 47 3 50 94.0
Grok 3 46 4 50 92.0

ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.

Table 2.
Total performance scores for open-ended questions (out of 200)
Model Rater Accuracy Depth Clarity Evidence Total
ChatGPT-4 1 48±4.2 43±4.8 47±4.8 40±4.7 177±9.5
2 49±3.2 45±5.3 47±4.8 40±4.7 181±7.4
Copilot 1 49±3.2 46±5.2 45±5.3 38±4.2 178±9.2
2 48±4.2 42±4.2 44±5.2 40±0.0 174±9.7
Grok 3 1 48±4.2 43±4.8 42±4.2 42±4.2 17.5±8.5
2 47±4.8 44±5.2 45±5.3 41±3.2 177±13.4
Claude 3.5 Sonnet 1 49±3.2 45±5.3 49±3.2 43±4.8 183±6.7
2 50±0.0 46±5.2 49±3.2 38±4.2 188±9.2

Values are presented as number (%) or mean±standard deviation.

Each rater evaluated 10 questions, the maximum score per question is 20, and the maximum total score is 200.

The maximum score for each category (accuracy, depth, clarity, evidence) is 5, and the maximum total score is 20 for all models (ChatGPT 4, Copilot, Grok 3, and Claude 3.5 Sonnet).

ChatGPT-4: OpenAI, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA.

Table 3.
Inter-Rater agreement (Cohen’s kappa)
Model Cohen’s kappa Agreement level
ChatGPT-4 0.231 Slight agreement
Claude 3.5 Sonnet 0.104 Slight agreement
Copilot 0.315 Fair agreement
Grok 3 0.259 Slight agreement

ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.

Table 4.
Inter-Rater agreement (Spearman’s rank correlation)
Model Spearman’s rank correlation coefficient p-value
ChatGPT-4 0.653 0.040
Claude 3.5 Sonnet 0.469 0.171
Copilot 0.862 <0.001
Grok 3 0.871 <0.001

ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.

Table 5.
Statistical comparison of model performance
Metric ANOVA (p-value) Kruskal-Wallis (p-value) Interpretation
Accuracy 0.867 0.858 Statistically not significant
Depth 0.456 0.441 Statistically not significant
Clarity 0.009 0.014 Statistically significant
Evidence 0.156 0.158 Statistically not significant
Total 0.215 0.219 Statistically not significant

ANOVA, analysis of variance.

Table 6.
Mode-of-failure analysis: multiple-choice question error distribution by model
Model Incorrect question Total number of errors Accuracy (%)
ChatGPT-4 Q2, Q19, Q33, Q37, Q43, Q50 6 88
Claude 3.5 Sonnet Q2, Q33, Q41 3 94
Copilot Q15, Q28, Q32 3 94
Grok 3 Q15, Q28, Q32, Q33 4 92

ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.

  • 1. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DS. Large language models in medicine. Nat Med 2023;29:1930–40.ArticlePubMedPDF
  • 2. Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Hou L, et al. Towards expert-level medical question answering with large language models [Internet]. arXiv; 2023 [cited 2025 Jun 27]. https://doi.org/10.48550/arXiv.2305.09617.Article
  • 3. Maity S, Saikia MJ. Large language models in healthcare and medical applications: a review. Bioengineering (Basel) 2025;12:631.ArticlePubMedPMC
  • 4. Ahn S. Large language model usage guidelines in Korean medical journals: a survey using human-artificial intelligence collaboration. J Yeungnam Med Sci 2025;42:14.ArticlePubMedPMCPDF
  • 5. Wang D, Zhang S. Large language models in medical and healthcare fields: applications, advances, and challenges. Artif Intell Rev 2024;57:299.ArticlePDF
  • 6. Liu M, Okuhara T, Huang W, Ogihara A, Nagao HS, Okada H, et al. Large language models in dental licensing examinations: systematic review and meta-analysis. Int Dent J 2025;75:213–22.ArticlePubMedPMC
  • 7. Yu E, Chu X, Zhang W, Meng X, Yang Y, Ji X, et al. Large language models in medicine: applications, challenges, and future directions. Int J Med Sci 2025;22:2792–801.ArticlePubMedPMC
  • 8. Huang H, Zheng O, Wang D, Yin J, Wang Z, Ding S, et al. ChatGPT for shaping the future of dentistry: the potential of multi-modal large language model. Int J Oral Sci 2023;15:29.ArticlePubMedPMCPDF
  • 9. Roganović J, Radenković M, Miličić B. Responsible use of artificial intelligence in dentistry: survey on dentists’ and final-year undergraduates' perspectives. Healthcare (Basel) 2023;11:1480.ArticlePubMedPMC
  • 10. Balel Y. Can ChatGPT be used in oral and maxillofacial surgery? J Stomatol Oral Maxillofac Surg 2023;124:101471.ArticlePubMed
  • 11. Cherian JM, Kurian N, Varghese KG, Thomas HA. World Health Organization’s global oral health status report: paediatric dentistry in the spotlight. J Paediatr Child Health 2023;59:925–6.ArticlePubMed
  • 12. Aoun A, Darwiche F, Al Hayek S, Doumit J. The fluoride debate: the pros and cons of fluoridation. Prev Nutr Food Sci 2018;23:171–80.ArticlePubMedPMC
  • 13. Petersen PE, Lennon MA. Effective use of fluorides for the prevention of dental caries in the 21st century: the WHO approach. Community Dent Oral Epidemiol 2004;32:319–21.ArticlePubMed
  • 14. Ahmed WM, Azhari AA, Alfaraj A, Alhamadani A, Zhang M, Lu CT. The quality of AI-generated dental caries multiple choice questions: a comparative analysis of ChatGPT and Google Bard language models. Heliyon 2024;10:e28198. ArticlePubMedPMC
  • 15. Nguyen HC, Dang HP, Nguyen TL, Hoang V, Nguyen VA. Accuracy of latest large language models in answering multiple choice questions in dentistry: a comparative study. PLoS One 2025;20:e0317423. ArticlePubMedPMC
  • 16. Salman IM, Ameer OZ, Khanfar MA, Hsieh YH. Artificial intelligence in healthcare education: evaluating the accuracy of ChatGPT, Copilot, and Google Gemini in cardiovascular pharmacology. Front Med (Lausanne) 2025;12:1495378.ArticlePubMedPMC
  • 17. Aldukhail S. Mapping the landscape of generative language models in dental education: a comparison between ChatGPT and Google Bard. Eur J Dent Educ 2025;29:136–48.ArticlePubMed
  • 18. Dermata A, Arhakis A, Makrygiannakis MA, Giannakopoulos K, Kaklamanos EG. Evaluating the evidence-based potential of six large language models in paediatric dentistry: a comparative study on generative artificial intelligence. Eur Arch Paediatr Dent 2025;26:527–35.ArticlePubMedPMCPDF
  • 19. Tussie C, Starosta A. Comparing the dental knowledge of large language models. Br Dent J 2024 Oct 31 [Epub]. https://doi.org/10.1038/s41415-024-8015-2.ArticlePubMed
  • 20. Giannakopoulos K, Kavadella A, Aaqel Salim A, Stamatopoulos V, Kaklamanos EG. Evaluation of the performance of generative AI large language models ChatGPT, Google Bard, and Microsoft Bing Chat in supporting evidence-based dentistry: comparative mixed methods study. J Med Internet Res 2023;25:e51580. ArticlePubMedPMC
  • 21. Künzle P, Paris S. Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments. Clin Oral Investig 2024;28:575.ArticlePubMedPMC
  • 22. Buldur M, Sezer B. Can artificial intelligence effectively respond to frequently asked questions about fluoride usage and effects?: a qualitative study on ChatGPT. Fluoride Q Rep 2023;56:201–16.
  • 23. Yilmaz BE, Gokkurt Yilmaz BN, Ozbey F. Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis. BMC Oral Health 2025;25:573.ArticlePubMedPMCPDF
  • 24. Rokhshad R, Zhang P, Mohammad-Rahimi H, Pitchika V, Entezari N, Schwendicke F. Accuracy and consistency of chatbots versus clinicians for answering pediatric dentistry questions: a pilot study. J Dent 2024;144:104938.ArticlePubMed
  • 25. Tokgöz Kaplan T, Cankar M. Evidence-based potential of generative artificial intelligence large language models on dental avulsion: ChatGPT versus Gemini. Dent Traumatol 2025;41:178–86.ArticlePubMed
  • 26. Torres-Zegarra BC, Rios-Garcia W, Ñaña-Cordova AM, Arteaga-Cisneros KF, Chalco XC, Ordoñez MA, et al. Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study. J Educ Eval Health Prof 2023;20:30.ArticlePubMedPMCPDF

Figure & Data

References

    Citations

    Citations to this article as recorded by  

      Figure
      • 0
      • 1
      Performance of large language models in fluoride-related dental knowledge: a comparative evaluation study of ChatGPT-4, Claude 3.5 Sonnet, Copilot, and Grok 3
      Image Image
      Fig. 1. Multiple-choice question accuracy of large language models in fluoride-related dental knowledge (accuracy computed as Correct/Total×100). ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.
      Fig. 2. Total performance scores for open-ended questions across 10 questions (maximum of 200). ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.
      Performance of large language models in fluoride-related dental knowledge: a comparative evaluation study of ChatGPT-4, Claude 3.5 Sonnet, Copilot, and Grok 3
      Model Correct Incorrect Total Accuracy (%)
      ChatGPT-4 44 6 50 88.0
      Claude 3.5 Sonnet 47 3 50 94.0
      Copilot 47 3 50 94.0
      Grok 3 46 4 50 92.0
      Model Rater Accuracy Depth Clarity Evidence Total
      ChatGPT-4 1 48±4.2 43±4.8 47±4.8 40±4.7 177±9.5
      2 49±3.2 45±5.3 47±4.8 40±4.7 181±7.4
      Copilot 1 49±3.2 46±5.2 45±5.3 38±4.2 178±9.2
      2 48±4.2 42±4.2 44±5.2 40±0.0 174±9.7
      Grok 3 1 48±4.2 43±4.8 42±4.2 42±4.2 17.5±8.5
      2 47±4.8 44±5.2 45±5.3 41±3.2 177±13.4
      Claude 3.5 Sonnet 1 49±3.2 45±5.3 49±3.2 43±4.8 183±6.7
      2 50±0.0 46±5.2 49±3.2 38±4.2 188±9.2
      Model Cohen’s kappa Agreement level
      ChatGPT-4 0.231 Slight agreement
      Claude 3.5 Sonnet 0.104 Slight agreement
      Copilot 0.315 Fair agreement
      Grok 3 0.259 Slight agreement
      Model Spearman’s rank correlation coefficient p-value
      ChatGPT-4 0.653 0.040
      Claude 3.5 Sonnet 0.469 0.171
      Copilot 0.862 <0.001
      Grok 3 0.871 <0.001
      Metric ANOVA (p-value) Kruskal-Wallis (p-value) Interpretation
      Accuracy 0.867 0.858 Statistically not significant
      Depth 0.456 0.441 Statistically not significant
      Clarity 0.009 0.014 Statistically significant
      Evidence 0.156 0.158 Statistically not significant
      Total 0.215 0.219 Statistically not significant
      Model Incorrect question Total number of errors Accuracy (%)
      ChatGPT-4 Q2, Q19, Q33, Q37, Q43, Q50 6 88
      Claude 3.5 Sonnet Q2, Q33, Q41 3 94
      Copilot Q15, Q28, Q32 3 94
      Grok 3 Q15, Q28, Q32, Q33 4 92
      Table 1. Multiple-choice question performance

      ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.

      Table 2. Total performance scores for open-ended questions (out of 200)

      Values are presented as number (%) or mean±standard deviation.

      Each rater evaluated 10 questions, the maximum score per question is 20, and the maximum total score is 200.

      The maximum score for each category (accuracy, depth, clarity, evidence) is 5, and the maximum total score is 20 for all models (ChatGPT 4, Copilot, Grok 3, and Claude 3.5 Sonnet).

      ChatGPT-4: OpenAI, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA.

      Table 3. Inter-Rater agreement (Cohen’s kappa)

      ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.

      Table 4. Inter-Rater agreement (Spearman’s rank correlation)

      ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.

      Table 5. Statistical comparison of model performance

      ANOVA, analysis of variance.

      Table 6. Mode-of-failure analysis: multiple-choice question error distribution by model

      ChatGPT-4: OpenAI, San Francisco, CA, USA; Claude 3.5 Sonnet: Anthropic, San Francisco, CA, USA; Copilot: Microsoft, Redmond, WA, USA; Grok 3: xAI, San Francisco, CA, USA.


      JYMS : Journal of Yeungnam Medical Science
      TOP