Evaluating LLM Accuracy at Answering Quantitative Occupational Hygiene Questions
Abstract No:
1676
Abstract Type:
Student Poster
Authors:
A Lee1, P Raynor2
Institutions:
1N/A, Minneapolis, MN, 2University of Minnesota, Minneapolis, MN
Presenter:
Anna Lee
N/A
Faculty Advisor:
Peter Raynor
University of Minnesota
Description:
How effectively do different large language models respond to occupational hygiene queries? This research looked at three commonly used models - OpenAI's ChatGPT, Anthropic's Claude, and Google's Gemini - to assess accuracy and compare model performance when tasked with open-ended quantitative questions. Variations in prompts were applied when engaging the models to determine the impact of prompt engineering. Test methodology included a comparison between model engagement with standalone queries versus multi-turn conversations to provide insights on model effectiveness based on user tendencies.
Situation/Problem:
With artificial intelligence usage becoming mainstream within workplaces, questions arise regarding the reliability of AI. If large language models are going to be used for occupational health and safety purposes, it is essential to understand the strengths and limitations of these models. This research sought to understand:
-How effectively can common LLMs respond to quantitative industrial hygiene problems?
-What types of questions can LLMs most effectively answer?
-What models and methods for model engagement are most effective to obtain accurate answers?
Methods:
One hundred quantitative open-ended occupational hygiene questions were compiled from University of Minnesota occupational hygiene coursework. Questions ranged in difficulty from simple unit conversions to more complex problems requiring both mathematical calculations and subject knowledge. Questions also covered a variety of topics within the realm of occupational hygiene. These one hundred questions were sent to three models - ChatGPT, Claude, and Gemini. Calls to the models were performed with and without an engineered prompt prefacing the questions. Calls to the models were also performed using both standalone queries and multi-turn conversations. Calls to each model under each condition were repeated five times. This testing process was automated utilizing Python code hosted in Google Colab, which executed calls to the LLMs' respective Application Programming Interfaces (APIs). Answers to the questions were graded on correctness, including inclusion of units. The grading process allowed for a one-percent margin of error in delineating between correct and incorrect answers. Data was analyzed both descriptively and statistically. Statistical analyses consisted of multi-factor analysis of the experimental factors that influenced the correctness of the LLM answers.
Results / Conclusions:
Across all tests, Gemini was the most accurate of the three models, with an average of 66.5% of questions answered correctly. ChatGPT answered an average of 39.5% of questions correctly and Claude answered an average of 36.8% of questions correctly. Use of engineered prompts appeared to have little effect on LLM accuracy. Whether the models were engaged with standalone queries or multi-turn conversations also had little effect. Accuracy of answers declined with increased question difficulty. Across all models and tests, 48.1% of "easy", 35.1% of "medium", and 16.8% of "hard" questions were answered correctly. A strength of utilizing quantitative questions is that correctness can be graded objectively. However, applicability of results are limited to users relying on AI for quantitative assessments, with limited relevance to more qualitative usages.
Core Competencies:
IH/OH Program Management
Choose at least one (1), and up to five, (5) keywords from the following list. These selections will optimize your presentation's search results for attendees.
Education and training
Based on the information that will be presented during your proposed session, please indicate the targeted audience practice level: (select one)
Professional: Professional is a job title given to persons who have obtained a baccalaureate or graduate degree in IH/OH, public health, safety, environmental sciences, biology, chemistry, physics, or engineering or who have a degree in another area that meets the standards set forth in the next section, Knowledge and Skill Sets of IH/OH Practice Levels, and has had 4 or more years of practice. One significant way of demonstrating professional competence is to achieve certification by a 3rd party whose certification scheme is recognized by the International Occupational Hygiene Association (IOHA) such as the Board of Global EHS Credentialing (BGC).
Was this session organized by an AIHA Technical Committee, Special Interest Group, Working Group, Advisory Group or other AIHA project Team?
No
Are worker exposure data and/or results of worker exposure data analysis presented?
No
How will this help advance the science of IH/OH?
With AI usage on the rise, professionals and non-professionals alike are bound to use AI to problem-solve occupational hygiene issues. Gaining an understanding of LLMs’ strengths and weaknesses specific to occupational hygiene is crucial to preventing misapplication of AI-generated guidance.
Have you presented this information before?
No
I have read and agree to these guidelines.
Yes
You have unsaved changes.