Mon, 5/20: 1:30 PM - 1:55 PM EDT
Pop-Up Education
Greater Columbus Convention Center
Room: Pop-Up Education, Exhibit Hall AB, Aisle 1300
General pre-trained transformers (GPTs) such as the ChatGPT large language models (LLMs) have captured headlines for their utility (or lack thereof) in writing stories, analyzing legal briefs, and writing computer code. Many organizations have hesitated to fully embrace these technologies due to uncertainties regarding the completeness and accuracy of responses and because use of these models requires surrendering potentially sensitive data to third parties. While the most used LLMs are hosted on a third-party platform, end users can run local pre-trained LLMs isolated to their local device. However, this approach requires the user to sacrifice model quality due to limitations in computing capacity. While machine learning has been used in a limited capacity in the industrial hygiene literature, there has been almost no investigation into how these technologies will impact the daily life of occupational health professionals. This pilot study evaluated the general utility of LLMs to read and summarize regulations from OSHA.
Five different LLMs were evaluated. Two of these, ChatGPT3.5 and Llama2-70B, were hosted on third-party servers. The remaining three were all Llama2-13B hosted on a local device. The first local LLM had access only to the data it was pre-trained on, while the second relied on copies of OSHA's standard for occupational noise and respirable crystalline silica (RCS) and no other information. The third local model used copies of both standards and its pre-trained data. The same prompts were provided to each of the five LLMs and evaluated independently by four certified industrial hygienists (CIHs) on a 4 point "Likert-type" scale, with responses of "terrible", "bad", "acceptable", and "good." Descriptive statistics and Fleiss' Kappa were calculated, and ordinal logistic regression was used to determine which of the five models performed the best, controlling for individual rater variability.
The four CIHs assigned a mean score of 2.4 across all LLMs. When stratified by LLM, ChatGPT 3.5 had the highest mean score (3.3), while Llama2-13B restricted only to the text of the standards had the lowest (1.8). The overall inter-rater agreement was 0.29, indicating "fair" agreement between CIHs. When controlling for rater, ChatGPT 3.5 had the highest probability (0.57) of a response marked "Good," while Llama2-13B restricted had the lowest (0.07). Overall, ChatGPT3.5 performed the best in responding to the prompts about the noise and RCS standards, but was judged to be less than "good" at the task. The results of this preliminary study suggest that some LLMs performed reasonably well with the task. However, the performance is linked to the sophistication of the model, so individual model performance can widely vary. These results suggest that organizations wishing to effectively utilize LLMs with their private data still need to rely on LLMs hosted by a third party, or to invest in significant IT infrastructure.
Describe how large language models operate and the implications of using a model hosted on a third party platform compared to one that is hosted locally.
Recognize the potential and limitations of these models to summarize occupational safety and health regulations into actionable information that can be used to improve training and decision making.
Content Level
Intermediate
Topics
Also part of the Virtual Program
Available as part of AIHA CONNECT OnDemand
Big Data
Computer/Mobile Apps and Tools
Standards, Regulations and Legal Issues