Using Big Data to Identify Trends and Patterns in OSHA’s Chemical Exposure Health Data: An Example Using Lead

Abstract No:

1193 

Abstract Type:

Professional Poster 

Authors:

S Smith1

Institutions:

1Benchmark Risk Group, Chicago, IL

Presenter:

Sierra Smith  
Benchmark Risk Group

Description:

As part of its mission to ensure the health and safety of American workers, the Occupational Safety and Health Administration (OSHA) maintains the Chemical Exposure Health Data (CEHD) which contains sampling results from OSHA compliance inspections. While others have used limited subsets of these data for analysis, we utilized the entirety of the available data to identify the potential presence of hazards by the North American Industry Classification System (NAICS) codes. After significant data cleaning, we identified 1,349,070 personal, 63,806 area, and 150,066 bulk samples of various substances collected across over 900 industries between 1984 and 2024. After excluding substances with fewer than 100 measurements or that did not represent a chemical exposure, a total of 285 unique substances were identified, ranging from 103,535 (inorganic lead) to 100 (hexylene glycol) measurements. To demonstrate the advantages and challenges of working with such a dataset, we have used personal airborne lead exposures as a case study for the types of industry-specific analyses that can be conducted. This analysis demonstrates the value of the CEHD dataset in helping OEHS professionals identify potential exposures in specific industries and provides preliminary information on the upper-bound estimates of potential exposure that may exist.

Situation / Problem:

OEHS professionals are often challenged by 1) the lack of information that can be used to help identify occupational hazards, and 2) quantitative information that may calibrate professional judgment. Large publicly available datasets of occupational exposure measurements do exist, but OEHS professionals are often unaware of such datasets or unsure how to use them in their work. Further complicating issues are the fact that many of these datasets are often too large to analyze effectively in Excel and contain numerous errors ranging from minor spelling mistakes to the wrong information being recorded in the wrong column. Furthermore, these large datasets often code information in a manner unfamiliar to an OEHS professional, making it challenging for them to identify the data of interest.

Despite these issues, large datasets, such as the OSHA CEHD, provide a wealth of information that can help with industry-wide hazard identification and prioritization of additional data collection to fill in any gaps in the available data or confirm existing data. To address the lack of examples, we used the OSHA CEHD to broadly characterize industrial hazards over time and across industries and specifically used personal airborne concentrations of lead collected in all industries from 1984 to 2024 as a case study of the value of this data.

Methods:

As part of OSHA's ongoing efforts to monitor worker exposures to chemicals in the workplace, OSHA compliance officers routinely take and record industrial hygiene samples, which are submitted to the OSHA Technical Center (OTC) for analysis. The OTC sampling information system has collected data from 1984 to the present, including information on personal exposures, as well as bulk and area samples for numerous airborne contaminants. It's important to note that OSHA typically samples activities with the highest exposure potential, therefore data in the CEHD database reasonably estimates worst-case, upper-bound exposure levels.

We compiled OSHA CEHD personal air, area air, and bulk measurements recorded from 1984 through 2024. Observations from the 2025 reporting period were excluded as the calendar year was still ongoing during the time of our analysis. Further, we excluded samples that were missing North American Industry Classification (NAICS) and/or Standard Industrial Classification (SIC), as well as samples with unidentifiable codes, which allowed us to further categorize each establishment by a specific industry for temporal comparisons across different sectors.

To demonstrate the advantages and challenges of using such a large dataset, trends in airborne lead levels were assessed across different industries and over time. Frequency and trend analyses were conducted using Stata 16 and R. Personal airborne measurements of lead were converted to 8-hr TWA values using the provided air concentrations and sampling times. Data were explored graphically, and descriptive statistics (e.g., sample size, geometric mean, & geometric standard deviation) were calculated for each year for the entire dataset and for each industry in a given year. Changes in 8-hr TWAs over time were assessed using a simple linear regression model controlling for year. This approach allowed for a high-level assessment of airborne concentrations of lead in numerous industries; however, the CEHD does not contain information on job title, job task, or personal protective equipment (PPE), making it impossible to ascertain important information on exposure determinants.

Results / Conclusions:

A total of 2,640,997 samples were measured and recorded between 1984 and 2024. We excluded samples with a missing substance name (n= 11,476), blank samples (n= 470,108), those missing both NAICS and SIC codes (n= 80,406), sample types other than personal air, area air, and bulk samples (n= 213,602; e.g., pH, soil samples), and unidentifiable industries (n= 123,573), resulting in a total of 1,787,138 samples for analysis. Substances with less than 100 measurements present in the dataset were excluded from the analysis. There were a total of 1,562,951 measurements across 285 different substances. These measurements included 1,349,079 personal air, 63,806 area air, and 150,066 bulk samples collected across various industries. A total of 907, 296, and 24 unique 6, 4, and 2-digit NAICS codes were identified.

A total of 100,545 lead measurements were available for analysis. Of these measurements, 83,614 were personal air, 3,425 were area air, and 13,506 were bulk samples. The number of measurements collected per year ranged from 4,802 in 1988 to 416 in 2020. A total of 54,632 (65%) of the personal samples were below the level of detection (LOD). The yearly geometric mean 8-hr TWA concentrations of lead ranged from 11.9 µg/m3 to 49.9 µg/m3 (compared to the current OSHA permissible exposure limit (PEL) of 50 µg/m3). Geometric mean 8-hr TWA concentrations were highest for the 4-digit NAICS codes associated with Spectator Sports (5,507 µg/m3), Traveler Accommodation (520 µg/m3), Taxi and Limousine Service (397 µg/m3), and Fabric Mills (387 µg/m3). However, each of these industries had less than five measurements, and upon closer inspection, all the measurements within each industry were collected at the same job site making it difficult to assess whether the exposures were specific to those job sites or the general industry. More broadly, the percentage of total lead measurements exceeding the OSHA PEL fluctuated over time, ranging from approximately 3%-22%. However, exceedance of 50 µg/m3 was consistently less than 10% of all yearly measurements from 2010 on. Across all industries with multiple years of measurements, airborne concentrations of lead decreased significantly over time, although at varying rates.

Using the example of occupational lead exposure demonstrates that data cleaning, organization, and interpretation of data in the CEHD can be challenging. However, this data can be useful for identifying hazards present in various industries and tracking industry wide trends in exposure over time. The methods employed in this analysis could easily be applied to any other substance in the CEHD database or focused on a specific set of industries. Further, efforts will focus on utilizing this data to develop an online hazard matrix that can be used by OEHS professionals to explore high level data trends in the CEHD.

Core Competencies:

Chemical Hazards

Secondary Core Competencies:

Exposure Assessment
Work Environments, Occupations, and Industrial Processes

Keywords

Choose at least one (1), and up to five, (5) keywords from the following list. These selections will optimize your presentation's search results for attendees.

Asbestos, lead, and dust
Exposure Assessment

Peer Review Group Selection

Based on the selected primary competency area of your proposal, select one group below that would be best suited to serve as a subject matter expert for peer review: (Select one)

Content Portfolio AG (CPAG) - includes Big Data, A.I., & Sensor Technologies, Enhancing OEHS Communication Skills, and Changing Work Dynamics

Targeted Audience (IH/OH Practice Level)

Based on the information that will be presented during your proposed session, please indicate the targeted audience practice level: (select one)

Practitioner: Practitioner is a job title given to persons in various occupational fields who are trained to assist professionals but are not themselves licensed or certified at a professional level by a certification body recognized by the National Accreditation Recognition (NAR) Committee of IOHA. The IH/OH practitioner performs tasks requiring significant knowledge and skill in the IH/OH field, such as conducting worker exposure monitoring and, in some cases, may even function independently of a professional IH/OH but may not be involved in the breadth of IH/OH practice nor have the level of responsibility of a professional IH/OH certified by examination. The IH/OH practitioner requires a certain level of education that can be obtained from an accredited university or equivalent. Additional training in specific skill sets that provide additional career paths to the IH/OH practitioner can also be obtained. IH/OH practitioners may also serve as team leaders or project managers.

Volunteer Groups

Was this session organized by an AIHA Technical Committee, Special Interest Group,  Working Group, Advisory Group or other AIHA project Team?  

No

Worker Exposure Data and/ or Results

Are worker exposure data and/or results of worker exposure data analysis presented?

Yes

If yes, i.e., If worker exposure data and/or results of worker exposure data analysis are to be presented please describe the statistical methods and tools (e.g. IHSTAT, Expostats, IHSTAT_Bayes, IHDA-AIHA, or other statistical tool, please specify) used for analysis of the data.

Stata 16, R

Practical Application

How will this help advance the science of IH/OH?

This poster demonstrates the value of the OSHA CEHD dataset; specifically, how the data may be used by OEHS professionals to identify potential worst-case exposures in various industries across time. This information may be used by OEHS professionals to fill gaps in occupational data or confirm existing data. Additionally, it highlights the importance of data cleaning in order to effectively analyze and understand large datasets. Lastly, this poster provides a case study on lead exposure to serve as an example of how to understand and use CEHD data in an effective manner.

Content Level

What level would you consider your presentation content geared towards?

Intermediate: Specific topics within a subject. The participant would have two (2) to ten (10) years experience in industrial hygiene or OEHS and a good understanding of the subject area, but not of the specific topic presented. Prerequisites required: another course, skill, or working knowledge of the general subject.

Presentation History

Have you presented this information before?

No

Poster Presentation Submission Agreement

I have read and agree to these guidelines.

Yes