Rethinking How We Wrap Command Line Tools from Python

Poster No:

2270 

Submission Type:

Abstract Submission 

Authors:

Florian Rupprecht1, Connor Lane2, Reinder Vos de Wael3, Michael Milham2, Gregory Kiar3

Institutions:

1Child Mind Institute, Brooklyn, NY, 2Child Mind Institute, New York, NY, 3Child Mind Institute, Montreal, Quebec

First Author:

Florian Rupprecht  
Child Mind Institute
Brooklyn, NY

Co-Author(s):

Connor Lane  
Child Mind Institute
New York, NY
Reinder Vos de Wael  
Child Mind Institute
Montreal, Quebec
Michael Milham  
Child Mind Institute
New York, NY
Gregory Kiar  
Child Mind Institute
Montreal, Quebec

Introduction:

Given the increasing prominence of Python for brain imaging signal analysis, there is a common desire to also adopt Python for data preprocessing. However, a majority of fundamental brain imaging tools (e.g. FSL, AFNI, Freesurfer) are distributed as collections of executables, primarily written in C and GNU Bash, exposing heterogenous command line interfaces, making their integration with Python either cumbersome at best or clumsy and error-prone at worst. In the last decade, the Python library Nipype [1] has seen increased adoption [2], as it effectively bridges the interoperability gap by offering uniform Python interfaces for a variety of brain imaging software collections. However, this approach caters to the development of complex pipelines, and falls short in several key areas and simpler applications. It adds complexity in simpler applications, and manual management of these interfaces becomes burdensome and error-prone [3], requiring extensive code modifications for even minor structural changes. Independent of Nipype, the process of reading and writing files to disk storage is challenging to implement robustly. Detecting errors and unexpected behaviours, such as tools accessing the wrong files, proves extremely difficult. In addition, large-scale processing in the cloud or across federated data centres necessitates separate handling of streaming data from remote file storage. These complexities result in unwieldy and user-unfriendly APIs, making the creation, expansion, and maintenance of brain imaging data pipelines a challenging endeavour. On the other end of the spectrum towards solving the problem of tool portability is Boutiques [4]. While Nipype provides a framework for describing and connecting complex workflows, Boutiques provides a descriptive framework through which individual tools can be defined. While this reduces the complexity of building simple pipelines, it does not overcome file-based data storage limitations.

Methods:

To tackle these issues and increase the transparency between the use of Python programs and command line tools, we develop Styx. Styx is a compiler that automatically generates statically typed Python functions from Boutiques command line tool descriptors. This lets us create efficient and easy-to-inspect code for any tool. It provides users with the benefits of statically written code, specifically intellisense (i.e. in-editor type checking, auto1 completion, and documentation) and static code validation. Our code generation system maintains flexibility, allowing us to make structural changes in the compiler once, automatically propagating them to any number of generated interfaces. Styx also handles execution across different platforms (native, Singularity, Docker) and works closely with a FUSE-based virtual file system. This system helps sandbox [5], stream, and monitor file-based inputs and outputs for command line tools, whether they're from local disk storage, Python-managed memory, or web file storage like S3.
Supporting Image: figure_1.png
 

Results:

Developing Styx is an ongoing research experiment. Performance of different storage types (disk, memory, database) using Styx versus Nipype interfaces will be evaluated across a range of metrics such as processing speed, memory usage, and CPU usage. We expect Styx generated wrappers to be notably faster on distributed file systems that have a considerable overhead for individual file access, while showing comparable performance on low latency file systems.

Conclusions:

In summary, the development of Styx offers a novel solution to the challenges of wrapping command line tools in Python for brain imaging data pipelines. By generating efficient and easy-to-inspect code, addressing performance issues, and simplifying file management, Styx represents a significant advancement in this field. With its ability to adapt to emerging technologies and streamline data processing, Styx is poised to make a substantial impact on the development and maintenance of such pipelines.

Modeling and Analysis Methods:

Other Methods

Neuroinformatics and Data Sharing:

Databasing and Data Sharing 2
Workflows 1
Informatics Other

Keywords:

Computational Neuroscience
Computing
Data analysis
Data Organization
Informatics
Machine Learning
Open Data
Open-Source Code
Open-Source Software
Workflows

1|2Indicates the priority used for review

Provide references using author date format

[1] K. Gorgolewski et al., “Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in Python”, Frontiers in neuroinformatics, p. 13, 2011. 2
[2] O. Esteban et al., “fMRIPrep: a robust preprocessing pipeline for functional MRI”, Nature methods, no. 1, pp. 111–116, 2019.
[3] Y. Chen et al., “Reproducing FSL's fMRI data analysis via Nipype: Relevance, challenges, and solutions”, Frontiers in Neuroimaging, p. 953215, 2022.
[4] T. Glatard et al., “Boutiques: a flexible framework to integrate command-line applications in computing platforms”, GigaScience, no. 5, 2018.
[5] J. Merino, “Optimizing sandbox creation with a FUSE file system Using sandboxfs to speed up Bazel builds”, 2020. Available: https://ftp.heanet.ie/mirrors/fosdem-video/2020/H.2215/sandboxfs_ bazel_speedup.webm