Creating a manifest from a wide-form tabular file for a longitudinal study with imaging and non-imaging visits¶

In this example, we have a longitudinal study with both non-imaging and imaging visits. Specifically, non-imaging (neuropsychological) data was collected every year, and imaging data (anatomical only) was collected every two years.

We start with two CSV files:

example2-demographics_neuropsych.csv contains demographics information and dates for the neuropsych visits

PARTICIPANT	SEX	DATE_OF_BIRTH	DATE_NEUROPSYCH1	DATE_NEUROPSYCH2	DATE_NEUROPSYCH3
ABC_001	F	1970/12/31	2015/01/30	2016/02/01	2017/02/10
ABC_002	M	1967/02/20	2015/02/19	2016/02/22	2017/02/25
ABC_003	F	1955/05/21	2016/03/03	2017/03/10

example2-mri.csv contains dates for the MRI visits

PARTICIPANT

DATE_MRI1

DATE_MRI2

ABC_001

2015/02/07

2017/02/15

ABC_002

2015/02/26

2017/03/01

ABC_003

2016/03/09

PARTICIPANT	DATE_MRI1	DATE_MRI2
ABC_001	2015/02/07	2017/02/15
ABC_002	2015/02/26	2017/03/01
ABC_003	2016/03/09

These files give us the following information:

The study has 3 participants
Each participant has 3 non-imaging visits and 2 imaging visits

Given that we know that all imaging sessions collected anatomical data only, we have all the information required for the manifest file. Here is a manifest-generation script that does the job:

Attention

The script below was written for Python 3.11 with pandas 2.2.3. It may not work with older/different versions.

#!/usr/bin/env python
"""Manifest-generation script for Example 2."""

from pathlib import Path

import pandas as pd

if __name__ == "__main__":

    # get the path to the demographics/neuropsych file and the MRI file
    # we assume that it is in the same directory as this script
    path_neuropsych = Path(__file__).parent / "example2-demographics_neuropsych.csv"
    path_mri = Path(__file__).parent / "example2-mri.csv"

    # load the files and merge them
    df_neuropsych = pd.read_csv(path_neuropsych, dtype=str)
    df_mri = pd.read_csv(path_mri, dtype=str)
    df_merged = pd.merge(
        df_neuropsych, df_mri, how="left", left_on="PARTICIPANT", right_on="PARTICIPANT"
    )

    data_for_manifest = []
    for _, row in df_merged.iterrows():

        # remove underscores
        participant_id = row["PARTICIPANT"].replace("_", "")

        # each row in the demographics file is multiple rows in the manifest file
        for visit_id in [
            "NEUROPSYCH1",
            "NEUROPSYCH2",
            "NEUROPSYCH3",
            "MRI1",
            "MRI2",
        ]:

            # if the DATE column is empty, the visit did not happen yet
            if pd.isna(row[f"DATE_{visit_id}"]):
                continue

            # session_id is only defined for MRI visits
            if visit_id.startswith("MRI"):
                session_id = visit_id.removeprefix("MRI")

                # all participants only have anat datatype
                datatype = ["anat"]
            else:
                session_id = pd.NA
                datatype = []

            # create the manifest entry
            data_for_manifest.append(
                {
                    "participant_id": participant_id,
                    "visit_id": visit_id,
                    "session_id": session_id,
                    "datatype": datatype,
                }
            )

    df_manifest = pd.DataFrame(data_for_manifest)

    # write the manifest in the same directory as this script
    df_manifest.to_csv(
        Path(__file__).parent / "example2-manifest.tsv", sep="\t", index=False
    )

Running this script creates a manifest that looks like this:

participant_id	visit_id	session_id	datatype
ABC001	NEUROPSYCH1		[]
ABC001	NEUROPSYCH2		[]
ABC001	NEUROPSYCH3		[]
ABC001	MRI1	1	[‘anat’]
ABC001	MRI2	2	[‘anat’]
ABC002	NEUROPSYCH1		[]
ABC002	NEUROPSYCH2		[]
ABC002	NEUROPSYCH3		[]
ABC002	MRI1	1	[‘anat’]
ABC002	MRI2	2	[‘anat’]
ABC003	NEUROPSYCH1		[]
ABC003	NEUROPSYCH2		[]
ABC003	MRI1	1	[‘anat’]