Creating a manifest from a wide-form tabular file for a longitudinal study with imaging and non-imaging visits¶
In this example, we have a longitudinal study with both non-imaging and imaging visits. Specifically, non-imaging (neuropsychological) data was collected every year, and imaging data (anatomical only) was collected every two years.
We start with two CSV files:
example2-demographics_neuropsych.csv
contains demographics information and dates for the neuropsych visitsPARTICIPANT
SEX
DATE_OF_BIRTH
DATE_NEUROPSYCH1
DATE_NEUROPSYCH2
DATE_NEUROPSYCH3
ABC_001
F
1970/12/31
2015/01/30
2016/02/01
2017/02/10
ABC_002
M
1967/02/20
2015/02/19
2016/02/22
2017/02/25
ABC_003
F
1955/05/21
2016/03/03
2017/03/10
example2-mri.csv
contains dates for the MRI visitsPARTICIPANT
DATE_MRI1
DATE_MRI2
ABC_001
2015/02/07
2017/02/15
ABC_002
2015/02/26
2017/03/01
ABC_003
2016/03/09
These files give us the following information:
The study has 3 participants
Each participant has 3 non-imaging visits and 2 imaging visits
Given that we know that all imaging sessions collected anatomical data only, we have all the information required for the manifest file. Here is a manifest-generation script that does the job:
Attention
The script below was written for Python 3.11 with pandas
2.2.3.
It may not work with older/different versions.
1#!/usr/bin/env python
2"""Manifest-generation script for Example 2."""
3
4from pathlib import Path
5
6import pandas as pd
7
8if __name__ == "__main__":
9
10 # get the path to the demographics/neuropsych file and the MRI file
11 # we assume that it is in the same directory as this script
12 path_neuropsych = Path(__file__).parent / "example2-demographics_neuropsych.csv"
13 path_mri = Path(__file__).parent / "example2-mri.csv"
14
15 # load the files and merge them
16 df_neuropsych = pd.read_csv(path_neuropsych, dtype=str)
17 df_mri = pd.read_csv(path_mri, dtype=str)
18 df_merged = pd.merge(
19 df_neuropsych, df_mri, how="left", left_on="PARTICIPANT", right_on="PARTICIPANT"
20 )
21
22 data_for_manifest = []
23 for _, row in df_merged.iterrows():
24
25 # remove underscores
26 participant_id = row["PARTICIPANT"].replace("_", "")
27
28 # each row in the demographics file is multiple rows in the manifest file
29 for visit_id in [
30 "NEUROPSYCH1",
31 "NEUROPSYCH2",
32 "NEUROPSYCH3",
33 "MRI1",
34 "MRI2",
35 ]:
36
37 # if the DATE column is empty, the visit did not happen yet
38 if pd.isna(row[f"DATE_{visit_id}"]):
39 continue
40
41 # session_id is only defined for MRI visits
42 if visit_id.startswith("MRI"):
43 session_id = visit_id.removeprefix("MRI")
44
45 # all participants only have anat datatype
46 datatype = ["anat"]
47 else:
48 session_id = pd.NA
49 datatype = []
50
51 # create the manifest entry
52 data_for_manifest.append(
53 {
54 "participant_id": participant_id,
55 "visit_id": visit_id,
56 "session_id": session_id,
57 "datatype": datatype,
58 }
59 )
60
61 df_manifest = pd.DataFrame(data_for_manifest)
62
63 # write the manifest in the same directory as this script
64 df_manifest.to_csv(
65 Path(__file__).parent / "example2-manifest.tsv", sep="\t", index=False
66 )
Running this script creates a manifest that looks like this:
participant_id |
visit_id |
session_id |
datatype |
---|---|---|---|
ABC001 |
NEUROPSYCH1 |
[] |
|
ABC001 |
NEUROPSYCH2 |
[] |
|
ABC001 |
NEUROPSYCH3 |
[] |
|
ABC001 |
MRI1 |
1 |
[‘anat’] |
ABC001 |
MRI2 |
2 |
[‘anat’] |
ABC002 |
NEUROPSYCH1 |
[] |
|
ABC002 |
NEUROPSYCH2 |
[] |
|
ABC002 |
NEUROPSYCH3 |
[] |
|
ABC002 |
MRI1 |
1 |
[‘anat’] |
ABC002 |
MRI2 |
2 |
[‘anat’] |
ABC003 |
NEUROPSYCH1 |
[] |
|
ABC003 |
NEUROPSYCH2 |
[] |
|
ABC003 |
MRI1 |
1 |
[‘anat’] |