Extracting Data#


import pandas as pd
import numpy as np

‘id_file’ represents the 'subject_list.txt' given with HCP data (inside the hcp_task folder) by NMA.

It contains the subject ids of the 100 participants.

id_file = r'/Users/rajdeep_ch/Documents/nma/project/hcp_task/subjects_list.txt' 
# using .loadtxt() method from numpy to load the text file
id_Series = np.loadtxt(id_file)

The values contained in id_Series are floating point numbers.

They are first converted to int and then to str type.

id_Series_str = [str(int(sub_id)) for sub_id in id_Series]
csv_file = r'/Users/rajdeep_ch/Documents/nma/project/alldata.csv' 

allData_df = pd.read_csv(csv_file)

The allData_df contains data for all 1200 participants across all the tasks.

allData_df.shape
(1207, 383)
allData_df.head()
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... Unnamed: 373 Unnamed: 374 Unnamed: 375 Unnamed: 376 Unnamed: 377 Unnamed: 378 Unnamed: 379 Unnamed: 380 Unnamed: 381 Unnamed: 382
0 Subject Release Acquisition Gender Age 3T_Full_MR_Compl T1_Count T2_Count 3T_RS-fMRI_Count 3T_RS-fMRI_PctCompl ... Noise_Comp Odor_Unadj Odor_AgeAdj PainIntens_RawScore PainInterf_Tscore Taste_Unadj Taste_AgeAdj Mars_Log_Score Mars_Errs Mars_Final
1 100004 S900 Q06 M 22-25 FALSE 0 0 0 0.0 ... 5.2 101.12 86.45 2 45.9 107.17 105.31 1.8 0 1.8
2 100206 S900 Q11 M 26-30 TRUE 1 1 4 100.0 ... 6.0 108.79 97.19 1 49.7 72.63 72.03 1.84 0 1.84
3 100307 Q1 Q01 F 26-30 TRUE 1 1 4 100.0 ... 3.6 101.12 86.45 0 38.6 71.69 71.76 1.76 0 1.76
4 100408 Q3 Q03 M 31-35 TRUE 1 1 4 100.0 ... 2.0 108.79 98.04 2 52.6 114.01 113.59 1.76 2 1.68

5 rows × 383 columns

Cleaned the dataframe by using row 0 as the column names of the dataframe.

The 0th row is then deleted.

allData_df.columns.name = None
allData_df.columns = list(allData_df.iloc[0])
allData_df.head()
Subject Release Acquisition Gender Age 3T_Full_MR_Compl T1_Count T2_Count 3T_RS-fMRI_Count 3T_RS-fMRI_PctCompl ... Noise_Comp Odor_Unadj Odor_AgeAdj PainIntens_RawScore PainInterf_Tscore Taste_Unadj Taste_AgeAdj Mars_Log_Score Mars_Errs Mars_Final
0 Subject Release Acquisition Gender Age 3T_Full_MR_Compl T1_Count T2_Count 3T_RS-fMRI_Count 3T_RS-fMRI_PctCompl ... Noise_Comp Odor_Unadj Odor_AgeAdj PainIntens_RawScore PainInterf_Tscore Taste_Unadj Taste_AgeAdj Mars_Log_Score Mars_Errs Mars_Final
1 100004 S900 Q06 M 22-25 FALSE 0 0 0 0.0 ... 5.2 101.12 86.45 2 45.9 107.17 105.31 1.8 0 1.8
2 100206 S900 Q11 M 26-30 TRUE 1 1 4 100.0 ... 6.0 108.79 97.19 1 49.7 72.63 72.03 1.84 0 1.84
3 100307 Q1 Q01 F 26-30 TRUE 1 1 4 100.0 ... 3.6 101.12 86.45 0 38.6 71.69 71.76 1.76 0 1.76
4 100408 Q3 Q03 M 31-35 TRUE 1 1 4 100.0 ... 2.0 108.79 98.04 2 52.6 114.01 113.59 1.76 2 1.68

5 rows × 383 columns

allData_df = allData_df.drop(0)
allData_df
Subject Release Acquisition Gender Age 3T_Full_MR_Compl T1_Count T2_Count 3T_RS-fMRI_Count 3T_RS-fMRI_PctCompl ... Noise_Comp Odor_Unadj Odor_AgeAdj PainIntens_RawScore PainInterf_Tscore Taste_Unadj Taste_AgeAdj Mars_Log_Score Mars_Errs Mars_Final
1 100004 S900 Q06 M 22-25 FALSE 0 0 0 0.0 ... 5.2 101.12 86.45 2 45.9 107.17 105.31 1.8 0 1.8
2 100206 S900 Q11 M 26-30 TRUE 1 1 4 100.0 ... 6.0 108.79 97.19 1 49.7 72.63 72.03 1.84 0 1.84
3 100307 Q1 Q01 F 26-30 TRUE 1 1 4 100.0 ... 3.6 101.12 86.45 0 38.6 71.69 71.76 1.76 0 1.76
4 100408 Q3 Q03 M 31-35 TRUE 1 1 4 100.0 ... 2.0 108.79 98.04 2 52.6 114.01 113.59 1.76 2 1.68
5 100610 S900 Q08 M 26-30 TRUE 2 1 4 100.0 ... 2.0 122.25 110.45 0 38.6 84.84 85.31 1.92 1 1.88
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1202 992774 Q2 Q02 M 31-35 TRUE 2 2 4 100.0 ... 8.4 122.25 111.41 4 50.1 107.17 103.55 1.76 0 1.76
1203 993675 S900 Q09 F 26-30 TRUE 2 2 4 100.0 ... 0.4 122.25 110.45 0 38.6 84.07 84.25 1.8 1 1.76
1204 994273 S500 Q06 M 26-30 TRUE 1 1 4 100.0 ... 6.0 122.25 111.41 7 63.8 110.65 109.73 1.8 1 1.76
1205 995174 S1200 Q13 M 22-25 FALSE 1 1 2 0.0 ... 3.6 88.61 64.58 3 50.1 117.16 117.4 1.8 0 1.8
1206 996782 S900 Q08 F 26-30 TRUE 2 2 4 100.0 ... 6.0 108.79 97.19 0 38.6 75.43 73.72 1.84 0 1.84

1206 rows × 383 columns

This is how the dataframe looks now.

From this dataframe, personality data is extracted.

The personality data columns have start index of 308 and end index of 373.

allData_df.iloc[:,308:373]
NEOFAC_A NEOFAC_O NEOFAC_C NEOFAC_N NEOFAC_E NEORAW_01 NEORAW_02 NEORAW_03 NEORAW_04 NEORAW_05 ... NEORAW_51 NEORAW_52 NEORAW_53 NEORAW_54 NEORAW_55 NEORAW_56 NEORAW_57 NEORAW_58 NEORAW_59 NEORAW_60
1 27 15 28 24 25 A D SA SA N ... N A A A A D N N D A
2 29 23 26 21 32 D A A SA N ... N A A D N D N D A N
3 37 24 35 15 37 N SA A A SA ... D SA A SD D D D D SD A
4 33 29 34 15 33 D A N SA A ... D A A N D N D A A A
5 31 33 36 7 15 SA D D SA N ... SD N SA SD D A SA SA D A
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1202 36 24 39 16 32 N N A SA N ... D SA A D D SD SD A SD SA
1203 42 31 37 10 24 A D N A N ... D N A D D SD D D SD SA
1204 28 32 29 22 27 D D N SA N ... D A A N N D N N N A
1205 27 30 41 18 36 A A N SA A ... D SA SA SA D A D SA D SA
1206 34 32 35 20 22 SD D D A N ... D N N D A N A D D A

1206 rows × 65 columns

The personality data for all 1200 participants in stored in the variable testing_personality

testing_personality = allData_df.iloc[:,308:373]
testing_personality.head()
NEOFAC_A NEOFAC_O NEOFAC_C NEOFAC_N NEOFAC_E NEORAW_01 NEORAW_02 NEORAW_03 NEORAW_04 NEORAW_05 ... NEORAW_51 NEORAW_52 NEORAW_53 NEORAW_54 NEORAW_55 NEORAW_56 NEORAW_57 NEORAW_58 NEORAW_59 NEORAW_60
1 27 15 28 24 25 A D SA SA N ... N A A A A D N N D A
2 29 23 26 21 32 D A A SA N ... N A A D N D N D A N
3 37 24 35 15 37 N SA A A SA ... D SA A SD D D D D SD A
4 33 29 34 15 33 D A N SA A ... D A A N D N D A A A
5 31 33 36 7 15 SA D D SA N ... SD N SA SD D A SA SA D A

5 rows × 65 columns

Extracting the subject ids for all 1200 participants in the variable all_id

all_id = allData_df.iloc[:,0]
all_id
1       100004
2       100206
3       100307
4       100408
5       100610
         ...  
1202    992774
1203    993675
1204    994273
1205    995174
1206    996782
Name: Subject, Length: 1206, dtype: object

A new dataframe, named personality_df is created by extracting personality data from allData_df and combining it with all_id.

It contains the personality data for all 1200 participants.

personality_df = pd.concat([all_id,testing_personality],axis=1)
personality_df.head()
Subject NEOFAC_A NEOFAC_O NEOFAC_C NEOFAC_N NEOFAC_E NEORAW_01 NEORAW_02 NEORAW_03 NEORAW_04 ... NEORAW_51 NEORAW_52 NEORAW_53 NEORAW_54 NEORAW_55 NEORAW_56 NEORAW_57 NEORAW_58 NEORAW_59 NEORAW_60
1 100004 27 15 28 24 25 A D SA SA ... N A A A A D N N D A
2 100206 29 23 26 21 32 D A A SA ... N A A D N D N D A N
3 100307 37 24 35 15 37 N SA A A ... D SA A SD D D D D SD A
4 100408 33 29 34 15 33 D A N SA ... D A A N D N D A A A
5 100610 31 33 36 7 15 SA D D SA ... SD N SA SD D A SA SA D A

5 rows × 66 columns

The personality_subset dataframe is meant to contain the personality data for the 100 participants.

personality_subset = pd.DataFrame(columns=personality_df.columns)
personality_subset
Subject NEOFAC_A NEOFAC_O NEOFAC_C NEOFAC_N NEOFAC_E NEORAW_01 NEORAW_02 NEORAW_03 NEORAW_04 ... NEORAW_51 NEORAW_52 NEORAW_53 NEORAW_54 NEORAW_55 NEORAW_56 NEORAW_57 NEORAW_58 NEORAW_59 NEORAW_60

0 rows × 66 columns

Looping through the list id_Series_str, which contains the subject ids of the 100 participants we want, the personality data for that respective participant subset_id is extracted from the personality data of all 1200 participants.

The extracted value is added to personality_subset

for subset_id in id_Series_str:

    subject_data = personality_df[personality_df['Subject'] == subset_id]
    personality_subset = pd.concat([personality_subset, subject_data],axis=0)
personality_subset.head()
Subject NEOFAC_A NEOFAC_O NEOFAC_C NEOFAC_N NEOFAC_E NEORAW_01 NEORAW_02 NEORAW_03 NEORAW_04 ... NEORAW_51 NEORAW_52 NEORAW_53 NEORAW_54 NEORAW_55 NEORAW_56 NEORAW_57 NEORAW_58 NEORAW_59 NEORAW_60
3 100307 37 24 35 15 37 N SA A A ... D SA A SD D D D D SD A
4 100408 33 29 34 15 33 D A N SA ... D A A N D N D A A A
12 101915 35 30 45 8 31 N D A SA ... D SA SA SD SD SD D D D SA
19 102816 36 27 32 10 31 A D N SA ... SD A N N D D D N D SA
23 103414 27 30 31 20 34 D A A SA ... D SA A D D D N N D A

5 rows × 66 columns

personality_subset.shape
(100, 66)

The dataframe is then converted to a csv file and saved.

Note that index argument is set to False for to_csv method.

personality_subset.to_csv(r'/Users/rajdeep_ch/Documents/nma/project/personality_data.csv',index=False)
personality_df.head()
Subject NEOFAC_A NEOFAC_O NEOFAC_C NEOFAC_N NEOFAC_E NEORAW_01 NEORAW_02 NEORAW_03 NEORAW_04 ... NEORAW_51 NEORAW_52 NEORAW_53 NEORAW_54 NEORAW_55 NEORAW_56 NEORAW_57 NEORAW_58 NEORAW_59 NEORAW_60
1 100004 27 15 28 24 25 A D SA SA ... N A A A A D N N D A
2 100206 29 23 26 21 32 D A A SA ... N A A D N D N D A N
3 100307 37 24 35 15 37 N SA A A ... D SA A SD D D D D SD A
4 100408 33 29 34 15 33 D A N SA ... D A A N D N D A A A
5 100610 31 33 36 7 15 SA D D SA ... SD N SA SD D A SA SA D A

5 rows × 66 columns

Note that the subject 131924 does not have data for the Big Five Inventory questions but does have scores for the five personality traits.

personality_df[personality_df['Subject'] == '131924']
Subject NEOFAC_A NEOFAC_O NEOFAC_C NEOFAC_N NEOFAC_E NEORAW_01 NEORAW_02 NEORAW_03 NEORAW_04 ... NEORAW_51 NEORAW_52 NEORAW_53 NEORAW_54 NEORAW_55 NEORAW_56 NEORAW_57 NEORAW_58 NEORAW_59 NEORAW_60
178 131924 30 30 12 12 12 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

1 rows × 66 columns

personality_subset.head()
Subject NEOFAC_A NEOFAC_O NEOFAC_C NEOFAC_N NEOFAC_E NEORAW_01 NEORAW_02 NEORAW_03 NEORAW_04 ... NEORAW_51 NEORAW_52 NEORAW_53 NEORAW_54 NEORAW_55 NEORAW_56 NEORAW_57 NEORAW_58 NEORAW_59 NEORAW_60
3 100307 37 24 35 15 37 N SA A A ... D SA A SD D D D D SD A
4 100408 33 29 34 15 33 D A N SA ... D A A N D N D A A A
12 101915 35 30 45 8 31 N D A SA ... D SA SA SD SD SD D D D SA
19 102816 36 27 32 10 31 A D N SA ... SD A N N D D D N D SA
23 103414 27 30 31 20 34 D A A SA ... D SA A D D D N N D A

5 rows × 66 columns


Preliminarly going through the personality trait scores to check max, min.

neofac_a = [int(score1) for score1 in personality_subset.iloc[:,1]]
np.max(neofac_a)
46
neofac_b = [int(score2) for score2 in personality_subset.iloc[:,2]]
np.max(neofac_b)
44
neofac_c = [int(score3) for score3 in personality_subset.iloc[:,3]]
np.max(neofac_c)
48
neofac_d = [int(score4) for score4 in personality_subset.iloc[:,4]]
np.max(neofac_d)
36
neofac_e = [int(score5) for score5 in personality_subset.iloc[:,5]]
np.max(neofac_e)
45
np.max([int(score_val) for score_val in personality_df.iloc[:,5] if type(score_val) != float])
47