Extracting Data
Extracting Data#
import pandas as pd
import numpy as np
‘id_file’ represents the 'subject_list.txt'
given with HCP data (inside the hcp_task folder) by NMA.
It contains the subject ids of the 100 participants.
id_file = r'/Users/rajdeep_ch/Documents/nma/project/hcp_task/subjects_list.txt'
# using .loadtxt() method from numpy to load the text file
id_Series = np.loadtxt(id_file)
The values contained in id_Series
are floating point numbers.
They are first converted to int
and then to str
type.
id_Series_str = [str(int(sub_id)) for sub_id in id_Series]
csv_file = r'/Users/rajdeep_ch/Documents/nma/project/alldata.csv'
allData_df = pd.read_csv(csv_file)
The allData_df
contains data for all 1200 participants across all the tasks.
allData_df.shape
(1207, 383)
allData_df.head()
Unnamed: 0 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | ... | Unnamed: 373 | Unnamed: 374 | Unnamed: 375 | Unnamed: 376 | Unnamed: 377 | Unnamed: 378 | Unnamed: 379 | Unnamed: 380 | Unnamed: 381 | Unnamed: 382 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Subject | Release | Acquisition | Gender | Age | 3T_Full_MR_Compl | T1_Count | T2_Count | 3T_RS-fMRI_Count | 3T_RS-fMRI_PctCompl | ... | Noise_Comp | Odor_Unadj | Odor_AgeAdj | PainIntens_RawScore | PainInterf_Tscore | Taste_Unadj | Taste_AgeAdj | Mars_Log_Score | Mars_Errs | Mars_Final |
1 | 100004 | S900 | Q06 | M | 22-25 | FALSE | 0 | 0 | 0 | 0.0 | ... | 5.2 | 101.12 | 86.45 | 2 | 45.9 | 107.17 | 105.31 | 1.8 | 0 | 1.8 |
2 | 100206 | S900 | Q11 | M | 26-30 | TRUE | 1 | 1 | 4 | 100.0 | ... | 6.0 | 108.79 | 97.19 | 1 | 49.7 | 72.63 | 72.03 | 1.84 | 0 | 1.84 |
3 | 100307 | Q1 | Q01 | F | 26-30 | TRUE | 1 | 1 | 4 | 100.0 | ... | 3.6 | 101.12 | 86.45 | 0 | 38.6 | 71.69 | 71.76 | 1.76 | 0 | 1.76 |
4 | 100408 | Q3 | Q03 | M | 31-35 | TRUE | 1 | 1 | 4 | 100.0 | ... | 2.0 | 108.79 | 98.04 | 2 | 52.6 | 114.01 | 113.59 | 1.76 | 2 | 1.68 |
5 rows × 383 columns
Cleaned the dataframe by using row 0 as the column names of the dataframe.
The 0th row is then deleted.
allData_df.columns.name = None
allData_df.columns = list(allData_df.iloc[0])
allData_df.head()
Subject | Release | Acquisition | Gender | Age | 3T_Full_MR_Compl | T1_Count | T2_Count | 3T_RS-fMRI_Count | 3T_RS-fMRI_PctCompl | ... | Noise_Comp | Odor_Unadj | Odor_AgeAdj | PainIntens_RawScore | PainInterf_Tscore | Taste_Unadj | Taste_AgeAdj | Mars_Log_Score | Mars_Errs | Mars_Final | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Subject | Release | Acquisition | Gender | Age | 3T_Full_MR_Compl | T1_Count | T2_Count | 3T_RS-fMRI_Count | 3T_RS-fMRI_PctCompl | ... | Noise_Comp | Odor_Unadj | Odor_AgeAdj | PainIntens_RawScore | PainInterf_Tscore | Taste_Unadj | Taste_AgeAdj | Mars_Log_Score | Mars_Errs | Mars_Final |
1 | 100004 | S900 | Q06 | M | 22-25 | FALSE | 0 | 0 | 0 | 0.0 | ... | 5.2 | 101.12 | 86.45 | 2 | 45.9 | 107.17 | 105.31 | 1.8 | 0 | 1.8 |
2 | 100206 | S900 | Q11 | M | 26-30 | TRUE | 1 | 1 | 4 | 100.0 | ... | 6.0 | 108.79 | 97.19 | 1 | 49.7 | 72.63 | 72.03 | 1.84 | 0 | 1.84 |
3 | 100307 | Q1 | Q01 | F | 26-30 | TRUE | 1 | 1 | 4 | 100.0 | ... | 3.6 | 101.12 | 86.45 | 0 | 38.6 | 71.69 | 71.76 | 1.76 | 0 | 1.76 |
4 | 100408 | Q3 | Q03 | M | 31-35 | TRUE | 1 | 1 | 4 | 100.0 | ... | 2.0 | 108.79 | 98.04 | 2 | 52.6 | 114.01 | 113.59 | 1.76 | 2 | 1.68 |
5 rows × 383 columns
allData_df = allData_df.drop(0)
allData_df
Subject | Release | Acquisition | Gender | Age | 3T_Full_MR_Compl | T1_Count | T2_Count | 3T_RS-fMRI_Count | 3T_RS-fMRI_PctCompl | ... | Noise_Comp | Odor_Unadj | Odor_AgeAdj | PainIntens_RawScore | PainInterf_Tscore | Taste_Unadj | Taste_AgeAdj | Mars_Log_Score | Mars_Errs | Mars_Final | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 100004 | S900 | Q06 | M | 22-25 | FALSE | 0 | 0 | 0 | 0.0 | ... | 5.2 | 101.12 | 86.45 | 2 | 45.9 | 107.17 | 105.31 | 1.8 | 0 | 1.8 |
2 | 100206 | S900 | Q11 | M | 26-30 | TRUE | 1 | 1 | 4 | 100.0 | ... | 6.0 | 108.79 | 97.19 | 1 | 49.7 | 72.63 | 72.03 | 1.84 | 0 | 1.84 |
3 | 100307 | Q1 | Q01 | F | 26-30 | TRUE | 1 | 1 | 4 | 100.0 | ... | 3.6 | 101.12 | 86.45 | 0 | 38.6 | 71.69 | 71.76 | 1.76 | 0 | 1.76 |
4 | 100408 | Q3 | Q03 | M | 31-35 | TRUE | 1 | 1 | 4 | 100.0 | ... | 2.0 | 108.79 | 98.04 | 2 | 52.6 | 114.01 | 113.59 | 1.76 | 2 | 1.68 |
5 | 100610 | S900 | Q08 | M | 26-30 | TRUE | 2 | 1 | 4 | 100.0 | ... | 2.0 | 122.25 | 110.45 | 0 | 38.6 | 84.84 | 85.31 | 1.92 | 1 | 1.88 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1202 | 992774 | Q2 | Q02 | M | 31-35 | TRUE | 2 | 2 | 4 | 100.0 | ... | 8.4 | 122.25 | 111.41 | 4 | 50.1 | 107.17 | 103.55 | 1.76 | 0 | 1.76 |
1203 | 993675 | S900 | Q09 | F | 26-30 | TRUE | 2 | 2 | 4 | 100.0 | ... | 0.4 | 122.25 | 110.45 | 0 | 38.6 | 84.07 | 84.25 | 1.8 | 1 | 1.76 |
1204 | 994273 | S500 | Q06 | M | 26-30 | TRUE | 1 | 1 | 4 | 100.0 | ... | 6.0 | 122.25 | 111.41 | 7 | 63.8 | 110.65 | 109.73 | 1.8 | 1 | 1.76 |
1205 | 995174 | S1200 | Q13 | M | 22-25 | FALSE | 1 | 1 | 2 | 0.0 | ... | 3.6 | 88.61 | 64.58 | 3 | 50.1 | 117.16 | 117.4 | 1.8 | 0 | 1.8 |
1206 | 996782 | S900 | Q08 | F | 26-30 | TRUE | 2 | 2 | 4 | 100.0 | ... | 6.0 | 108.79 | 97.19 | 0 | 38.6 | 75.43 | 73.72 | 1.84 | 0 | 1.84 |
1206 rows × 383 columns
This is how the dataframe looks now.
From this dataframe, personality data is extracted.
The personality data columns have start index of 308 and end index of 373.
allData_df.iloc[:,308:373]
NEOFAC_A | NEOFAC_O | NEOFAC_C | NEOFAC_N | NEOFAC_E | NEORAW_01 | NEORAW_02 | NEORAW_03 | NEORAW_04 | NEORAW_05 | ... | NEORAW_51 | NEORAW_52 | NEORAW_53 | NEORAW_54 | NEORAW_55 | NEORAW_56 | NEORAW_57 | NEORAW_58 | NEORAW_59 | NEORAW_60 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 27 | 15 | 28 | 24 | 25 | A | D | SA | SA | N | ... | N | A | A | A | A | D | N | N | D | A |
2 | 29 | 23 | 26 | 21 | 32 | D | A | A | SA | N | ... | N | A | A | D | N | D | N | D | A | N |
3 | 37 | 24 | 35 | 15 | 37 | N | SA | A | A | SA | ... | D | SA | A | SD | D | D | D | D | SD | A |
4 | 33 | 29 | 34 | 15 | 33 | D | A | N | SA | A | ... | D | A | A | N | D | N | D | A | A | A |
5 | 31 | 33 | 36 | 7 | 15 | SA | D | D | SA | N | ... | SD | N | SA | SD | D | A | SA | SA | D | A |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1202 | 36 | 24 | 39 | 16 | 32 | N | N | A | SA | N | ... | D | SA | A | D | D | SD | SD | A | SD | SA |
1203 | 42 | 31 | 37 | 10 | 24 | A | D | N | A | N | ... | D | N | A | D | D | SD | D | D | SD | SA |
1204 | 28 | 32 | 29 | 22 | 27 | D | D | N | SA | N | ... | D | A | A | N | N | D | N | N | N | A |
1205 | 27 | 30 | 41 | 18 | 36 | A | A | N | SA | A | ... | D | SA | SA | SA | D | A | D | SA | D | SA |
1206 | 34 | 32 | 35 | 20 | 22 | SD | D | D | A | N | ... | D | N | N | D | A | N | A | D | D | A |
1206 rows × 65 columns
The personality data for all 1200 participants in stored in the variable testing_personality
testing_personality = allData_df.iloc[:,308:373]
testing_personality.head()
NEOFAC_A | NEOFAC_O | NEOFAC_C | NEOFAC_N | NEOFAC_E | NEORAW_01 | NEORAW_02 | NEORAW_03 | NEORAW_04 | NEORAW_05 | ... | NEORAW_51 | NEORAW_52 | NEORAW_53 | NEORAW_54 | NEORAW_55 | NEORAW_56 | NEORAW_57 | NEORAW_58 | NEORAW_59 | NEORAW_60 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 27 | 15 | 28 | 24 | 25 | A | D | SA | SA | N | ... | N | A | A | A | A | D | N | N | D | A |
2 | 29 | 23 | 26 | 21 | 32 | D | A | A | SA | N | ... | N | A | A | D | N | D | N | D | A | N |
3 | 37 | 24 | 35 | 15 | 37 | N | SA | A | A | SA | ... | D | SA | A | SD | D | D | D | D | SD | A |
4 | 33 | 29 | 34 | 15 | 33 | D | A | N | SA | A | ... | D | A | A | N | D | N | D | A | A | A |
5 | 31 | 33 | 36 | 7 | 15 | SA | D | D | SA | N | ... | SD | N | SA | SD | D | A | SA | SA | D | A |
5 rows × 65 columns
Extracting the subject ids for all 1200 participants in the variable all_id
all_id = allData_df.iloc[:,0]
all_id
1 100004
2 100206
3 100307
4 100408
5 100610
...
1202 992774
1203 993675
1204 994273
1205 995174
1206 996782
Name: Subject, Length: 1206, dtype: object
A new dataframe, named personality_df
is created by extracting personality data from allData_df
and combining it with all_id
.
It contains the personality data for all 1200 participants.
personality_df = pd.concat([all_id,testing_personality],axis=1)
personality_df.head()
Subject | NEOFAC_A | NEOFAC_O | NEOFAC_C | NEOFAC_N | NEOFAC_E | NEORAW_01 | NEORAW_02 | NEORAW_03 | NEORAW_04 | ... | NEORAW_51 | NEORAW_52 | NEORAW_53 | NEORAW_54 | NEORAW_55 | NEORAW_56 | NEORAW_57 | NEORAW_58 | NEORAW_59 | NEORAW_60 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 100004 | 27 | 15 | 28 | 24 | 25 | A | D | SA | SA | ... | N | A | A | A | A | D | N | N | D | A |
2 | 100206 | 29 | 23 | 26 | 21 | 32 | D | A | A | SA | ... | N | A | A | D | N | D | N | D | A | N |
3 | 100307 | 37 | 24 | 35 | 15 | 37 | N | SA | A | A | ... | D | SA | A | SD | D | D | D | D | SD | A |
4 | 100408 | 33 | 29 | 34 | 15 | 33 | D | A | N | SA | ... | D | A | A | N | D | N | D | A | A | A |
5 | 100610 | 31 | 33 | 36 | 7 | 15 | SA | D | D | SA | ... | SD | N | SA | SD | D | A | SA | SA | D | A |
5 rows × 66 columns
The personality_subset
dataframe is meant to contain the personality data for the 100 participants.
personality_subset = pd.DataFrame(columns=personality_df.columns)
personality_subset
Subject | NEOFAC_A | NEOFAC_O | NEOFAC_C | NEOFAC_N | NEOFAC_E | NEORAW_01 | NEORAW_02 | NEORAW_03 | NEORAW_04 | ... | NEORAW_51 | NEORAW_52 | NEORAW_53 | NEORAW_54 | NEORAW_55 | NEORAW_56 | NEORAW_57 | NEORAW_58 | NEORAW_59 | NEORAW_60 |
---|
0 rows × 66 columns
Looping through the list id_Series_str
, which contains the subject ids of the 100 participants we want, the personality data for that respective participant subset_id
is extracted from the personality data of all 1200 participants.
The extracted value is added to personality_subset
for subset_id in id_Series_str:
subject_data = personality_df[personality_df['Subject'] == subset_id]
personality_subset = pd.concat([personality_subset, subject_data],axis=0)
personality_subset.head()
Subject | NEOFAC_A | NEOFAC_O | NEOFAC_C | NEOFAC_N | NEOFAC_E | NEORAW_01 | NEORAW_02 | NEORAW_03 | NEORAW_04 | ... | NEORAW_51 | NEORAW_52 | NEORAW_53 | NEORAW_54 | NEORAW_55 | NEORAW_56 | NEORAW_57 | NEORAW_58 | NEORAW_59 | NEORAW_60 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 100307 | 37 | 24 | 35 | 15 | 37 | N | SA | A | A | ... | D | SA | A | SD | D | D | D | D | SD | A |
4 | 100408 | 33 | 29 | 34 | 15 | 33 | D | A | N | SA | ... | D | A | A | N | D | N | D | A | A | A |
12 | 101915 | 35 | 30 | 45 | 8 | 31 | N | D | A | SA | ... | D | SA | SA | SD | SD | SD | D | D | D | SA |
19 | 102816 | 36 | 27 | 32 | 10 | 31 | A | D | N | SA | ... | SD | A | N | N | D | D | D | N | D | SA |
23 | 103414 | 27 | 30 | 31 | 20 | 34 | D | A | A | SA | ... | D | SA | A | D | D | D | N | N | D | A |
5 rows × 66 columns
personality_subset.shape
(100, 66)
The dataframe is then converted to a csv file and saved.
Note that index
argument is set to False for to_csv
method.
personality_subset.to_csv(r'/Users/rajdeep_ch/Documents/nma/project/personality_data.csv',index=False)
personality_df.head()
Subject | NEOFAC_A | NEOFAC_O | NEOFAC_C | NEOFAC_N | NEOFAC_E | NEORAW_01 | NEORAW_02 | NEORAW_03 | NEORAW_04 | ... | NEORAW_51 | NEORAW_52 | NEORAW_53 | NEORAW_54 | NEORAW_55 | NEORAW_56 | NEORAW_57 | NEORAW_58 | NEORAW_59 | NEORAW_60 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 100004 | 27 | 15 | 28 | 24 | 25 | A | D | SA | SA | ... | N | A | A | A | A | D | N | N | D | A |
2 | 100206 | 29 | 23 | 26 | 21 | 32 | D | A | A | SA | ... | N | A | A | D | N | D | N | D | A | N |
3 | 100307 | 37 | 24 | 35 | 15 | 37 | N | SA | A | A | ... | D | SA | A | SD | D | D | D | D | SD | A |
4 | 100408 | 33 | 29 | 34 | 15 | 33 | D | A | N | SA | ... | D | A | A | N | D | N | D | A | A | A |
5 | 100610 | 31 | 33 | 36 | 7 | 15 | SA | D | D | SA | ... | SD | N | SA | SD | D | A | SA | SA | D | A |
5 rows × 66 columns
Note that the subject 131924
does not have data for the Big Five Inventory questions but does have scores for the five personality traits.
personality_df[personality_df['Subject'] == '131924']
Subject | NEOFAC_A | NEOFAC_O | NEOFAC_C | NEOFAC_N | NEOFAC_E | NEORAW_01 | NEORAW_02 | NEORAW_03 | NEORAW_04 | ... | NEORAW_51 | NEORAW_52 | NEORAW_53 | NEORAW_54 | NEORAW_55 | NEORAW_56 | NEORAW_57 | NEORAW_58 | NEORAW_59 | NEORAW_60 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
178 | 131924 | 30 | 30 | 12 | 12 | 12 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 rows × 66 columns
personality_subset.head()
Subject | NEOFAC_A | NEOFAC_O | NEOFAC_C | NEOFAC_N | NEOFAC_E | NEORAW_01 | NEORAW_02 | NEORAW_03 | NEORAW_04 | ... | NEORAW_51 | NEORAW_52 | NEORAW_53 | NEORAW_54 | NEORAW_55 | NEORAW_56 | NEORAW_57 | NEORAW_58 | NEORAW_59 | NEORAW_60 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 100307 | 37 | 24 | 35 | 15 | 37 | N | SA | A | A | ... | D | SA | A | SD | D | D | D | D | SD | A |
4 | 100408 | 33 | 29 | 34 | 15 | 33 | D | A | N | SA | ... | D | A | A | N | D | N | D | A | A | A |
12 | 101915 | 35 | 30 | 45 | 8 | 31 | N | D | A | SA | ... | D | SA | SA | SD | SD | SD | D | D | D | SA |
19 | 102816 | 36 | 27 | 32 | 10 | 31 | A | D | N | SA | ... | SD | A | N | N | D | D | D | N | D | SA |
23 | 103414 | 27 | 30 | 31 | 20 | 34 | D | A | A | SA | ... | D | SA | A | D | D | D | N | N | D | A |
5 rows × 66 columns
Preliminarly going through the personality trait scores to check max, min.
neofac_a = [int(score1) for score1 in personality_subset.iloc[:,1]]
np.max(neofac_a)
46
neofac_b = [int(score2) for score2 in personality_subset.iloc[:,2]]
np.max(neofac_b)
44
neofac_c = [int(score3) for score3 in personality_subset.iloc[:,3]]
np.max(neofac_c)
48
neofac_d = [int(score4) for score4 in personality_subset.iloc[:,4]]
np.max(neofac_d)
36
neofac_e = [int(score5) for score5 in personality_subset.iloc[:,5]]
np.max(neofac_e)
45
np.max([int(score_val) for score_val in personality_df.iloc[:,5] if type(score_val) != float])
47