Clinical
genie_registry.clinical
¶
Clinical file format validation and processing
Functions¶
_check_year(clinicaldf, year_col, filename, allowed_string_values=None)
¶
Check year columns
| PARAMETER | DESCRIPTION |
|---|---|
clinicaldf
|
Clinical dataframe
TYPE:
|
year_col
|
YEAR column
TYPE:
|
filename
|
Name of file
TYPE:
|
allowed_string_values
|
list of other allowed string values
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Error message |
_check_int_dead_consistency(clinicaldf)
¶
Check if vital status interval and dead column are consistent
| PARAMETER | DESCRIPTION |
|---|---|
clinicaldf
|
Clinical Data Frame
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Error message if values and inconsistent or blank string |
_check_int_year_consistency(clinicaldf, cols, string_vals)
¶
Check if vital status interval and year columns are consistent in their values.
What determines text consistency
- IF exactly one column in the columns being checked is "Unknown" in that row, THEN that column with the "Unknown" value has to be the interval column
- OTHERWISE values must be either all numeric or the SAME string value in for each row for all cols that are being checked
Here we will show examples of invalid vs valid text consistency:
Note: INT_CONTACT and YEAR_CONTACT are the columns being checked, and INT_CONTACT
is the interval column while YEAR_CONTACT is the year column.
VALID Examples
| INT_CONTACT | YEAR_CONTACT |
|---|---|
| Unknown | 2012 |
| INT_CONTACT | YEAR_CONTACT |
|---|---|
| Unknown | Unknown |
| INT_CONTACT | YEAR_CONTACT |
|---|---|
| 2012 | 2012 |
| INT_CONTACT | YEAR_CONTACT |
|---|---|
| Not Collected | Not Collected |
INVALID Examples
| INT_CONTACT | YEAR_CONTACT |
|---|---|
| 2012 | Unknown |
| INT_CONTACT | YEAR_CONTACT |
|---|---|
| 2012 | Not Collected |
| INT_CONTACT | YEAR_CONTACT |
|---|---|
| Not collected | Not Released |
| PARAMETER | DESCRIPTION |
|---|---|
clinicaldf
|
input Clinical Data Frame
TYPE:
|
cols
|
Columns in the clinical data frame to be checked for consistency
TYPE:
|
string_vals
|
String values that aren't integers
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Error message if values are inconsistent or blank string |
_check_year_death_validity(clinicaldf)
¶
YEAR_DEATH should alway be greater than or equal to YEAR_CONTACT when they are both available. This function checks if YEAR_DEATH >= YEAR_CONTACT and returns row indices of invalid YEAR_DEATH rows.
| PARAMETER | DESCRIPTION |
|---|---|
clinicaldf
|
Clinical Data Frame
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Index
|
pd.Index: The row indices of the row with YEAR_DEATH < YEAR_CONTACT in the input clinical data |
_check_year_death_validity_message(invalid_year_death_indices)
¶
This function returns the error and warning messages if the input clinical data has row with YEAR_DEATH < YEAR_CONTACT
| PARAMETER | DESCRIPTION |
|---|---|
invalid_year_death_indices
|
The row indices of the rows with YEAR_DEATH < YEAR_CONTACT in the input clinical data
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Tuple[str, str]: The error message that tells you how many patients with invalid YEAR_DEATH values that your |
str
|
input clinical data has |
_check_int_dod_validity(clinicaldf)
¶
INT_DOD should alway be greater than or equal to INT_CONTACT when they are both available. This function checks if INT_DOD >= INT_CONTACT and returns row indices of invalid INT_DOD rows.
| PARAMETER | DESCRIPTION |
|---|---|
clinicaldf
|
Clinical Data Frame
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Index
|
pd.Index: The row indices of the row with INT_DOD < INT_CONTACT in the input clinical data |
_check_int_dod_validity_message(invalid_int_dod_indices)
¶
This function returns the error and warning messages if the input clinical data has row with INT_DOD < INT_CONTACT
| PARAMETER | DESCRIPTION |
|---|---|
invalid_int_dod_indices
|
The row indices of the rows with INT_DOD < INT_CONTACT in the input clinical data
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Tuple[str, str]: The error message that tells you how many patients with invalid INT_DOD values that your |
str
|
input clinical data has |
remap_clinical_values(clinicaldf, sex_mapping, race_mapping, ethnicity_mapping, sampletype_mapping)
¶
Remap clinical attributes from integer to string values
| PARAMETER | DESCRIPTION |
|---|---|
clinicaldf
|
Clinical data
TYPE:
|
sex_mapping
|
Sex mapping data
TYPE:
|
race_mapping
|
Race mapping data
TYPE:
|
ethnicity_mapping
|
Ethnicity mapping data
TYPE:
|
sample_type
|
Sample type mapping data
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Mapped clinical dataframe |
genie_registry.clinical.Clinical
¶
Bases: FileTypeFormat
Attributes¶
_fileType = 'clinical'
class-attribute
instance-attribute
¶
_process_kwargs = ['newPath', 'parentId', 'clinicalTemplate', 'sample', 'patient', 'patientCols', 'sampleCols']
class-attribute
instance-attribute
¶
Functions¶
_validateFilename(filePath)
¶
update_clinical(row)
¶
Transform the values of each row of the clinical file
uploadMissingData(df, col, dbSynId, stagingSynId)
¶
Uploads missing clinical samples / patients
| PARAMETER | DESCRIPTION |
|---|---|
df
|
dataframe with clinical data
TYPE:
|
col
|
column in dataframe. Usually SAMPLE_ID or PATIENT_ID.
TYPE:
|
dbSynId
|
Synapse table Synapse id
TYPE:
|
stagingSynId
|
Center Synapse staging Id
TYPE:
|
_process(clinical, clinicalTemplate)
¶
preprocess(newpath)
¶
Gather preprocess parameters
| PARAMETER | DESCRIPTION |
|---|---|
filePath
|
Path to file
|
| RETURNS | DESCRIPTION |
|---|---|
|
dict with keys - 'clinicalTemplate', 'sample', 'patient', 'patientCols', 'sampleCols' |
process_steps(clinicalDf, newPath, parentId, clinicalTemplate, sample, patient, patientCols, sampleCols)
¶
Process clincial file, redact PHI values, upload to clinical database
_validate_oncotree_code_mapping(clinicaldf, oncotree_mapping)
staticmethod
¶
Checks that the oncotree codes in the input clinical data is a valid oncotree code from the official oncotree site
| PARAMETER | DESCRIPTION |
|---|---|
clinicaldf
|
clinical input data to validate
TYPE:
|
oncotree_mapping
|
table of official oncotree mappings
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Index
|
pd.Index: row indices of unmapped oncotree codes in the |
Index
|
input clinical data |
_validate_oncotree_code_mapping_message(clinicaldf, unmapped_oncotree_indices)
staticmethod
¶
This function returns the error and warning messages if the input clinical data has row indices with unmapped oncotree codes
| PARAMETER | DESCRIPTION |
|---|---|
clinicaldf
|
input clinical data
TYPE:
|
unmapped_oncotree_indices
|
row indices of the input clinical data with unmapped oncotree codes
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Tuple[str, str]
|
Tuple[str, str]: error message that tells you how many samples AND the unique unmapped oncotree codes that your input clinical data has |
_validate_sample_class_and_type(clinicaldf, sampletype_mapping)
¶
Validates that the values of SAMPLE_CLASS and SAMPLE_TYPE in the clinical data is consistent and returns error message with the error(s).
The following conditions must be met
- When SAMPLE_CLASS is
cfDNA, SAMPLE_TYPE must be 8 - When SAMPLE_TYPE is 8, SAMPLE_CLASS must be
cfDNA
Valid Examples
| SAMPLE_TYPE | SAMPLE_CLASS |
|---|---|
| 8 | cfDNA |
| 8 | cfDNA |
| SAMPLE_TYPE | SAMPLE_CLASS |
|---|---|
| 8 | cfDNA |
| 8 | cfDNA |
| 2 | Tumor |
Invalid Examples
| SAMPLE_TYPE | SAMPLE_CLASS |
|---|---|
| 8 | Other |
| 2 | cfDNA |
| SAMPLE_TYPE | SAMPLE_CLASS |
|---|---|
| 8 | cfDNA |
| 8 | Other |
| SAMPLE_TYPE | SAMPLE_CLASS |
|---|---|
| 8 | NaN |
| 2 | cfDNA |
| PARAMETER | DESCRIPTION |
|---|---|
clinicaldf
|
input clinical data
TYPE:
|
sampletype_mapping
|
sample type mapping table containing the mappings of the SAMPLE_TYPE to the value behind it. This is used to check the SAMPLE_TYPE and SAMPLE_CLASS values.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
error message of the concatenated error message(s)
TYPE:
|
_validate(clinicaldf)
¶
This function validates the clinical file to make sure it adhere to the clinical SOP.
| PARAMETER | DESCRIPTION |
|---|---|
clinicaldf
|
Merged clinical file with patient and sample information
|
| RETURNS | DESCRIPTION |
|---|---|
|
Error message |
Note
SAMPLE_CLASS is a required column and must contain only 'Tumor' or 'cfDNA' values.
_get_dataframe(filePathList)
¶
_cross_validate_bed_files_exist(clinicaldf)
¶
Check that a bed file exist per SEQ_ASSAY_ID value in clinical file
_cross_validate_bed_files_exist_message(missing_bed_files)
¶
Gets the warning/error messages given the missing bed files list
| PARAMETER | DESCRIPTION |
|---|---|
missing_bed_files
|
list of missing bed files
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
error + warning
TYPE:
|
_cross_validate_assay_info_has_seq(clinicaldf)
¶
Cross validates that assay information file has all the SEQ_ASSAY_IDs present in the clinical file TODO: Refactor this function (similar to _cross_validate in maf) once the clinical files have been taken care of so that it can be generalized to any file type
| PARAMETER | DESCRIPTION |
|---|---|
clinicaldf
|
input clinical data
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
errors and warnings
TYPE:
|
_cross_validate(clinicaldf)
¶
Cross-validation for clinical file(s)