Skip to content

Clinical

genie_registry.clinical

Clinical file format validation and processing

Functions

_check_year(clinicaldf, year_col, filename, allowed_string_values=None)

Check year columns

PARAMETER DESCRIPTION
clinicaldf

Clinical dataframe

TYPE: DataFrame

year_col

YEAR column

TYPE: int

filename

Name of file

TYPE: str

allowed_string_values

list of other allowed string values

TYPE: Optional[list] DEFAULT: None

RETURNS DESCRIPTION
str

Error message

_check_int_dead_consistency(clinicaldf)

Check if vital status interval and dead column are consistent

PARAMETER DESCRIPTION
clinicaldf

Clinical Data Frame

TYPE: DataFrame

RETURNS DESCRIPTION
str

Error message if values and inconsistent or blank string

_check_int_year_consistency(clinicaldf, cols, string_vals)

Check if vital status interval and year columns are consistent in their values.

What determines text consistency
  • IF exactly one column in the columns being checked is "Unknown" in that row, THEN that column with the "Unknown" value has to be the interval column
  • OTHERWISE values must be either all numeric or the SAME string value in for each row for all cols that are being checked

Here we will show examples of invalid vs valid text consistency:

Note: INT_CONTACT and YEAR_CONTACT are the columns being checked, and INT_CONTACT
is the interval column while YEAR_CONTACT is the year column.
VALID Examples
INT_CONTACT YEAR_CONTACT
Unknown 2012
INT_CONTACT YEAR_CONTACT
Unknown Unknown
INT_CONTACT YEAR_CONTACT
2012 2012
INT_CONTACT YEAR_CONTACT
Not Collected Not Collected
INVALID Examples
INT_CONTACT YEAR_CONTACT
2012 Unknown
INT_CONTACT YEAR_CONTACT
2012 Not Collected
INT_CONTACT YEAR_CONTACT
Not collected Not Released
PARAMETER DESCRIPTION
clinicaldf

input Clinical Data Frame

TYPE: DataFrame

cols

Columns in the clinical data frame to be checked for consistency

TYPE: List[str]

string_vals

String values that aren't integers

TYPE: List[str]

RETURNS DESCRIPTION
str

Error message if values are inconsistent or blank string

_check_year_death_validity(clinicaldf)

YEAR_DEATH should alway be greater than or equal to YEAR_CONTACT when they are both available. This function checks if YEAR_DEATH >= YEAR_CONTACT and returns row indices of invalid YEAR_DEATH rows.

PARAMETER DESCRIPTION
clinicaldf

Clinical Data Frame

TYPE: DataFrame

RETURNS DESCRIPTION
Index

pd.Index: The row indices of the row with YEAR_DEATH < YEAR_CONTACT in the input clinical data

_check_year_death_validity_message(invalid_year_death_indices)

This function returns the error and warning messages if the input clinical data has row with YEAR_DEATH < YEAR_CONTACT

PARAMETER DESCRIPTION
invalid_year_death_indices

The row indices of the rows with YEAR_DEATH < YEAR_CONTACT in the input clinical data

TYPE: Index

RETURNS DESCRIPTION
str

Tuple[str, str]: The error message that tells you how many patients with invalid YEAR_DEATH values that your

str

input clinical data has

_check_int_dod_validity(clinicaldf)

INT_DOD should alway be greater than or equal to INT_CONTACT when they are both available. This function checks if INT_DOD >= INT_CONTACT and returns row indices of invalid INT_DOD rows.

PARAMETER DESCRIPTION
clinicaldf

Clinical Data Frame

TYPE: DataFrame

RETURNS DESCRIPTION
Index

pd.Index: The row indices of the row with INT_DOD < INT_CONTACT in the input clinical data

_check_int_dod_validity_message(invalid_int_dod_indices)

This function returns the error and warning messages if the input clinical data has row with INT_DOD < INT_CONTACT

PARAMETER DESCRIPTION
invalid_int_dod_indices

The row indices of the rows with INT_DOD < INT_CONTACT in the input clinical data

TYPE: Index

RETURNS DESCRIPTION
str

Tuple[str, str]: The error message that tells you how many patients with invalid INT_DOD values that your

str

input clinical data has

remap_clinical_values(clinicaldf, sex_mapping, race_mapping, ethnicity_mapping, sampletype_mapping)

Remap clinical attributes from integer to string values

PARAMETER DESCRIPTION
clinicaldf

Clinical data

TYPE: DataFrame

sex_mapping

Sex mapping data

TYPE: DataFrame

race_mapping

Race mapping data

TYPE: DataFrame

ethnicity_mapping

Ethnicity mapping data

TYPE: DataFrame

sample_type

Sample type mapping data

RETURNS DESCRIPTION
DataFrame

Mapped clinical dataframe

genie_registry.clinical.Clinical

Bases: FileTypeFormat

Attributes

_fileType = 'clinical' class-attribute instance-attribute

_process_kwargs = ['newPath', 'parentId', 'clinicalTemplate', 'sample', 'patient', 'patientCols', 'sampleCols'] class-attribute instance-attribute

Functions

_validateFilename(filePath)

update_clinical(row)

Transform the values of each row of the clinical file

uploadMissingData(df, col, dbSynId, stagingSynId)

Uploads missing clinical samples / patients

PARAMETER DESCRIPTION
df

dataframe with clinical data

TYPE: DataFrame

col

column in dataframe. Usually SAMPLE_ID or PATIENT_ID.

TYPE: str

dbSynId

Synapse table Synapse id

TYPE: str

stagingSynId

Center Synapse staging Id

TYPE: str

_process(clinical, clinicalTemplate)

preprocess(newpath)

Gather preprocess parameters

PARAMETER DESCRIPTION
filePath

Path to file

RETURNS DESCRIPTION

dict with keys - 'clinicalTemplate', 'sample', 'patient', 'patientCols', 'sampleCols'

process_steps(clinicalDf, newPath, parentId, clinicalTemplate, sample, patient, patientCols, sampleCols)

Process clincial file, redact PHI values, upload to clinical database

_validate_oncotree_code_mapping(clinicaldf, oncotree_mapping) staticmethod

Checks that the oncotree codes in the input clinical data is a valid oncotree code from the official oncotree site

PARAMETER DESCRIPTION
clinicaldf

clinical input data to validate

TYPE: DataFrame

oncotree_mapping

table of official oncotree mappings

TYPE: DataFrame

RETURNS DESCRIPTION
Index

pd.Index: row indices of unmapped oncotree codes in the

Index

input clinical data

_validate_oncotree_code_mapping_message(clinicaldf, unmapped_oncotree_indices) staticmethod

This function returns the error and warning messages if the input clinical data has row indices with unmapped oncotree codes

PARAMETER DESCRIPTION
clinicaldf

input clinical data

TYPE: DataFrame

unmapped_oncotree_indices

row indices of the input clinical data with unmapped oncotree codes

TYPE: DataFrame

RETURNS DESCRIPTION
Tuple[str, str]

Tuple[str, str]: error message that tells you how many samples AND the unique unmapped oncotree codes that your input clinical data has

_validate_sample_class_and_type(clinicaldf, sampletype_mapping)

Validates that the values of SAMPLE_CLASS and SAMPLE_TYPE in the clinical data is consistent and returns error message with the error(s).

The following conditions must be met
  • When SAMPLE_CLASS is cfDNA, SAMPLE_TYPE must be 8
  • When SAMPLE_TYPE is 8, SAMPLE_CLASS must be cfDNA
Valid Examples
SAMPLE_TYPE SAMPLE_CLASS
8 cfDNA
8 cfDNA
SAMPLE_TYPE SAMPLE_CLASS
8 cfDNA
8 cfDNA
2 Tumor
Invalid Examples
SAMPLE_TYPE SAMPLE_CLASS
8 Other
2 cfDNA
SAMPLE_TYPE SAMPLE_CLASS
8 cfDNA
8 Other
SAMPLE_TYPE SAMPLE_CLASS
8 NaN
2 cfDNA
PARAMETER DESCRIPTION
clinicaldf

input clinical data

TYPE: DataFrame

sampletype_mapping

sample type mapping table containing the mappings of the SAMPLE_TYPE to the value behind it. This is used to check the SAMPLE_TYPE and SAMPLE_CLASS values.

TYPE: DataFrame

RETURNS DESCRIPTION
str

error message of the concatenated error message(s)

TYPE: str

_validate(clinicaldf)

This function validates the clinical file to make sure it adhere to the clinical SOP.

PARAMETER DESCRIPTION
clinicaldf

Merged clinical file with patient and sample information

RETURNS DESCRIPTION

Error message

Note

SAMPLE_CLASS is a required column and must contain only 'Tumor' or 'cfDNA' values.

_get_dataframe(filePathList)

_cross_validate_bed_files_exist(clinicaldf)

Check that a bed file exist per SEQ_ASSAY_ID value in clinical file

_cross_validate_bed_files_exist_message(missing_bed_files)

Gets the warning/error messages given the missing bed files list

PARAMETER DESCRIPTION
missing_bed_files

list of missing bed files

TYPE: list

RETURNS DESCRIPTION
tuple

error + warning

TYPE: tuple

_cross_validate_assay_info_has_seq(clinicaldf)

Cross validates that assay information file has all the SEQ_ASSAY_IDs present in the clinical file TODO: Refactor this function (similar to _cross_validate in maf) once the clinical files have been taken care of so that it can be generalized to any file type

PARAMETER DESCRIPTION
clinicaldf

input clinical data

TYPE: DataFrame

RETURNS DESCRIPTION
tuple

errors and warnings

TYPE: tuple

_cross_validate(clinicaldf)

Cross-validation for clinical file(s)