Vcf
genie_registry.vcf
¶
Attributes¶
logger = logging.getLogger(__name__)
module-attribute
¶
Classes¶
FileTypeFormat
¶
Functions¶
__init__(syn, center, genie_config=None, ancillary_files=None)
¶
A validator helper class for a center's files.
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
a synapseclient.Synapse object
TYPE:
|
center
|
The participating center name.
TYPE:
|
genie_config
|
The configurations needed for the GENIE codebase. GENIE table type/name to Synapse Id. Defaults to None.
TYPE:
|
ancillary_files
|
all files downloaded for validation. Defaults to None.
TYPE:
|
read_file(filePathList)
¶
Each file is to be read in for validation and processing. This is not to be changed in any functions.
| PARAMETER | DESCRIPTION |
|---|---|
filePathList
|
A list of file paths (Max is 2 for the two clinical files)
|
| RETURNS | DESCRIPTION |
|---|---|
df
|
Pandas dataframe of file |
validateFilename(filePath)
¶
Validation of file name. The filename is what maps the file to its validation and processing.
| PARAMETER | DESCRIPTION |
|---|---|
filePath
|
Path to file
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
file type defined by self._fileType |
process_steps(df, **kwargs)
¶
This function is modified for every single file. It reformats the file and stores the file into database and Synapse.
preprocess(newpath)
¶
This is for any preprocessing that has to occur to the entity name to add to kwargs for processing. entity name is included in the new path
| PARAMETER | DESCRIPTION |
|---|---|
newpath
|
Path to file
|
process(filePath, **kwargs)
¶
This is the main processing function.
| PARAMETER | DESCRIPTION |
|---|---|
filePath
|
Path to file
|
kwargs
|
The kwargs are determined by self._process_kwargs
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
file path of processed file |
validate(filePathList, **kwargs)
¶
This is the main validation function. Every file type calls self._validate, which is different.
| PARAMETER | DESCRIPTION |
|---|---|
filePathList
|
A list of file paths.
|
kwargs
|
The kwargs are determined by self._validation_kwargs
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
The errors and warnings as a file from validation.
TYPE:
|
vcf
¶
Bases: FileTypeFormat
Functions¶
process_steps(df)
¶
The processing of vcf files is specific to GENIE, so not included in this function
validate_tumor_and_normal_sample_columns(input_df)
¶
Validates that the expected tumor sample column and optional normal sample columns are present in the VCF depending on how many columns you have present in the VCF and they have no missing values
Rules
- VCFs can only have a max of 11 columns including the 9 required columns
- For 11 columns VCFs, it is assumed this is a matched tumor normal vcf file which means there should be a tumor sample and normal sample column present
- For 10 column VCFs, it is assumed this is a single sample vcf file which means there should be a tumor sample column present
- Anything lower than 10 columns is INVALID because you must have at least a tumor sample column on top of the 9 required VCF columns
- If tumor sample and/or normal sample columns are present, they must not have any missing values.
VCF with Matched Tumor Normal columns
| OTHER_VCF_COLUMNS | GENIE-GOLD-1-1-tumor | GENIE-GOLD-1-1-normal |
|---|---|---|
| ... | ... | ... |
VCFs with Single Sample Column
| OTHER_VCF_COLUMNS | TUMOR |
|---|---|
| ... | ... |
| OTHER_VCF_COLUMNS | GENIE-GOLD-1-1 |
|---|---|
| ... | ... |
| PARAMETER | DESCRIPTION |
|---|---|
input_df
|
input vcf data to be validated
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
error message
TYPE:
|
Functions¶
contains_whitespace(row)
¶
Gets the total number of whitespaces from each column of a row