Skip to content

Vcf

genie_registry.vcf

Attributes

logger = logging.getLogger(__name__) module-attribute

Classes

FileTypeFormat

Functions
__init__(syn, center, genie_config=None, ancillary_files=None)

A validator helper class for a center's files.

PARAMETER DESCRIPTION
syn

a synapseclient.Synapse object

TYPE: Synapse

center

The participating center name.

TYPE: str

genie_config

The configurations needed for the GENIE codebase. GENIE table type/name to Synapse Id. Defaults to None.

TYPE: dict DEFAULT: None

ancillary_files

all files downloaded for validation. Defaults to None.

TYPE: List[List[Entity]] DEFAULT: None

read_file(filePathList)

Each file is to be read in for validation and processing. This is not to be changed in any functions.

PARAMETER DESCRIPTION
filePathList

A list of file paths (Max is 2 for the two clinical files)

RETURNS DESCRIPTION
df

Pandas dataframe of file

validateFilename(filePath)

Validation of file name. The filename is what maps the file to its validation and processing.

PARAMETER DESCRIPTION
filePath

Path to file

RETURNS DESCRIPTION
str

file type defined by self._fileType

process_steps(df, **kwargs)

This function is modified for every single file. It reformats the file and stores the file into database and Synapse.

preprocess(newpath)

This is for any preprocessing that has to occur to the entity name to add to kwargs for processing. entity name is included in the new path

PARAMETER DESCRIPTION
newpath

Path to file

process(filePath, **kwargs)

This is the main processing function.

PARAMETER DESCRIPTION
filePath

Path to file

kwargs

The kwargs are determined by self._process_kwargs

DEFAULT: {}

RETURNS DESCRIPTION
str

file path of processed file

validate(filePathList, **kwargs)

This is the main validation function. Every file type calls self._validate, which is different.

PARAMETER DESCRIPTION
filePathList

A list of file paths.

kwargs

The kwargs are determined by self._validation_kwargs

DEFAULT: {}

RETURNS DESCRIPTION
tuple

The errors and warnings as a file from validation.

TYPE: ValidationResults

vcf

Bases: FileTypeFormat

Functions
process_steps(df)

The processing of vcf files is specific to GENIE, so not included in this function

validate_tumor_and_normal_sample_columns(input_df)

Validates that the expected tumor sample column and optional normal sample columns are present in the VCF depending on how many columns you have present in the VCF and they have no missing values

Rules
  • VCFs can only have a max of 11 columns including the 9 required columns
  • For 11 columns VCFs, it is assumed this is a matched tumor normal vcf file which means there should be a tumor sample and normal sample column present
  • For 10 column VCFs, it is assumed this is a single sample vcf file which means there should be a tumor sample column present
  • Anything lower than 10 columns is INVALID because you must have at least a tumor sample column on top of the 9 required VCF columns
  • If tumor sample and/or normal sample columns are present, they must not have any missing values.
VCF with Matched Tumor Normal columns
OTHER_VCF_COLUMNS GENIE-GOLD-1-1-tumor GENIE-GOLD-1-1-normal
... ... ...
VCFs with Single Sample Column
OTHER_VCF_COLUMNS TUMOR
... ...
OTHER_VCF_COLUMNS GENIE-GOLD-1-1
... ...
PARAMETER DESCRIPTION
input_df

input vcf data to be validated

TYPE: DataFrame

RETURNS DESCRIPTION
str

error message

TYPE: str

Functions

contains_whitespace(row)

Gets the total number of whitespaces from each column of a row