Maf
genie_registry.maf
¶
Attributes¶
logger = logging.getLogger(__name__)
module-attribute
¶
Classes¶
FileTypeFormat
¶
Functions¶
__init__(syn, center, genie_config=None, ancillary_files=None)
¶
A validator helper class for a center's files.
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
a synapseclient.Synapse object
TYPE:
|
center
|
The participating center name.
TYPE:
|
genie_config
|
The configurations needed for the GENIE codebase. GENIE table type/name to Synapse Id. Defaults to None.
TYPE:
|
ancillary_files
|
all files downloaded for validation. Defaults to None.
TYPE:
|
read_file(filePathList)
¶
Each file is to be read in for validation and processing. This is not to be changed in any functions.
| PARAMETER | DESCRIPTION |
|---|---|
filePathList
|
A list of file paths (Max is 2 for the two clinical files)
|
| RETURNS | DESCRIPTION |
|---|---|
df
|
Pandas dataframe of file |
validateFilename(filePath)
¶
Validation of file name. The filename is what maps the file to its validation and processing.
| PARAMETER | DESCRIPTION |
|---|---|
filePath
|
Path to file
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
file type defined by self._fileType |
process_steps(df, **kwargs)
¶
This function is modified for every single file. It reformats the file and stores the file into database and Synapse.
preprocess(newpath)
¶
This is for any preprocessing that has to occur to the entity name to add to kwargs for processing. entity name is included in the new path
| PARAMETER | DESCRIPTION |
|---|---|
newpath
|
Path to file
|
process(filePath, **kwargs)
¶
This is the main processing function.
| PARAMETER | DESCRIPTION |
|---|---|
filePath
|
Path to file
|
kwargs
|
The kwargs are determined by self._process_kwargs
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
file path of processed file |
validate(filePathList, **kwargs)
¶
This is the main validation function. Every file type calls self._validate, which is different.
| PARAMETER | DESCRIPTION |
|---|---|
filePathList
|
A list of file paths.
|
kwargs
|
The kwargs are determined by self._validation_kwargs
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
The errors and warnings as a file from validation.
TYPE:
|
maf
¶
Bases: FileTypeFormat
MAF file format validation / processing
Functions¶
_check_allele_col_validity(df)
¶
This function checks specific columns in a MAF (Mutation Annotation Format) file for certain conditions.
The following conditions must be met
If the MAF file has all three of these columns
- TUMOR_SEQ_ALLELE1 (TSA1)
- TUMOR_SEQ_ALLELE2 (TSA2)
- REFERENCE_ALLELE (REF)
Then, one of the following must be true
- Every value in TSA1 must be the same as the value in REF
- Every value in TSA1 must be the same as the value in TSA2
Additionally, if the MAF file has at least these two columns
- REFERENCE_ALLELE (REF)
- TUMOR_SEQ_ALLELE2 (TSA2)
Then
NO values in REF can match TSA2
These rules are important because Genome Nexus (GN) uses TSA1 to annotate data
when it's not clear which variant to use. So, there can't be a mix of rows where
some have TSA1 equal to REF and some have TSA1 equal to TSA2.
Valid Examples
| REFERENCE_ALLELE | TUMOR_SEQ_ALLELE1 | TUMOR_SEQ_ALLELE2 |
| ---------------- | ----------------- | ----------------- |
| C | C | A |
| T | T | C |
| REFERENCE_ALLELE | TUMOR_SEQ_ALLELE1 | TUMOR_SEQ_ALLELE2 |
| ---------------- | ----------------- | ----------------- |
| C | A | A |
| T | C | C |
| REFERENCE_ALLELE | TUMOR_SEQ_ALLELE2 |
| ---------------- | ----------------- |
| C | A |
| T | C |
Invalid Examples
| REFERENCE_ALLELE | TUMOR_SEQ_ALLELE1 | TUMOR_SEQ_ALLELE2 |
| ---------------- | ----------------- | ----------------- |
| C | C | A |
| C | A | A |
| REFERENCE_ALLELE | TUMOR_SEQ_ALLELE1 | TUMOR_SEQ_ALLELE2 |
| ---------------- | ----------------- | ----------------- |
| A | C | A |
| T | C | T |
| REFERENCE_ALLELE | TUMOR_SEQ_ALLELE2 |
| ---------------- | ----------------- |
| C | C |
| T | C |
See this Genome Nexus issue for more background regarding why this validation rule was implemented.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
input mutation dataframe
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
the error message
TYPE:
|
_check_allele_col(df, col)
¶
Check the Allele column is correctly formatted.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
mutation dataframe
|
col
|
Column header name
|
| RETURNS | DESCRIPTION |
|---|---|
|
error, warning |