Skip to content

Maf

genie_registry.maf

Attributes

logger = logging.getLogger(__name__) module-attribute

Classes

FileTypeFormat

Functions
__init__(syn, center, genie_config=None, ancillary_files=None)

A validator helper class for a center's files.

PARAMETER DESCRIPTION
syn

a synapseclient.Synapse object

TYPE: Synapse

center

The participating center name.

TYPE: str

genie_config

The configurations needed for the GENIE codebase. GENIE table type/name to Synapse Id. Defaults to None.

TYPE: dict DEFAULT: None

ancillary_files

all files downloaded for validation. Defaults to None.

TYPE: List[List[Entity]] DEFAULT: None

read_file(filePathList)

Each file is to be read in for validation and processing. This is not to be changed in any functions.

PARAMETER DESCRIPTION
filePathList

A list of file paths (Max is 2 for the two clinical files)

RETURNS DESCRIPTION
df

Pandas dataframe of file

validateFilename(filePath)

Validation of file name. The filename is what maps the file to its validation and processing.

PARAMETER DESCRIPTION
filePath

Path to file

RETURNS DESCRIPTION
str

file type defined by self._fileType

process_steps(df, **kwargs)

This function is modified for every single file. It reformats the file and stores the file into database and Synapse.

preprocess(newpath)

This is for any preprocessing that has to occur to the entity name to add to kwargs for processing. entity name is included in the new path

PARAMETER DESCRIPTION
newpath

Path to file

process(filePath, **kwargs)

This is the main processing function.

PARAMETER DESCRIPTION
filePath

Path to file

kwargs

The kwargs are determined by self._process_kwargs

DEFAULT: {}

RETURNS DESCRIPTION
str

file path of processed file

validate(filePathList, **kwargs)

This is the main validation function. Every file type calls self._validate, which is different.

PARAMETER DESCRIPTION
filePathList

A list of file paths.

kwargs

The kwargs are determined by self._validation_kwargs

DEFAULT: {}

RETURNS DESCRIPTION
tuple

The errors and warnings as a file from validation.

TYPE: ValidationResults

maf

Bases: FileTypeFormat

MAF file format validation / processing

Functions
process_steps(df)

The processing of maf files is specific to GENIE, so not included in this function

Functions

_check_allele_col_validity(df)

This function checks specific columns in a MAF (Mutation Annotation Format) file for certain conditions.

The following conditions must be met

If the MAF file has all three of these columns

- TUMOR_SEQ_ALLELE1 (TSA1)
- TUMOR_SEQ_ALLELE2 (TSA2)
- REFERENCE_ALLELE (REF)

Then, one of the following must be true

- Every value in TSA1 must be the same as the value in REF
- Every value in TSA1 must be the same as the value in TSA2

Additionally, if the MAF file has at least these two columns

- REFERENCE_ALLELE (REF)
- TUMOR_SEQ_ALLELE2 (TSA2)

Then

NO values in REF can match TSA2

These rules are important because Genome Nexus (GN) uses TSA1 to annotate data when it's not clear which variant to use. So, there can't be a mix of rows where some have TSA1 equal to REF and some have TSA1 equal to TSA2.

Valid Examples
| REFERENCE_ALLELE | TUMOR_SEQ_ALLELE1 | TUMOR_SEQ_ALLELE2 |
| ---------------- | ----------------- | ----------------- |
| C                | C                 | A                 |
| T                | T                 | C                 |
| REFERENCE_ALLELE | TUMOR_SEQ_ALLELE1 | TUMOR_SEQ_ALLELE2 |
| ---------------- | ----------------- | ----------------- |
| C                | A                 | A                 |
| T                | C                 | C                 |
| REFERENCE_ALLELE | TUMOR_SEQ_ALLELE2 |
| ---------------- | ----------------- |
| C                | A                 |
| T                | C                 |
Invalid Examples
| REFERENCE_ALLELE | TUMOR_SEQ_ALLELE1 | TUMOR_SEQ_ALLELE2 |
| ---------------- | ----------------- | ----------------- |
| C                | C                 | A                 |
| C                | A                 | A                 |
| REFERENCE_ALLELE | TUMOR_SEQ_ALLELE1 | TUMOR_SEQ_ALLELE2 |
| ---------------- | ----------------- | ----------------- |
| A                | C                 | A                 |
| T                | C                 | T                 |
| REFERENCE_ALLELE | TUMOR_SEQ_ALLELE2 |
| ---------------- | ----------------- |
| C                | C                 |
| T                | C                 |

See this Genome Nexus issue for more background regarding why this validation rule was implemented.

PARAMETER DESCRIPTION
df

input mutation dataframe

TYPE: DataFrame

RETURNS DESCRIPTION
str

the error message

TYPE: str

_check_allele_col(df, col)

Check the Allele column is correctly formatted.

PARAMETER DESCRIPTION
df

mutation dataframe

col

Column header name

RETURNS DESCRIPTION

error, warning