Skip to content

process_mutation

genie.process_mutation

Process mutation files TODO deprecate this module and spread functions around

Attributes

logger = logging.getLogger(__name__) module-attribute

MAF_COL_MAPPING = {'HUGO_SYMBOL': 'Hugo_Symbol', 'ENTREZ_GENE_ID': 'Entrez_Gene_Id', 'CENTER': 'Center', 'NCBI_BUILD': 'NCBI_Build', 'CHROMOSOME': 'Chromosome', 'START_POSITION': 'Start_Position', 'END_POSITION': 'End_Position', 'STRAND': 'Strand', 'VARIANT_CLASSIFICATION': 'Variant_Classification', 'VARIANT_TYPE': 'Variant_Type', 'REFERENCE_ALLELE': 'Reference_Allele', 'TUMOR_SEQ_ALLELE1': 'Tumor_Seq_Allele1', 'TUMOR_SEQ_ALLELE2': 'Tumor_Seq_Allele2', 'DBSNP_RS': 'dbSNP_RS', 'DBSNP_VAL_STATUS': 'dbSNP_Val_Status', 'TUMOR_SAMPLE_BARCODE': 'Tumor_Sample_Barcode', 'MATCHED_NORM_SAMPLE_BARCODE': 'Matched_Norm_Sample_Barcode', 'MATCH_NORM_SEQ_ALLELE1': 'Match_Norm_Seq_Allele1', 'MATCH_NORM_SEQ_ALLELE2': 'Match_Norm_Seq_Allele2', 'TUMOR_VALIDATION_ALLELE1': 'Tumor_Validation_Allele1', 'TUMOR_VALIDATION_ALLELE2': 'Tumor_Validation_Allele2', 'MATCH_NORM_VALIDATION_ALLELE1': 'Match_Norm_Validation_Allele1', 'MATCH_NORM_VALIDATION_ALLELE2': 'Match_Norm_Validation_Allele2', 'VERIFICATION_STATUS': 'Verification_Status', 'VALIDATION_STATUS': 'Validation_Status', 'MUTATION_STATUS': 'Mutation_Status', 'SEQUENCING_PHASE': 'Sequencing_Phase', 'SEQUENCE_SOURCE': 'Sequence_Source', 'VALIDATION_METHOD': 'Validation_Method', 'SCORE': 'Score', 'BAM_FILE': 'BAM_File', 'SEQUENCER': 'Sequencer', 'T_REF_COUNT': 't_ref_count', 'T_ALT_COUNT': 't_alt_count', 'N_REF_COUNT': 'n_ref_count', 'N_ALT_COUNT': 'n_alt_count', 'ALLELE': 'Allele', 'AMINO_ACID_CHANGE': 'amino_acid_change', 'AMINO_ACIDS': 'Amino_acids', 'CDS_POSITION': 'CDS_position', 'CODONS': 'Codons', 'CONSEQUENCE': 'Consequence', 'EXISTING_VARIATION': 'Existing_variation', 'EXON_NUMBER': 'Exon_Number', 'FEATURE': 'Feature', 'FEATURE_TYPE': 'Feature_type', 'GENE': 'Gene', 'HGVSC': 'HGVSc', 'HGVSP': 'HGVSp', 'HGVSP_SHORT': 'HGVSp_Short', 'HOTSPOT': 'Hotspot', 'MA:FIMPACT': 'MA:FImpact', 'MA:LINK.MSA': 'MA:link.MSA', 'MA:LINK.PDB': 'MA:link.PDB', 'MA:LINK.VAR': 'MA:link.var', 'MA:PROTEIN.CHANGE': 'MA:protein.change', 'POLYPHEN': 'PolyPhen', 'PROTEIN_POSITION': 'Protein_position', 'REFSEQ': 'RefSeq', 'TRANSCRIPT': 'transcript', 'TRANSCRIPT_ID': 'Transcript_ID', 'ALL_EFFECTS': 'all_effects', 'CDNA_CHANGE': 'cdna_change', 'CDNA_POSITION': 'cDNA_position', 'N_DEPTH': 'n_depth', 'T_DEPTH': 't_depth'} module-attribute

KNOWN_STRING_COLS = ['IS_NEW', 'ALLELE_NUM', 'Chromosome', 'CLIN_SIG', 'MOTIF_NAME', 'HIGH_INF_POS', 'MINIMISED', 'CHROMOSOME', 'VERIFICATION_STATUS', 'VALIDATION_STATUS', 'MUTATION_STATUS', 'SEQUENCE_SOURCE', 'SEQUENCER', 'REPORT_AF', 'CDNA_CHANGE', 'AMINO_ACID_CHANGE', 'TRANSCRIPT', 'transcript', 'STRAND_VEP', 'HGNC_ID', 'PUBMED', 'PICK', 'Exon_Number', 'genomic_location_explanation', 'Annotation_Status', 'Variant_Classification', 'Transcript_Exon'] module-attribute

Functions

_convert_to_str_dtype(column_types, known_string_cols)

Sometimes the deteremined dtype is incorrect based off the first 100 rows, update the incorrect dtypes.

determine_dtype(path)

Reads in a dataframe partially and determines the dtype of columns

move_and_configure_maf(mutation_path, input_files_dir)

Moves maf files into processing directory. Maf file's column headers are renamed if necessary and .0 are stripped.

PARAMETER DESCRIPTION
mutation_path

Mutation file path

TYPE: str

input_files_dir

Input file directory

TYPE: str

RETURNS DESCRIPTION
str

Filepath to moved and configured maf

TYPE: str

move_mutation(mutation_path, input_files_dir)

Move mutation file into processing directory

process_mutation_workflow(syn, center, validfiles, genie_config, workdir)

Process vcf/maf workflow

PARAMETER DESCRIPTION
syn

Synapse connection

TYPE: Synapse

center

Center name

TYPE: str

validfiles

Center validated files

TYPE: DataFrame

genie_config

GENIE configuration.

TYPE: dict

workdir

Working directory

TYPE: str

RETURNS DESCRIPTION
Optional[str]

Annotated Maf Path. None if there are no valid mutation files.

create_annotation_paths(center, workdir)

Creates the filepaths required in the annotation process

PARAMETER DESCRIPTION
center

name of the center

TYPE: str

workdir

work directory to create paths in

TYPE: str

RETURNS DESCRIPTION
namedtuple

tuple with all the paths

TYPE: namedtuple

concat_annotation_error_reports(center, input_dir)

Concatenates the annotation error reports

PARAMETER DESCRIPTION
center

name of center associated with error report

TYPE: str

input_dir

directory where error reports are

TYPE: str

Returns: pd.DataFrame: full annotation error report

check_annotation_error_reports(syn, maf_table_synid, full_error_report, center)

A simple QC check to make sure our genome nexus error report failed annotations matches our final processed maf table's failed annotations

PARAMETER DESCRIPTION
syn

synapse client

TYPE: Synapse

maf_table_synid

synapse_id of the narrow maf table

TYPE: str

full_error_report

the failed annotations error report

TYPE: DataFrame

center

the center this is for

TYPE: str

store_annotation_error_reports(full_error_report, full_error_report_path, syn, errors_folder_synid)

Stores the annotation error reports to synapse

PARAMETER DESCRIPTION
full_error_report

full error report to store

TYPE: DataFrame

syn

synapse client object

TYPE: Synapse

errors_folder_synid

synapse id of error report folder to store reports in

TYPE: str

annotate_mutation(annotation_paths, mutation_files, genie_annotation_pkg, center)

Process vcf/maf files

PARAMETER DESCRIPTION
center

Center name

TYPE: str

mutation_files

list of mutation files

TYPE: list

genie_annotation_pkg

Path to GENIE annotation package

TYPE: str

RETURNS DESCRIPTION
None

Path to final maf

append_or_createdf(dataframe, filepath)

Creates a file with the dataframe or appends to a existing file.

PARAMETER DESCRIPTION
df

pandas.dataframe to write out

filepath

Filepath to append or create

TYPE: str

format_maf(mafdf, center)

Format maf file, shortens the maf file length

PARAMETER DESCRIPTION
mafdf

mutation dataframe

TYPE: DataFrame

center

Center name

TYPE: str

RETURNS DESCRIPTION
DataFrame

Formatted mutation dataframe

split_and_store_maf(syn, center, maf_tableid, annotation_paths, flatfiles_synid)

Separates annotated maf file into narrow and full maf and stores them

PARAMETER DESCRIPTION
syn

Synapse connection

TYPE: Synapse

center

Center

TYPE: str

maf_tableid

Mutation table synapse id

TYPE: str

annotation_paths

filepaths in the annotation process

TYPE: namedtuple

flatfiles_synid

GENIE flat files folder

TYPE: str