process_mutation
genie.process_mutation
¶
Process mutation files TODO deprecate this module and spread functions around
Attributes¶
logger = logging.getLogger(__name__)
module-attribute
¶
MAF_COL_MAPPING = {'HUGO_SYMBOL': 'Hugo_Symbol', 'ENTREZ_GENE_ID': 'Entrez_Gene_Id', 'CENTER': 'Center', 'NCBI_BUILD': 'NCBI_Build', 'CHROMOSOME': 'Chromosome', 'START_POSITION': 'Start_Position', 'END_POSITION': 'End_Position', 'STRAND': 'Strand', 'VARIANT_CLASSIFICATION': 'Variant_Classification', 'VARIANT_TYPE': 'Variant_Type', 'REFERENCE_ALLELE': 'Reference_Allele', 'TUMOR_SEQ_ALLELE1': 'Tumor_Seq_Allele1', 'TUMOR_SEQ_ALLELE2': 'Tumor_Seq_Allele2', 'DBSNP_RS': 'dbSNP_RS', 'DBSNP_VAL_STATUS': 'dbSNP_Val_Status', 'TUMOR_SAMPLE_BARCODE': 'Tumor_Sample_Barcode', 'MATCHED_NORM_SAMPLE_BARCODE': 'Matched_Norm_Sample_Barcode', 'MATCH_NORM_SEQ_ALLELE1': 'Match_Norm_Seq_Allele1', 'MATCH_NORM_SEQ_ALLELE2': 'Match_Norm_Seq_Allele2', 'TUMOR_VALIDATION_ALLELE1': 'Tumor_Validation_Allele1', 'TUMOR_VALIDATION_ALLELE2': 'Tumor_Validation_Allele2', 'MATCH_NORM_VALIDATION_ALLELE1': 'Match_Norm_Validation_Allele1', 'MATCH_NORM_VALIDATION_ALLELE2': 'Match_Norm_Validation_Allele2', 'VERIFICATION_STATUS': 'Verification_Status', 'VALIDATION_STATUS': 'Validation_Status', 'MUTATION_STATUS': 'Mutation_Status', 'SEQUENCING_PHASE': 'Sequencing_Phase', 'SEQUENCE_SOURCE': 'Sequence_Source', 'VALIDATION_METHOD': 'Validation_Method', 'SCORE': 'Score', 'BAM_FILE': 'BAM_File', 'SEQUENCER': 'Sequencer', 'T_REF_COUNT': 't_ref_count', 'T_ALT_COUNT': 't_alt_count', 'N_REF_COUNT': 'n_ref_count', 'N_ALT_COUNT': 'n_alt_count', 'ALLELE': 'Allele', 'AMINO_ACID_CHANGE': 'amino_acid_change', 'AMINO_ACIDS': 'Amino_acids', 'CDS_POSITION': 'CDS_position', 'CODONS': 'Codons', 'CONSEQUENCE': 'Consequence', 'EXISTING_VARIATION': 'Existing_variation', 'EXON_NUMBER': 'Exon_Number', 'FEATURE': 'Feature', 'FEATURE_TYPE': 'Feature_type', 'GENE': 'Gene', 'HGVSC': 'HGVSc', 'HGVSP': 'HGVSp', 'HGVSP_SHORT': 'HGVSp_Short', 'HOTSPOT': 'Hotspot', 'MA:FIMPACT': 'MA:FImpact', 'MA:LINK.MSA': 'MA:link.MSA', 'MA:LINK.PDB': 'MA:link.PDB', 'MA:LINK.VAR': 'MA:link.var', 'MA:PROTEIN.CHANGE': 'MA:protein.change', 'POLYPHEN': 'PolyPhen', 'PROTEIN_POSITION': 'Protein_position', 'REFSEQ': 'RefSeq', 'TRANSCRIPT': 'transcript', 'TRANSCRIPT_ID': 'Transcript_ID', 'ALL_EFFECTS': 'all_effects', 'CDNA_CHANGE': 'cdna_change', 'CDNA_POSITION': 'cDNA_position', 'N_DEPTH': 'n_depth', 'T_DEPTH': 't_depth'}
module-attribute
¶
KNOWN_STRING_COLS = ['IS_NEW', 'ALLELE_NUM', 'Chromosome', 'CLIN_SIG', 'MOTIF_NAME', 'HIGH_INF_POS', 'MINIMISED', 'CHROMOSOME', 'VERIFICATION_STATUS', 'VALIDATION_STATUS', 'MUTATION_STATUS', 'SEQUENCE_SOURCE', 'SEQUENCER', 'REPORT_AF', 'CDNA_CHANGE', 'AMINO_ACID_CHANGE', 'TRANSCRIPT', 'transcript', 'STRAND_VEP', 'HGNC_ID', 'PUBMED', 'PICK', 'Exon_Number', 'genomic_location_explanation', 'Annotation_Status', 'Variant_Classification', 'Transcript_Exon']
module-attribute
¶
Functions¶
_convert_to_str_dtype(column_types, known_string_cols)
¶
Sometimes the deteremined dtype is incorrect based off the first 100 rows, update the incorrect dtypes.
determine_dtype(path)
¶
Reads in a dataframe partially and determines the dtype of columns
move_and_configure_maf(mutation_path, input_files_dir)
¶
Moves maf files into processing directory. Maf file's column headers are renamed if necessary and .0 are stripped.
| PARAMETER | DESCRIPTION |
|---|---|
mutation_path
|
Mutation file path
TYPE:
|
input_files_dir
|
Input file directory
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Filepath to moved and configured maf
TYPE:
|
move_mutation(mutation_path, input_files_dir)
¶
Move mutation file into processing directory
process_mutation_workflow(syn, center, validfiles, genie_config, workdir)
¶
Process vcf/maf workflow
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse connection
TYPE:
|
center
|
Center name
TYPE:
|
validfiles
|
Center validated files
TYPE:
|
genie_config
|
GENIE configuration.
TYPE:
|
workdir
|
Working directory
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Optional[str]
|
Annotated Maf Path. None if there are no valid mutation files. |
create_annotation_paths(center, workdir)
¶
Creates the filepaths required in the annotation process
| PARAMETER | DESCRIPTION |
|---|---|
center
|
name of the center
TYPE:
|
workdir
|
work directory to create paths in
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
namedtuple
|
tuple with all the paths
TYPE:
|
concat_annotation_error_reports(center, input_dir)
¶
Concatenates the annotation error reports
| PARAMETER | DESCRIPTION |
|---|---|
center
|
name of center associated with error report
TYPE:
|
input_dir
|
directory where error reports are
TYPE:
|
Returns: pd.DataFrame: full annotation error report
check_annotation_error_reports(syn, maf_table_synid, full_error_report, center)
¶
A simple QC check to make sure our genome nexus error report failed annotations matches our final processed maf table's failed annotations
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
synapse client
TYPE:
|
maf_table_synid
|
synapse_id of the narrow maf table
TYPE:
|
full_error_report
|
the failed annotations error report
TYPE:
|
center
|
the center this is for
TYPE:
|
store_annotation_error_reports(full_error_report, full_error_report_path, syn, errors_folder_synid)
¶
Stores the annotation error reports to synapse
| PARAMETER | DESCRIPTION |
|---|---|
full_error_report
|
full error report to store
TYPE:
|
syn
|
synapse client object
TYPE:
|
errors_folder_synid
|
synapse id of error report folder to store reports in
TYPE:
|
annotate_mutation(annotation_paths, mutation_files, genie_annotation_pkg, center)
¶
Process vcf/maf files
| PARAMETER | DESCRIPTION |
|---|---|
center
|
Center name
TYPE:
|
mutation_files
|
list of mutation files
TYPE:
|
genie_annotation_pkg
|
Path to GENIE annotation package
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
Path to final maf |
append_or_createdf(dataframe, filepath)
¶
Creates a file with the dataframe or appends to a existing file.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
pandas.dataframe to write out
|
filepath
|
Filepath to append or create
TYPE:
|
format_maf(mafdf, center)
¶
Format maf file, shortens the maf file length
| PARAMETER | DESCRIPTION |
|---|---|
mafdf
|
mutation dataframe
TYPE:
|
center
|
Center name
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Formatted mutation dataframe |
split_and_store_maf(syn, center, maf_tableid, annotation_paths, flatfiles_synid)
¶
Separates annotated maf file into narrow and full maf and stores them
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse connection
TYPE:
|
center
|
Center
TYPE:
|
maf_tableid
|
Mutation table synapse id
TYPE:
|
annotation_paths
|
filepaths in the annotation process
TYPE:
|
flatfiles_synid
|
GENIE flat files folder
TYPE:
|