Database to Staging
genie.database_to_staging
¶
Functions for releasing GENIE consortium releases
Attributes¶
logger = logging.getLogger(__name__)
module-attribute
¶
GENIE_RELEASE_DIR = os.path.join(os.path.expanduser('~/.synapseCache'), 'GENIE_release')
module-attribute
¶
CASE_LIST_PATH = os.path.join(GENIE_RELEASE_DIR, 'case_lists')
module-attribute
¶
CNA_CENTER_PATH = os.path.join(GENIE_RELEASE_DIR, 'data_CNA_%s.txt')
module-attribute
¶
SAMPLE_CENTER_PATH = os.path.join(GENIE_RELEASE_DIR, 'data_clinical_supp_sample_%s.txt')
module-attribute
¶
PATIENT_CENTER_PATH = os.path.join(GENIE_RELEASE_DIR, 'data_clinical_supp_patient_%s.txt')
module-attribute
¶
MUTATIONS_CENTER_PATH = os.path.join(GENIE_RELEASE_DIR, 'data_mutations_extended_%s.txt')
module-attribute
¶
FUSIONS_CENTER_PATH = os.path.join(GENIE_RELEASE_DIR, 'data_fusions_%s.txt')
module-attribute
¶
SEG_CENTER_PATH = os.path.join(GENIE_RELEASE_DIR, 'data_cna_hg19_%s.seg')
module-attribute
¶
SV_CENTER_PATH = os.path.join(GENIE_RELEASE_DIR, 'data_sv_%s.txt')
module-attribute
¶
BED_DIFFS_SEQASSAY_PATH = os.path.join(GENIE_RELEASE_DIR, 'diff_%s.csv')
module-attribute
¶
FULL_MAF_RELEASE_COLUMNS = ['Hugo_Symbol', 'Entrez_Gene_Id', 'Center', 'NCBI_Build', 'Chromosome', 'Start_Position', 'End_Position', 'Strand', 'Consequence', 'Variant_Classification', 'Variant_Type', 'Reference_Allele', 'Tumor_Seq_Allele1', 'Tumor_Seq_Allele2', 'dbSNP_RS', 'dbSNP_Val_Status', 'Tumor_Sample_Barcode', 'Matched_Norm_Sample_Barcode', 'Match_Norm_Seq_Allele1', 'Match_Norm_Seq_Allele2', 'Tumor_Validation_Allele1', 'Tumor_Validation_Allele2', 'Match_Norm_Validation_Allele1', 'Match_Norm_Validation_Allele2', 'Verification_Status', 'Validation_Status', 'Mutation_Status', 'Sequencing_Phase', 'Sequence_Source', 'Validation_Method', 'Score', 'BAM_File', 'Sequencer', 't_ref_count', 't_alt_count', 'n_ref_count', 'n_alt_count', 'HGVSc', 'HGVSp', 'HGVSp_Short', 'Transcript_ID', 'RefSeq', 'Protein_position', 'Codons', 'Exon_Number', 'gnomAD_AF', 'gnomAD_AFR_AF', 'gnomAD_AMR_AF', 'gnomAD_ASJ_AF', 'gnomAD_EAS_AF', 'gnomAD_FIN_AF', 'gnomAD_NFE_AF', 'gnomAD_OTH_AF', 'gnomAD_SAS_AF', 'FILTER', 'Polyphen_Prediction', 'Polyphen_Score', 'SIFT_Prediction', 'SIFT_Score', 'SWISSPROT', 'n_depth', 't_depth', 'Annotation_Status', 'mutationInCis_Flag']
module-attribute
¶
DEPRECATED_CONSORTIUM_RELEASE_FILES = ['data_fusions.txt', 'meta_fusions.txt']
module-attribute
¶
Functions¶
_to_redact_interval(df_col)
¶
Determines year values that are "<18" and interval values >89 that need to be redacted Returns bool because BIRTH_YEAR needs to be redacted as well based on the results
| PARAMETER | DESCRIPTION |
|---|---|
df_col
|
Dataframe column/pandas.Series of an interval column
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
pandas.Series: to redact boolean vector pandas.Series: to redact pediatric boolean vector
TYPE:
|
_redact_year(df_col)
¶
Redacts year values that have < or >
| PARAMETER | DESCRIPTION |
|---|---|
df_col
|
Dataframe column/pandas.Series of a year column
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Series
|
pandas.Series: Redacted series |
_redact_ped_year(df_col)
¶
Redacts year values that have <
| PARAMETER | DESCRIPTION |
|---|---|
df_col
|
Dataframe column/pandas.Series of a year column
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Series
|
pandas.Series: Redacted series |
_to_redact_difference(df_col_year1, df_col_year2)
¶
Determine if difference between year2 and year1 is > 89
| PARAMETER | DESCRIPTION |
|---|---|
df_col_year1
|
Dataframe column/pandas.Series of a year column
TYPE:
|
df_col_year2
|
Dataframe column/pandas.Series of a year column
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Series
|
pandas.Series: to redact boolean vector |
redact_phi(clinicaldf, interval_cols_to_redact=['AGE_AT_SEQ_REPORT', 'INT_CONTACT', 'INT_DOD'])
¶
Redacts the PHI by re-annotating the clinical file
| PARAMETER | DESCRIPTION |
|---|---|
clinicaldf
|
merged clinical dataframe
TYPE:
|
interval_cols_to_redact
|
List of interval columns to redact. Defaults to ['AGE_AT_SEQ_REPORT', 'INT_CONTACT', 'INT_DOD']
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
pandas.DataFrame: Redacted clinical dataframe |
remove_maf_samples(mafdf, keep_samples)
¶
Remove samples from maf file
| PARAMETER | DESCRIPTION |
|---|---|
mafdf
|
Maf dataframe
TYPE:
|
keep_samples
|
Samples to keep
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Filtered maf dataframe |
get_whitelist_variants_idx(mafdf)
¶
Get boolean vector for variants that are known somatic sites This is to override the germline filter
configure_maf(mafdf, remove_variants, flagged_variants)
¶
Configures each maf dataframe, does germline filtering
Germline filtering for MAF files uses the gnomAD columns that refer to the allele frequencies (AF) of variants in different population groups from the gnomAD (Genome Aggregation Database). This filter will filter out variants with a maximum AF > 0.05% across all populations which are typically common germline variants.
Germline filtering for MAF files occurs during release instead of during processing because the MAF file gets re-annotated during processing via genome nexus annotation.
| PARAMETER | DESCRIPTION |
|---|---|
mafdf
|
Maf dataframe
|
remove_variants
|
Variants to remove
|
flagged_variants
|
Variants to flag
|
| RETURNS | DESCRIPTION |
|---|---|
|
configured maf row |
calculate_missing_variant_counts(depth, alt_count, ref_count)
¶
Calculate missing counts. t_depth = t_alt_count + t_ref_count
| PARAMETER | DESCRIPTION |
|---|---|
depth
|
Allele Depth
TYPE:
|
alt_count
|
Allele alt counts
TYPE:
|
ref_count
|
Allele ref counts
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
filled in depth, alt_count and ref_count values |
runMAFinBED(syn, center_mappingdf, test=False, staging=False, genieVersion='test')
¶
Run MAF in BED script, filter data and update MAFinBED files
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse client connection
TYPE:
|
center_mappingdf
|
center mapping dataframe
TYPE:
|
test
|
Testing parameter. Defaults to False.
TYPE:
|
staging
|
Staging parameter. Defaults to False.
TYPE:
|
genieVersion
|
GENIE version. Defaults to "test".
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Series
|
pd.Series: Variants to remove |
get_run_maf_in_bed_script_cmd(notinbed_file, script_dir, test, staging)
¶
This function gets the MAFinBED R script command call based on whether we are running in test, staging or production mode
| PARAMETER | DESCRIPTION |
|---|---|
notinbed_file
|
Full path to the notinbed csv
TYPE:
|
script_dir
|
directory of the MAFinBED script
TYPE:
|
test
|
Testing parameter.
TYPE:
|
staging
|
Staging parameter
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Full command call for the MAFinBED script
TYPE:
|
store_maf_in_bed_filtered_variants(syn, notinbed_file, center_mapping_df, genie_version)
¶
This script retrieves and stores the filtered variants generated from running the MAFinBed script as files on Synapse. This script then returns the not in bed file with the removed variants column added. TODO: Add handling for empty file?
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse client connection
TYPE:
|
notinbed_file
|
input file
TYPE:
|
center_mapping_df
|
the center mapping dataframe
TYPE:
|
genie_version
|
version of this genie run
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
pd.DataFrame: filtered variants dataframe with removed variants column added |
seq_date_filter(clinicalDf, processingDate, consortiumReleaseCutOff)
¶
Filter samples by seq date
| PARAMETER | DESCRIPTION |
|---|---|
clinicalDf
|
Clinical dataframe
|
processingDate
|
Processing date in form of Apr-XXXX
|
consortiumReleaseCutOff
|
Release cut off days
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
Samples to remove |
mutation_in_cis_filter(syn, skipMutationsInCis, variant_filtering_synId, center_mappingdf, genieVersion, test=False, staging=False)
¶
Run mutation in cis filter, look up samples to remove.
The mutation in cis script ONLY runs WHEN the skipMutationsInCis parameter is FALSE
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
|
skipMutationsInCis
|
Skip this filter
|
variant_filtering_synId
|
mergeCheck database dataframe
|
center_mappingdf
|
center mapping dataframe
|
genieVersion
|
GENIE version. Default is test.
|
test
|
Testing parameter. Default is False.
DEFAULT:
|
staging
|
Staging parameter. Default is False.
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
|
pd.Series: Samples to remove |
get_mutation_in_cis_filter_script_cmd(test, staging)
¶
This function gets the mutation_in_cis_filter R script command call based on whether we are running in test, staging or production mode
| PARAMETER | DESCRIPTION |
|---|---|
test
|
Testing parameter.
TYPE:
|
staging
|
Staging parameter. Default is False.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Full command call for the mergeCheck script
TYPE:
|
get_mutation_in_cis_filtered_samples(syn, variant_filtering_synId)
¶
Gets the samples to remove in our variant filtering table TODO: Add handling for when we have 0 row query results Args: syn (synapseclient.Synapse): synapse client connection variant_filtering_synId (str): variant filtering synapse id
| RETURNS | DESCRIPTION |
|---|---|
list
|
pd.Series: removed samples |
get_mutation_in_cis_flagged_variants(syn, variant_filtering_synId)
¶
Gets the flagged variants in our variant filtering table which is a unique string concatenation of the Chromosome, Start_Position, HGVSp_Short, Reference_Allele, Tumor_Seq_Allele2 and Tumor_Sample_Barcode columns TODO: Add handling for when we have 0 row query results
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
synapse client connection
TYPE:
|
variant_filtering_synId
|
variant filtering synapse id
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Series
|
pd.Series: flagged variants |
store_mutation_in_cis_files_to_staging(syn, center_mappingdf, variant_filtering_synId, genieVersion)
¶
Stores the mutation in cis files to synapse per center
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
synapse client connection
TYPE:
|
center_mappingdf
|
center mapping dataframe
TYPE:
|
variant_filtering_synId
|
variant filtering synapse id
TYPE:
|
genieVersion
|
version of the genie pipeline run
TYPE:
|
seq_assay_id_filter(clinicaldf)
¶
(Deprecated) Remove samples that are part of SEQ_ASSAY_IDs with less than 50 samples
| PARAMETER | DESCRIPTION |
|---|---|
clinicalDf
|
Sample clinical dataframe
|
| RETURNS | DESCRIPTION |
|---|---|
|
pd.Series: samples to remove |
no_genepanel_filter(clinicaldf, beddf)
¶
Remove samples that don't have bed files associated with them
| PARAMETER | DESCRIPTION |
|---|---|
clinicaldf
|
Clinical data
|
beddf
|
bed data
|
| RETURNS | DESCRIPTION |
|---|---|
|
pd.Series: samples to remove |
store_gene_panel_files(syn, fileviewSynId, genieVersion, data_gene_panel, consortiumReleaseSynId, wes_seqassayids)
¶
filter_out_germline_variants(input_data, status_col_str)
¶
Filters out germline variants given a status col str. Genie pipeline cannot have any of these variants. NOTE: We have to search for the status column because there's no column name validation in the release steps so the status column may have different casing.
| PARAMETER | DESCRIPTION |
|---|---|
input_data
|
input data with germline variants to filter out
TYPE:
|
status_col_str
|
search string for the status column for the data
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
pd.DataFrame: filtered out germline variant data |
store_sv_files(syn, release_synid, genie_version, synid, keep_for_center_consortium_samples, keep_for_merged_consortium_samples, current_release_staging, center_mappingdf)
¶
Create, filter, configure, and store structural variant file
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
TYPE:
|
release_synid
|
Synapse id to store release file
TYPE:
|
genie_version
|
GENIE version (ie. v6.1-consortium)
TYPE:
|
synid
|
SV database synid
TYPE:
|
keep_for_center_consortium_samples
|
Samples to keep for center files
TYPE:
|
keep_for_merged_consortium_samples
|
Samples to keep for merged file
TYPE:
|
current_release_staging
|
Staging flag
TYPE:
|
center_mappingdf
|
Center mapping dataframe
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
|
List of SV Samples |
append_or_create_release_maf(dataframe, filepath)
¶
Creates a file with the dataframe or appends to a existing file.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
pandas.dataframe to write out
|
filepath
|
Filepath to append or create
TYPE:
|
store_maf_files(syn, genie_version, flatfiles_view_synid, release_synid, clinicaldf, center_mappingdf, keep_for_merged_consortium_samples, keep_for_center_consortium_samples, remove_mafinbed_variants, flagged_mutationInCis_variants, current_release_staging)
¶
Create, filter, configure, and store maf file
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
|
genie_version
|
GENIE version (ie. v6.1-consortium)
|
flatfiles_view_synid
|
Synapse id of fileview with all the flat files
|
release_synid
|
Synapse id to store release file
|
clinicaldf
|
Clinical dataframe with SAMPLE_ID and CENTER
|
center_mappingdf
|
Center mapping dataframe
|
keep_for_merged_consortium_samples
|
Samples to keep for merged file
|
keep_for_center_consortium_samples
|
Samples to keep for center files
|
remove_mafinbed_variants
|
Variants to remove
|
flagged_mutationInCis_variants
|
Variants to flag
|
current_release_staging
|
Staging flag
|
run_genie_filters(syn, genie_version, variant_filtering_synId, clinicaldf, beddf, center_mappingdf, processing_date, skip_mutationsincis, consortium_release_cutoff, test, current_release_staging)
¶
Run GENIE filters and returns variants and samples to remove
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
|
genie_version
|
GENIE version (ie. v6.1-consortium)
|
variant_filtering_synId
|
Synapse id of mutationInCis table
|
clinicaldf
|
Clinical dataframe with SAMPLE_ID and SEQ_ASSAY_ID
|
beddf
|
Bed dataframe
|
center_mappingdf
|
Center mapping dataframe
|
processing_date
|
Processing date
|
skip_mutationsincis
|
Skip mutation in cis filter
|
consortium_release_cutoff
|
Release cutoff in days
|
test
|
Test flag
|
current_release_staging
|
Staging flag
|
| RETURNS | DESCRIPTION |
|---|---|
|
pandas.Series: Variants to remove |
|
set
|
samples to remove for release files |
set
|
samples to remove for center files |
|
pandas.Series: Variants to flag |
store_assay_info_files(syn, genie_version, assay_info_synid, clinicaldf, release_synid)
¶
Creates, stores assay information and gets WES panel list
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
|
genie_version
|
GENIE version (ie. v6.1-consortium)
|
assay_info_synid
|
Assay information database synid
|
clinicaldf
|
Clinical dataframe with SAMPLE_ID and SEQ_ASSAY_ID
|
release_synid
|
Synapse id to store release file
|
| RETURNS | DESCRIPTION |
|---|---|
|
List of whole exome sequencing SEQ_ASSAY_IDs |
store_clinical_files(syn, genie_version, clinicaldf, oncotree_url, sample_cols, patient_cols, remove_center_consortium_samples, remove_merged_consortium_samples, release_synid, current_release_staging, center_mappingdf, databaseSynIdMappingDf, used=None)
¶
Create, filter, configure, and store clinical file
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
|
genie_version
|
GENIE version (ie. v6.1-consortium)
|
clinicaldf
|
Clinical dataframe with SAMPLE_ID and SEQ_ASSAY_ID
|
oncotree_url
|
Oncotree URL
|
sample_cols
|
Clinical sample columns
|
patient_cols
|
Clinical patient columns
|
remove_center_consortium_samples
|
Samples to remove for center files
|
remove_merged_consortium_samples
|
Samples to remove for merged file
|
release_synid
|
Synapse id to store release file
|
current_release_staging
|
Staging flag
|
center_mappingdf
|
Center mapping dataframe
|
databaseSynIdMappingDf
|
Database to Synapse Id mapping
|
| RETURNS | DESCRIPTION |
|---|---|
|
pandas.DataFrame: configured clinical dataframe |
|
|
pandas.Series: samples to keep for center files |
|
|
pandas.Series: samples to keep for release files |
store_cna_files(syn, flatfiles_view_synid, keep_for_center_consortium_samples, keep_for_merged_consortium_samples, center_mappingdf, genie_version, release_synid, current_release_staging)
¶
Create, filter and store cna file
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
|
flatfiles_view_synid
|
Synapse id of fileview with all the flat files
|
keep_for_center_consortium_samples
|
Samples to keep for center files
|
keep_for_merged_consortium_samples
|
Samples to keep for merged file
|
center_mappingdf
|
Center mapping dataframe
|
genie_version
|
GENIE version (ie. v6.1-consortium)
|
release_synid
|
Synapse id to store release file
|
Returns: list: CNA samples
store_seg_files(syn, genie_version, seg_synid, release_synid, keep_for_center_consortium_samples, keep_for_merged_consortium_samples, center_mappingdf, current_release_staging)
¶
Create, filter and store seg file
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
|
genie_version
|
GENIE version (ie. v6.1-consortium)
|
seg_synid
|
Seg database synid
|
release_synid
|
Synapse id to store release file
|
keep_for_center_consortium_samples
|
Samples to keep for center files
|
keep_for_merged_consortium_samples
|
Samples to keep for merged file
|
center_mappingdf
|
Center mapping dataframe
|
current_release_staging
|
Staging flag
|
store_data_gene_matrix(syn, genie_version, clinicaldf, cna_samples, release_synid, wes_seqassayids, sv_samples)
¶
Create and store data gene matrix file
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
|
genie_version
|
GENIE version (ie. v6.1-consortium)
|
clinicaldf
|
Clinical dataframe with SAMPLE_ID and SEQ_ASSAY_ID
|
cna_samples
|
Samples with CNA
|
release_synid
|
Synapse id to store release file
|
wes_seqassayids
|
Whole exome sequencing SEQ_ASSAY_IDs
|
sv_samples
|
Samples with SV
|
| RETURNS | DESCRIPTION |
|---|---|
|
pandas.DataFrame: data gene matrix dataframe |
store_bed_files(syn, genie_version, beddf, seq_assay_ids, center_mappingdf, current_release_staging, release_synid, used=None)
¶
Store bed files, store the bed regions that had symbols remapped Filters bed file by clinical dataframe seq assays
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
|
genie_version
|
GENIE version (ie. v6.1-consortium)
|
beddf
|
Bed dataframe
|
seq_assay_ids
|
All SEQ_ASSAY_IDs in the clinical file
|
center_mappingdf
|
Center mapping dataframe
|
current_release_staging
|
Staging flag
|
release_synid
|
Synapse id to store release file
|
stagingToCbio(syn, processingDate, genieVersion, CENTER_MAPPING_DF, databaseSynIdMappingDf, oncotree_url=None, consortiumReleaseCutOff=183, current_release_staging=False, skipMutationsInCis=False, test=False)
¶
Main function that takes the GENIE database and creates release files
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
|
processingDate
|
Processing date in form of Apr-XXXX
|
genieVersion
|
GENIE version. Default is test.
|
CENTER_MAPPING_DF
|
center mapping dataframe
|
databaseSynIdMappingDf
|
Database to Synapse Id mapping
|
oncotree_url
|
Oncotree link
DEFAULT:
|
consortiumReleaseCutOff
|
Release cut off days
DEFAULT:
|
current_release_staging
|
Is it staging. Default is False.
DEFAULT:
|
skipMutationsInCis
|
Skip mutation in cis filter. Default is False.
DEFAULT:
|
test
|
Testing parameter. Default is False.
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
Gene panel entities |
revise_metadata_files(syn, consortiumid, genie_version=None)
¶
Rewrite metadata files with the correct GENIE version
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
|
consortiumid
|
Synapse id of consortium release folder
|
genie_version
|
GENIE version, Default to None
DEFAULT:
|
search_or_create_folder(syn, parentid, folder_name)
¶
Searches for an existing Synapse Folder given a parent id and creates the Synapse folder if it doesn't exist
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse connection
TYPE:
|
parentid
|
Synapse Id of a project or folder
TYPE:
|
folder_name
|
Folde rname
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Synapse Folder id
TYPE:
|
create_link_version(syn, genie_version, case_list_entities, gene_panel_entities, database_synid_mappingdf, release_type='consortium')
¶
Create release links from the actual entity and version
TODO: Refactor to use fileviews
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
|
genie_version
|
GENIE version number
|
case_list_entities
|
Case list entities
|
gene_panel_entities
|
Gene panel entities
|
database_synid_mappingdf
|
dataframe containing database to synapse id mapping
|
release_type
|
'consortium' or 'public' release
DEFAULT:
|