Skip to content

Database to Staging

genie.database_to_staging

Functions for releasing GENIE consortium releases

Attributes

logger = logging.getLogger(__name__) module-attribute

GENIE_RELEASE_DIR = os.path.join(os.path.expanduser('~/.synapseCache'), 'GENIE_release') module-attribute

CASE_LIST_PATH = os.path.join(GENIE_RELEASE_DIR, 'case_lists') module-attribute

CNA_CENTER_PATH = os.path.join(GENIE_RELEASE_DIR, 'data_CNA_%s.txt') module-attribute

SAMPLE_CENTER_PATH = os.path.join(GENIE_RELEASE_DIR, 'data_clinical_supp_sample_%s.txt') module-attribute

PATIENT_CENTER_PATH = os.path.join(GENIE_RELEASE_DIR, 'data_clinical_supp_patient_%s.txt') module-attribute

MUTATIONS_CENTER_PATH = os.path.join(GENIE_RELEASE_DIR, 'data_mutations_extended_%s.txt') module-attribute

FUSIONS_CENTER_PATH = os.path.join(GENIE_RELEASE_DIR, 'data_fusions_%s.txt') module-attribute

SEG_CENTER_PATH = os.path.join(GENIE_RELEASE_DIR, 'data_cna_hg19_%s.seg') module-attribute

SV_CENTER_PATH = os.path.join(GENIE_RELEASE_DIR, 'data_sv_%s.txt') module-attribute

BED_DIFFS_SEQASSAY_PATH = os.path.join(GENIE_RELEASE_DIR, 'diff_%s.csv') module-attribute

FULL_MAF_RELEASE_COLUMNS = ['Hugo_Symbol', 'Entrez_Gene_Id', 'Center', 'NCBI_Build', 'Chromosome', 'Start_Position', 'End_Position', 'Strand', 'Consequence', 'Variant_Classification', 'Variant_Type', 'Reference_Allele', 'Tumor_Seq_Allele1', 'Tumor_Seq_Allele2', 'dbSNP_RS', 'dbSNP_Val_Status', 'Tumor_Sample_Barcode', 'Matched_Norm_Sample_Barcode', 'Match_Norm_Seq_Allele1', 'Match_Norm_Seq_Allele2', 'Tumor_Validation_Allele1', 'Tumor_Validation_Allele2', 'Match_Norm_Validation_Allele1', 'Match_Norm_Validation_Allele2', 'Verification_Status', 'Validation_Status', 'Mutation_Status', 'Sequencing_Phase', 'Sequence_Source', 'Validation_Method', 'Score', 'BAM_File', 'Sequencer', 't_ref_count', 't_alt_count', 'n_ref_count', 'n_alt_count', 'HGVSc', 'HGVSp', 'HGVSp_Short', 'Transcript_ID', 'RefSeq', 'Protein_position', 'Codons', 'Exon_Number', 'gnomAD_AF', 'gnomAD_AFR_AF', 'gnomAD_AMR_AF', 'gnomAD_ASJ_AF', 'gnomAD_EAS_AF', 'gnomAD_FIN_AF', 'gnomAD_NFE_AF', 'gnomAD_OTH_AF', 'gnomAD_SAS_AF', 'FILTER', 'Polyphen_Prediction', 'Polyphen_Score', 'SIFT_Prediction', 'SIFT_Score', 'SWISSPROT', 'n_depth', 't_depth', 'Annotation_Status', 'mutationInCis_Flag'] module-attribute

DEPRECATED_CONSORTIUM_RELEASE_FILES = ['data_fusions.txt', 'meta_fusions.txt'] module-attribute

Functions

_to_redact_interval(df_col)

Determines year values that are "<18" and interval values >89 that need to be redacted Returns bool because BIRTH_YEAR needs to be redacted as well based on the results

PARAMETER DESCRIPTION
df_col

Dataframe column/pandas.Series of an interval column

TYPE: Series

RETURNS DESCRIPTION
tuple

pandas.Series: to redact boolean vector pandas.Series: to redact pediatric boolean vector

TYPE: Tuple[Series, Series]

_redact_year(df_col)

Redacts year values that have < or >

PARAMETER DESCRIPTION
df_col

Dataframe column/pandas.Series of a year column

TYPE: Series

RETURNS DESCRIPTION
Series

pandas.Series: Redacted series

_redact_ped_year(df_col)

Redacts year values that have <

PARAMETER DESCRIPTION
df_col

Dataframe column/pandas.Series of a year column

TYPE: Series

RETURNS DESCRIPTION
Series

pandas.Series: Redacted series

_to_redact_difference(df_col_year1, df_col_year2)

Determine if difference between year2 and year1 is > 89

PARAMETER DESCRIPTION
df_col_year1

Dataframe column/pandas.Series of a year column

TYPE: Series

df_col_year2

Dataframe column/pandas.Series of a year column

TYPE: Series

RETURNS DESCRIPTION
Series

pandas.Series: to redact boolean vector

redact_phi(clinicaldf, interval_cols_to_redact=['AGE_AT_SEQ_REPORT', 'INT_CONTACT', 'INT_DOD'])

Redacts the PHI by re-annotating the clinical file

PARAMETER DESCRIPTION
clinicaldf

merged clinical dataframe

TYPE: DataFrame

interval_cols_to_redact

List of interval columns to redact. Defaults to ['AGE_AT_SEQ_REPORT', 'INT_CONTACT', 'INT_DOD']

TYPE: list DEFAULT: ['AGE_AT_SEQ_REPORT', 'INT_CONTACT', 'INT_DOD']

RETURNS DESCRIPTION
DataFrame

pandas.DataFrame: Redacted clinical dataframe

remove_maf_samples(mafdf, keep_samples)

Remove samples from maf file

PARAMETER DESCRIPTION
mafdf

Maf dataframe

TYPE: DataFrame

keep_samples

Samples to keep

TYPE: list

RETURNS DESCRIPTION
DataFrame

Filtered maf dataframe

get_whitelist_variants_idx(mafdf)

Get boolean vector for variants that are known somatic sites This is to override the germline filter

configure_maf(mafdf, remove_variants, flagged_variants)

Configures each maf dataframe, does germline filtering

Germline filtering for MAF files uses the gnomAD columns that refer to the allele frequencies (AF) of variants in different population groups from the gnomAD (Genome Aggregation Database). This filter will filter out variants with a maximum AF > 0.05% across all populations which are typically common germline variants.

Germline filtering for MAF files occurs during release instead of during processing because the MAF file gets re-annotated during processing via genome nexus annotation.

PARAMETER DESCRIPTION
mafdf

Maf dataframe

remove_variants

Variants to remove

flagged_variants

Variants to flag

RETURNS DESCRIPTION

configured maf row

calculate_missing_variant_counts(depth, alt_count, ref_count)

Calculate missing counts. t_depth = t_alt_count + t_ref_count

PARAMETER DESCRIPTION
depth

Allele Depth

TYPE: Series

alt_count

Allele alt counts

TYPE: Series

ref_count

Allele ref counts

TYPE: Series

RETURNS DESCRIPTION
dict

filled in depth, alt_count and ref_count values

runMAFinBED(syn, center_mappingdf, test=False, staging=False, genieVersion='test')

Run MAF in BED script, filter data and update MAFinBED files

PARAMETER DESCRIPTION
syn

Synapse client connection

TYPE: Synapse

center_mappingdf

center mapping dataframe

TYPE: DataFrame

test

Testing parameter. Defaults to False.

TYPE: bool DEFAULT: False

staging

Staging parameter. Defaults to False.

TYPE: bool DEFAULT: False

genieVersion

GENIE version. Defaults to "test".

TYPE: str DEFAULT: 'test'

RETURNS DESCRIPTION
Series

pd.Series: Variants to remove

get_run_maf_in_bed_script_cmd(notinbed_file, script_dir, test, staging)

This function gets the MAFinBED R script command call based on whether we are running in test, staging or production mode

PARAMETER DESCRIPTION
notinbed_file

Full path to the notinbed csv

TYPE: str

script_dir

directory of the MAFinBED script

TYPE: str

test

Testing parameter.

TYPE: bool

staging

Staging parameter

TYPE: bool

RETURNS DESCRIPTION
str

Full command call for the MAFinBED script

TYPE: str

store_maf_in_bed_filtered_variants(syn, notinbed_file, center_mapping_df, genie_version)

This script retrieves and stores the filtered variants generated from running the MAFinBed script as files on Synapse. This script then returns the not in bed file with the removed variants column added. TODO: Add handling for empty file?

PARAMETER DESCRIPTION
syn

Synapse client connection

TYPE: Synapse

notinbed_file

input file

TYPE: str

center_mapping_df

the center mapping dataframe

TYPE: DataFrame

genie_version

version of this genie run

TYPE: str

RETURNS DESCRIPTION
DataFrame

pd.DataFrame: filtered variants dataframe with removed variants column added

seq_date_filter(clinicalDf, processingDate, consortiumReleaseCutOff)

Filter samples by seq date

PARAMETER DESCRIPTION
clinicalDf

Clinical dataframe

processingDate

Processing date in form of Apr-XXXX

consortiumReleaseCutOff

Release cut off days

RETURNS DESCRIPTION
list

Samples to remove

mutation_in_cis_filter(syn, skipMutationsInCis, variant_filtering_synId, center_mappingdf, genieVersion, test=False, staging=False)

Run mutation in cis filter, look up samples to remove.

The mutation in cis script ONLY runs WHEN the skipMutationsInCis parameter is FALSE

PARAMETER DESCRIPTION
syn

Synapse object

skipMutationsInCis

Skip this filter

variant_filtering_synId

mergeCheck database dataframe

center_mappingdf

center mapping dataframe

genieVersion

GENIE version. Default is test.

test

Testing parameter. Default is False.

DEFAULT: False

staging

Staging parameter. Default is False.

DEFAULT: False

RETURNS DESCRIPTION

pd.Series: Samples to remove

get_mutation_in_cis_filter_script_cmd(test, staging)

This function gets the mutation_in_cis_filter R script command call based on whether we are running in test, staging or production mode

PARAMETER DESCRIPTION
test

Testing parameter.

TYPE: bool

staging

Staging parameter. Default is False.

TYPE: bool

RETURNS DESCRIPTION
str

Full command call for the mergeCheck script

TYPE: str

get_mutation_in_cis_filtered_samples(syn, variant_filtering_synId)

Gets the samples to remove in our variant filtering table TODO: Add handling for when we have 0 row query results Args: syn (synapseclient.Synapse): synapse client connection variant_filtering_synId (str): variant filtering synapse id

RETURNS DESCRIPTION
list

pd.Series: removed samples

get_mutation_in_cis_flagged_variants(syn, variant_filtering_synId)

Gets the flagged variants in our variant filtering table which is a unique string concatenation of the Chromosome, Start_Position, HGVSp_Short, Reference_Allele, Tumor_Seq_Allele2 and Tumor_Sample_Barcode columns TODO: Add handling for when we have 0 row query results

PARAMETER DESCRIPTION
syn

synapse client connection

TYPE: Synapse

variant_filtering_synId

variant filtering synapse id

TYPE: str

RETURNS DESCRIPTION
Series

pd.Series: flagged variants

store_mutation_in_cis_files_to_staging(syn, center_mappingdf, variant_filtering_synId, genieVersion)

Stores the mutation in cis files to synapse per center

PARAMETER DESCRIPTION
syn

synapse client connection

TYPE: Synapse

center_mappingdf

center mapping dataframe

TYPE: DataFrame

variant_filtering_synId

variant filtering synapse id

TYPE: str

genieVersion

version of the genie pipeline run

TYPE: str

seq_assay_id_filter(clinicaldf)

(Deprecated) Remove samples that are part of SEQ_ASSAY_IDs with less than 50 samples

PARAMETER DESCRIPTION
clinicalDf

Sample clinical dataframe

RETURNS DESCRIPTION

pd.Series: samples to remove

no_genepanel_filter(clinicaldf, beddf)

Remove samples that don't have bed files associated with them

PARAMETER DESCRIPTION
clinicaldf

Clinical data

beddf

bed data

RETURNS DESCRIPTION

pd.Series: samples to remove

store_gene_panel_files(syn, fileviewSynId, genieVersion, data_gene_panel, consortiumReleaseSynId, wes_seqassayids)

filter_out_germline_variants(input_data, status_col_str)

Filters out germline variants given a status col str. Genie pipeline cannot have any of these variants. NOTE: We have to search for the status column because there's no column name validation in the release steps so the status column may have different casing.

PARAMETER DESCRIPTION
input_data

input data with germline variants to filter out

TYPE: DataFrame

status_col_str

search string for the status column for the data

TYPE: str

RETURNS DESCRIPTION
DataFrame

pd.DataFrame: filtered out germline variant data

store_sv_files(syn, release_synid, genie_version, synid, keep_for_center_consortium_samples, keep_for_merged_consortium_samples, current_release_staging, center_mappingdf)

Create, filter, configure, and store structural variant file

PARAMETER DESCRIPTION
syn

Synapse object

TYPE: Synapse

release_synid

Synapse id to store release file

TYPE: str

genie_version

GENIE version (ie. v6.1-consortium)

TYPE: str

synid

SV database synid

TYPE: str

keep_for_center_consortium_samples

Samples to keep for center files

TYPE: List[str]

keep_for_merged_consortium_samples

Samples to keep for merged file

TYPE: List[str]

current_release_staging

Staging flag

TYPE: bool

center_mappingdf

Center mapping dataframe

TYPE: DataFrame

RETURNS DESCRIPTION

List of SV Samples

append_or_create_release_maf(dataframe, filepath)

Creates a file with the dataframe or appends to a existing file.

PARAMETER DESCRIPTION
df

pandas.dataframe to write out

filepath

Filepath to append or create

TYPE: str

store_maf_files(syn, genie_version, flatfiles_view_synid, release_synid, clinicaldf, center_mappingdf, keep_for_merged_consortium_samples, keep_for_center_consortium_samples, remove_mafinbed_variants, flagged_mutationInCis_variants, current_release_staging)

Create, filter, configure, and store maf file

PARAMETER DESCRIPTION
syn

Synapse object

genie_version

GENIE version (ie. v6.1-consortium)

flatfiles_view_synid

Synapse id of fileview with all the flat files

release_synid

Synapse id to store release file

clinicaldf

Clinical dataframe with SAMPLE_ID and CENTER

center_mappingdf

Center mapping dataframe

keep_for_merged_consortium_samples

Samples to keep for merged file

keep_for_center_consortium_samples

Samples to keep for center files

remove_mafinbed_variants

Variants to remove

flagged_mutationInCis_variants

Variants to flag

current_release_staging

Staging flag

run_genie_filters(syn, genie_version, variant_filtering_synId, clinicaldf, beddf, center_mappingdf, processing_date, skip_mutationsincis, consortium_release_cutoff, test, current_release_staging)

Run GENIE filters and returns variants and samples to remove

PARAMETER DESCRIPTION
syn

Synapse object

genie_version

GENIE version (ie. v6.1-consortium)

variant_filtering_synId

Synapse id of mutationInCis table

clinicaldf

Clinical dataframe with SAMPLE_ID and SEQ_ASSAY_ID

beddf

Bed dataframe

center_mappingdf

Center mapping dataframe

processing_date

Processing date

skip_mutationsincis

Skip mutation in cis filter

consortium_release_cutoff

Release cutoff in days

test

Test flag

current_release_staging

Staging flag

RETURNS DESCRIPTION

pandas.Series: Variants to remove

set

samples to remove for release files

set

samples to remove for center files

pandas.Series: Variants to flag

store_assay_info_files(syn, genie_version, assay_info_synid, clinicaldf, release_synid)

Creates, stores assay information and gets WES panel list

PARAMETER DESCRIPTION
syn

Synapse object

genie_version

GENIE version (ie. v6.1-consortium)

assay_info_synid

Assay information database synid

clinicaldf

Clinical dataframe with SAMPLE_ID and SEQ_ASSAY_ID

release_synid

Synapse id to store release file

RETURNS DESCRIPTION

List of whole exome sequencing SEQ_ASSAY_IDs

store_clinical_files(syn, genie_version, clinicaldf, oncotree_url, sample_cols, patient_cols, remove_center_consortium_samples, remove_merged_consortium_samples, release_synid, current_release_staging, center_mappingdf, databaseSynIdMappingDf, used=None)

Create, filter, configure, and store clinical file

PARAMETER DESCRIPTION
syn

Synapse object

genie_version

GENIE version (ie. v6.1-consortium)

clinicaldf

Clinical dataframe with SAMPLE_ID and SEQ_ASSAY_ID

oncotree_url

Oncotree URL

sample_cols

Clinical sample columns

patient_cols

Clinical patient columns

remove_center_consortium_samples

Samples to remove for center files

remove_merged_consortium_samples

Samples to remove for merged file

release_synid

Synapse id to store release file

current_release_staging

Staging flag

center_mappingdf

Center mapping dataframe

databaseSynIdMappingDf

Database to Synapse Id mapping

RETURNS DESCRIPTION

pandas.DataFrame: configured clinical dataframe

pandas.Series: samples to keep for center files

pandas.Series: samples to keep for release files

store_cna_files(syn, flatfiles_view_synid, keep_for_center_consortium_samples, keep_for_merged_consortium_samples, center_mappingdf, genie_version, release_synid, current_release_staging)

Create, filter and store cna file

PARAMETER DESCRIPTION
syn

Synapse object

flatfiles_view_synid

Synapse id of fileview with all the flat files

keep_for_center_consortium_samples

Samples to keep for center files

keep_for_merged_consortium_samples

Samples to keep for merged file

center_mappingdf

Center mapping dataframe

genie_version

GENIE version (ie. v6.1-consortium)

release_synid

Synapse id to store release file

Returns: list: CNA samples

store_seg_files(syn, genie_version, seg_synid, release_synid, keep_for_center_consortium_samples, keep_for_merged_consortium_samples, center_mappingdf, current_release_staging)

Create, filter and store seg file

PARAMETER DESCRIPTION
syn

Synapse object

genie_version

GENIE version (ie. v6.1-consortium)

seg_synid

Seg database synid

release_synid

Synapse id to store release file

keep_for_center_consortium_samples

Samples to keep for center files

keep_for_merged_consortium_samples

Samples to keep for merged file

center_mappingdf

Center mapping dataframe

current_release_staging

Staging flag

store_data_gene_matrix(syn, genie_version, clinicaldf, cna_samples, release_synid, wes_seqassayids, sv_samples)

Create and store data gene matrix file

PARAMETER DESCRIPTION
syn

Synapse object

genie_version

GENIE version (ie. v6.1-consortium)

clinicaldf

Clinical dataframe with SAMPLE_ID and SEQ_ASSAY_ID

cna_samples

Samples with CNA

release_synid

Synapse id to store release file

wes_seqassayids

Whole exome sequencing SEQ_ASSAY_IDs

sv_samples

Samples with SV

RETURNS DESCRIPTION

pandas.DataFrame: data gene matrix dataframe

store_bed_files(syn, genie_version, beddf, seq_assay_ids, center_mappingdf, current_release_staging, release_synid, used=None)

Store bed files, store the bed regions that had symbols remapped Filters bed file by clinical dataframe seq assays

PARAMETER DESCRIPTION
syn

Synapse object

genie_version

GENIE version (ie. v6.1-consortium)

beddf

Bed dataframe

seq_assay_ids

All SEQ_ASSAY_IDs in the clinical file

center_mappingdf

Center mapping dataframe

current_release_staging

Staging flag

release_synid

Synapse id to store release file

stagingToCbio(syn, processingDate, genieVersion, CENTER_MAPPING_DF, databaseSynIdMappingDf, oncotree_url=None, consortiumReleaseCutOff=183, current_release_staging=False, skipMutationsInCis=False, test=False)

Main function that takes the GENIE database and creates release files

PARAMETER DESCRIPTION
syn

Synapse object

processingDate

Processing date in form of Apr-XXXX

genieVersion

GENIE version. Default is test.

CENTER_MAPPING_DF

center mapping dataframe

databaseSynIdMappingDf

Database to Synapse Id mapping

oncotree_url

Oncotree link

DEFAULT: None

consortiumReleaseCutOff

Release cut off days

DEFAULT: 183

current_release_staging

Is it staging. Default is False.

DEFAULT: False

skipMutationsInCis

Skip mutation in cis filter. Default is False.

DEFAULT: False

test

Testing parameter. Default is False.

DEFAULT: False

RETURNS DESCRIPTION
list

Gene panel entities

revise_metadata_files(syn, consortiumid, genie_version=None)

Rewrite metadata files with the correct GENIE version

PARAMETER DESCRIPTION
syn

Synapse object

consortiumid

Synapse id of consortium release folder

genie_version

GENIE version, Default to None

DEFAULT: None

search_or_create_folder(syn, parentid, folder_name)

Searches for an existing Synapse Folder given a parent id and creates the Synapse folder if it doesn't exist

PARAMETER DESCRIPTION
syn

Synapse connection

TYPE: Synapse

parentid

Synapse Id of a project or folder

TYPE: str

folder_name

Folde rname

TYPE: str

RETURNS DESCRIPTION
str

Synapse Folder id

TYPE: str

Create release links from the actual entity and version

TODO: Refactor to use fileviews

PARAMETER DESCRIPTION
syn

Synapse object

genie_version

GENIE version number

case_list_entities

Case list entities

gene_panel_entities

Gene panel entities

database_synid_mappingdf

dataframe containing database to synapse id mapping

release_type

'consortium' or 'public' release

DEFAULT: 'consortium'