Skip to content

process_functions

genie.process_functions

Processing functions that are used in the GENIE pipeline

Attributes

__version__ = '17.1.0' module-attribute

logger = logging.getLogger(__name__) module-attribute

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__)) module-attribute

Functions

to_unix_epoch_time_utc(dt)

Wrapper for Synapse to_unix_epoch_time that forces UTC tzinfo

PARAMETER DESCRIPTION
dt

input datetime time object

TYPE: Union[date, datetime, str]

RETURNS DESCRIPTION
int

Converted UTC datetime object to UNIX time

TYPE: int

get_clinical_dataframe(filePathList)

Gets the clinical file(s) and reads them in as a dataframe

PARAMETER DESCRIPTION
filePathList

List of clinical files

TYPE: list

RAISES DESCRIPTION
ValueError

when PATIENT_ID column doesn't exist

ValueError

When PATIENT_IDs in sample file doesn't exist in patient file

RETURNS DESCRIPTION
DataFrame

pd.DataFrame: clinical file as a dataframe

get_assay_dataframe(filepath_list)

Reads in assay_information.yaml file and outputs it as a dataframe

PARAMETER DESCRIPTION
filepath_list

list of files

TYPE: list

RAISES DESCRIPTION
ValueError

thrown if read error with file

RETURNS DESCRIPTION
DataFrame

pd.DataFrame: dataframe version of assay info file

retry_get_url(url)

Implement retry logic when getting urls. Timesout at 3 seconds, retries 5 times.

PARAMETER DESCRIPTION
url

Http or https url

RETURNS DESCRIPTION

requests.get()

checkUrl(url)

Check if URL link is live

PARAMETER DESCRIPTION
url

web URL

checkColExist(DF, key)

This function checks if the column(s) exist(s) in a dataframe

PARAMETER DESCRIPTION
DF

pandas dataframe

TYPE: DataFrame

key

Expected column header name(s)

TYPE: Union[str, int, list]

RETURNS DESCRIPTION
bool

True if column(s) exist(s)

TYPE: bool

validate_genie_identifier(identifiers, center, filename, col)

Validate GENIE sample and patient ids.

PARAMETER DESCRIPTION
identifiers

Array of GENIE identifiers

TYPE: Series

center

GENIE center name

TYPE: str

filename

name of file

TYPE: str

col

Column with identifiers

TYPE: str

return

str: Errors

lookup_dataframe_value(df, col, query)

Look up dataframe value given query and column

PARAMETER DESCRIPTION
df

dataframe

col

column with value to return

query

Query for specific column

RETURNS DESCRIPTION

value

rmFiles(folderPath, recursive=True)

Convenience function to remove all files in dir

PARAMETER DESCRIPTION
folderPath

Path to folder

recursive

Removes all files recursively

DEFAULT: True

removeStringFloat(string)

remove string float in tsv file

PARAMETER DESCRIPTION
string

tsv file in string format

Return

string: string with float removed

removePandasDfFloat(df, header=True)

Remove decimal for integers due to pandas

PARAMETER DESCRIPTION
df

Pandas dataframe

Return

str: tsv in text

removeFloat(df)

Need to remove this function as it calls another function

checkGenieId(ID, center)

Checks if GENIE ID is labelled correctly and reformats the GENIE ID

PARAMETER DESCRIPTION
ID

string

center

GENIE center

Return

str: Formatted GENIE ID string

seqDateFilter(clinicalDf, processingDate, days)

SEQ_DATE filter SEQ_DATE - Clinical data (6 and 12 as parameters) Jan-2017 , given processing date (today) -> staging release (processing date - Jan-2017 < 6 months) July-2016 , given processing date (today) -> consortium release (processing date - July-2016 between 6 months - 12 months)

addClinicalHeaders(clinicalDf, mapping, patientCols, sampleCols, samplePath, patientPath)

Add clinical file headers

PARAMETER DESCRIPTION
clinicalDf

clinical dataframe

mapping

mapping dataframe, maps clinical columns to labels and descriptions

patientCols

list of patient columns

sampleCols

list of sample columns

samplePath

clinical sample path

patientPath

clinical patient path

_check_valid_df(df, col)

Checking if variable is a pandas dataframe and column specified exist

PARAMETER DESCRIPTION
df

Pandas dataframe

col

Column name

_get_left_diff_df(left, right, checkby)

Subset the dataframe based on 'checkby' by taking values in the left df that arent in the right df

PARAMETER DESCRIPTION
left

Dataframe

right

Dataframe

checkby

Column of values to compare

Return

Dataframe: Subset of dataframe from left that don't exist in the right

_get_left_union_df(left, right, checkby)

Subset the dataframe based on 'checkby' by taking the union of values in the left df with the right df

PARAMETER DESCRIPTION
left

Dataframe

right

Dataframe

checkby

Column of values to compare

Return

Dataframe: Subset of dataframe from left that also exist in the right

_append_rows(new_datasetdf, databasedf, checkby)

Compares the dataset from the database and determines which rows to append from the dataset

PARAMETER DESCRIPTION
new_datasetdf

Input data dataframe

databasedf

Existing data dataframe

checkby

Column of values to compare

Return

Dataframe: Dataframe of rows to append

_delete_rows(new_datasetdf, databasedf, checkby)

Compares the dataset from the database and determines which rows to delete from the dataset

PARAMETER DESCRIPTION
new_datasetdf

Input data dataframe

databasedf

Existing data dataframe

checkby

Column of values to compare

Return

Dataframe: Dataframe of rows to delete

_create_update_rowsdf(updating_databasedf, updatesetdf, rowids, differentrows)

Create the update dataset dataframe

PARAMETER DESCRIPTION
updating_databasedf

Update database dataframe

updatesetdf

Update dataset dataframe

rowids

rowids of the database (Synapse ROW_ID, ROW_VERSION)

differentrows

vector of booleans for rows that need to be updated True for update, False for not

RETURNS DESCRIPTION
dataframe

Update dataframe

_update_rows(new_datasetdf, databasedf, checkby)

Compares the dataset from the database and determines which rows to update from the dataset

PARAMETER DESCRIPTION
new_datasetdf

Input data dataframe

databasedf

Existing data dataframe

checkby

Column of values to compare

Return

Dataframe: Dataframe of rows to update

checkInt(element)

Check if an item can become an integer

PARAMETER DESCRIPTION
element

Any variable and type

RETURNS DESCRIPTION
boolean

True/False

check_col_and_values(df, col, possible_values, filename, na_allowed=False, required=False, sep=None)

This function checks if the column exists then checks if the values in the column have the correct values

PARAMETER DESCRIPTION
df

Input dataframe

col

Expected column name

possible_values

list of possible values

filename

Name of file

required

If the column is required. Default is False

DEFAULT: False

RETURNS DESCRIPTION
tuple

warning, error

extract_oncotree_code_mappings_from_oncotree_json(oncotree_json, primary, secondary)

get_oncotree_code_mappings(oncotree_tumortype_api_endpoint_url)

CREATE ONCOTREE DICTIONARY MAPPING TO PRIMARY, SECONDARY, CANCER TYPE, AND CANCER DESCRIPTION

getCODE(mapping, key, useDescription=False)

getPrimary(code, oncotreeDict, primary)

synapse_login(debug=False)

Logs into Synapse if credentials are saved. If not saved, then user is prompted username and auth token.

PARAMETER DESCRIPTION
debug

Synapse debug feature. Defaults to False

TYPE: Optional[bool] DEFAULT: False

RETURNS DESCRIPTION
Synapse

Synapseclient object

get_gdc_data_dictionary(filetype)

Use the GDC API to get the values allowed for columns of different filetypes (ie. disease_type in the case file)

PARAMETER DESCRIPTION
filetype

GDC file type (ie. case, read_group)

Return

json: Dictionary of allowed columns for the filetype and allowed values for those columns

_create_schema(syn, table_name, parentid, columns=None, annotations=None)

Creates Table Schema

PARAMETER DESCRIPTION
syn

Synapse object

table_name

Name of table

parentid

Project synapse id

columns

Columns of Table

DEFAULT: None

annotations

Dictionary of annotations to add

DEFAULT: None

RETURNS DESCRIPTION

Schema

_update_database_mapping(syn, database_synid_mappingdf, database_mapping_synid, fileformat, new_tableid)

Updates database to synapse id mapping table

PARAMETER DESCRIPTION
syn

Synapse object

database_synid_mappingdf

Database to synapse id mapping dataframe

database_mapping_synid

Database to synapse id table id

fileformat

File format updated

new_tableid

New file format table id

RETURNS DESCRIPTION

Updated Table object

_move_entity(syn, ent, parentid, name=None)

Moves an entity (works like linux mv)

PARAMETER DESCRIPTION
syn

Synapse object

ent

Synapse Entity

parentid

Synapse Project id

name

New Entity name if a new name is desired

DEFAULT: None

RETURNS DESCRIPTION

Moved Entity

get_dbmapping(syn, projectid)

Gets database mapping information

PARAMETER DESCRIPTION
syn

Synapse connection

TYPE: Synapse

projectid

Project id where new data lives

TYPE: str

RETURNS DESCRIPTION
dict

{'synid': database mapping syn id, 'df': database mapping pd.DataFrame}

create_new_fileformat_table(syn, file_format, newdb_name, projectid, archive_projectid)

Creates new database table based on old database table and archives old database table

PARAMETER DESCRIPTION
syn

Synapse object

TYPE: Synapse

file_format

File format to update

TYPE: str

newdb_name

Name of new database table

TYPE: str

projectid

Project id where new database should live

TYPE: str

archive_projectid

Project id where old database should be moved

TYPE: str

RETURNS DESCRIPTION
dict

{"newdb_ent": New database synapseclient.Table, "newdb_mappingdf": new databse pd.DataFrame, "moved_ent": old database synpaseclient.Table}

create_missing_columns(dataset, schema)

Creates and fills missing columns with the relevant NA value for the given data type. Note that special handling had to occur for allowing NAs in integer based columns in pandas by converting the integer column into the Int64 (pandas nullable integer data type)

PARAMETER DESCRIPTION
dataset

input dataset to fill missing columns for

TYPE: DataFrame

schema

the expected schema {column_name(str): data_type(str)} for the input dataset

TYPE: dict

RETURNS DESCRIPTION
Series

pd.Series: updated dataset

check_values_in_column(df, col, values)

Check if a column in a dataframe contains specific values Args: df (pd.DataFrame): The clinical dataframe col (str): The column name values (list): Expected values in the column Returns: bool: True if the column contains the specified values

get_row_indices_for_invalid_column_values(df, col, possible_values, na_allowed=False, sep=None)

This function checks the column values against possible_values and returns row indices of invalid rows.

PARAMETER DESCRIPTION
df

Input dataframe

TYPE: DataFrame

col

The column to be checked

TYPE: str

possible_values

The list of possible values

TYPE: list

na_allowed

If NA is allowed. Defaults to False.

TYPE: bool DEFAULT: False

sep

The string separator. Defaults to None.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Index

pd.Index: The row indices of the rows with values that are not in possible_values.

get_message_for_invalid_column_value(col, filename, invalid_indices, possible_values)

This function returns the error and warning messages if the target column has rows with invalid values.

PARAMETER DESCRIPTION
col

The column to be checked

TYPE: str

filename

The file name

TYPE: str

invalid_indices

The row indices of the rows with invalid values

TYPE: Index

possible_values

The list of possible values

TYPE: list

RETURNS DESCRIPTION
tuple

warning, error

TYPE: tuple

check_column_and_values_row_specific(df, col, possible_values, filename, na_allowed=False, required=False, sep=None)

This function checks if the column exists and checks if the values in the column have the valid values. Currently, this function is only used in assay.py

PARAMETER DESCRIPTION
df

Input dataframe

TYPE: DataFrame

col

The column to be checked

TYPE: str

possible_values

The list of possible values

TYPE: list

filename

The file name

TYPE: str

na_allowed

If NA is allowed. Defaults to False.

TYPE: bool DEFAULT: False

required

If the column is required. Defaults to False.

TYPE: bool DEFAULT: False

sep

The string separator. Defaults to None.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
tuple

warning, error

TYPE: tuple

add_columns_to_data_gene_matrix(data_gene_matrix, sample_list, column_name)

Add CNA and SV columns to data gene matrix

PARAMETER DESCRIPTION
data_gene_matrix

data gene matrix

TYPE: DataFrame

sample_list

The list of cna or sv samples

TYPE: list

column_name

The column name to be added

TYPE: str