process_functions
genie.process_functions
¶
Processing functions that are used in the GENIE pipeline
Attributes¶
__version__ = '17.1.0'
module-attribute
¶
logger = logging.getLogger(__name__)
module-attribute
¶
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
module-attribute
¶
Functions¶
to_unix_epoch_time_utc(dt)
¶
Wrapper for Synapse to_unix_epoch_time that forces UTC tzinfo
| PARAMETER | DESCRIPTION |
|---|---|
dt
|
input datetime time object
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
Converted UTC datetime object to UNIX time
TYPE:
|
get_clinical_dataframe(filePathList)
¶
Gets the clinical file(s) and reads them in as a dataframe
| PARAMETER | DESCRIPTION |
|---|---|
filePathList
|
List of clinical files
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
when PATIENT_ID column doesn't exist |
ValueError
|
When PATIENT_IDs in sample file doesn't exist in patient file |
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
pd.DataFrame: clinical file as a dataframe |
get_assay_dataframe(filepath_list)
¶
Reads in assay_information.yaml file and outputs it as a dataframe
| PARAMETER | DESCRIPTION |
|---|---|
filepath_list
|
list of files
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
thrown if read error with file |
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
pd.DataFrame: dataframe version of assay info file |
retry_get_url(url)
¶
Implement retry logic when getting urls. Timesout at 3 seconds, retries 5 times.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
Http or https url
|
| RETURNS | DESCRIPTION |
|---|---|
|
requests.get() |
checkUrl(url)
¶
Check if URL link is live
| PARAMETER | DESCRIPTION |
|---|---|
url
|
web URL
|
checkColExist(DF, key)
¶
This function checks if the column(s) exist(s) in a dataframe
| PARAMETER | DESCRIPTION |
|---|---|
DF
|
pandas dataframe
TYPE:
|
key
|
Expected column header name(s)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if column(s) exist(s)
TYPE:
|
validate_genie_identifier(identifiers, center, filename, col)
¶
Validate GENIE sample and patient ids.
| PARAMETER | DESCRIPTION |
|---|---|
identifiers
|
Array of GENIE identifiers
TYPE:
|
center
|
GENIE center name
TYPE:
|
filename
|
name of file
TYPE:
|
col
|
Column with identifiers
TYPE:
|
return
str: Errors
lookup_dataframe_value(df, col, query)
¶
Look up dataframe value given query and column
| PARAMETER | DESCRIPTION |
|---|---|
df
|
dataframe
|
col
|
column with value to return
|
query
|
Query for specific column
|
| RETURNS | DESCRIPTION |
|---|---|
|
value |
rmFiles(folderPath, recursive=True)
¶
Convenience function to remove all files in dir
| PARAMETER | DESCRIPTION |
|---|---|
folderPath
|
Path to folder
|
recursive
|
Removes all files recursively
DEFAULT:
|
removeStringFloat(string)
¶
remove string float in tsv file
| PARAMETER | DESCRIPTION |
|---|---|
string
|
tsv file in string format
|
Return
string: string with float removed
removePandasDfFloat(df, header=True)
¶
Remove decimal for integers due to pandas
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Pandas dataframe
|
Return
str: tsv in text
removeFloat(df)
¶
Need to remove this function as it calls another function
checkGenieId(ID, center)
¶
Checks if GENIE ID is labelled correctly and reformats the GENIE ID
| PARAMETER | DESCRIPTION |
|---|---|
ID
|
string
|
center
|
GENIE center
|
Return
str: Formatted GENIE ID string
seqDateFilter(clinicalDf, processingDate, days)
¶
SEQ_DATE filter SEQ_DATE - Clinical data (6 and 12 as parameters) Jan-2017 , given processing date (today) -> staging release (processing date - Jan-2017 < 6 months) July-2016 , given processing date (today) -> consortium release (processing date - July-2016 between 6 months - 12 months)
addClinicalHeaders(clinicalDf, mapping, patientCols, sampleCols, samplePath, patientPath)
¶
Add clinical file headers
| PARAMETER | DESCRIPTION |
|---|---|
clinicalDf
|
clinical dataframe
|
mapping
|
mapping dataframe, maps clinical columns to labels and descriptions
|
patientCols
|
list of patient columns
|
sampleCols
|
list of sample columns
|
samplePath
|
clinical sample path
|
patientPath
|
clinical patient path
|
_check_valid_df(df, col)
¶
Checking if variable is a pandas dataframe and column specified exist
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Pandas dataframe
|
col
|
Column name
|
_get_left_diff_df(left, right, checkby)
¶
Subset the dataframe based on 'checkby' by taking values in the left df that arent in the right df
| PARAMETER | DESCRIPTION |
|---|---|
left
|
Dataframe
|
right
|
Dataframe
|
checkby
|
Column of values to compare
|
Return
Dataframe: Subset of dataframe from left that don't exist in the right
_get_left_union_df(left, right, checkby)
¶
Subset the dataframe based on 'checkby' by taking the union of values in the left df with the right df
| PARAMETER | DESCRIPTION |
|---|---|
left
|
Dataframe
|
right
|
Dataframe
|
checkby
|
Column of values to compare
|
Return
Dataframe: Subset of dataframe from left that also exist in the right
_append_rows(new_datasetdf, databasedf, checkby)
¶
Compares the dataset from the database and determines which rows to append from the dataset
| PARAMETER | DESCRIPTION |
|---|---|
new_datasetdf
|
Input data dataframe
|
databasedf
|
Existing data dataframe
|
checkby
|
Column of values to compare
|
Return
Dataframe: Dataframe of rows to append
_delete_rows(new_datasetdf, databasedf, checkby)
¶
Compares the dataset from the database and determines which rows to delete from the dataset
| PARAMETER | DESCRIPTION |
|---|---|
new_datasetdf
|
Input data dataframe
|
databasedf
|
Existing data dataframe
|
checkby
|
Column of values to compare
|
Return
Dataframe: Dataframe of rows to delete
_create_update_rowsdf(updating_databasedf, updatesetdf, rowids, differentrows)
¶
Create the update dataset dataframe
| PARAMETER | DESCRIPTION |
|---|---|
updating_databasedf
|
Update database dataframe
|
updatesetdf
|
Update dataset dataframe
|
rowids
|
rowids of the database (Synapse ROW_ID, ROW_VERSION)
|
differentrows
|
vector of booleans for rows that need to be updated True for update, False for not
|
| RETURNS | DESCRIPTION |
|---|---|
dataframe
|
Update dataframe |
_update_rows(new_datasetdf, databasedf, checkby)
¶
Compares the dataset from the database and determines which rows to update from the dataset
| PARAMETER | DESCRIPTION |
|---|---|
new_datasetdf
|
Input data dataframe
|
databasedf
|
Existing data dataframe
|
checkby
|
Column of values to compare
|
Return
Dataframe: Dataframe of rows to update
checkInt(element)
¶
Check if an item can become an integer
| PARAMETER | DESCRIPTION |
|---|---|
element
|
Any variable and type
|
| RETURNS | DESCRIPTION |
|---|---|
boolean
|
True/False |
check_col_and_values(df, col, possible_values, filename, na_allowed=False, required=False, sep=None)
¶
This function checks if the column exists then checks if the values in the column have the correct values
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input dataframe
|
col
|
Expected column name
|
possible_values
|
list of possible values
|
filename
|
Name of file
|
required
|
If the column is required. Default is False
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
warning, error |
extract_oncotree_code_mappings_from_oncotree_json(oncotree_json, primary, secondary)
¶
get_oncotree_code_mappings(oncotree_tumortype_api_endpoint_url)
¶
CREATE ONCOTREE DICTIONARY MAPPING TO PRIMARY, SECONDARY, CANCER TYPE, AND CANCER DESCRIPTION
getCODE(mapping, key, useDescription=False)
¶
getPrimary(code, oncotreeDict, primary)
¶
synapse_login(debug=False)
¶
Logs into Synapse if credentials are saved. If not saved, then user is prompted username and auth token.
| PARAMETER | DESCRIPTION |
|---|---|
debug
|
Synapse debug feature. Defaults to False
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Synapse
|
Synapseclient object |
get_gdc_data_dictionary(filetype)
¶
Use the GDC API to get the values allowed for columns of different filetypes (ie. disease_type in the case file)
| PARAMETER | DESCRIPTION |
|---|---|
filetype
|
GDC file type (ie. case, read_group)
|
Return
json: Dictionary of allowed columns for the filetype and allowed values for those columns
_create_schema(syn, table_name, parentid, columns=None, annotations=None)
¶
Creates Table Schema
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
|
table_name
|
Name of table
|
parentid
|
Project synapse id
|
columns
|
Columns of Table
DEFAULT:
|
annotations
|
Dictionary of annotations to add
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
|
Schema |
_update_database_mapping(syn, database_synid_mappingdf, database_mapping_synid, fileformat, new_tableid)
¶
Updates database to synapse id mapping table
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
|
database_synid_mappingdf
|
Database to synapse id mapping dataframe
|
database_mapping_synid
|
Database to synapse id table id
|
fileformat
|
File format updated
|
new_tableid
|
New file format table id
|
| RETURNS | DESCRIPTION |
|---|---|
|
Updated Table object |
_move_entity(syn, ent, parentid, name=None)
¶
Moves an entity (works like linux mv)
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
|
ent
|
Synapse Entity
|
parentid
|
Synapse Project id
|
name
|
New Entity name if a new name is desired
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
|
Moved Entity |
get_dbmapping(syn, projectid)
¶
Gets database mapping information
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse connection
TYPE:
|
projectid
|
Project id where new data lives
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
{'synid': database mapping syn id, 'df': database mapping pd.DataFrame} |
create_new_fileformat_table(syn, file_format, newdb_name, projectid, archive_projectid)
¶
Creates new database table based on old database table and archives old database table
| PARAMETER | DESCRIPTION |
|---|---|
syn
|
Synapse object
TYPE:
|
file_format
|
File format to update
TYPE:
|
newdb_name
|
Name of new database table
TYPE:
|
projectid
|
Project id where new database should live
TYPE:
|
archive_projectid
|
Project id where old database should be moved
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
{"newdb_ent": New database synapseclient.Table, "newdb_mappingdf": new databse pd.DataFrame, "moved_ent": old database synpaseclient.Table} |
create_missing_columns(dataset, schema)
¶
Creates and fills missing columns with the relevant NA value for the given data type. Note that special handling had to occur for allowing NAs in integer based columns in pandas by converting the integer column into the Int64 (pandas nullable integer data type)
| PARAMETER | DESCRIPTION |
|---|---|
dataset
|
input dataset to fill missing columns for
TYPE:
|
schema
|
the expected schema {column_name(str): data_type(str)} for the input dataset
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Series
|
pd.Series: updated dataset |
check_values_in_column(df, col, values)
¶
Check if a column in a dataframe contains specific values Args: df (pd.DataFrame): The clinical dataframe col (str): The column name values (list): Expected values in the column Returns: bool: True if the column contains the specified values
get_row_indices_for_invalid_column_values(df, col, possible_values, na_allowed=False, sep=None)
¶
This function checks the column values against possible_values and returns row indices of invalid rows.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input dataframe
TYPE:
|
col
|
The column to be checked
TYPE:
|
possible_values
|
The list of possible values
TYPE:
|
na_allowed
|
If NA is allowed. Defaults to False.
TYPE:
|
sep
|
The string separator. Defaults to None.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Index
|
pd.Index: The row indices of the rows with values that are not in possible_values. |
get_message_for_invalid_column_value(col, filename, invalid_indices, possible_values)
¶
This function returns the error and warning messages if the target column has rows with invalid values.
| PARAMETER | DESCRIPTION |
|---|---|
col
|
The column to be checked
TYPE:
|
filename
|
The file name
TYPE:
|
invalid_indices
|
The row indices of the rows with invalid values
TYPE:
|
possible_values
|
The list of possible values
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
warning, error
TYPE:
|
check_column_and_values_row_specific(df, col, possible_values, filename, na_allowed=False, required=False, sep=None)
¶
This function checks if the column exists and checks if the values in the column have the valid values. Currently, this function is only used in assay.py
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input dataframe
TYPE:
|
col
|
The column to be checked
TYPE:
|
possible_values
|
The list of possible values
TYPE:
|
filename
|
The file name
TYPE:
|
na_allowed
|
If NA is allowed. Defaults to False.
TYPE:
|
required
|
If the column is required. Defaults to False.
TYPE:
|
sep
|
The string separator. Defaults to None.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
warning, error
TYPE:
|
add_columns_to_data_gene_matrix(data_gene_matrix, sample_list, column_name)
¶
Add CNA and SV columns to data gene matrix
| PARAMETER | DESCRIPTION |
|---|---|
data_gene_matrix
|
data gene matrix
TYPE:
|
sample_list
|
The list of cna or sv samples
TYPE:
|
column_name
|
The column name to be added
TYPE:
|