R build status # Installation remotes::install_github("Sage-Bionetworks/sageseqr")

The sageseqr package integrates the targets R package, the config package for R, and Synapse. targets tracks dependency relationships in the workflow and only updates data when it has changed. A config file allows inputs and parameters to be explicitly defined in one location. Synapse is a data repository that allows sensitive data to be stored and shared responsibly.

The workflow takes RNA-seq gene counts and sample metadata as inputs, normalizes counts by conditional quantile normalization (CQN), removes outliers based on a user-defined threshold, empirically selects meaningful covariates and returns differential expression analysis results. The data is also visualized in several ways to help you understand meaningful trends. The visualizations include a heatmap identifying highly correlated covariates, a sample-specific x and y marker gene check, boxplots visualizing the distribution of continuous variables and a principal component analysis (PCA) to visualize sample distribution.

The Targets

The series of steps that make up the workflow are called targets. The target objects are stored in a cache and can either be read or loaded into your environment with the targets functions tar_read or tar_load. Source code for each target can be visualized by setting show_source = TRUE with loadd and readd.

Importantly, running clean will remove the data stored as targets (but, the data is never completely gone!). You may specific targets by name by passing them to the tar_destroy() function.

The targets are called by the targets tar_make() function and are:

Raw data: - import_metadata- imports the raw metadata directly from synapse - import_counts - imports the raw counts directly from synapse - biomart_results - the complete list of genes with biomaRt annotations.

Exploratory data visualizations: - gene_coexpression - the distribution of correlated gene counts. - boxplots - the distribution of continuous variables. - sex_plot - the distribution of samples by x and y marker genes. - sex_plot_pca - a PCA of sex-specific expression to visualize more dimensionality than sex_plot. - correlation_plot - the correlation of covariates. - significant_covariates_plot - the correlation of covariates to gene expression. - outliers - the clustering of samples by PCA. - plot_de_volcano - volcano plot of differentially expressed genes.

Transformed or normalized data: - clean_md - metadata with factor and numeric types. - filtered_counts - counts matrix with low gene expression removed. - biotypes - gene proportions summarized by biotype. - cqn_counts - CQN normalized counts. - model - model selected by multivariate forward stepwise regression (evaluated by Bayesian Information Criteria (BIC)). - de - differential expression results including adjusted p-values and gene list. - report - output markdown report rendered as HTML.

Access to Data

Anyone can create a Synapse account and access public data in a variety of disciplines: Alzheimer’s Disease Knowledge portal, CommonMind Consoritum.