vignettes/using-the-dccvalidator-pec.Rmd
using-the-dccvalidator-pec.Rmd
We’ve built the dccvalidator tool to streamline the process of data validation and QA/QC. As the PsychENCODE (PEC) Knowledge Portal has grown to more than 39 contributing labs and over 90,000 data files, we’ve realized a need to be more standardized in our approaches to data curation. Thus, we built an application that performs many of the routine data quality checks we previously conducted by hand, with the hopes that it will help you, the data contributor, get your data checked, validated,nand shared easily and quickly.
The application is hosted on a Shiny server here.
To use this application you must:
Some portions of the app submit data to Synapse. This allows curators at Sage to troubleshoot issues if needed. No one outside the Sage curation team will be able to download the data.
In order to contribute data to the PsychENCODE Knowledge Portal, signing this agreement is required for each data set. In this agreement, you will acknowledge:
You are the expert of the content of your data! As a data contributor, you are responsible for ensuring your data is compliant with relevant policies and does not disclose the identity of research participants.
If you discover you have released sensitive information in error, please follow the process in Synapse to flag affected data.
We are also interested in rapidly identifiying data that does not conform to our quality control standards. If you discover data is contaminated or sample identity is questionable due to a concordance analysis, for example, please reach out to the curation team, PEC_SageAdmin@synaspse.org.
A biospecimen is a sample of material such as tissue, cells, DNA, RNA or protein that has a unique identifier associated to it - specimenID
. The same biospecimen may be characterized in multiple assay types. In this case, the unique identifier should remain the same. We strongly recommend you do not name specimens using individual identifiers. In the case where multiple sequencing libaries are prepared from a single biospecimen, LibraryID
is an available key. Replicates are tracked using integers and the keys technicalReplicate
and sequencingReplicate
.
A manifest is .tsv or .txt file with data files to be uploaded to Synapse as entries in each row. Details of a manifest are described in the Uploading and Downloading Data in Bulk Synapse User Guide. While a metadata file will be stored on Synapse as a flat file, and select variables added as file annotations, all variables in a manifest file will live as annotations respective to the file in that row. To successfully upload a file, you must specify the local path
to the file and the Synapse ID of the folder in the parent
column.
Yes, Synapse supports Provenance! Provenance can be leveraged to connect raw data to reprocessed or summarized data. Populate the used
column in the manifest with the synID. The required values format for linking multiple files is used = synID;synID
.
Yes, with Provenance. Populate the executed
column with the url to your Github repo.
Each study in PEC will have accompanying documentation in the PEC portal. Here is an example of study documentation in the Accelerating Medicines Partnership in Alzheimer’s Disease portal, developed by Sage Bionetworks.
You can submit your documentation through the dccvalidator app on the Documentation page. There should be a study description for the whole study, and an assay description for each of the assays that was performed. These can be in a single file, or you can upload multiple files to the assay description section.
With a new study, there may not yet be a Staging folder in the PEC Knowledge Portal. Please contact us - PEC_SageAdmin@synaspse.org.
Each study should include metadata that would help a new researcher understand and reuse the data. In most cases, we will expect 4 files:
Metadata file templates are available in the PsychENCODE Knowledge Portal resources.
If you don’t see a template for the assay(s) in your study, please send a request for a new schema to PEC_SageAdmin@synapse.org. We depend on your expertise to develop schemas that capture the most pertinent metadata!
The data validation portion of the app allows you to upload metadata files (as .csv) and the manifest (as .tsv or .txt) and view the results of a series of automated checks.
Examples of the types of checks we perform are: - All required columns from the templates are present - Individuals and specimens have unique identifiers - Metadata terms conform to a controlled vocabulary, where applicable
The Data Analysis Core will reprocess common data types during PEC Phase II. Currently, common data types are RNASeq, ATACSeq, ChipSeq and all single cell data. For common data types, only fastq files are required. For other data types, please provide fastq and bam files.
Once data has passed validation, and the PEC data curators permit edit permissions to the Staging folder, you will use your newly created manifest file to upload data using syncToSynapse
. You can execute syncToSynapse
in the Python client and R client. The Synapse Python client supports multithreaded upload and will provide faster upload speeds than the Synapse R client. For getting started with the Synapse programmatic clients, please visit our Synapse docs.
Data is uploaded to a Staging folder, private to each individual group. Once curated, data is moved to a PEC folder for a limited period of time where all consortium members have access to the data via the PEC Team. Finally, data is made public to Synapse users in a Data folder. All data upload takes place in the PsychENCODE Knowledge Portal Synapse Project. While access to the project is public, restrictions are associated with the Staging and PEC folder to make sure the data remains private for the appropriate period of time.
Synapse IDs are always preserved (i.e. IDs remain associated to the file).
Please send questions to PEC_SageAdmin@synapse.org.