Subject Guides: Data Management: Organizing Data

Suggested Folder Structures 🗂️

It is generally recommended to keep data, code, and outputs separate to avoid confusion. Although the exact structure can vary depending on the project, here are two approaches to organizing files.

Basic Projects
One or two data processing scripts, simple pipeline with few steps

.
├── InputData          <- Folder containing data that will be processed
├── OutputData         <- Folder containing data that has been processed
├── Figures            <- Folder containing with figures or tables summarizing the results
├── Code               <- Folder the scripts or programs to do the analysis
├── LICENSE            <- File explaining the terms under which data/code is being made available
├── README.txt         <- File documenting the analysis, and (ideally) the purpose of each file.

Advanced Projects
Many kinds of input data, documentation, and code. For highly complex datasets refer to (Bio-Lab Informatics System [BLIS]).

This structure can be generated and auto-populated with the Reproducible Science template from Cookiecutter.

.
├── AUTHORS.md         <- File: List of people that contributed to the project (Markdown format)
├── LICENSE            <- File: Plain text file explaining the usage terms/license of the data/code file (CC-By, MIT)
├── README.md          <- File: Readme file (Markdown format)
├── bin                <- Folder: Your compiled model code can be stored here (not tracked by git)
├── config             <- Folder: Configuration files, e.g., for doxygen or for your model if needed
├── data               <- Folder: Data for this project
│   ├── external       <- Folder: Data from third party sources.
│   ├── interim        <- Folder: Intermediate data that has been transformed.
│   ├── processed      <- Folder: The final, canonical data sets for modeling.
│   └── raw            <- Folder: The original, immutable data dump.
├── docs               <- Folder: Documentation, e.g., doxygen or scientific papers (not tracked by git)
├── notebooks          <- Folder: Ipython or R notebooks
├── reports            <- Folder: Manuscript source, e.g., LaTeX, Markdown, etc., or any project reports
│   └── figures        <- Folder: Figures for the manuscript or reports
└── src                <- Folder: Source code for this project
    ├── data           <- Folder: scripts and programs to process data
    ├── external       <- Folder: Any external source code, e.g., other projects, or external libraries
    ├── models         <- Folder: Source code for your own model
    ├── tools          <- Folder: Any helper scripts go here
    └── visualization  <- Folder: Visualisation scripts, e.g., matplotlib, ggplot2 related

Electronic Lab Notebook

LabArchives -- LabArchives is a suite of SaaS applications that help scientists manage and document their research data, prove & protect discovery, collaborate securely, manage inventory & samples, and manage lab resources. LabArchives is FERPA, HIPAA, and GDPR compliant and offers integration with FigShare and other data repositories.

Data Life-Cycle

The following list provides examples of data types to be shared. The list does not include all data types to be considered.

"Omics" Data

Genomics and epigenetics (e.g., WGS, WES, EPIC arrays, targeted panels)
Transcriptomics e.g., RNA-seq)
Microbiomics (e.g., bacteria, virus)
Proteomics (e.g., mass spec, RPPA)
Metabolomics (e.g., mass spec)

Imaging Data

Medical imaging (e.g., ultrasound, MRI, CT)
Non-medical imaging (e.g., fluorescence microscopy)
Other (e.g., histopathology)

Biological Data

Electrophysiology (e.g., sensor data, ECG)
Biochemical (e.g., X-ray, NMR, AMF, FRET)
Pre-clinical (e.g., PDX growth curve)

Phenotype Data

Non-human (e.g. phenotypic features of animal models)
Human traits (e.g., blood type)
Demographics
Clinical data (including specimen information)

Additional Data

Clinical trial results
Epidemiology/surveillance
Administrative
Algorithm/Simulation
Social/Behavioral
Survey/questionnaire

Section 1 Rubric: Data Type

A general summary of the types and estimated amount of scientific data to be generated and/or used in the research.

Performance level
Performance Criteria		Complete/detailed	Addressed issue, but incomplete	Did not address
1.1	Describes what *types* of scientific data will be generated and/or used in the research	Clearly defines data type(s)	Some details about data types are included, but missing details or wouldn’t be well understood by someone outside of the project	No information about data types is included; fails to adequately describe the data types
1.2	Describes *data formats* created or used during project	Clearly describes data format standard(s) for the data	Describes some but not all data formats, or data format standards for the data. Where standards do not exist, does not propose how this will be addressed	Does not include information about data format standards
1.3	Identifies *relevant other data, and any associated* *documentation* that be made accessible	Clearly describes other relevant data and associated documentation	Missing some details regarding documentation so that data wouldn't be well understood by someone outside of the project	No information about the data documentation
1.4	Identifies *how much* data (volume) will be produced	Expected scale of data (GB,TB, etc.) is clearly specified	Expected scale of data is vaguely specified	Expected scale of data is not specified

Section 2 Rubric: Related Tools, Software and/or Code

An indication of whether specialized tools are needed to access or manipulate shared scientific data to support replication or reuse, and name(s) of the needed tool(s) and software. If applicable, specify how needed tools can be accessed, (e.g., open source and freely available, generally available for a fee in the marketplace, available only from the research team) and, if known, whether such tools are likely to remain available for as long as the scientific data remain available.

Performance level
Performance Criteria		Complete/detailed	Addressed issue, but incomplete	Did not address
2.1	Describes what specialized or licensed *software or tools* are needed to access or manipulate data generated and/or used in the research	Clearly defines what software or and tools are needed to access and manipulate data, and specifies which are proprietary, open source, and/or custom created by the researcher(s)	Some details about software or tools are included, but DMP is missing details or wouldn’t be well understood by someone outside of the project	No information about software or tools is included; fails to adequately describe software or tools.
2.2	Describes whether *custom-created code or in-house software* or tools are needed or will be created to access or manipulate data generated and/or used in the research and if/how this will be made available	Clearly defines whether custom code or tools will be needed to access and manipulate data, and if/how they will be made available.	Missing some details about whether custom code or tools will be needed or created.	No information about custom code or tools is included.