It is generally recommended to keep data, code, and outputs separate to avoid confusion. Although the exact structure can vary depending on the project, here are two approaches to organizing files.
Basic Projects
One or two data processing scripts, simple pipeline with few steps
. ├── InputData <- Folder containing data that will be processed ├── OutputData <- Folder containing data that has been processed ├── Figures <- Folder containing with figures or tables summarizing the results ├── Code <- Folder the scripts or programs to do the analysis ├── LICENSE <- File explaining the terms under which data/code is being made available ├── README.txt <- File documenting the analysis, and (ideally) the purpose of each file.
Advanced Projects
Many kinds of input data, documentation, and code. For highly complex datasets refer to (Bio-Lab Informatics System [BLIS]).
This structure can be generated and auto-populated with the Reproducible Science template from Cookiecutter.
. ├── AUTHORS.md <- File: List of people that contributed to the project (Markdown format) ├── LICENSE <- File: Plain text file explaining the usage terms/license of the data/code file (CC-By, MIT) ├── README.md <- File: Readme file (Markdown format) ├── bin <- Folder: Your compiled model code can be stored here (not tracked by git) ├── config <- Folder: Configuration files, e.g., for doxygen or for your model if needed ├── data <- Folder: Data for this project │ ├── external <- Folder: Data from third party sources. │ ├── interim <- Folder: Intermediate data that has been transformed. │ ├── processed <- Folder: The final, canonical data sets for modeling. │ └── raw <- Folder: The original, immutable data dump. ├── docs <- Folder: Documentation, e.g., doxygen or scientific papers (not tracked by git) ├── notebooks <- Folder: Ipython or R notebooks ├── reports <- Folder: Manuscript source, e.g., LaTeX, Markdown, etc., or any project reports │ └── figures <- Folder: Figures for the manuscript or reports └── src <- Folder: Source code for this project ├── data <- Folder: scripts and programs to process data ├── external <- Folder: Any external source code, e.g., other projects, or external libraries ├── models <- Folder: Source code for your own model ├── tools <- Folder: Any helper scripts go here └── visualization <- Folder: Visualisation scripts, e.g., matplotlib, ggplot2 related
LabArchives -- LabArchives is a suite of SaaS applications that help scientists manage and document their research data, prove & protect discovery, collaborate securely, manage inventory & samples, and manage lab resources. LabArchives is FERPA, HIPAA, and GDPR compliant and offers integration with FigShare and other data repositories.
The following list provides examples of data types to be shared. The list does not include all data types to be considered.
"Omics" Data
Imaging Data
Biological Data
Phenotype Data
Additional Data
Performance level | ||||
Performance Criteria | Complete/detailed | Addressed issue, but incomplete | Did not address | |
1.1 |
Describes what types of scientific data will be generated and/or used in the research |
Clearly defines data type(s) |
Some details about data types are included, but missing details or wouldn’t be well understood by someone outside of the project |
No information about data types is included; fails to adequately describe the data types |
1.2 |
Describes data formats created or used during project |
Clearly describes data format standard(s) for the data |
Describes some but not all data formats, or data format standards for the data. Where standards do not exist, does not propose how this will be addressed |
Does not include information about data format standards |
1.3 |
Identifies relevant other data, and any associated documentation that be made accessible |
Clearly describes other relevant data and associated documentation |
Missing some details regarding documentation so that data wouldn't be well understood by someone outside of the project |
No information about the data documentation |
1.4 |
Identifies how much data (volume) will be produced |
Expected scale of data (GB,TB, etc.) is clearly specified |
Expected scale of data is vaguely specified |
Expected scale of data is not specified |
Performance level | ||||
Performance Criteria | Complete/detailed | Addressed issue, but incomplete | Did not address | |
2.1 |
Describes what specialized or licensed software or tools are needed to access or manipulate data generated and/or used in the research |
Clearly defines what software or and tools are needed to access and manipulate data, and specifies which are proprietary, open source, and/or custom created by the researcher(s) |
Some details about software or tools are included, but DMP is missing details or wouldn’t be well understood by someone outside of the project |
No information about software or tools is included; fails to adequately describe software or tools. |
2.2 |
Describes whether custom-created code or in-house software or tools are needed or will be created to access or manipulate data generated and/or used in the research and if/how this will be made available |
Clearly defines whether custom code or tools will be needed to access and manipulate data, and if/how they will be made available. |
Missing some details about whether custom code or tools will be needed or created. |
No information about custom code or tools is included. |