Skip to Main Content
Edward G. Miner Library

Data Management: Organizing Data

This guide provides resources for managing and sharing your research data no matter the discipline.

Suggested Folder Structures 🗂️

It is generally recommended to keep data, code, and outputs separate to avoid confusion. Although the exact structure can vary depending on the project, here are two approaches to organizing files.

Basic Projects
One or two data processing scripts, simple pipeline with few steps

.
├── InputData          <- Folder containing data that will be processed
├── OutputData         <- Folder containing data that has been processed
├── Figures            <- Folder containing with figures or tables summarizing the results
├── Code               <- Folder the scripts or programs to do the analysis
├── LICENSE            <- File explaining the terms under which data/code is being made available
├── README.txt         <- File documenting the analysis, and (ideally) the purpose of each file.

Advanced Projects
Many kinds of input data, documentation, and code. For highly complex datasets refer to (Bio-Lab Informatics System [BLIS]).

This structure can be generated and auto-populated with the Reproducible Science template from Cookiecutter.

.
├── AUTHORS.md         <- File: List of people that contributed to the project (Markdown format)
├── LICENSE            <- File: Plain text file explaining the usage terms/license of the data/code file (CC-By, MIT)
├── README.md          <- File: Readme file (Markdown format)
├── bin                <- Folder: Your compiled model code can be stored here (not tracked by git)
├── config             <- Folder: Configuration files, e.g., for doxygen or for your model if needed
├── data               <- Folder: Data for this project
│   ├── external       <- Folder: Data from third party sources.
│   ├── interim        <- Folder: Intermediate data that has been transformed.
│   ├── processed      <- Folder: The final, canonical data sets for modeling.
│   └── raw            <- Folder: The original, immutable data dump.
├── docs               <- Folder: Documentation, e.g., doxygen or scientific papers (not tracked by git)
├── notebooks          <- Folder: Ipython or R notebooks
├── reports            <- Folder: Manuscript source, e.g., LaTeX, Markdown, etc., or any project reports
│   └── figures        <- Folder: Figures for the manuscript or reports
└── src                <- Folder: Source code for this project
    ├── data           <- Folder: scripts and programs to process data
    ├── external       <- Folder: Any external source code, e.g., other projects, or external libraries
    ├── models         <- Folder: Source code for your own model
    ├── tools          <- Folder: Any helper scripts go here
    └── visualization  <- Folder: Visualisation scripts, e.g., matplotlib, ggplot2 related

Electronic Lab Notebook



LabArchives -- LabArchives is a suite of SaaS applications that help scientists manage and document their research data, prove & protect discovery, collaborate securely, manage inventory & samples, and manage lab resources. LabArchives is FERPA, HIPAA, and GDPR compliant and offers 
integration with FigShare and other data repositories.

Data Life-Cycle

The following list provides examples of data types to be shared. The list does not include all data types to be considered.

"Omics" Data

  • Genomics and epigenetics (e.g., WGS, WES, EPIC arrays, targeted panels)
  • Transcriptomics e.g., RNA-seq)
  • Microbiomics (e.g., bacteria, virus)
  • Proteomics (e.g., mass spec, RPPA)
  • Metabolomics (e.g., mass spec)

Imaging Data

  • Medical imaging (e.g., ultrasound, MRI, CT)
  • Non-medical imaging (e.g., fluorescence microscopy)
  • Other (e.g., histopathology)

Biological Data

  • Electrophysiology (e.g., sensor data, ECG)
  • Biochemical (e.g., X-ray, NMR, AMF, FRET)
  • Pre-clinical (e.g., PDX growth curve)

Phenotype Data

  • Non-human (e.g. phenotypic features of animal models)
  • Human traits (e.g., blood type)
  • Demographics
  • Clinical data (including specimen information)

Additional Data

  • Clinical trial results
  • Epidemiology/surveillance
  • Administrative
  • Algorithm/Simulation
  • Social/Behavioral
  • Survey/questionnaire

Section 1 Rubric: Data Type

A general summary of the types and estimated amount of scientific data to be generated and/or used in the research.
Performance level
Performance Criteria Complete/detailed Addressed issue, but incomplete Did not address
1.1

Describes what types of scientific data will be generated and/or used in the research

Clearly defines data type(s)

Some details about data types are included, but missing details or wouldn’t be well understood by someone outside of the project

No information about data types is included; fails to adequately describe the data types

1.2

Describes data formats created or used during project

Clearly describes data format standard(s) for the data

Describes some but not all data formats, or data format standards for the data. Where standards do not exist, does not propose how this will be addressed

Does not include information about data format standards

1.3

Identifies relevant other data, and any associated documentation that be made accessible

Clearly describes other relevant data and associated documentation

Missing some details regarding documentation so that data wouldn't be well understood by someone outside of the project

No information about the data documentation

1.4

Identifies how much data (volume) will be produced

Expected scale of data (GB,TB, etc.) is clearly specified

Expected scale of data is vaguely specified

Expected scale of data is not specified

Section 2 Rubric: Related Tools, Software and/or Code

An indication of whether specialized tools are needed to access or manipulate shared scientific data to support replication or reuse, and name(s) of the needed tool(s) and software. If applicable, specify how needed tools can be accessed, (e.g., open source and freely available, generally available for a fee in the marketplace, available only from the research team) and, if known, whether such tools are likely to remain available for as long as the scientific data remain available.
Performance level
Performance Criteria Complete/detailed Addressed issue, but incomplete Did not address
2.1

Describes what specialized or licensed software or tools are needed to access or manipulate data generated and/or used in the research

Clearly defines what software or and tools are needed to access and manipulate data, and specifies which are proprietary, open source, and/or custom created by the researcher(s)

Some details about software or tools are included, but DMP is missing details or wouldn’t be well understood by someone outside of the project

No information about software or tools is included; fails to adequately describe software or tools.

2.2

Describes whether custom-created code or in-house software or tools are needed or will be created to access or manipulate data generated and/or used in the research and if/how this will be made available

Clearly defines whether custom code or tools will be needed to access and manipulate data, and if/how they will be made available.

Missing some details about whether custom code or tools will be needed or created.

No information about custom code or tools is included.