Working with open-source Human Microbiome Project Data phases 1 (HMP) and 2 (iHMP): Efficient Data Access and Analysis Workflow


Instructors names and contact information

  1. Levi Waldron, Graduate School of Public Health and Health Policy, City University of New York, New York, NY. email:

  2. Ni Zhao, Department of Biostatistics, Johns Hopkins University, Baltimore, MD. email:

  3. Ekaterina Smirnova, Department of Biostatistics, Virginia Commonwealth University, Richmond, VA. email:

Workshop Description

The composition of microbial species in a human body is essential for maintaining human health, and it is associated with a number of diseases including obesity, bowel inflammatory disease, and bacterial vaginosis. Over the last decade, microbiome data analysis almost entirely shifted towards using samples taken directly from various sites of human body and to explore a large number of microbes using 16S or whole metagenome sequencing. This technological shift has led to a radical change in the data collected and led to the creation of the Human Microbiome Project (HMP) in 2008. With the growth and success of the microbiomics field, the size and complexity, and availability of the microbiome data in any given experiment have increased exponentially. Publicly-available data sets from amplicon sequencing of two 16S ribosomal RNA variable regions, with extensive controlled-access participant data, provide a reference for ongoing microbiome studies. However, utilization of these data sets can be hindered by the complex bioinformatic steps required to access, import, decrypt, and merge the various components in formats suitable for ecological and statistical analysis.

The main goal of this workshop is to provide a much-needed tutorial on downloading, understanding, and analyzing the publicly-available HMP data. We will describe the two packages, HMP16SData and HMP2Data, that provide the comprehensive tutorial on the analysis of Phase 1 and 2 of the HMP project, respectively. These packages provide count data for both 16S ribosomal RNA variable regions, integrated with phylogeny, taxonomy, public participant data, and controlled participant data for authorized researchers, using standard integrative Bioconductor data objects. As such, these packages provide the first comprehensive publicly-available tool for working with microbiome data collected at different body sites (e.g., fecal, nasal, vaginal) coupled with other omics approaches (e.g., cytokines, whole metagenome), and is conducted longitudinally on multiple subjects at multiple visits. By removing bioinformatic hurdles of data access and management, HMP16SData and HMP2Data enable researchers with only basic R skills to quickly analyze HMP data. This allows studying the dynamics of microbial composition over the disease progression, discovering disease-specific microbial biomarkers, and understanding the role of microbiome in connection to other omics measurements.

This workshop will consist of the series on the instructors-led live demos presented together with the brief lecture materials and statistical methods review when necessary. Completely worked out examples are available through the HMP16SData and HMP2Data package vignettes.


  • Basic knowledge of R syntax
  • Familiarity with the microbiome data
  • Familiarity with alpha and beta diversity in the context of microbiome data analysis
  • Familiarity with principal components and principal coordinates analysis

Background reading:

  • Schiffer L, Azhar R, Shepherd L, Ramos M, Geistlinger L, Huttenhower C, Dowd JB, Segata N, Waldron L (2019). “HMP16SData: Efficient Access to the Human Microbiome Project through Bioconductor.” American Journal of Epidemiology. doi: 10.1093/aje/kwz006.

Time outline

Activity Time
HMP16SData 10m
HMP2Data 10m
Microbiome Analyses 15m
Multi-omics data integration 15m

Workshop goals and objectives

The main purpose of this workshop is to help researchers interested in public microbiome data to start using the HMP16SData and HMP2Data packages.

Learning goals

  • understand the goals of HMP projects and the studies available through the HMP DAC portal
  • download and use HMP data packages
  • understand the difference between phase 1 and phase 2 projects
  • understand principles of -omics data integration and visualization

Learning objectives

  • install HMP data packages and access vignettes
  • download HMP data to produce analyses illustrated in package vignettes
  • create sample summary statistics
  • create alpha and beta diversity plots
  • understand longitudinal patterns and examine microbiome diversity changes
  • integrate 16S and cytokines data using co-inertia analyses techniques
  • learn capturing patterns across -omics data sets of different structure


Installing all packages necessary for this workshop can be accomplished by the following command:

BiocManager::install("waldronlab/MicrobiomeWorkshop", build_vignettes=TRUE, 

Load the workshop package:


Then you should be able to view the compiled vignette:


Note, all packages are listed in the DESCRIPTION file of this workshop.

Also some utility packages:


Source some miscellaneous R scripts for HMP2Data component:

source(system.file(package='MicrobiomeWorkshop', 'vignettes', "CIAPlots.R"))
source(system.file(package='MicrobiomeWorkshop', 'vignettes', "ScreePlot.R"))


Bioconductor provides curated resources of microbiome data. Most microbiome data are generated either by targeted amplicon sequencing (usually of variable regions of the 16S ribosomal RNA gene) or by metagenomic shotgun sequencing (MGX). These two approaches are analyzed by different sequence analysis tools, but downstream statistical and ecological analysis can involve any of the following types of data:

  • taxonomic abundance at different levels of the taxonomic hierarchy
  • phylogenetic distances and the phylogenetic tree of life
  • metabolic potential of the microbiome
  • abundance of microbial genes and gene families

A review of types and properties of microbiome data is provided by (Morgan and Huttenhower 2012).

curatedMetagenomicData: Curated and processed metagenomic data through ExperimentHub

curatedMetagenomicData(Pasolli et al. 2017) provides 6 types of processed data for >30 publicly available whole-metagenome shotgun sequencing datasets, including from the Human Microbiome Project Phase 1:

  1. Species-level taxonomic profiles, expressed as relative abundance from kingdom to strain level
  2. Presence of unique, clade-specific markers
  3. Abundance of unique, clade-specific markers
  4. Abundance of gene families
  5. Metabolic pathway coverage
  6. Metabolic pathway abundance

Types 1-3 are generated by MetaPhlAn2; 4-6 are generated by HUMAnN2.

Currently, curatedMetagenomicData provides:

  • 8184 samples from 46 datasets, primarily of the human gut but including body sites profiled in the Human Microbiome Project
  • Processed data from whole-metagenome shotgun metagenomics, with manually-curated metadata, as integrated and documented Bioconductor ExpressionSet objects
  • ~80 fields of specimen metadata from original papers, supplementary files, and websites, with manual curation to standardize annotations
  • Processing of data through the MetaPhlAn2 pipeline for taxonomic abundance, and HUMAnN2 pipeline for metabolic analysis
  • These represent ~100TB of raw sequencing data, but the processed data provided are much smaller.

These datasets are documented in the reference manual.

This is an ExperimentHub package, and its main workhorse function is curatedMetagenomicData():

The manually curated metadata for all available samples are provided in a single table combined_metadata:


The main function provides a list of ExpressionSet objects:

oral <- c("HMP_2012.metaphlan_bugs_list.oralcavity")
esl <- curatedMetagenomicData(oral, dryrun = FALSE)
#> Working on HMP_2012.metaphlan_bugs_list.oralcavity
#> snapshotDate(): 2019-06-20
#> see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache 
#>     'EH2491 : 2491'
#> List of length 1
#> names(1): HMP_2012.metaphlan_bugs_list.oralcavity

These ExpressionSet objects can also be converted to phyloseq object for ecological analysis and differential abundance analysis using the DESeq2 package, using the ExpressionSet2phyloseq() function:

ExpressionSet2phyloseq( esl[[1]], phylogenetictree = TRUE)
#> Loading required namespace: phyloseq
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 600 taxa and 415 samples ]
#> sample_data() Sample Data:       [ 415 samples by 18 sample variables ]
#> tax_table()   Taxonomy Table:    [ 600 taxa by 8 taxonomic ranks ]
#> phy_tree()    Phylogenetic Tree: [ 600 tips and 599 internal nodes ]

See the documentation of phyloseq for more on ecological and differential abundance analysis of the microbiome.

HMP16SData: 16S rRNA Sequencing Data from the Human Microbiome Project

#> snapshotDate(): 2019-06-20

HMP16SData(Schiffer et al. 2019) is a Bioconductor ExperimentData package of the Human Microbiome Project (HMP) 16S rRNA sequencing data. Taxonomic count data files are provided as downloaded from the HMP Data Analysis and Coordination Center from its QIIME pipeline. Processed data is provided as SummarizedExperiment class objects via ExperimentHub. Like other ExperimentHub-based packages, a convenience function does downloading, automatic local caching, and serializing of a Bioconductor data class. This returns taxonomic counts from the V1-3 variable region of the 16S rRNA gene, along with the unrestricted participant data and phylogenetic tree.

#> snapshotDate(): 2019-06-20
#> see ?HMP16SData and browseVignettes('HMP16SData') for documentation
#> downloading 0 resources
#> loading from cache 
#>     'EH1117 : 1117'
#> class: SummarizedExperiment 
#> dim: 43140 2898 
#> metadata(2): experimentData phylogeneticTree
#> assays(1): 16SrRNA
#> rownames(43140): OTU_97.1 OTU_97.10 ... OTU_97.9997 OTU_97.9999
#> colnames(2898): 700013549 700014386 ... 700114963 700114965

This can also be converted to phyloseq for ecological and differential abundance analysis; see the HMP16SData vignette for details.


Integrative Human Microbiome Project (iHMP)

  • One of the largest open data resources for studying longitudinal microbiome changes and connection to disease

  • Novel data – first scientific results just published in the special collection of Nature and its sister journals (

  • Human Microbiome Project Data Portal (
  • Open data – but not straightforward to download and work with