Welcome to the EHRtemporalVariability Shiny App! (public demo version)

Variability in healthcare processes, protocols or due to the system or human biases, can be a potential bias for the reuse of Electronic Health Records (EHRs), where unexpected batch-effects can be introduced. EHRtemporalVariability is an open-source R-package and Shiny-app for exploring and uncovering the effects of time in the statistical distributions of EHRs, namely dataset shifts. EHRtemporalVariability batches and visualizes EHRs temporal-evolution through dynamic heatmaps and non-parametric-information-geometry plots for coded, numerical and multivariate EHR data.

R package GitHub repo: https://github.com/hms-dbmi/EHRtemporalVariability.

Shiny app GitHub repo: https://github.com/hms-dbmi/EHRtemporalVariability-shiny.

To begin, you can use any of the next options:

  1. "Demo with real data" explores a case study of the app capabilities with the US National Hospital Discharge Survey open dataset.
  2. "NEW! COVID-19 PUBLIC DATASETS" Newly added case studies with public COVID-19 datasets to help researchers delineate their temporal variability.
  3. "Load your .RData" takes an .RData file generated by the EHRtemporalVariability R package to display and explore its results.

For further details see our vignette and publications.

If you use EHRtemporalVariability, please cite:

Carlos Sáez, Alba Gutiérrez-Sacristán, Isaac Kohane, Juan M García-Gómez, Paul Avillach. EHRtemporalVariability: delineating temporal data-set shifts in Electronic Health Records. GigaScience, Volume 9, Issue 8, August 2020, giaa079. https://doi.org/10.1093/gigascience/giaa079

NHDS data

The National Hospital Discharge Survey (NHDS), which was conducted annually from 1965-2010, was a national probability survey designed to meet the need for information on characteristics of inpatients discharged from non-Federal short-stay hospitals in the United States. Data from the NHDS are available annually and are used to examine important topics of interest in public health and for a variety of activities by governmental, scientific, academic, and commercial institutions. This demo includes a total of 3,257,718 hospital discharges between 2000 and 2010. For more information visit https://www.cdc.gov/nchs/nhds/index.htm .

COVID-19 Open Research Dataset Challenge (CORD-19)

The CORD-19 dataset is provided by the White House and a coalition of leading research groups to to apply natural language processing and other AI techniques to generate new insights in support of the ongoing fight against the new COVID-19 infectiuos disease. CORD-19 is a resource of over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. We provide here the temporal variability of the kaggle version of the dataset, concretely of the following variables: source_x, license, journal, title and abstract. For title and abstract we applied basic processing to extract their n-grams (nltk stopwords, R text2vec prunning [term_count_min = 10, doc_proportion_max = 0.99, doc_proportion_min = 0.001], n-grams from 1 to 4). Source: Kaggle - COVID-19 Open Research Dataset Challenge (CORD-19) .

Tutorial Notebook: Applying EHRtemporalVariability to the CORD-19 dataset .

Epidemiological Data from the nCoV-2019 Outbreak (nCoV-2019)

A collection of publicly available information on worldwide cases confirmed during the ongoing nCoV-2019 outbreak. We analyzed the raw dataset release at 2020-03-31, counting with a total of 129877 cases with valid dates. Further preprocessing of lists of symptoms and chronic diseases will be included. Source: nCoV2019 GitHub repository . More info: https://doi.org/10.1038/s41597-020-0448-0 .

Additional info in our COVID-19 Subgroup Discovery and Exploration tool .

DS4C: Data Science for COVID-19 in South Korea

DS4C is a structured dataset based on the report materials of Korea Centers for Disease Control & Prevention (KCDC) and local governments. We focus on the epidemiological data of COVID-19 patients, making currently a weekly analysis with the confirmation date as the reference date. Source: Kaggle - Data Science for COVID-19 (DS4C) .

Tutorial Notebook : Applying EHRtemporalVariability to the DS4C dataset .


Using the EHRtemporalVariability R package allows more flexibility in raw data formatting and variable preprocessing, such as formatting ICD-9 codes, selecting or deriving specific variables, or even reducing the dimensionality of data. The resultant objects of classes 'DataTemporalMap' and 'IGTProjection' can be used as input for the shiny App. You can save them in a file to be uploaded herein by typing: save( dataTemporalMaps, igtProjections, file = “myResults.RData ). For more information read the accompaining vignette.

Do not upload patient-level data

To see what a correctly formatted data set looks like download the NHDS demo file below (limited to a 10% random subsample of the NHDS data between 2000 and 2010).

Download the NHDS RData here!