NHDS data

The National Hospital Discharge Survey (NHDS), which was conducted annually from 1965-2010, was a national probability survey designed to meet the need for information on characteristics of inpatients discharged from non-Federal short-stay hospitals in the United States. Data from the NHDS are available annually and are used to examine important topics of interest in public health and for a variety of activities by governmental, scientific, academic, and commercial institutions. This demo includes a total of 3,257,718 hospital discharges between 2000 and 2010. For more information visit https://www.cdc.gov/nchs/nhds/index.htm .

NEW! Mexico Government COVID-19 dataset

The Mexico Government COVID-19 dataset comprises a collection of publicly available information on Mexican nationwide cases tested during the ongoing COVID-19 pandemic. We analyzed the dataset release at 2020-11-02. This work is part of the COVID-19 Subgroup Discovery and Exploration Tool ( COVID-19 SDE Tool ) project of the Biomedical Data Science Lab, Universitat Politècnica de València, Spain.

Tutorial Notebook: Temporal Variability of the Mexico Government COVID-19 dataset.

COVID-19 Open Research Dataset Challenge (CORD-19)

The CORD-19 dataset is provided by the White House and a coalition of leading research groups to to apply natural language processing and other AI techniques to generate new insights in support of the ongoing fight against the new COVID-19 infectiuos disease. CORD-19 is a resource of over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. We provide here the temporal variability of the kaggle version of the dataset, concretely of the following variables: source_x, license, journal, title and abstract. For title and abstract we applied basic processing to extract their n-grams (nltk stopwords, R text2vec prunning [term_count_min = 10, doc_proportion_max = 0.99, doc_proportion_min = 0.001], n-grams from 1 to 4). Source: Kaggle - COVID-19 Open Research Dataset Challenge (CORD-19).

Tutorial Notebook: Applying EHRtemporalVariability to the CORD-19 dataset.

Epidemiological Data from the nCoV-2019 Outbreak (nCoV-2019)

A collection of publicly available information on worldwide cases confirmed during the ongoing nCoV-2019 outbreak. We analyzed the raw dataset release at 2020-03-31, counting with a total of 129877 cases with valid dates. Further preprocessing of lists of symptoms and chronic diseases will be included. Source: nCoV2019 GitHub repository . More info: https://doi.org/10.1038/s41597-020-0448-0.

Additional info in our COVID-19 Subgroup Discovery and Exploration tool.

DS4C: Data Science for COVID-19 in South Korea

DS4C is a structured dataset based on the report materials of Korea Centers for Disease Control & Prevention (KCDC) and local governments. We focus on the epidemiological data of COVID-19 patients, making currently a weekly analysis with the confirmation date as the reference date. Source: Kaggle - Data Science for COVID-19 (DS4C).

Tutorial Notebook : Applying EHRtemporalVariability to the DS4C dataset.


Using the EHRtemporalVariability R package allows more flexibility in raw data formatting and variable preprocessing, such as formatting ICD-9 codes, selecting or deriving specific variables, or even reducing the dimensionality of data. The resultant objects of classes 'DataTemporalMap' and 'IGTProjection' can be used as input for the shiny App. You can save them in a file to be uploaded herein by typing: save( dataTemporalMaps, igtProjections, file = “myResults.RData ). For more information read the accompaining vignette.

Do not upload patient-level data

To see what a correctly formatted data set looks like download the NHDS demo file below (limited to a 10% random subsample of the NHDS data between 2000 and 2010).

Download the NHDS RData here!