Getting Started with R and Python

The epidatr (R) epidatpy (Python) packages provide access to all the endpoints of the Delphi Epidata API, and can be used to make requests for specific signals on specific dates and in select geographic regions.

Table of Contents

  1. Setup
    1. Installation
    2. API Keys
  2. Basic Usage
  3. Getting versioned data
  4. Plotting
  5. Finding locations of interest
    1. International data
  6. Finding data sources and signals of interest
  7. Legacy Clients

Setup

Installation

You can install these packages like this:

You can install the stable version of epidatr from CRAN:

install.packages("epidatr")
# Or use pak/renv
pak::pkg_install("epidatr")

If you want the development version, install from GitHub:

remotes::install_github("cmu-delphi/epidatr", ref = "dev")

The epidatpy package will soon be available on PyPI as epidatpy. Meanwhile, it can be installed from GitHub with

pip install "git+https://github.com/cmu-delphi/epidatpy.git#egg=epidatpy"

API Keys

The Delphi API requires a (free) API key for full functionality. While most endpoints are available without one, there are limits on API usage for anonymous users, including a rate limit.

To generate your key, register for a pseudo-anonymous account.

  • R: See the save_api_key() function documentation for details on how to set up epidatr to use your API key.
  • Python: The epidatpy client will automatically look for this key in the environment variable DELPHI_EPIDATA_KEY. We recommend storing your key in a .env file, using python-dotenv to load it into your environment, and adding .env to your .gitignore file.

Note: Private endpoints (i.e. those prefixed with pvt_) require a separate key that needs to be passed as an argument. These endpoints require specific data use agreements to access.

Basic Usage

Fetching data from the Delphi Epidata API is simple. Suppose we are interested in the covidcast endpoint, which provides access to a wide range of data on COVID-19. Reviewing the endpoint documentation, we see that we need to specify a data source name, a signal name, a geographic level, a time resolution, and the location and times of interest.

The pub_covidcast() function lets us access the covidcast endpoint:

library(epidatr)
library(dplyr)

# Obtain the most up-to-date version of the smoothed covid-like illness (CLI)
# signal from the COVID-19 Trends and Impact survey for the US
epidata <- pub_covidcast(
  source = "fb-survey",
  signals = "smoothed_cli",
  geo_type = "nation",
  time_type = "day",
  geo_values = "us",
  time_values = epirange(20210105, 20210410)
)
head(epidata)
from epidatpy import CovidcastEpidata, EpiDataContext, EpiRange

# Initialize client (caching enabled for 7 days)
epidata = EpiDataContext(use_cache=True, cache_max_age_days=7)

# Obtain the most up-to-date version of the confirmed cumulative cases
# from JHU CSSE for the US
apicall = epidata.pub_covidcast(
    data_source="jhu-csse",
    signals="confirmed_cumulative_num",
    geo_type="nation",
    time_type="day",
    geo_values="us",
    time_values=EpiRange(20210405, 20210410),
)
print(apicall.df().head())

pub_covidcast() returns a tibble in R and an EpiDataCall (convertible to a pandas DataFrame via .df()) in Python. Each row represents one observation in the specified location on one day. The location abbreviation is given in the geo_value column, the date in the time_value column. Here value is the requested signal—in this case, the smoothed estimate of the percentage of people with COVID-like illness or case counts—and stderr is its standard error.

The Epidata API makes signals available at different geographic levels, depending on the endpoint. To request signals for all states instead of the entire US, we use the geo_type argument paired with * for the geo_values argument. (Only some endpoints allow for the use of * to access data at all locations. Check the help for a given endpoint to see if it supports *.)

# Obtain data for all states
pub_covidcast(
  source = "fb-survey",
  signals = "smoothed_cli",
  geo_type = "state",
  time_type = "day",
  geo_values = "*",
  time_values = epirange(20210105, 20210410)
)
# Obtain data for all states
epidata.pub_covidcast(
    data_source="fb-survey",
    signals="smoothed_cli",
    geo_type="state",
    time_type="day",
    geo_values="*",
    time_values=EpiRange(20210405, 20210410),
).df()

Alternatively, we can fetch the full time series for a subset of states by listing out the desired locations in the geo_value argument and using * in the time_values argument:

# Obtain data for PA, CA, and FL
pub_covidcast(
  source = "fb-survey",
  signals = "smoothed_cli",
  geo_type = "state",
  time_type = "day",
  geo_values = c("pa", "ca", "fl"),
  time_values = "*"
)
# Obtain data for PA, CA, and FL
epidata.pub_covidcast(
    data_source="fb-survey",
    signals="smoothed_cli",
    geo_type="state",
    time_type="day",
    geo_values="pa,ca,fl",
    time_values="*",
).df()

Getting versioned data

The Epidata API stores a historical record of all data, including corrections and updates, which is particularly useful for accurately backtesting forecasting models. To retrieve versioned data, we can use the as_of argument, which fetches the data as it was known on a specific date.

# Obtain the signal as it was on 2021-06-01
pub_covidcast(
  source = "fb-survey",
  signals = "smoothed_cli",
  geo_type = "state",
  time_type = "day",
  geo_values = "pa",
  time_values = epirange(20210105, 20210410),
  as_of = "2021-06-01"
)
# Obtain the signal as it was on 2021-06-01
epidata.pub_covidcast(
    data_source="fb-survey",
    signals="smoothed_cli",
    geo_type="state",
    time_type="day",
    geo_values="pa",
    time_values=EpiRange(20210405, 20210410),
    as_of="2021-06-01",
).df()

We can also request all versions of the data issued within a specific time period using the issues argument.

# See how the estimate for a SINGLE day (March 1, 2021) evolved
# by fetching all issues reported between March and April 2021.
pub_covidcast(
  source = "fb-survey",
  signals = "smoothed_cli",
  geo_type = "state",
  time_type = "day",
  geo_values = "pa",
  time_values = "2021-03-01",
  issues = epirange("2021-03-01", "2021-04-30")
)
# See how the estimate for a SINGLE day (March 1, 2021) evolved
# by fetching all issues reported between March and April 2021.
epidata.pub_covidcast(
    data_source="fb-survey",
    signals="smoothed_cli",
    geo_type="state",
    time_type="day",
    geo_values="pa",
    time_values="2021-03-01",
    issues=EpiRange("2021-03-01", "2021-04-30"),
).df()

Finally, we can use the lag argument to request only data that was reported a certain number of days after the event.

# Fetch survey data for January 2021, but ONLY include data
# that was issued exactly 2 days after it was collected.
pub_covidcast(
  source = "fb-survey",
  signals = "smoothed_cli",
  geo_type = "state",
  time_type = "day",
  geo_values = "pa",
  time_values = epirange(20210101, 20210131),
  lag = 2
)
# Fetch survey data for January 2021, but ONLY include data
# that was issued exactly 2 days after it was collected.
epidata.pub_covidcast(
    data_source="fb-survey",
    signals="smoothed_cli",
    geo_type="state",
    time_type="day",
    geo_values="pa",
    time_values=EpiRange(20210101, 20210131),
    lag=2,
).df()

See vignette("versioned-data") for details and more ways to specify versioned data.

Plotting

Because the output data is in a standard tibble (R) or DataFrame (Python) format, we can easily plot it using standard libraries like ggplot2 or matplotlib.

library(ggplot2)
ggplot(epidata, aes(x = time_value, y = value)) +
  geom_line() +
  labs(
    title = "Smoothed CLI from Facebook Survey",
    subtitle = "PA, 2021",
    x = "Date",
    y = "CLI"
  )
import matplotlib.pyplot as plt

# Fetch data for PA, CA, FL
apicall = epidata.pub_covidcast(
    data_source="fb-survey",
    signals="smoothed_cli",
    geo_type="state",
    geo_values="pa,ca,fl",
    time_type="day",
    time_values=EpiRange(20210405, 20210410),
)

# Plot
fig, ax = plt.subplots(figsize=(6, 5))
(
    apicall.df()
    .pivot_table(values="value", index="time_value", columns="geo_value")
    .plot(xlabel="Date", ylabel="CLI", ax=ax, linewidth=1.5)
)
plt.title("Smoothed CLI from Facebook Survey")
plt.show()

Finding locations of interest

Most data is only available for the US. Select endpoints report other countries at the national and/or regional levels. Endpoint descriptions explicitly state when they cover non-US locations.

For endpoints that report US data, consult the geographic coding documentation for COVID-19 and for other diseases to see available geographic levels.

International data

International data is available via

  • pub_dengue_nowcast (North and South America)
  • pub_ecdc_ili (Europe)
  • pub_kcdc_ili (Korea)
  • pub_nidss_dengue (Taiwan)
  • pub_nidss_flu (Taiwan)
  • pub_paho_dengue (North and South America)
  • pvt_dengue_sensors (North and South America)

Finding data sources and signals of interest

Above we used data from Delphi’s symptom surveys, but the Epidata API includes numerous data streams: medical claims data, cases and deaths, mobility, and many others. This can make it a challenge to find the data stream that you are most interested in.

The Epidata documentation lists all the data sources and signals available through the API for COVID-19 and for other diseases.

You can also use the client libraries to discover endpoints interactively:

# Get a table of endpoint functions
avail_endpoints()
# List sources available in the pub_covidcast endpoint
covidcast = CovidcastEpidata(use_cache=True)
print(covidcast.source_names())

# List signals available for a specific source (e.g., jhu-csse)
print(covidcast.signal_names("jhu-csse"))

Legacy Clients

Legacy clients are also available for Python, R, and JavaScript, but its use is discouraged.

The following samples show how to import the library and fetch Delphi’s COVID-19 Surveillance Streams from Facebook Survey CLI for county 06001 and days 20200401 and 20200405-20200414 (11 days total).

Install delphi-epidata from PyPI with pip install delphi-epidata.

from delphi_epidata import Epidata
# Configure API key, if needed.
#Epidata.auth = ('epidata', <your API key>)
res = Epidata.covidcast('fb-survey', 'smoothed_cli', 'day', 'county', [20200401, Epidata.range(20200405, 20200414)], '06001')
print(res['result'], res['message'], len(res['epidata']))
# Configure API key, if needed.
#option('epidata.auth', <your API key>)
source('delphi_epidata.R')
res <- Epidata$covidcast('fb-survey', 'smoothed_cli', 'day', 'county', list(20200401, Epidata$range(20200405, 20200414)), '06001')
cat(paste(res$result, res$message, length(res$epidata), "\n"))

The minimalist JavaScript client does not currently support API keys. If you need API key support in JavaScript, contact delphi-support+privacy@andrew.cmu.edu.

<script src="delphi_epidata.js"></script>
<script>
EpidataAsync.covidcast(
  "fb-survey",
  "smoothed_cli",
  "day",
  "county",
  [20200401, EpidataAsync.range(20200405, 20200414)],
  "06001"
).then((res) => {
  console.log(
    res.result,
    res.message,
    res.epidata != null ? res.epidata.length : 0
  );
});
</script>