Skip to main content

Command Palette

Search for a command to run...

Exploring JSON in DuckDB with the openFDA Animal & Veterinary Adverse Events API

Updated
11 min read
Exploring JSON in DuckDB with the openFDA Animal & Veterinary Adverse Events API

Exploring JSON in DuckDB with the openFDA Animal & Veterinary Adverse Events API

A practical, end‑to‑end walkthrough using the DuckDB CLI with JSON

When working with Modern data, JSON is everywhere — APIs, logs, manifests, telemetry, and data feeds. But working with JSON at scale is often painful: nested structures, arrays inside arrays, inconsistent schemas, and compressed files delivered over HTTP.

DuckDB does a great job changing that dynamic completely.

In this article, we’ll walk through a real‑world example using the openFDA Animal & Veterinary Adverse Events dataset, a public API that exposes:

  • a self‑describing JSON manifest

  • downloadable ZIP‑wrapped JSON partitions

  • deeply nested event structures

Using only the DuckDB CLI, we will:

  • Read JSON directly from URLs

  • Parse nested JSON structures

  • Extract ZIP‑compressed JSON over HTTPS

  • Dynamically generate SQL to build staging tables

  • Flatten complex arrays with UNNEST

  • Materialize clean relational tables:

    • events

    • event_reactions

    • event_drugs

  • Prepare the dataset for analytics or downstream modeling

  • exporting to csv and parquet.

This is the kind of workflow that normally a full ELT pipeline, or custom process — but DuckDB handles it locally, with no servers and no friction.

As a disclaimer, I want to point out that the main objective of this article is to illustrate DuckDB's capabilities.


1. Setup Environment

When working with DuckDB project, I usually create a folder for the project and use VS Code. I create a commands.txt file to store my SQL commands. In this walk through we will be creating sql files and a duckdb database.

mkdir duckdb/openfda_demo
cd duckdb/openfda_demo
code .

2. Installing DuckDB

DuckDB has documentation on how to install the DuckDB CLI here. I used winget to install the DuckDB cli.

winget install DuckDB.cli

To execute DuckDB, I would normally open the terminal in VSCode and type my commands in there. We will be creating a DuckDB database

duckdb -ui openfda_animalandveterinary.duckdb

The -ui will download the duckdb ui extension, load it and will open up a browser session to a beautiful jupyter notebook styled IDE. The exercises will use the terminal cli, but the browser session has nice features to look into.

If you wanted to not create a DuckDB, you can just run this command

duckdb -ui

That will use an :memory: database and any tables you create will be lost when you exit the session.

DuckDB’s extension system is one of its superpowers.
For this project, we will be using the httpfs, json and the zipfs extensions. The httpfs and json are core extensions included when DuckDB is installed. The zipfs needs to be installed from community. This extensions allows DuckDB to read/decompress ZIP files. The httpfs allows DuckDB to access directly from URLs.

INSTALL zipfs FROM community;
LOAD zipfs;

SET memory_limit = '40GB';

The memory limit is optional but helpful when working with large FDA datasets.


2. Loading the FDA Manifest

The openFDA API provides a single manifest file describing all datasets and their download locations. We will download the manifest into a table and access the specific nodes we are looking for.

DROP TABLE IF EXISTS manifest_json;

CREATE TABLE manifest_json AS
SELECT *
FROM read_json('https://api.fda.gov/download.json', maximum_object_size = 4294967295);

DuckDB stores JSON data as a Struct data type. The Struct data type makes it easy to work with complex nested data. I highly recommend getting a hand of DuckDB in Action. It contains a treasure-trove of information about DuckDB. MotherDuck has a promotion for the e-book if you subscribe to their newsletters

The manifest_json table contains two columns, meta and results.

A quick schema check:

DESCRIBE manifest_json;

below are the struct definitions of the two columns. Meta Column

STRUCT(
  disclaimer    VARCHAR,
  terms         VARCHAR,
  license       VARCHAR,
  last_updated  DATE
)

Results Column

STRUCT(
  food STRUCT(
    enforcement STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    "event" STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    )
  ),

  animalandveterinary STRUCT(
    "event" STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    )
  ),

  transparency STRUCT(
    crl STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    )
  ),

  tobacco STRUCT(
    problem STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    )
  ),

  other STRUCT(
    historicaldocument STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    unii STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    nsde STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    substance STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    )
  ),

  device STRUCT(
    classification STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    "510k" STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    covid19serology STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    registrationlisting STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    enforcement STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    udi STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    "event" STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    recall STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    pma STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    )
  ),

  cosmetic STRUCT(
    "event" STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    )
  ),

  drug STRUCT(
    drugsfda STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    "label" STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    enforcement STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    "event" STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    shortages STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    ),
    ndc STRUCT(
      export_date    DATE,
      partitions     STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
      total_records  BIGINT
    )
  )
)

At first it can be overwhelming, today we are interested animalandveterinary.event from the Results column. If you notice, each dataset in the manifest contains an export_date, partitions, and total_records.


3. Extracting Dataset Partitions

If you navigate to the manifest link https://api.fda.gov/download.json and scroll down to the animalandveterinary → event node you will see many partitions to this dataset. Today we are interested in the download data from a file called 2025 Q3 (all).

The manifest contains a nested structure:

results → animalandveterinary → event → partitions[]

Each partition represents a downloadable ZIP file containing JSON.

We extract them like this:

DROP TABLE if exists stg_partitions;

CREATE temp TABLE stg_partitions AS
SELECT 
results.animalandveterinary.event.export_date,
p.display_name,
p.file,
p.size_mb,
p.records
FROM manifest_json
CROSS JOIN unnest (results.animalandveterinary.event.partitions) as p(p);

Now we can inspect the available partitions:

SELECT * FROM stg_partitions WHERE display_name;


4. Dynamically Generating SQL to Create the Staging Table

DuckDB’s CLI has a powerful trick:
You can generate SQL dynamically using .once, .read, and printf. More information can be found here on the Dot Commands.

First, generate a script that creates an empty staging table with the correct schema:

.header off
.mode tabs
.once tmp_create_stg_json.sql

SELECT printf(
    'DROP TABLE IF EXISTS stg_json;
     CREATE TABLE stg_json AS
     SELECT ''%s'' AS display_name, *
     FROM read_json(''zip://%s/*json'', maximum_object_size = 4294967295)
     WITH NO DATA;',
    display_name, file
)
FROM stg_partitions
LIMIT 1;

The above command will create the file tmp_create_stg_json.sql with the following context.

drop table if exists stg_json;

CREATE TABLE stg_json AS SELECT '2025 Q3 (all)' display_name, * FROM read_json( 'zip://https://download.open.fda.gov/animalandveterinary/event/2025q3/animalandveterinary-event-0001-of-0001.json.zip/*json', maximum_object_size = 4294967295) WITH NO DATA

Notice the file name from the partition is ends with .json.zip , and we prepend the zip:// and append the *json. This is saying to pass this file into the zipfs extension and unzip any files that end with *json. We need to do this, so DuckDB can read the Json structure to create the table. The WITH NO DATA tells DuckDB, just the schema, no data.

Then execute it:

.read tmp_create_stg_json.sql

This creates an empty table stg_json with the exact schema of that FDA JSON.


5. Populating the Staging Table

Now generate a script to load the actual data:

.header off
.mode tabs
.once tmp_populate_stg_json.sql
Select printf('truncate table stg_json;')
union all
Select printf('INSERT INTO stg_json 
SELECT ''%s'' display_name, * FROM read_json( ''zip://%s/*json'', maximum_object_size = 4294967295);',display_name, file ) as create_statement
from stg_partitions
where display_name = '2025 Q3 (all)';

The above command will create the following sql file tmp_populate_stg_json.sql. Today we will pull the partitions with the display_name = '2025 Q3 (all)'

TRUNCATE TABLE stg_json;

INSERT INTO stg_json

SELECT '2025 Q3 (all)' display_name, * FROM read_json( 'zip://https://download.open.fda.gov/animalandveterinary/event/2025q3/animalandveterinary-event-0001-of-0001.json.zip/*json', maximum_object_size = 4294967295);

Execute it:

.read tmp_populate_stg_json.sql

At this point the stg_json will contain the following structure with 1 record. The column we are interested in is the results, which will contain the complete json file.

describe stg_json;
column_name = display_name
column_type = VARCHAR
       null = YES
        key = NULL
    default = NULL
      extra = NULL

column_name = meta
column_type = STRUCT(disclaimer VARCHAR, terms VARCHAR, license VARCHAR, last_updated DATE, results STRUCT("skip" BIGINT, "limit" BIGINT, total BIGINT))
       null = YES
        key = NULL
    default = NULL
      extra = NULL

column_name = results
column_type = STRUCT(reaction STRUCT(veddra_version VARCHAR, veddra_term_code VARCHAR, veddra_term_name VARCHAR)[], receiver STRUCT(organization VARCHAR, street_address VARCHAR, city VARCHAR, state VARCHAR, postal_code VARCHAR, country VARCHAR), unique_aer_id_number VARCHAR, original_receive_date VARCHAR, number_of_animals_affected VARCHAR, primary_reporter VARCHAR, number_of_animals_treated VARCHAR, drug STRUCT(route VARCHAR, brand_name VARCHAR, dosage_form VARCHAR, manufacturer STRUCT("name" VARCHAR, registration_number VARCHAR), atc_vet_code VARCHAR, active_ingredients STRUCT("name" VARCHAR, dose STRUCT(numerator VARCHAR, numerator_unit VARCHAR, denominator VARCHAR, denominator_unit VARCHAR))[], used_according_to_label VARCHAR, off_label_use VARCHAR, lot_number VARCHAR)[], health_assessment_prior_to_exposure STRUCT(assessed_by VARCHAR), onset_date VARCHAR, report_id VARCHAR, animal STRUCT(species VARCHAR, gender VARCHAR, female_animal_physiological_status VARCHAR, age STRUCT(min VARCHAR, unit VARCHAR, qualifier VARCHAR, max VARCHAR), weight STRUCT(qualifier VARCHAR, min VARCHAR, unit VARCHAR, max VARCHAR), breed STRUCT(is_crossbred VARCHAR, breed_component VARCHAR), reproductive_status VARCHAR), type_of_information VARCHAR, outcome STRUCT(medical_status VARCHAR, number_of_animals_affected VARCHAR)[])[]
       null = YES
        key = NULL
    default = NULL
      extra = NULL

6. Flattening the Nested JSON

The FDA event JSON is deeply nested:

  • results[]

  • reaction[]

  • drug[]

  • drug.active_ingredients[]

  • outcome[]

DuckDB’s UNNEST makes this manageable.

A key trick is aliasing the unnested column:

CROSS JOIN UNNEST(results) AS r(r)

This avoids the default unnest column name.

Here’s the flattening query:

SET memory_limit = '40GB';

DROP SEQUENCE IF EXISTS seq_eventid;
CREATE SEQUENCE seq_eventid START 1;

DROP TABLE IF EXISTS e;

CREATE TEMP TABLE e AS
SELECT
    nextval('seq_eventid') AS id,
    r.unique_aer_id_number,
    r.original_receive_date,
    r.number_of_animals_affected,
    r.primary_reporter,
    r.number_of_animals_treated,
    r.onset_date,
    r.report_id,
    r.type_of_information,

    -- receiver
    r.receiver,
    r.receiver.organization AS receiver_organization,
    r.receiver.street_address AS receiver_street_address,
    r.receiver.city AS receiver_city,
    r.receiver.state AS receiver_state,
    r.receiver.postal_code AS receiver_postal_code,
    r.receiver.country AS receiver_country,

    -- health assessment
    r.health_assessment_prior_to_exposure.assessed_by,

    -- animal
    r.animal,
    r.animal.species AS animal_species,
    r.animal.gender AS animal_gender,
    r.animal.female_animal_physiological_status AS animal_female_animal_physiological_status,
    r.animal.age.min AS animal_age_min,
    r.animal.age.max AS animal_age_max,
    r.animal.age.unit AS animal_age_unit,
    r.animal.age.qualifier AS animal_age_qualifier,
    r.animal.weight.min AS animal_weight_min,
    r.animal.weight.max AS animal_weight_max,
    r.animal.weight.unit AS animal_weight_unit,
    r.animal.weight.qualifier AS animal_weight_qualifier,
    r.animal.breed.is_crossbred AS animal_breed_is_crossbred,
    r.animal.breed.breed_component AS animal_breed_component,
    r.animal.reproductive_status AS animal_reproductive_status,

    -- nested arrays preserved for later normalization
    r.reaction,
    r.drug,
    r.outcome

FROM stg_json
CROSS JOIN UNNEST(results) AS r(r)
WHERE stg_json.display_name = '2025 Q3 (all)';

There are a large number of columns so I will use the following command to show the record in a line mode. the temp table e contains all the values we will need to materialize the relational tables.

.mode line
select * from e limit 1;


7. Materializing the Events, Reactions and Drugs Tables

Now that the event‑level table is flattened, we can normalize the nested arrays.

7.1 Events

DROP TABLE if exists events;

CREATE TABLE events AS 
SELECT 
  id
  ,unique_aer_id_number
  ,original_receive_date
  ,number_of_animals_affected
  ,primary_reporter
  ,number_of_animals_treated
  ,onset_date
  ,report_id
  ,type_of_information
  ,receiver_organization
  ,receiver_street_address
  ,receiver_city
  ,receiver_state
  ,receiver_postal_code
  ,receiver_country
  ,assessed_by
  ,animal_species
  ,animal_gender
  ,animal_female_animal_physiological_status
  ,animal_age_min
  ,animal_age_max
  ,animal_age_unit
  ,animal_age_qualifier
  ,animal_weight_min
  ,animal_weight_max
  ,animal_weight_unit
  ,animal_weight_qualifier
  ,animal_breed_is_crossbred
  ,animal_breed_component
  ,animal_reproductive_status
FROM e;

7.2 Event Reactions

DROP SEQUENCE if exists seq_event_reaction_id;
CREATE SEQUENCE seq_event_reaction_id START 1;

DROP TABLE if exists event_reactions;
CREATE TABLE event_reactions AS
SELECT 
nextval('seq_event_reaction_id') as id,
e.id as event_id,
e.unique_aer_id_number,
r.veddra_version,
r.veddra_term_code,
r.veddra_term_name
FROM e
CROSS JOIN unnest(e.reaction) as r(r);


7.3 Event Drug Details

DROP SEQUENCE if exists seq_event_drug_id;
CREATE SEQUENCE seq_event_drug_id START 1;

DROP TABLE if exists event_drugs;
CREATE TABLE event_drugs as
SELECT 
nextval('seq_event_drug_id') as id,
e.id as event_id,
e.unique_aer_id_number,
r.veddra_version,
r.veddra_term_code,
r.veddra_term_name
FROM e
CROSS JOIN unnest(e.reaction) as r(r);


8. Export our Relational Tables

DuckDB make it easy to export data. More information can be found here .

 COPY (Select * from events ) TO 'events_exported.csv';
 COPY (Select * from events ) TO 'events_exported.parquet';

9. Final Thoughts

This walk though demonstrates just how far DuckDB can go with complex JSON:

  • Read JSON directly from URLs

  • Parse nested structures

  • Extract ZIP‑wrapped JSON over HTTPS

  • Flatten arrays with UNNEST

  • Build staging and normalized tables

  • Automate SQL generation in the CLI

All of this runs locally, with no servers, clusters, or cloud infrastructure.

DuckDB continues to redefine what’s possible in local analytics — and JSON processing is one of its superpowers.