Exploring JSON in DuckDB with the openFDA Animal & Veterinary Adverse Events API

Exploring JSON in DuckDB with the openFDA Animal & Veterinary Adverse Events API
A practical, end‑to‑end walkthrough using the DuckDB CLI with JSON
When working with Modern data, JSON is everywhere — APIs, logs, manifests, telemetry, and data feeds. But working with JSON at scale is often painful: nested structures, arrays inside arrays, inconsistent schemas, and compressed files delivered over HTTP.
DuckDB does a great job changing that dynamic completely.
In this article, we’ll walk through a real‑world example using the openFDA Animal & Veterinary Adverse Events dataset, a public API that exposes:
a self‑describing JSON manifest
downloadable ZIP‑wrapped JSON partitions
deeply nested event structures
Using only the DuckDB CLI, we will:
Read JSON directly from URLs
Parse nested JSON structures
Extract ZIP‑compressed JSON over HTTPS
Dynamically generate SQL to build staging tables
Flatten complex arrays with
UNNESTMaterialize clean relational tables:
eventsevent_reactionsevent_drugs
Prepare the dataset for analytics or downstream modeling
exporting to csv and parquet.
This is the kind of workflow that normally a full ELT pipeline, or custom process — but DuckDB handles it locally, with no servers and no friction.
As a disclaimer, I want to point out that the main objective of this article is to illustrate DuckDB's capabilities.
1. Setup Environment
When working with DuckDB project, I usually create a folder for the project and use VS Code. I create a commands.txt file to store my SQL commands. In this walk through we will be creating sql files and a duckdb database.
mkdir duckdb/openfda_demo
cd duckdb/openfda_demo
code .
2. Installing DuckDB
DuckDB has documentation on how to install the DuckDB CLI here. I used winget to install the DuckDB cli.
winget install DuckDB.cli
To execute DuckDB, I would normally open the terminal in VSCode and type my commands in there. We will be creating a DuckDB database
duckdb -ui openfda_animalandveterinary.duckdb
The -ui will download the duckdb ui extension, load it and will open up a browser session to a beautiful jupyter notebook styled IDE. The exercises will use the terminal cli, but the browser session has nice features to look into.
If you wanted to not create a DuckDB, you can just run this command
duckdb -ui
That will use an :memory: database and any tables you create will be lost when you exit the session.
DuckDB’s extension system is one of its superpowers.
For this project, we will be using the httpfs, json and the zipfs extensions. The httpfs and json are core extensions included when DuckDB is installed. The zipfs needs to be installed from community. This extensions allows DuckDB to read/decompress ZIP files. The httpfs allows DuckDB to access directly from URLs.
INSTALL zipfs FROM community;
LOAD zipfs;
SET memory_limit = '40GB';
The memory limit is optional but helpful when working with large FDA datasets.
2. Loading the FDA Manifest
The openFDA API provides a single manifest file describing all datasets and their download locations. We will download the manifest into a table and access the specific nodes we are looking for.
DROP TABLE IF EXISTS manifest_json;
CREATE TABLE manifest_json AS
SELECT *
FROM read_json('https://api.fda.gov/download.json', maximum_object_size = 4294967295);
DuckDB stores JSON data as a Struct data type. The Struct data type makes it easy to work with complex nested data. I highly recommend getting a hand of DuckDB in Action. It contains a treasure-trove of information about DuckDB. MotherDuck has a promotion for the e-book if you subscribe to their newsletters
The manifest_json table contains two columns, meta and results.
A quick schema check:
DESCRIBE manifest_json;
below are the struct definitions of the two columns. Meta Column
STRUCT(
disclaimer VARCHAR,
terms VARCHAR,
license VARCHAR,
last_updated DATE
)
Results Column
STRUCT(
food STRUCT(
enforcement STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
"event" STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
)
),
animalandveterinary STRUCT(
"event" STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
)
),
transparency STRUCT(
crl STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
)
),
tobacco STRUCT(
problem STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
)
),
other STRUCT(
historicaldocument STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
unii STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
nsde STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
substance STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
)
),
device STRUCT(
classification STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
"510k" STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
covid19serology STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
registrationlisting STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
enforcement STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
udi STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
"event" STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
recall STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
pma STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
)
),
cosmetic STRUCT(
"event" STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
)
),
drug STRUCT(
drugsfda STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
"label" STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
enforcement STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
"event" STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
shortages STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
),
ndc STRUCT(
export_date DATE,
partitions STRUCT(display_name VARCHAR, file VARCHAR, size_mb VARCHAR, records BIGINT)[],
total_records BIGINT
)
)
)
At first it can be overwhelming, today we are interested animalandveterinary.event from the Results column. If you notice, each dataset in the manifest contains an export_date, partitions, and total_records.
3. Extracting Dataset Partitions
If you navigate to the manifest link https://api.fda.gov/download.json and scroll down to the animalandveterinary → event node you will see many partitions to this dataset. Today we are interested in the download data from a file called 2025 Q3 (all).
The manifest contains a nested structure:
results → animalandveterinary → event → partitions[]
Each partition represents a downloadable ZIP file containing JSON.
We extract them like this:
DROP TABLE if exists stg_partitions;
CREATE temp TABLE stg_partitions AS
SELECT
results.animalandveterinary.event.export_date,
p.display_name,
p.file,
p.size_mb,
p.records
FROM manifest_json
CROSS JOIN unnest (results.animalandveterinary.event.partitions) as p(p);
Now we can inspect the available partitions:
SELECT * FROM stg_partitions WHERE display_name;

4. Dynamically Generating SQL to Create the Staging Table
DuckDB’s CLI has a powerful trick:
You can generate SQL dynamically using .once, .read, and printf. More information can be found here on the Dot Commands.
First, generate a script that creates an empty staging table with the correct schema:
.header off
.mode tabs
.once tmp_create_stg_json.sql
SELECT printf(
'DROP TABLE IF EXISTS stg_json;
CREATE TABLE stg_json AS
SELECT ''%s'' AS display_name, *
FROM read_json(''zip://%s/*json'', maximum_object_size = 4294967295)
WITH NO DATA;',
display_name, file
)
FROM stg_partitions
LIMIT 1;
The above command will create the file tmp_create_stg_json.sql with the following context.
drop table if exists stg_json;
CREATE TABLE stg_json AS SELECT '2025 Q3 (all)' display_name, * FROM read_json( 'zip://https://download.open.fda.gov/animalandveterinary/event/2025q3/animalandveterinary-event-0001-of-0001.json.zip/*json', maximum_object_size = 4294967295) WITH NO DATA
Notice the file name from the partition is ends with .json.zip , and we prepend the zip:// and append the *json. This is saying to pass this file into the zipfs extension and unzip any files that end with *json. We need to do this, so DuckDB can read the Json structure to create the table. The WITH NO DATA tells DuckDB, just the schema, no data.
Then execute it:
.read tmp_create_stg_json.sql
This creates an empty table stg_json with the exact schema of that FDA JSON.
5. Populating the Staging Table
Now generate a script to load the actual data:
.header off
.mode tabs
.once tmp_populate_stg_json.sql
Select printf('truncate table stg_json;')
union all
Select printf('INSERT INTO stg_json
SELECT ''%s'' display_name, * FROM read_json( ''zip://%s/*json'', maximum_object_size = 4294967295);',display_name, file ) as create_statement
from stg_partitions
where display_name = '2025 Q3 (all)';
The above command will create the following sql file tmp_populate_stg_json.sql. Today we will pull the partitions with the display_name = '2025 Q3 (all)'
TRUNCATE TABLE stg_json;
INSERT INTO stg_json
SELECT '2025 Q3 (all)' display_name, * FROM read_json( 'zip://https://download.open.fda.gov/animalandveterinary/event/2025q3/animalandveterinary-event-0001-of-0001.json.zip/*json', maximum_object_size = 4294967295);
Execute it:
.read tmp_populate_stg_json.sql
At this point the stg_json will contain the following structure with 1 record. The column we are interested in is the results, which will contain the complete json file.
describe stg_json;
column_name = display_name
column_type = VARCHAR
null = YES
key = NULL
default = NULL
extra = NULL
column_name = meta
column_type = STRUCT(disclaimer VARCHAR, terms VARCHAR, license VARCHAR, last_updated DATE, results STRUCT("skip" BIGINT, "limit" BIGINT, total BIGINT))
null = YES
key = NULL
default = NULL
extra = NULL
column_name = results
column_type = STRUCT(reaction STRUCT(veddra_version VARCHAR, veddra_term_code VARCHAR, veddra_term_name VARCHAR)[], receiver STRUCT(organization VARCHAR, street_address VARCHAR, city VARCHAR, state VARCHAR, postal_code VARCHAR, country VARCHAR), unique_aer_id_number VARCHAR, original_receive_date VARCHAR, number_of_animals_affected VARCHAR, primary_reporter VARCHAR, number_of_animals_treated VARCHAR, drug STRUCT(route VARCHAR, brand_name VARCHAR, dosage_form VARCHAR, manufacturer STRUCT("name" VARCHAR, registration_number VARCHAR), atc_vet_code VARCHAR, active_ingredients STRUCT("name" VARCHAR, dose STRUCT(numerator VARCHAR, numerator_unit VARCHAR, denominator VARCHAR, denominator_unit VARCHAR))[], used_according_to_label VARCHAR, off_label_use VARCHAR, lot_number VARCHAR)[], health_assessment_prior_to_exposure STRUCT(assessed_by VARCHAR), onset_date VARCHAR, report_id VARCHAR, animal STRUCT(species VARCHAR, gender VARCHAR, female_animal_physiological_status VARCHAR, age STRUCT(min VARCHAR, unit VARCHAR, qualifier VARCHAR, max VARCHAR), weight STRUCT(qualifier VARCHAR, min VARCHAR, unit VARCHAR, max VARCHAR), breed STRUCT(is_crossbred VARCHAR, breed_component VARCHAR), reproductive_status VARCHAR), type_of_information VARCHAR, outcome STRUCT(medical_status VARCHAR, number_of_animals_affected VARCHAR)[])[]
null = YES
key = NULL
default = NULL
extra = NULL
6. Flattening the Nested JSON
The FDA event JSON is deeply nested:
results[]reaction[]drug[]drug.active_ingredients[]outcome[]
DuckDB’s UNNEST makes this manageable.
A key trick is aliasing the unnested column:
CROSS JOIN UNNEST(results) AS r(r)
This avoids the default unnest column name.
Here’s the flattening query:
SET memory_limit = '40GB';
DROP SEQUENCE IF EXISTS seq_eventid;
CREATE SEQUENCE seq_eventid START 1;
DROP TABLE IF EXISTS e;
CREATE TEMP TABLE e AS
SELECT
nextval('seq_eventid') AS id,
r.unique_aer_id_number,
r.original_receive_date,
r.number_of_animals_affected,
r.primary_reporter,
r.number_of_animals_treated,
r.onset_date,
r.report_id,
r.type_of_information,
-- receiver
r.receiver,
r.receiver.organization AS receiver_organization,
r.receiver.street_address AS receiver_street_address,
r.receiver.city AS receiver_city,
r.receiver.state AS receiver_state,
r.receiver.postal_code AS receiver_postal_code,
r.receiver.country AS receiver_country,
-- health assessment
r.health_assessment_prior_to_exposure.assessed_by,
-- animal
r.animal,
r.animal.species AS animal_species,
r.animal.gender AS animal_gender,
r.animal.female_animal_physiological_status AS animal_female_animal_physiological_status,
r.animal.age.min AS animal_age_min,
r.animal.age.max AS animal_age_max,
r.animal.age.unit AS animal_age_unit,
r.animal.age.qualifier AS animal_age_qualifier,
r.animal.weight.min AS animal_weight_min,
r.animal.weight.max AS animal_weight_max,
r.animal.weight.unit AS animal_weight_unit,
r.animal.weight.qualifier AS animal_weight_qualifier,
r.animal.breed.is_crossbred AS animal_breed_is_crossbred,
r.animal.breed.breed_component AS animal_breed_component,
r.animal.reproductive_status AS animal_reproductive_status,
-- nested arrays preserved for later normalization
r.reaction,
r.drug,
r.outcome
FROM stg_json
CROSS JOIN UNNEST(results) AS r(r)
WHERE stg_json.display_name = '2025 Q3 (all)';
There are a large number of columns so I will use the following command to show the record in a line mode. the temp table e contains all the values we will need to materialize the relational tables.
.mode line
select * from e limit 1;

7. Materializing the Events, Reactions and Drugs Tables
Now that the event‑level table is flattened, we can normalize the nested arrays.
7.1 Events
DROP TABLE if exists events;
CREATE TABLE events AS
SELECT
id
,unique_aer_id_number
,original_receive_date
,number_of_animals_affected
,primary_reporter
,number_of_animals_treated
,onset_date
,report_id
,type_of_information
,receiver_organization
,receiver_street_address
,receiver_city
,receiver_state
,receiver_postal_code
,receiver_country
,assessed_by
,animal_species
,animal_gender
,animal_female_animal_physiological_status
,animal_age_min
,animal_age_max
,animal_age_unit
,animal_age_qualifier
,animal_weight_min
,animal_weight_max
,animal_weight_unit
,animal_weight_qualifier
,animal_breed_is_crossbred
,animal_breed_component
,animal_reproductive_status
FROM e;

7.2 Event Reactions
DROP SEQUENCE if exists seq_event_reaction_id;
CREATE SEQUENCE seq_event_reaction_id START 1;
DROP TABLE if exists event_reactions;
CREATE TABLE event_reactions AS
SELECT
nextval('seq_event_reaction_id') as id,
e.id as event_id,
e.unique_aer_id_number,
r.veddra_version,
r.veddra_term_code,
r.veddra_term_name
FROM e
CROSS JOIN unnest(e.reaction) as r(r);

7.3 Event Drug Details
DROP SEQUENCE if exists seq_event_drug_id;
CREATE SEQUENCE seq_event_drug_id START 1;
DROP TABLE if exists event_drugs;
CREATE TABLE event_drugs as
SELECT
nextval('seq_event_drug_id') as id,
e.id as event_id,
e.unique_aer_id_number,
r.veddra_version,
r.veddra_term_code,
r.veddra_term_name
FROM e
CROSS JOIN unnest(e.reaction) as r(r);

8. Export our Relational Tables
DuckDB make it easy to export data. More information can be found here .
COPY (Select * from events ) TO 'events_exported.csv';
COPY (Select * from events ) TO 'events_exported.parquet';
9. Final Thoughts
This walk though demonstrates just how far DuckDB can go with complex JSON:
Read JSON directly from URLs
Parse nested structures
Extract ZIP‑wrapped JSON over HTTPS
Flatten arrays with
UNNESTBuild staging and normalized tables
Automate SQL generation in the CLI
All of this runs locally, with no servers, clusters, or cloud infrastructure.
DuckDB continues to redefine what’s possible in local analytics — and JSON processing is one of its superpowers.