It is not uncommon for instrumentation, webhooks, and events to end up in large piles of (messy, terrifying, unusable) JSON.
But Buz makes it easy to create well-structured piles of JSON. π
And DuckDB makes it easy to query said piles of JSON. π¦
The combination of π¦'s and π's is pretty cool.
Schematizing Data
A lot can be said about using schemas to describe data.
Schemas can be used to auto-generate tracking SDK's, seed data dictionaries and discovery mechanisms, version payloads and more.
But schemas also provide significant value far downstream of data generation
, data collection
, and "organizational contracts".
Payload validation and annotation
Buz uses metadata attributes defined within schemas to validate and annotate payloads.
For example, this schema:β
{
"$schema": "https://registry.buz.dev/s/io.silverton/buz/internal/meta/v1.0.json",
"$id": "io.silverton/buz/example/birdsBees/v1.0.json",
"title": "io.silverton/buz/example/birdsBees/v1.0.json",
"description": "Schema for birds and bees",
"owner": {
"org": "silverton",
"team": "buz",
"individual": "jakthom"
},
"self": {
"vendor": "io.silverton",
"namespace": "buz.example.birdsBees",
"version": "1.0"
},
"type": "object",
"properties": {
"uuid": {
"description": "The event uuid",
"type": "string",
"format": "uuid"
},
"event_name": {
"description": "The name of the event",
"type": "string"
},
"organization": {
"type": "object",
"properties": {
"name": {
"description": "The organization name",
"type": "string"
},
"id": {
"description": "The organization id",
"type": "number"
}
}
}
},
"required": [
"uuid",
"event_name",
"organization"
],
"additionalProperties": false
}
and this payload:β
{
"uuid": "b06834d6-aea1-4ef2-9c00-8cd5aa76c8e2",
"event_name": "viewed",
"organization": {
"name": "bee corp",
"id": 10,
}
}
will result in this envelope:β
[
{
"uuid": "28b0b531-285f-46f7-88e5-92b34d403fcd",
"timestamp": "2023-04-28T21:09:16.960953Z",
"buzTimestamp": "2023-04-28T21:09:16.960953Z",
"buzVersion": "x.x.dev",
"buzName": "buz-bootstrap",
"buzEnv": "development",
"protocol": "webhook",
"schema": "io.silverton/buz/example/birdsBees/v1.0.json",
"vendor": "io.silverton",
"namespace": "buz.example.birdsBees",
"version": "1.0",
"isValid": true,
"contexts": {
"io.silverton/buz/internal/contexts/httpHeaders/v1.0.json": {
"Accept": "*/*",
"Content-Length": "153",
"Content-Type": "application/json",
"User-Agent": "curl/7.86.0"
}
},
"payload": {
"event_name": "viewed",
"organization": {
"id": 10,
"name": "bee corp"
},
"uuid": "b06834d6-aea1-4ef2-9c00-8cd5aa76c8e2"
}
}
]
Schema attributes get appended to the envelope, which can be used to consistently, reliably answer questions like:
- Where was payload sourced from?
- What does this payload represent?
- What version of the
buz.example.birdsBees
schema does this event conform to? - Is the payload valid?
These attributes are appended to sink-specific mechanisms like Kafka headers or Pub/Sub attributes.
And power automation far downstream of the point of collection
.
Payload annotation-based partitioning
The example AWS deployment of Buz writes events directly to Kinesis Firehose, which then leverages dynamic partitioning to write incoming payloads to well-structured S3 paths.
The above buz.example.birdsBees
payload will be written to an s3 path of:
isValid=true/vendor=io.silverton/namespace=buz.example.birdsBees/version=1.0/year=2023/month=4/day=28/$FILE.gz
Payload validation, vendors, namespaces, versions, etc are all written to different paths. Which also means every s3 path is comprised of identically-structured payloads.
This is key.
Querying JSON directly from S3
DuckDB makes it easy to query your S3 pile (bucket? lake? lakehouse? idk).
To start querying this well-structured pile, first load and configure DuckDB's httpfs extension:
D install httpfs;
D load httpfs;
D set s3_region = 'us-east-1'; # or wherever
D set s3_access_key_id = '$YOUR_ID';
D set s3_secret_access_key = '$YOUR_KEY';
Since the pile of JSON in S3 is consistently-structured, DuckDB schema discovery is effectively free.
You get real tables with real types. And struct
s all the way down:
D create table local_pile as
select
*
from
read_json_auto('s3://bz-dev-events/isValid=true/vendor=io.silverton/namespace=buz.example.birdsBees/version=1.0/year=2023/month=4/day=28/*.json.gz', timestampformat='%Y-%m-%dT%H:%M:%S.%fZ');
D describe local_pile;
ββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββ
β column_name β column_type β null β key β default β extra β
β varchar β varchar β varchar β varchar β varchar β int32 β
ββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββ€
β uuid β UUID β YES β β β β
β timestamp β TIMESTAMP β YES β β β β
β buzTimestamp β TIMESTAMP β YES β β β β
β buzVersion β VARCHAR β YES β β β β
β buzName β VARCHAR β YES β β β β
β buzEnv β VARCHAR β YES β β β β
β protocol β VARCHAR β YES β β β β
β schema β VARCHAR β YES β β β β
β vendor β VARCHAR β YES β β β β
β namespace β VARCHAR β YES β β β β
β version β VARCHAR β YES β β β β
β isValid β BOOLEAN β YES β β β β
β contexts β STRUCT("io.silverton/buz/internal/contexts/httpHeaders/v1.0.json" STRUCT("Accept" VARCHAR, "Content-Length" BIGINT, "Content-Type" VARCHAR, "User-Agent" VARCHAR)) β YES β β β β
β payload β STRUCT(event_name VARCHAR, organization STRUCT(id UBIGINT, "name" VARCHAR), uuid UUID) β YES β β β β
ββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββ€
β 14 rows 6 columns β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Queries like this are possible out of the box:
D select
> date_trunc('day', timestamp) as day,
> payload.event_name,
> payload.organization.id as org_id,
> count(*) as events
> from
> local_pile
> group by
> 1,2,3;
ββββββββββββββ¬βββββββββββββ¬βββββββββ¬βββββββββ
β day β event_name β org_id β events β
β date β varchar β uint64 β int64 β
ββββββββββββββΌβββββββββββββΌβββββββββΌβββββββββ€
β 2023-04-28 β viewed β 10 β 210353β
ββββββββββββββ΄βββββββββββββ΄βββββββββ΄βββββββββ
Well-structured piles of Parquet
A well-structured pile of parquet is sometimes preferable to a well-structured pile of JSON. Which is fine!
This too is easy with DuckDB:
D copy local_pile to 'local_pile.parquet' (format parquet);
To query the local parquet pile:
D select * from read_parquet('local_pile.parquet');
ββββββββββββββββββββββββ¬βββββββββββββββββββββββ¬βββββββββββββββββββββββ¬βββββββββββββ¬ββββββββββββββββ¬ββββββββββββββ¬ββββ¬βββββββββββββββ¬βββββββββββββββββββββββ¬ββββββββββ¬ββββββββββ¬βββββββββββββββββββββββ¬βββββββββββββββββββββββ
β uuid β timestamp β buzTimestamp β buzVersion β buzName β buzEnv β β¦ β vendor β namespace β version β isValid β contexts β payload β
β uuid β timestamp β timestamp β varchar β varchar β varchar β β varchar β varchar β varchar β boolean β struct("io.silvertβ¦ β struct(event_name β¦ β
ββββββββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββΌββββββββββββββββΌββββββββββββββΌββββΌβββββββββββββββΌβββββββββββββββββββββββΌββββββββββΌββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββββ€
β 813da4d2-c49b-4b99β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β dcbf6249-42e0-4abeβ¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β 3ae2d65c-f785-41b5β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β 6c7b3e35-6635-4eecβ¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β 8b3b4c8c-2ce5-4431β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β 8683ec4e-38a4-49c1β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β c098fdd6-7d0a-446aβ¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β dc47ec95-fe66-472dβ¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β a3494e5e-4709-4e62β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β a6c4fc0e-b03f-47acβ¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β 55065c25-546a-4f26β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β 808bfe01-6489-44aeβ¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β ff6325aa-5235-4f94β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β b488fe95-50d6-4244β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β f7d63eb5-a1e5-41c5β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
Or write it straight back to S3 as partitioned Parquet:
D copy local_pile to 's3://bz-dev-events/partitioned' (format parquet, partition_by (isValid, vendor, namespace, version));
In Conclusion
Tools like Buz and DuckDB make it easier than ever to create and use well-structured data sitting in S3. Your pile of JSON (or Parquet) doesn't need to be a mess!
And with functionality like registering file paths or this it's only getting better.