ytpg: youtube as a postgres database
a postgres wire-protocol proxy whose storage backend is youtube. tables are videos, rows are caption cues.
i built a postgres server that stores its data in youtube videos.
you connect psql to it on localhost:5432 like any other postgres. under the hood: every table is an unlisted video, every row is a vtt caption cue, every commit updates a pinned comment with a merkle checkpoint. there is no local database. if the process crashes it rebuilds itself from youtube on the next boot.
it is, on purpose, one of the worst databases ever made. about 25 writes per day before you hit youtube quota. every operation takes 2 to 5 seconds. google can delete your data whenever they feel like it.
i think it’s great.
code: github.com/neonwalker/ytpg
the mapping
| postgres | youtube |
|---|---|
| database | channel |
| table registry | description of the ytpg::__manifest__ video (json) |
| table | unlisted video named ytpg::<name> |
| schema | the table video’s description (json) |
| row | a vtt caption cue with a json payload |
| insert / update / delete | append-only cues with an _op field |
| commit log / lsn | pinned comment on the table video |
| checkpoint | merkle root in that pinned comment |
a row looks like this:
00:00:00.002000 --> 00:00:00.003000
{"id":2,"email":"[email protected]","name":"Bob","_op":"INSERT","_lsn":42}
the current state of any table is: replay all cues in lsn order, fold the inserts/updates/deletes. the manifest video tells you which tables exist; if you lose the cached id you find it again with search.list.
what works
more than i expected, honestly.
- the usual sql: crud, joins of every flavour (
INNER,LEFT,RIGHT,FULL OUTER,CROSS, multi-way), the obviousWHEREops,ORDER BY,COUNT(*),LIMIT. selects are full scans, joins are nested-loop, a two-way join is twocaptions.downloadcalls. - DROP can’t actually delete (a fresh video costs 1,600 quota units). it recycles instead: renames the video, blanks captions, pushes the id onto a free list. the next CREATE pays 50.
- type coercion and real postgres sqlstate codes for
NOT NULLviolations and bad casts. - psql is happy:
\dt,\l,\dn,\d <table>work because i fake the catalog rows. - segmented tables when a vtt grows past one video. crash recovery from the lsn in the pinned comment. no local state.
- a small cli:
ytpg init,serve,status,dump,restore,shell.
what doesn’t, and probably won’t
- subqueries, transactions, foreign keys
USING/NATURAL JOIN,GROUP BY,HAVING, window functions, CTEsJOINinsideUPDATE/DELETE- indexes
- extended query protocol (parse/bind/execute)
- binary type encodings
- multi-process atomicity
- throughput past ~25 writes/day or ~50 reads/day
- youtube can delete your data
stack
python 3.12, sqlglot for parsing, webvtt-py for cues, google-api-python-client for the api, click for the cli, asyncio for the wire protocol. no sqlite, no redis, no cache file. nothing on disk.