Introduction
The main objective of this article is to explain how the asset storage system in Penpot works, along with its internal processes. Before diving into implementation details, it’s important to understand why Penpot needs a dedicated asset storage system in the first place.
Unlike applications that simply store user-uploaded assets and delete them when the corresponding object is deleted, Penpot is an editor. This means users can create elements that contain images, then undo and redo those changes. These scenarios require decoupling the asset lifecycle from the editing process, making it asynchronous and eventually consistent.
Additionally, Penpot uses logical deletion for most objects, allowing users a window of a few days to undo destructive actions. This further supports the need for asset storage to be decoupled from the objects we manage.
Key Features
Key features of the asset storage system include:
- Logical and/or deferred deletion
- Asset categorization via buckets
- Asset deduplication within buckets
- Reference counting by bucket
- Support for multiple storage backends
The Data Model
Before diving into the features, we’ll provide an overview of the data model, focusing on the two most relevant objects.
storage_object
Every uploaded asset creates an entry in the storage_object
table (within Penpot’s PostgreSQL database). This table stores basic metadata such as size, type, and the backend where the asset is stored.
Schema:
Table "public.storage_object"
Column | Type | Collation | Nullable | Default
------------+--------------------------+-----------+----------+--------------------
id | uuid | | not null | uuid_generate_v4()
created_at | timestamp with time zone | | not null | now()
deleted_at | timestamp with time zone | | |
size | bigint | | not null | 0
backend | text | | not null |
metadata | jsonb | | |
touched_at | timestamp with time zone | | |
Notable aspects:
backend
: Indicates where the physical asset is stored. Current options arefs
ands3
. A now-deprecated backend stored files in the database itself, which helped with backups but didn’t scale well.bucket
(inmetadata
): Categorizes the asset semantically. Main buckets includeteam-font-variant
,file-media-object
, andprofile
.hash
(inmetadata
): A BLAKE2b hash computed from the asset’s content, used for deduplication (within buckets).touched_at
: Marks whether the object is pending reference analysis. ANULL
value indicates no recent changes requiring reanalysis.
file_media_object
When a user uploads an image in the workspace, an entry is also created in the file_media_object
table. This object is what Penpot files internally reference.
Schema:
Table "public.file_media_object"
Column | Type | Collation | Nullable | Default
--------------+--------------------------+-----------+----------+--------------------
id | uuid | | not null | uuid_generate_v4()
created_at | timestamp with time zone | | not null | clock_timestamp()
deleted_at | timestamp with time zone | | |
name | text | | not null |
width | integer | | not null |
height | integer | | not null |
mtype | text | | not null |
file_id | uuid | | not null |
is_local | boolean | | not null | false
media_id | uuid | | not null |
thumbnail_id | uuid | | |
This table links a file to a storage_object
through the media_id
and thumbnail_id
fields.
For instance, multiple file_media_object
entries (e.g., from different templates) can reference the same storage_object
thanks to deduplication. This also makes garbage collection (GC) more efficient, as it can query references directly without scanning file blobs.
Other objects like fonts and profile photos reference storage_object
directly along with their corresponding bucket names.
Processes
Image Uploads
When a user uploads an image (as-is or as a background), a multipart request is made to the API. And internally the upload process performs the following operations:
- The BLAKE2b hash of the content is calculated.
- The system checks for an existing
storage_object
with the same hash and bucket (per examplefile-media-object
). - A new entry in
file_media_object
is created linking the asset and the file, storing metadata like size and MIME type. - The API returns the
file_media_object
ID. - The frontend updates the file with this ID.
For fonts or other object types, there’s no intermediate relation; they reference the storage_object
directly.
If the image is no longer used (e.g., after an undo), it remains referenced until the file becomes inactive (i.e., no recent modifications) and is processed by FileGC (see below).
Logical Deletes
Although not exclusive to asset storage, logical deletion is tightly related.
When a major object (file, project, team, profile) is deleted, it’s marked as deleted, and an asynchronous cascade process begins. This marks related objects as deleted with the same timestamp.
Actual removal is deferred to the Garbage Collection process, offering a aprox 7-day undo window.
Garbage Collection
We often joke that Penpot’s GC is like a Mark and Sweep garbage collector, with a marking phase (analysis) and a sweep phase (deletion).
There are four main GC processes:
FileGC
Cleans up old or unused image references in a file after a period of inactivity. It also performs many other file cleaning operations and it consists in:
- Analyzes file content.
- Deletes unused entries from
file_media_object
. - Marks the corresponding
storage_object
entries via thetouched_at
field.
Runs periodically, only processing inactive files.
ObjectsGC
Deletes all non-storage objects marked for deletion after ~7 days. If the object is linked to a storage_object
, it sets the touched_at
field for later analysis.
StorageTouchedGC
Periodically scans storage_object
entries with a non-null touched_at
.
- Determines the reference strategy based on the object’s bucket.
- If no references exist, marks the object as deleted for final removal.
StorageDeletedGC
Permanently deletes storage_object
entries marked as deleted. Uses batch deletion to optimize throughput—particularly important for S3-type backends that may charge per API call.
Performance Considerations
These processes must run efficiently and incrementally, avoiding long-held locks. A slow GC process can block user actions like uploading images or templates, leading to performance issues or timeouts.
This is especially important with deduplication. If a GC process is analyzing a storage_object
already used in a new template being uploaded, the upload may be blocked until GC completes.
To mitigate this, each GC process uses mini-transactions, reducing locking time and making operations virtually invisible to users.