* 🎉 Add telemetry anonymous event collection
Rewrite the audit logging subsystem to support three operating modes and
add anonymous telemetry event collection:
Modes:
- A (audit-log only): events persisted with full context
- B (audit-log + telemetry): same as A, plus events are collected for
telemetry shipping
- C (telemetry-only): events stored anonymously with PII stripped,
telemetry flag active, audit-log flag inactive
Audit system refactoring (app.loggers.audit):
- Replace qualified map keys (::audit/name etc.) with plain keywords
- Rename submit! -> submit, insert! -> insert, prepare-event ->
prepare-rpc-event
- Add submit* as a lower-level public API
- Add process-event dispatch function that handles all three modes and
webhooks in a single tx-run!
- Add :id to event schema (auto-generated if omitted)
- Add filter-telemetry-props: anonymises event props per event type.
Keeps UUID/boolean/number values; for login/identify events preserves
lang, auth-backend, email-domain; for navigate events preserves route,
file-id, team-id, page-id; instance-start trigger passes through.
- Add filter-telemetry-context: retains only safe context keys.
Backend: version, initiator, client-version, client-user-agent.
Frontend: browser, os, locale, screen metrics, event-origin.
- Timestamps truncated to day precision via ct/truncate for telemetry
storage
- PII stripped: props emptied, ip-addr zeroed, session-linking and
access-token fields removed from context
Config (app.config):
- Derive :enable-telemetry flag from telemetry-enabled config option
Email utilities (app.email):
- Add email/clean and email/get-domain helper functions for domain
extraction from email addresses
Setup (app.setup):
- Emit instance-start trigger event at system startup
- Simplify handle-instance-id (remove read-only check)
RPC layer (app.rpc):
- wrap-audit now activates when :telemetry flag is set
- Add :request-id to RPC params context for event correlation
RPC commands (management, teams_invitations, verify_token, OIDC auth,
webhooks): migrate all audit call sites to use the new plain-key API
SREPL (app.srepl.main):
- Migrate all audit/insert! calls to audit/insert with plain keys
Telemetry task (app.tasks.telemetry):
- Restructure legacy report into make-legacy-request; distinguish
payload type as :telemetry-legacy-report
- Add collect-and-send-audit-events: loop fetching up to 10,000 rows
per iteration, encodes and sends each page, deletes on success,
stops immediately on failure for retry
- Add send-event-batch: POSTs fressian+zstd batch (base64 via
blob/encode-str) to the telemetry endpoint with instance-id per event
- Add gc-telemetry-events: enforces 100,000-row safety cap by dropping
oldest rows first
- Add delete-sent-events: deletes successfully shipped rows by id
Blob utilities (app.util.blob):
- Add encode-str/decode-str: combine fressian+zstd encoding with URL-
safe base64 for JSON-safe string transport
Database:
- Add migration 0145: index on audit_log (source, created_at ASC) for
efficient telemetry batch collection queries
Frontend:
- Always initialize event system regardless of :audit-log flag
- Defer auth events (signin identify) to after profile is set
- Refactor event subsystem for telemetry support
Tests (21 test vars, 94 assertions in tasks-telemetry-test):
- Cover all code paths: disabled/enabled telemetry, no-events no-op,
happy-path batch send and delete, failure retention, payload anonymity,
context stripping, timestamp day precision, batch encoding round-trip,
multi-page iteration, GC cap enforcement, partial failure handling
- blob encode-str/decode-str round-trip tests (14 test vars)
- RPC audit integration tests (5 test vars)
Signed-off-by: Andrey Antukh <niwi@niwi.nz>
* 📎 Add pr feedback changes
---------
Signed-off-by: Andrey Antukh <niwi@niwi.nz>
The main idea behind this refactor is make the
API less especialized for specific use of out internal
submidules and make it more general and usable
for more general purposes (per example cache)
Replace general usage of virtual threads with platform threads
and use virtual threads for lightweight procs such that websocket
connections. This decision is made mainly because virtual threads
does not appear on thread dumps in an easy way so debugging issues
becomes very difficult.
The threads requirement of penpot for serving http requests
is not very big so having so this decision does not really affects
the resource usage.
This upgrade also includes complete elimination of use spec
from the backend codebase, completing the long running migration
to fully use malli for validation and decoding.
The main issue was the long running gc operation that
affects storage objects with deduplication. The long running
transacion ends locking some storage object rows which collaterally
made operations like import-binfile become blocked indefinitelly
because of the same rows (because of deduplication).
The solution used in this commit is split operations on small
chunks so we no longer use long running transactions that holds
too many locks. With this approach we will make a window to work
concurrently all operarate the distinct operations that requires
locks on the same rows.
The climit previously of this commit is heavily used inside a
transactions, so in heavy contention operation such that file thumbnail
creation can cause a db pool exhaust.
This commit fixes this issue setting up a better resource limiting
mechanism that works outside the transactions so, contention will
no longer hold an open connection/transaction.
It also adds general improvement to the traceability to the climit
mechanism: it now properly logs the profile-id that is currently
cause some contention on specific resources.
It also add a general/root climit that is applied to all requests
so if someone start making abussive requests, we can clearly detect
it.
The main objective is prevent deletion of objects that can leave
unreachable orphan objects which we are unable to correctly track.
Additionally, this commit includes:
1. Properly implement safe cascade deletion of all participating
tables on soft deletion in the objects-gc task;
2. Make the file thumbnail related tables also participate in the
touch/refcount mechanism applyign to the same safety checks;
3. Add helper for db query lazy iteration using PostgreSQL support
for server side cursors;
4. Fix efficiency issues on gc related task using server side
cursors instead of custom chunked iteration for processing data.
The problem resided when a large chunk of rows that has identical
value on the deleted_at column and the chunk size is small (the
default); when the custom chunked iteration only reads a first N
items and skip the rest of the set to the next run.
This has caused many objects to remain pending to be eliminated,
taking up space for longer than expected. The server side cursor
based iteration does not has this problem and iterates correctly
over all objects.
5. Fix refcount issues on font variant deletion RPC methods