Andrey Antukh 405a73e8ba
Add climit impl and config for file snapshot methods (#9722)
*  Add dedicated concurrency limit for restore-file-snapshot

This adds a dedicated climit configuration for the restore-file-snapshot
RPC method with :permits 1 per profile (plus queue of 2 and 60s timeout)
and a global limit of 3. Previously the method only used the generic
root/by-profile and root/global limits, allowing up to 7 concurrent
restore operations per profile which caused database row lock contention
on FOR UPDATE and connection pool exhaustion.

*  Skip locking on restore! to avoid blocking other operations

Changes the row lock acquisition in restore! from a blocking FOR UPDATE
to FOR UPDATE SKIP LOCKED. If the file row is already locked by another
concurrent operation (e.g., another restore or an update-file), the query
returns no rows and the caller fails fast with a clear conflict error
instead of blocking indefinitely holding a database connection.

*  Add queue and timeout limits to root/by-profile concurrency limit

Previously root/by-profile had no queue limit (unbounded Integer/MAX_VALUE)
and no timeout, allowing requests to pile up indefinitely behind a profile
whose permits were exhausted by long-running operations. This could lead
to memory pressure and cascading failures. Now limited to 30 queued
requests with a 30-second timeout so excess requests fail fast.

*  Move backup snapshot creation outside restore transaction

The backup snapshot (fsnap/create!) is now created in its own short-lived
connection before the actual restore transaction begins. This ensures the
backup is persisted independently of the restore outcome and reduces the
restore transaction window.

The restore itself runs inside a db/tx-run! block with an optimistic
locking check: it reads the file with FOR UPDATE and compares its revn
against the value captured at backup time. If the file was edited
concurrently, the restore aborts with a conflict error to prevent data
loss.

Co-dependent with the SKIP LOCKED change in restore! — the FOR UPDATE
acquired here is in the same transaction as restore!, so the SKIP LOCKED
inside restore! correctly sees the row as unlocked (same transaction).

* ♻️ Remove unused private function get-minimal-file

The local get-minimal-file function in file_snapshots.clj is no longer
used since restore! switched to direct exec-one! with FOR UPDATE SKIP
LOCKED. The sql:get-minimal-file SQL constant is still used directly.

*  Add minor improvements on db connection management

* ♻️ Refactor create-file-snapshot to use explicit transaction management

Remove automatic transaction wrapping (`::db/transaction true`) and
pass `cfg` through the call chain instead of destructured `conn`.
Wrap `fsnap/create!` in an explicit `db/tx-run!` for clearer
transaction boundaries.

Signed-off-by: Andrey Antukh <niwi@niwi.nz>

*  Add dedicated concurrency limit for create-file-snapshot

This adds a dedicated climit configuration for the create-file-snapshot
RPC method with :permits 1 per profile (plus queue of 2 and 60s timeout)
and a global limit of 3. Previously the method only used the generic
root/by-profile and root/global limits, allowing up to 10 concurrent
snapshot creation operations per profile which could cause database
contention and connection pool exhaustion.

Signed-off-by: Andrey Antukh <niwi@niwi.nz>

---------

Signed-off-by: Andrey Antukh <niwi@niwi.nz>
2026-05-19 14:30:44 +02:00

44 lines
1.5 KiB
Clojure

;; This Source Code Form is subject to the terms of the Mozilla Public
;; License, v. 2.0. If a copy of the MPL was not distributed with this
;; file, You can obtain one at http://mozilla.org/MPL/2.0/.
;;
;; Copyright (c) KALEIDOS INC
(ns backend-tests.db-test
(:require
[app.db :as db]
[backend-tests.helpers :as th]
[clojure.test :as t])
(:import
com.zaxxer.hikari.HikariConfig
com.zaxxer.hikari.HikariDataSource
java.sql.Connection))
(t/use-fixtures :once th/state-init)
(t/deftest pool-stats-returns-expected-keys
(let [stats (db/pool-stats th/*pool*)]
(t/testing "all expected keys are present"
(t/is (contains? stats :active-connections))
(t/is (contains? stats :idle-connections))
(t/is (contains? stats :threads-awaiting-connection))
(t/is (contains? stats :total-connections))
(t/is (contains? stats :maximum-pool-size))
(t/is (contains? stats :minimum-idle)))
(t/testing "values are non-negative integers"
(t/is (>= (:active-connections stats) 0))
(t/is (>= (:idle-connections stats) 0))
(t/is (>= (:threads-awaiting-connection stats) 0))
(t/is (>= (:total-connections stats) 0))
(t/is (>= (:maximum-pool-size stats) 0))
(t/is (>= (:minimum-idle stats) 0)))
(t/testing "total connections equals active + idle"
(t/is (= (:total-connections stats)
(+ (:active-connections stats)
(:idle-connections stats)))))
(t/testing "maximum pool size is reasonable"
(t/is (pos? (:maximum-pool-size stats))))))