Merge pull request 'Provide reference Grafana dashboard and improve docs related to monitoring+perf' (#966) from Oneric/akkoma:grafana_ref into develop

Reviewed-on: https://akkoma.dev/AkkomaGang/akkoma/pulls/966
2025-08-23 16:27:18 +00:00 · 2025-08-23 16:27:18 +00:00 · 7c1a08913f
commit 7c1a08913f
parent be2e014c60 c62a476289
6 changed files with 3765 additions and 55 deletions
--- a/docs/docs/administration/CLI_tasks/database.md
+++ b/docs/docs/administration/CLI_tasks/database.md
@ -26,13 +26,19 @@ Replaces embedded objects with references to them in the `objects` table. Only n

 ## Prune old remote posts from the database

-This will prune remote posts older than 90 days (configurable with [`config :pleroma, :instance, remote_post_retention_days`](../../configuration/cheatsheet.md#instance)) from the database. Pruned posts may be refetched in some cases.
+This will selectively prune remote posts older than 90 days (configurable with [`config :pleroma, :instance, remote_post_retention_days`](../../configuration/cheatsheet.md#instance)) from the database. Pruned posts may be refetched in some cases.

 !!! note
-    The disk space will only be reclaimed after a proper vacuum. By default, Postgresql does this for you on a regular basis, but if your instance has been running for a long time and there are many rows deleted, it may be advantageous to use `VACUUM FULL` (e.g. by using the `--vacuum` option).
+    The disk space used up by deleted rows only becomes usable for new data after a vaccum.
+    By default, Postgresql does this for you on a regular basis, but if you delete a lot at once
+    it might be advantageous to also manually kick off a vacuum and statistics update using `VACUUM ANALYZE`.
+
+    **However**, the freed up space is never returned to the operating system unless you run
+    the much more heavy `VACUUM FULL` operation. This epensive but comprehensive vacuum mode
+    can be schedlued using the `--vacuum` option.

 !!! danger
-    You may run out of disk space during the execution of the task or vacuuming if you don't have about 1/3rds of the database size free. Vacuum causes a substantial increase in I/O traffic, and may lead to a degraded experience while it is running.
+    You may run out of disk space during the execution of the task or full vacuuming if you don't have about 1/3rds of the database size free. `VACUUM FULL` causes a substantial increase in I/O traffic, needs full table locks and thus renders the instance basically unusable while its running.

 === "OTP"

@ -48,6 +54,27 @@ This will prune remote posts older than 90 days (configurable with [`config :ple

 ### Options

+The recommended starting point and configuration for small and medium-sized instances is:
+```sh
+prune_objects --keep-followed posts --keep-threads --keep-non-public
+# followed by
+prune_orphaned_activities --no-singles
+prune_orphaned_activities --no-arrays
+# and finally, using psql to manually run:
+#   VACUUM ANALYZE;
+#   REINDEX TABLE objects;
+#   REINDEX TABLE activities;
+```
+This almost certainly won’t delete stuff your interested in and
+makes sure the database is immediately utilising the newly freed up space.
+If you need more aggressive database size reductions or if this proves too costly to run for you
+you can drop restrictions and/or use the `--limit` option.
+In the opposite case if everything goes through quickly,
+you can combine the three CLI tasks into one for future runs using `--prune-orphaned-activities`
+and perhaps even using a full vacuum (which implies a reindex) using `--vacuum` too.
+
+Full details below:
+
 - `--keep-followed <mode>` - If set to `posts` all posts and boosts of users with local follows will be kept.  
    If set to `full` it will additionally keep any posts such users interacted with; this requires `--keep-threads`.  
    By default this is set to `none` and followed users are not treated special.
@ -56,7 +83,8 @@ This will prune remote posts older than 90 days (configurable with [`config :ple
 - `--limit` - limits how many remote posts get pruned. This limit does **not** apply to any of the follow up jobs. If wanting to keep the database load in check it is thus advisable to run the standalone `prune_orphaned_activities` task with a limit afterwards instead of passing `--prune-orphaned-activities` to this task.
 - `--prune-orphaned-activities` - Also prune orphaned activities afterwards. Activities are things like Like, Create, Announce, Flag (aka reports)... They can significantly help reduce the database size.
 - `--prune-pinned` - Also prune pinned posts; keeping pinned posts does not suffice to protect their threads from pruning, even when using `--keep-threads`.  
-    Note, if using this option and pinned posts are pruned, they and their threads will just be refetched on the next user update. Therefore it usually doesn't bring much gain while incurring a heavy fetch load after pruning.
+    Note, if using this option and pinned posts are pruned, they and their threads will just be refetched on the next user update. Therefore it usually doesn't bring much gain while incurring a heavy fetch load after pruning.  
+    One exception to this is if you already need to use a relatively small `--limit` to keep downtime mangeable or even being able to run it without downtime. Retaining pinned posts adds a mostly constant overhead which will impact repeated runs with small limit much more than one full prune run.
 - `--vacuum` - Run `VACUUM FULL` after the objects are pruned. This should not be used on a regular basis, but is useful if your instance has been running for a long time before pruning.

 ## Prune orphaned activities from the database
--- a/docs/docs/administration/monitoring.md
+++ b/docs/docs/administration/monitoring.md
@ -1,45 +1,275 @@
 # Monitoring Akkoma

-If you run akkoma, you may be inclined to collect metrics to ensure your instance is running smoothly,
-and that there's nothing quietly failing in the background.
+If you run Akkoma it’s a good idea to collect metrics to ensure your instance is running smoothly
+without anything silently failing and to aid troubleshooting if something actually goes wrong.

-To facilitate this, akkoma exposes a dashboard and prometheus metrics to be scraped.
+To facilitate this, Akkoma exposes Prometheus metrics to be scraped for long-term 24/7 monitoring
+as well as as two built-in dashboards with ephemeral info about just the current status.
+Setting up Prometheus scraping is highly recommended.

 ## Prometheus

-See: [export\_prometheus\_metrics](../../configuration/cheatsheet#instance)
+This method gives a more or less complete overview and allows for 24/7 long-term monitoring.

-To scrape prometheus metrics, we need an oauth2 token with the `admin:metrics` scope.
+Prometheus metric export can be globally disabled if you really want to,
+but it doesn’t cause much overhead and is enabled by default: see the
+[export\_prometheus\_metrics](../../configuration/cheatsheet#instance) config option.

-Consider using [constanze](https://akkoma.dev/AkkomaGang/constanze) to make this easier -
+Akkoma only exposes the current state of all metrics; to make it actually useful
+an external scraper needs to regularly fetch and store those values.
+An overview of the necessary steps follows.
+
+### Step 1: generate a token
+
+Accessing prometheus metrics, requires an OAuth2 token with the `admin:metrics` (sub)scope.
+An access token with only this subscope will be unable to do anything at all _except_ looking at the exported metrics.
+
+Assuming your account has access to the `admin` scope category,
+a suitable metrics-only token can be conveniently generated using
+[constanze](https://akkoma.dev/AkkomaGang/constanze).
+If you didn’t already do so before, set up `constanze` by running `constanze configure`.
+Now getting the token is as simple as running the below command and following its instructions:

 ```bash
 constanze token --client-app --scopes "admin:metrics" --client-name "Prometheus"
 ```

-or see `scripts/create_metrics_app.sh` in the source tree for the process to get this token.
+Alternatively you may manually call into the token and app APIs;
+check `scripts/create_metrics_app.sh` in the source tree for the process for this.

-Once you have your token of the form `Bearer $ACCESS_TOKEN`, you can use that in your prometheus config:
+The resulting token will have the form `Bearer $ACCESS_TOKEN`;
+in the following replace occurrences of `$ACCESS_TOKEN` with the actual token string everywhere.  
+If you wish, you can now check the token works by manually using it to query the current metrics with `curl`:

-```yaml
- job_name: akkoma
-  scheme: https
-  authorization:
-    credentials: $ACCESS_TOKEN # this should have the bearer prefix removed
-  metrics_path: /api/v1/akkoma/metrics
-  static_configs:
-  - targets:
-    - example.com
+!!! note
+    After restarting the instance it may take a couple minutes for content to show up in the metric endpoint
+
+```sh
+curl -i -H 'Authorization: Bearer $ACCESS_TOKEN' https://myinstance.example/api/v1/akkoma/metrics | head -n 100
 ```

-## Dashboard
+### Step 2: set up a scraper
+
+You may use the eponymous [Prometheus](https://prometheus.io/)
+or anything compatible with it like e.g. [VictoriaMetrics](https://victoriametrics.com/).
+The latter claims better performance and storage efficiency.
+
+Both of them can usually be easily installed via distro-packages or docker.
+Depending on your distro or installation method the preferred way to change the CLI arguments and the location of config files may differ; consult the documentation of your chosen method to find out.  
+Of special interest is the location of the prometheus scraping config file
+and perhaps the maximal data retention period setting,
+to manage used disk space and make sure you keep records long enough for your purposes.
+It might also be a good idea to set up a minimal buffer of free disk space if you’re tight on that;
+with VictoriaMetrics this can be done via the `-storage.minFreeDiskSpaceBytes 1GiB` CLI flag.
+
+Ideally the scraper runs on a different machine than Akkoma to be able to
+distinguish Akkoma downtime from scraper downtime, but this is not strictly necessary.
+
+Once you’ve installed one of them, it’s time to add a job for scraping Akkoma.
+For Prometheus the `scrape_configs` section will usually be added to the main config file,
+for VictoriaMetrics this will be in the file passed via `-promscrape.config file_path`.
+In either case a `scrape_configs` with just one job for a single Akkoma instance will look like this:
+
+```yaml
+scrape_configs:
+  - job_name: 'akkoma_scrape_job'
+    scheme: https
+    metrics_path: /api/v1/akkoma/metrics
+    static_configs:
+    - targets: ['myinstance.example']
+    # reminder: no Bearer prefix here!
+    bearer_token: '$ACCESS_TOKEN'
+    # One minute should be frequent enough, but you can choose any value, or rely on the global default.
+    # However, for later use in Grafana you want to match this exactly, thus make note.
+    scrape_interval: '1m'
+```
+
+Now (re)start the scraper service, wait for a multiple of the scrape interval and check logs
+to make sure no errors occur.
+
+### Step 3: visualise the collected data
+
+At last it’s time to actually get a look at the collected data.
+There are many options working with Prometheus-compatible backends
+and even software which can act as both the scraper _and_ visualiser in one service.
+Here we’ll just deal with Grafana, since we ship a reference Grafana dashboard you can just import.
+
+There are again multiple options for [installing Grafana](https://grafana.com/docs/grafana/latest/setup-grafana/)
+and detailing all of them is out of scope here, but it’s nothing too complicated if you already set up Akkoma.
+
+Once you’ve got it running and are logged into Grafana,
+you first need to tell it about the scraper which acts a the “data source”.
+For this go to the “Connections” category and select “Data Sources”.
+Here click the shiny button for adding a new data source,
+select the “Prometheus” type and fill in the details
+matching how you set up the scraper itself.
+In particular, **use the same `Scrape Interval` value!**
+
+Now you’re ready to go to the “Dashboards” page.
+Click the “New” button, select “Import” and upload or copy the contents of
+the reference dashboard `installation/grafana_dashboard.json` from Akkoma’s source tree.
+It will now ask you to select the data source you just configured,
+as well as for the name of the job in your scraper config
+and your instance domain+port identifier.
+For the example settings from step 2 above
+the latter two are `akkoma_scrape_job` and `myinstance.example:443`.
+*(`443` is the default port for HTTPS)*
+
+That’s it, you’ve got a fancy dashboard with long-term, 24/7 metrics now!
+Updating the dashboard can be done by just repeating the import process.
+
+Here’s an example taken from a healthy, small instance where
+nobody was logged in for about the first half of the displayed time span:  
+![Full view of the reference dashboard as it looked at the time of writing](img/grafana_dashboard.webp)
+
+!!! note
+    By default the dashboard does not count downtime of the data source, e.g. the scraper,
+    towards instance downtime, but a commented out alternative query is provided in the
+    panel edit menu. If you host the scraper on the same machine as Akkoma you likely want to swap this out.
+
+### Remarks on interpreting the data
+
+What’s kind of load and even error ratio is normal or irregular can depend on
+the instance size, chosen configuration and federation partners.
+*(E.g. when following relays, much more activities will be received and the received activities will in turn kick off more internal work and also external fetches raising the overall load)*
+
+Here the 24/7 nature of the metric history helps out, since we can just
+look at "known-good" time spans to get a feeling for what’s normal and good.
+If issues without an immediately clear origin crop up,
+we can look for deviations from this known-good pattern.
+
+Still there are some things to be aware of and some common guidelines.
+
+#### Panel-specific time ranges
+
+A select few panels, are set to use a custom time range
+independent from what you chose for the dashboard as a whole.
+This is indicated with blue text next to the panel title.  
+Those custom times only take precedence over _relative_ global time ranges.
+If you choose fixed start and end dates in the global setting
+*(for example to look at a long-term trend after a specific change)*
+this will take precedence over custom panel times and everything follows the date range.
+
+In the image above e.g. the uptime percent gauge thus considers the entire last week
+while most other panels only display data for the last 6 hours.
+
+#### Long-term trends
+
+The lower section of the dashboard with 24h and 48h averages is particularly useful for observing long-term trends.
+E.g. how a patch, version upgrade or database `VACCUUM`/`REINDEX` affects performance.
+
+For small time ranges you can still look at them to make sure the values are at a reasonable level,
+but the upper part is probably more interesting.
+
+#### Processing times total
+
+The actions counted by various “time per second” or “time per minute” stats are partially overlapping.
+E.g. the time to conclude a HTTP response includes the time it took to run whatever
+query was needed to fetch the necessary information from the database.
+However not all database queries originate from HTTP requests.
+
+But also, not all of the recorded time might have actually consumed CPU cycles.
+Some jobs, e.g. `RemoteFetcherWorker`, will need to fetch data over the network
+and often most of the time from job start to completion is just spent waiting
+for a reply from the remote server to arrive.  
+Even a few HTTP endpoints will need to fetch remote data before completing;
+e.g. `/inbox` needs to verify the signature of the submission, but if the signing key
+wasn’t encountered before it first needs to be fetched.
+Getting deliveries from such unknown users happens more often than you might initially assume
+due to e.g. Mastodon federating actor deletions to _every server it knows about_
+regardless of whether there was ever any contact with the deleted user.
+*(Meaning in the end the key lookup will just result in a `410 Gone` response and us dropping the superfluous `Delete`)*
+
+Thus if you just add up all timing stats you’ll count some actions multiple times
+and may end up consistently with more processing time being done than time elapsing on the wall clock
+even though your server is neither overloaded nor subject to relative time dilation.
+
+For keeping track of CPU and elixir-side(!) IO bottlenecks,
+the corresponding BEAM VM gauges are much better indicators.
+They should be zero most of the time and never exceed zero by much.
+
+!!! note
+    The BEAM VM (running our elixir code) cannot know about
+    the database’s IO activity or CPU cycle consumption,
+    thus this gauge is no indicator for database bottlenecks.
+
+#### Job failures and federation
+
+Most jobs are automatically retried and may fail (“exception”) due to no fault of your own instance
+e.g. network issues or a remote server temporarily being overloaded.
+Thus seeing some failures here is normal and nothing to be concerned about;
+usually it will just resolve itself on the next retry.  
+Consistent and/or a relatively high success-to-failure ratio though
+is worth looking into using logs.
+
+Of particular importance are Publisher jobs;
+they handle delivering your local content to its intended remote recipients.
+Again some PublisherWorker exceptions are no cause for concern,
+but if all retries for a delivery fail, this means a remote recipient never
+received something they should’ve seen.  
+Due to its particular importance, such final delivery failures are
+recorded again in a separate metric.
+The reference dashboard shows it in the “AP delivery failures” panel.
+Everything listed there exhausted all retries without success.  
+Ideally this will always be empty and for small instances this should be the
+case most of the time.
+However, whenever a remote instance which once interacted with
+your local instance in the past is decommissioned, delivery failures will likely
+eventually show up in your metrics. For example:
+
+ - a local user might be followed by an user from the dead instance
+ - a local posts was in the past fetched by the dead instance and this post is now deleted;
+    Akkoma will attempt to deliver the `Delete` to the dead instance even if there’s no follow relationship
+
+Delivery failures for such dead instances will typically list a reason like
+`http_410`, `http_502`, `http_520`-`http_530` (cloudflare’d instances), `econnrefused`, `nxdomain` or just `timeout`.
+
+If all deliveries to a given remote instance consistently fail for a longer time,
+Akkoma will mark it as permanently unreachable and stop even attempting to deliver
+to it meaning the errors should go away after a while.
+*(If Akkoma sees activity from the presumed dead instance again it will resume deliveries for future content, but anything in the past will remain lost)*
+
+Large instances with many users are more likely to have (had) some relationship to
+such a recently decommissioned instances and thus might see failures here more often
+even if nothing is wrong with the local Akkoma instance.
+If this makes too much noise, consider filtering out telltale delivery failures.
+
+On the opposite side of things, a `http_401` error for example is always worth looking into!
+
+## Built-in Dashboard

 Administrators can access a live dashboard under `/phoenix/live_dashboard`
 giving an overview of uptime, software versions, database stats and more.

-The dashboard also includes a variation of the prometheus metrics, however
-they do not exactly match due to respective limitations of the dashboard
-and the prometheus exporter.
-Even more important, the dashboard collects metrics locally in the browser
-only while the page is open and cannot give a view on their past history.
-For proper monitoring it is recommended to set up prometheus.
+This dashboard can also show a limited subset of Prometheus metrics,
+however besides being limited it only starts collecting data when opening
+the corresponding page in the browser and the history only exists in ephemeral browser memory.
+When navigating away from the page, all history is gone.
+However, this is not this dashboards main purpose anyway.
+
+The usefulness of this built-in dashboard are the insights into the current state of
+the BEAM VM running Akkoma’s code and statistics about the database and its performance
+as well as database diagnostics.
+BEAM VM stats include detailed memory consumption breakdowns
+and a full list of running processes for example.
+
+## Oban Web
+
+This too requires administrator rights to access and can be found under `/akkoma/oban` if enabled.
+The exposed aggregate info is mostly redundant with job statistics already tracked in Prometheus,
+but it additionally also:
+
+ - shows not-yet executed jobs in the backlog of queues
+ - shows full argument and meta details for each job
+ - allows interactively deleting or manually retrying jobs  
+  *(keep this in mind when granting people administrator rights!)*
+
+However, there are two caveats:
+1. Just as with the other built-in dashboard, data is not kept around
+    (although here a **short** backlog actually exists);
+    when you notice an issue during use and go here to check it likely is already too late.
+    Job details and history only exists while the jobs are still in the database;
+    by default failed and succeeded jobs will disappear after about a minute.
+2. This dashboard comes with some seemingly constant-ish overhead.
+    For large instances this appears to be negligible, but small instances on weaker hardware might suffer.
+    Thus this dashboard can be disabled in the [config](../cheatsheet.md#oban-web).
--- a/docs/docs/configuration/cheatsheet.md
+++ b/docs/docs/configuration/cheatsheet.md
@ -1193,7 +1193,7 @@ Each job has these settings:
 * `:max_running` - max concurrently running jobs
 * `:max_waiting` - max waiting jobs

-### Translation Settings
+## Translation Settings

 Settings to automatically translate statuses for end users. Currently supported
 translation services are DeepL and LibreTranslate. The supported command line tool is [Argos Translate](https://github.com/argosopentech/argos-translate).
@ -1223,3 +1223,12 @@ Translations are available at `/api/v1/statuses/:id/translations/:language`, whe
 - `:command_argos_translate` - command for `argos-translate`. Can be the command if it's in your PATH, or the full path to the file (default: `argos-translate`).
 - `:command_argospm` - command for `argospm`. Can be the command if it's in your PATH, or the full path to the file (default: `argospm`).
 - `:strip_html` - Strip html from the post before translating it (default: `true`).
+
+## Oban Web
+
+The built-in Oban Web dashboard grants all administrators access to look at and modify the instance’s job queue.
+To enable or disable it the following setting can be set to `true` or `false` respectively:
+
+```
+config :oban_met, autostart: false
+```
--- a/docs/docs/configuration/optimisation/general.md
+++ b/docs/docs/configuration/optimisation/general.md
@ -0,0 +1,48 @@
+# General Performance and Optimisation Notes
+
+# Oban Web
+
+The built-in Oban Web dashboard has a seemingly constant'ish overhead
+irrelevant to large instances but potentially
+noticeable for small instances on low power systems.
+Thus if the latter applies to your case, you might want to disable it;
+see [the cheatsheet](../cheatsheet.md#oban-web).
+
+# Relays
+
+Subscribing to relays exposes your instance to a high volume flood of incoming activities.
+This does not just incur the cost of processing those activities themselves, but typically
+each activity may trigger additional work, like fetching ancestors and child posts to
+complete the thread, refreshing user profiles, etc.  
+Furthermore the larger the count of activities and objects in your database the costlier
+all database operations on these (highly important) tables get.
+
+Carefully consider whether this is worth the cost
+and if you experience performance issues unsubscribe from relays.
+
+Regularly pruning old remote posts and orphaned activities is also especially important
+when following relays or just having unfollowed relays for performance reasons.
+
+# Pruning old remote data
+
+Over time your instance accumulates more and more remote data, mainly in form of posts and activities.
+Chances are you and your local users do not actually care for the vast majority of those.
+Consider regularly *(frequency highly dependent on your individual setup)* pruning such old and irrelevant remote data; see
+[the corresponding `mix` tasks](../../../administration/CLI_tasks/database#prune-old-remote-posts-from-the-database).
+
+# Database Maintenance
+
+Akkoma’s performance is highly dependent on and often bottle-necked by the database.
+Taking good care of it pays off!
+See the dedicated [PostgreSQL page](../postgresql.md).
+
+# HTTP Request Cache
+
+If your instance is frequently getting _many_ `GET` requests from external 
+actors *(i.e. everyone except logged-in local users)* an additional
+*(Akkoma already has some caching built-in and so might your reverse proxy)*
+caching layer as described in the [Varnish Cache guide](varnish_cache.md)
+might help alleviate the impact.
+
+If this condition does **not** hold though,
+setting up such a cache likely only worsens latency and wastes memory.
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@ -1,26 +1,26 @@
-certifi==2022.9.24
-charset-normalizer==2.1.1
-click==8.1.3
-ghp-import==2.1.0
-idna==3.4
-importlib-metadata==4.12.0
-Jinja2==3.1.2
-Markdown==3.3.7
-markdown-include==0.7.0
-MarkupSafe==2.1.1
-mergedeep==1.3.4
-mkdocs==1.4.2
-mkdocs-material==8.5.9
-mkdocs-material-extensions==1.1
-packaging==21.3
-Pygments==2.13.0
-pymdown-extensions==9.8
-pyparsing==3.0.9
-python-dateutil==2.8.2
-PyYAML==6.0
-pyyaml_env_tag==0.1
-requests==2.28.1
-six==1.16.0
-urllib3==1.26.12
-watchdog==2.1.9
-zipp==3.8.0
+certifi
+charset-normalizer
+click
+ghp-import
+idna
+importlib-metadata
+Jinja2
+Markdown
+markdown-include
+MarkupSafe
+mergedeep
+mkdocs
+mkdocs-material
+mkdocs-material-extensions
+packaging
+Pygments
+pymdown-extensions
+pyparsing
+python-dateutil
+PyYAML
+pyyaml_env_tag
+requests
+six
+urllib3
+watchdog
+zipp
--- a/installation/grafana_dashboard.json
+++ b/installation/grafana_dashboard.json