Merge pull request 'Provide reference Grafana dashboard and improve docs related to monitoring+perf' (#966) from Oneric/akkoma:grafana_ref into develop

Reviewed-on: https://akkoma.dev/AkkomaGang/akkoma/pulls/966
This commit is contained in:
Oneric 2025-08-23 16:27:18 +00:00
commit 7c1a08913f
6 changed files with 3765 additions and 55 deletions

View file

@ -26,13 +26,19 @@ Replaces embedded objects with references to them in the `objects` table. Only n
## Prune old remote posts from the database
This will prune remote posts older than 90 days (configurable with [`config :pleroma, :instance, remote_post_retention_days`](../../configuration/cheatsheet.md#instance)) from the database. Pruned posts may be refetched in some cases.
This will selectively prune remote posts older than 90 days (configurable with [`config :pleroma, :instance, remote_post_retention_days`](../../configuration/cheatsheet.md#instance)) from the database. Pruned posts may be refetched in some cases.
!!! note
The disk space will only be reclaimed after a proper vacuum. By default, Postgresql does this for you on a regular basis, but if your instance has been running for a long time and there are many rows deleted, it may be advantageous to use `VACUUM FULL` (e.g. by using the `--vacuum` option).
The disk space used up by deleted rows only becomes usable for new data after a vaccum.
By default, Postgresql does this for you on a regular basis, but if you delete a lot at once
it might be advantageous to also manually kick off a vacuum and statistics update using `VACUUM ANALYZE`.
**However**, the freed up space is never returned to the operating system unless you run
the much more heavy `VACUUM FULL` operation. This epensive but comprehensive vacuum mode
can be schedlued using the `--vacuum` option.
!!! danger
You may run out of disk space during the execution of the task or vacuuming if you don't have about 1/3rds of the database size free. Vacuum causes a substantial increase in I/O traffic, and may lead to a degraded experience while it is running.
You may run out of disk space during the execution of the task or full vacuuming if you don't have about 1/3rds of the database size free. `VACUUM FULL` causes a substantial increase in I/O traffic, needs full table locks and thus renders the instance basically unusable while its running.
=== "OTP"
@ -48,6 +54,27 @@ This will prune remote posts older than 90 days (configurable with [`config :ple
### Options
The recommended starting point and configuration for small and medium-sized instances is:
```sh
prune_objects --keep-followed posts --keep-threads --keep-non-public
# followed by
prune_orphaned_activities --no-singles
prune_orphaned_activities --no-arrays
# and finally, using psql to manually run:
# VACUUM ANALYZE;
# REINDEX TABLE objects;
# REINDEX TABLE activities;
```
This almost certainly wont delete stuff your interested in and
makes sure the database is immediately utilising the newly freed up space.
If you need more aggressive database size reductions or if this proves too costly to run for you
you can drop restrictions and/or use the `--limit` option.
In the opposite case if everything goes through quickly,
you can combine the three CLI tasks into one for future runs using `--prune-orphaned-activities`
and perhaps even using a full vacuum (which implies a reindex) using `--vacuum` too.
Full details below:
- `--keep-followed <mode>` - If set to `posts` all posts and boosts of users with local follows will be kept.
If set to `full` it will additionally keep any posts such users interacted with; this requires `--keep-threads`.
By default this is set to `none` and followed users are not treated special.
@ -56,7 +83,8 @@ This will prune remote posts older than 90 days (configurable with [`config :ple
- `--limit` - limits how many remote posts get pruned. This limit does **not** apply to any of the follow up jobs. If wanting to keep the database load in check it is thus advisable to run the standalone `prune_orphaned_activities` task with a limit afterwards instead of passing `--prune-orphaned-activities` to this task.
- `--prune-orphaned-activities` - Also prune orphaned activities afterwards. Activities are things like Like, Create, Announce, Flag (aka reports)... They can significantly help reduce the database size.
- `--prune-pinned` - Also prune pinned posts; keeping pinned posts does not suffice to protect their threads from pruning, even when using `--keep-threads`.
Note, if using this option and pinned posts are pruned, they and their threads will just be refetched on the next user update. Therefore it usually doesn't bring much gain while incurring a heavy fetch load after pruning.
Note, if using this option and pinned posts are pruned, they and their threads will just be refetched on the next user update. Therefore it usually doesn't bring much gain while incurring a heavy fetch load after pruning.
One exception to this is if you already need to use a relatively small `--limit` to keep downtime mangeable or even being able to run it without downtime. Retaining pinned posts adds a mostly constant overhead which will impact repeated runs with small limit much more than one full prune run.
- `--vacuum` - Run `VACUUM FULL` after the objects are pruned. This should not be used on a regular basis, but is useful if your instance has been running for a long time before pruning.
## Prune orphaned activities from the database

View file

@ -1,45 +1,275 @@
# Monitoring Akkoma
If you run akkoma, you may be inclined to collect metrics to ensure your instance is running smoothly,
and that there's nothing quietly failing in the background.
If you run Akkoma its a good idea to collect metrics to ensure your instance is running smoothly
without anything silently failing and to aid troubleshooting if something actually goes wrong.
To facilitate this, akkoma exposes a dashboard and prometheus metrics to be scraped.
To facilitate this, Akkoma exposes Prometheus metrics to be scraped for long-term 24/7 monitoring
as well as as two built-in dashboards with ephemeral info about just the current status.
Setting up Prometheus scraping is highly recommended.
## Prometheus
See: [export\_prometheus\_metrics](../../configuration/cheatsheet#instance)
This method gives a more or less complete overview and allows for 24/7 long-term monitoring.
To scrape prometheus metrics, we need an oauth2 token with the `admin:metrics` scope.
Prometheus metric export can be globally disabled if you really want to,
but it doesnt cause much overhead and is enabled by default: see the
[export\_prometheus\_metrics](../../configuration/cheatsheet#instance) config option.
Consider using [constanze](https://akkoma.dev/AkkomaGang/constanze) to make this easier -
Akkoma only exposes the current state of all metrics; to make it actually useful
an external scraper needs to regularly fetch and store those values.
An overview of the necessary steps follows.
### Step 1: generate a token
Accessing prometheus metrics, requires an OAuth2 token with the `admin:metrics` (sub)scope.
An access token with only this subscope will be unable to do anything at all _except_ looking at the exported metrics.
Assuming your account has access to the `admin` scope category,
a suitable metrics-only token can be conveniently generated using
[constanze](https://akkoma.dev/AkkomaGang/constanze).
If you didnt already do so before, set up `constanze` by running `constanze configure`.
Now getting the token is as simple as running the below command and following its instructions:
```bash
constanze token --client-app --scopes "admin:metrics" --client-name "Prometheus"
```
or see `scripts/create_metrics_app.sh` in the source tree for the process to get this token.
Alternatively you may manually call into the token and app APIs;
check `scripts/create_metrics_app.sh` in the source tree for the process for this.
Once you have your token of the form `Bearer $ACCESS_TOKEN`, you can use that in your prometheus config:
The resulting token will have the form `Bearer $ACCESS_TOKEN`;
in the following replace occurrences of `$ACCESS_TOKEN` with the actual token string everywhere.
If you wish, you can now check the token works by manually using it to query the current metrics with `curl`:
```yaml
- job_name: akkoma
scheme: https
authorization:
credentials: $ACCESS_TOKEN # this should have the bearer prefix removed
metrics_path: /api/v1/akkoma/metrics
static_configs:
- targets:
- example.com
!!! note
After restarting the instance it may take a couple minutes for content to show up in the metric endpoint
```sh
curl -i -H 'Authorization: Bearer $ACCESS_TOKEN' https://myinstance.example/api/v1/akkoma/metrics | head -n 100
```
## Dashboard
### Step 2: set up a scraper
You may use the eponymous [Prometheus](https://prometheus.io/)
or anything compatible with it like e.g. [VictoriaMetrics](https://victoriametrics.com/).
The latter claims better performance and storage efficiency.
Both of them can usually be easily installed via distro-packages or docker.
Depending on your distro or installation method the preferred way to change the CLI arguments and the location of config files may differ; consult the documentation of your chosen method to find out.
Of special interest is the location of the prometheus scraping config file
and perhaps the maximal data retention period setting,
to manage used disk space and make sure you keep records long enough for your purposes.
It might also be a good idea to set up a minimal buffer of free disk space if youre tight on that;
with VictoriaMetrics this can be done via the `-storage.minFreeDiskSpaceBytes 1GiB` CLI flag.
Ideally the scraper runs on a different machine than Akkoma to be able to
distinguish Akkoma downtime from scraper downtime, but this is not strictly necessary.
Once youve installed one of them, its time to add a job for scraping Akkoma.
For Prometheus the `scrape_configs` section will usually be added to the main config file,
for VictoriaMetrics this will be in the file passed via `-promscrape.config file_path`.
In either case a `scrape_configs` with just one job for a single Akkoma instance will look like this:
```yaml
scrape_configs:
- job_name: 'akkoma_scrape_job'
scheme: https
metrics_path: /api/v1/akkoma/metrics
static_configs:
- targets: ['myinstance.example']
# reminder: no Bearer prefix here!
bearer_token: '$ACCESS_TOKEN'
# One minute should be frequent enough, but you can choose any value, or rely on the global default.
# However, for later use in Grafana you want to match this exactly, thus make note.
scrape_interval: '1m'
```
Now (re)start the scraper service, wait for a multiple of the scrape interval and check logs
to make sure no errors occur.
### Step 3: visualise the collected data
At last its time to actually get a look at the collected data.
There are many options working with Prometheus-compatible backends
and even software which can act as both the scraper _and_ visualiser in one service.
Here well just deal with Grafana, since we ship a reference Grafana dashboard you can just import.
There are again multiple options for [installing Grafana](https://grafana.com/docs/grafana/latest/setup-grafana/)
and detailing all of them is out of scope here, but its nothing too complicated if you already set up Akkoma.
Once youve got it running and are logged into Grafana,
you first need to tell it about the scraper which acts a the “data source”.
For this go to the “Connections” category and select “Data Sources”.
Here click the shiny button for adding a new data source,
select the “Prometheus” type and fill in the details
matching how you set up the scraper itself.
In particular, **use the same `Scrape Interval` value!**
Now youre ready to go to the “Dashboards” page.
Click the “New” button, select “Import” and upload or copy the contents of
the reference dashboard `installation/grafana_dashboard.json` from Akkomas source tree.
It will now ask you to select the data source you just configured,
as well as for the name of the job in your scraper config
and your instance domain+port identifier.
For the example settings from step 2 above
the latter two are `akkoma_scrape_job` and `myinstance.example:443`.
*(`443` is the default port for HTTPS)*
Thats it, youve got a fancy dashboard with long-term, 24/7 metrics now!
Updating the dashboard can be done by just repeating the import process.
Heres an example taken from a healthy, small instance where
nobody was logged in for about the first half of the displayed time span:
![Full view of the reference dashboard as it looked at the time of writing](img/grafana_dashboard.webp)
!!! note
By default the dashboard does not count downtime of the data source, e.g. the scraper,
towards instance downtime, but a commented out alternative query is provided in the
panel edit menu. If you host the scraper on the same machine as Akkoma you likely want to swap this out.
### Remarks on interpreting the data
Whats kind of load and even error ratio is normal or irregular can depend on
the instance size, chosen configuration and federation partners.
*(E.g. when following relays, much more activities will be received and the received activities will in turn kick off more internal work and also external fetches raising the overall load)*
Here the 24/7 nature of the metric history helps out, since we can just
look at "known-good" time spans to get a feeling for whats normal and good.
If issues without an immediately clear origin crop up,
we can look for deviations from this known-good pattern.
Still there are some things to be aware of and some common guidelines.
#### Panel-specific time ranges
A select few panels, are set to use a custom time range
independent from what you chose for the dashboard as a whole.
This is indicated with blue text next to the panel title.
Those custom times only take precedence over _relative_ global time ranges.
If you choose fixed start and end dates in the global setting
*(for example to look at a long-term trend after a specific change)*
this will take precedence over custom panel times and everything follows the date range.
In the image above e.g. the uptime percent gauge thus considers the entire last week
while most other panels only display data for the last 6 hours.
#### Long-term trends
The lower section of the dashboard with 24h and 48h averages is particularly useful for observing long-term trends.
E.g. how a patch, version upgrade or database `VACCUUM`/`REINDEX` affects performance.
For small time ranges you can still look at them to make sure the values are at a reasonable level,
but the upper part is probably more interesting.
#### Processing times total
The actions counted by various “time per second” or “time per minute” stats are partially overlapping.
E.g. the time to conclude a HTTP response includes the time it took to run whatever
query was needed to fetch the necessary information from the database.
However not all database queries originate from HTTP requests.
But also, not all of the recorded time might have actually consumed CPU cycles.
Some jobs, e.g. `RemoteFetcherWorker`, will need to fetch data over the network
and often most of the time from job start to completion is just spent waiting
for a reply from the remote server to arrive.
Even a few HTTP endpoints will need to fetch remote data before completing;
e.g. `/inbox` needs to verify the signature of the submission, but if the signing key
wasnt encountered before it first needs to be fetched.
Getting deliveries from such unknown users happens more often than you might initially assume
due to e.g. Mastodon federating actor deletions to _every server it knows about_
regardless of whether there was ever any contact with the deleted user.
*(Meaning in the end the key lookup will just result in a `410 Gone` response and us dropping the superfluous `Delete`)*
Thus if you just add up all timing stats youll count some actions multiple times
and may end up consistently with more processing time being done than time elapsing on the wall clock
even though your server is neither overloaded nor subject to relative time dilation.
For keeping track of CPU and elixir-side(!) IO bottlenecks,
the corresponding BEAM VM gauges are much better indicators.
They should be zero most of the time and never exceed zero by much.
!!! note
The BEAM VM (running our elixir code) cannot know about
the databases IO activity or CPU cycle consumption,
thus this gauge is no indicator for database bottlenecks.
#### Job failures and federation
Most jobs are automatically retried and may fail (“exception”) due to no fault of your own instance
e.g. network issues or a remote server temporarily being overloaded.
Thus seeing some failures here is normal and nothing to be concerned about;
usually it will just resolve itself on the next retry.
Consistent and/or a relatively high success-to-failure ratio though
is worth looking into using logs.
Of particular importance are Publisher jobs;
they handle delivering your local content to its intended remote recipients.
Again some PublisherWorker exceptions are no cause for concern,
but if all retries for a delivery fail, this means a remote recipient never
received something they shouldve seen.
Due to its particular importance, such final delivery failures are
recorded again in a separate metric.
The reference dashboard shows it in the “AP delivery failures” panel.
Everything listed there exhausted all retries without success.
Ideally this will always be empty and for small instances this should be the
case most of the time.
However, whenever a remote instance which once interacted with
your local instance in the past is decommissioned, delivery failures will likely
eventually show up in your metrics. For example:
- a local user might be followed by an user from the dead instance
- a local posts was in the past fetched by the dead instance and this post is now deleted;
Akkoma will attempt to deliver the `Delete` to the dead instance even if theres no follow relationship
Delivery failures for such dead instances will typically list a reason like
`http_410`, `http_502`, `http_520`-`http_530` (cloudflared instances), `econnrefused`, `nxdomain` or just `timeout`.
If all deliveries to a given remote instance consistently fail for a longer time,
Akkoma will mark it as permanently unreachable and stop even attempting to deliver
to it meaning the errors should go away after a while.
*(If Akkoma sees activity from the presumed dead instance again it will resume deliveries for future content, but anything in the past will remain lost)*
Large instances with many users are more likely to have (had) some relationship to
such a recently decommissioned instances and thus might see failures here more often
even if nothing is wrong with the local Akkoma instance.
If this makes too much noise, consider filtering out telltale delivery failures.
On the opposite side of things, a `http_401` error for example is always worth looking into!
## Built-in Dashboard
Administrators can access a live dashboard under `/phoenix/live_dashboard`
giving an overview of uptime, software versions, database stats and more.
The dashboard also includes a variation of the prometheus metrics, however
they do not exactly match due to respective limitations of the dashboard
and the prometheus exporter.
Even more important, the dashboard collects metrics locally in the browser
only while the page is open and cannot give a view on their past history.
For proper monitoring it is recommended to set up prometheus.
This dashboard can also show a limited subset of Prometheus metrics,
however besides being limited it only starts collecting data when opening
the corresponding page in the browser and the history only exists in ephemeral browser memory.
When navigating away from the page, all history is gone.
However, this is not this dashboards main purpose anyway.
The usefulness of this built-in dashboard are the insights into the current state of
the BEAM VM running Akkomas code and statistics about the database and its performance
as well as database diagnostics.
BEAM VM stats include detailed memory consumption breakdowns
and a full list of running processes for example.
## Oban Web
This too requires administrator rights to access and can be found under `/akkoma/oban` if enabled.
The exposed aggregate info is mostly redundant with job statistics already tracked in Prometheus,
but it additionally also:
- shows not-yet executed jobs in the backlog of queues
- shows full argument and meta details for each job
- allows interactively deleting or manually retrying jobs
*(keep this in mind when granting people administrator rights!)*
However, there are two caveats:
1. Just as with the other built-in dashboard, data is not kept around
(although here a **short** backlog actually exists);
when you notice an issue during use and go here to check it likely is already too late.
Job details and history only exists while the jobs are still in the database;
by default failed and succeeded jobs will disappear after about a minute.
2. This dashboard comes with some seemingly constant-ish overhead.
For large instances this appears to be negligible, but small instances on weaker hardware might suffer.
Thus this dashboard can be disabled in the [config](../cheatsheet.md#oban-web).

View file

@ -1193,7 +1193,7 @@ Each job has these settings:
* `:max_running` - max concurrently running jobs
* `:max_waiting` - max waiting jobs
### Translation Settings
## Translation Settings
Settings to automatically translate statuses for end users. Currently supported
translation services are DeepL and LibreTranslate. The supported command line tool is [Argos Translate](https://github.com/argosopentech/argos-translate).
@ -1223,3 +1223,12 @@ Translations are available at `/api/v1/statuses/:id/translations/:language`, whe
- `:command_argos_translate` - command for `argos-translate`. Can be the command if it's in your PATH, or the full path to the file (default: `argos-translate`).
- `:command_argospm` - command for `argospm`. Can be the command if it's in your PATH, or the full path to the file (default: `argospm`).
- `:strip_html` - Strip html from the post before translating it (default: `true`).
## Oban Web
The built-in Oban Web dashboard grants all administrators access to look at and modify the instances job queue.
To enable or disable it the following setting can be set to `true` or `false` respectively:
```
config :oban_met, autostart: false
```

View file

@ -0,0 +1,48 @@
# General Performance and Optimisation Notes
# Oban Web
The built-in Oban Web dashboard has a seemingly constant'ish overhead
irrelevant to large instances but potentially
noticeable for small instances on low power systems.
Thus if the latter applies to your case, you might want to disable it;
see [the cheatsheet](../cheatsheet.md#oban-web).
# Relays
Subscribing to relays exposes your instance to a high volume flood of incoming activities.
This does not just incur the cost of processing those activities themselves, but typically
each activity may trigger additional work, like fetching ancestors and child posts to
complete the thread, refreshing user profiles, etc.
Furthermore the larger the count of activities and objects in your database the costlier
all database operations on these (highly important) tables get.
Carefully consider whether this is worth the cost
and if you experience performance issues unsubscribe from relays.
Regularly pruning old remote posts and orphaned activities is also especially important
when following relays or just having unfollowed relays for performance reasons.
# Pruning old remote data
Over time your instance accumulates more and more remote data, mainly in form of posts and activities.
Chances are you and your local users do not actually care for the vast majority of those.
Consider regularly *(frequency highly dependent on your individual setup)* pruning such old and irrelevant remote data; see
[the corresponding `mix` tasks](../../../administration/CLI_tasks/database#prune-old-remote-posts-from-the-database).
# Database Maintenance
Akkomas performance is highly dependent on and often bottle-necked by the database.
Taking good care of it pays off!
See the dedicated [PostgreSQL page](../postgresql.md).
# HTTP Request Cache
If your instance is frequently getting _many_ `GET` requests from external
actors *(i.e. everyone except logged-in local users)* an additional
*(Akkoma already has some caching built-in and so might your reverse proxy)*
caching layer as described in the [Varnish Cache guide](varnish_cache.md)
might help alleviate the impact.
If this condition does **not** hold though,
setting up such a cache likely only worsens latency and wastes memory.

View file

@ -1,26 +1,26 @@
certifi==2022.9.24
charset-normalizer==2.1.1
click==8.1.3
ghp-import==2.1.0
idna==3.4
importlib-metadata==4.12.0
Jinja2==3.1.2
Markdown==3.3.7
markdown-include==0.7.0
MarkupSafe==2.1.1
mergedeep==1.3.4
mkdocs==1.4.2
mkdocs-material==8.5.9
mkdocs-material-extensions==1.1
packaging==21.3
Pygments==2.13.0
pymdown-extensions==9.8
pyparsing==3.0.9
python-dateutil==2.8.2
PyYAML==6.0
pyyaml_env_tag==0.1
requests==2.28.1
six==1.16.0
urllib3==1.26.12
watchdog==2.1.9
zipp==3.8.0
certifi
charset-normalizer
click
ghp-import
idna
importlib-metadata
Jinja2
Markdown
markdown-include
MarkupSafe
mergedeep
mkdocs
mkdocs-material
mkdocs-material-extensions
packaging
Pygments
pymdown-extensions
pyparsing
python-dateutil
PyYAML
pyyaml_env_tag
requests
six
urllib3
watchdog
zipp

File diff suppressed because it is too large Load diff