Commit graph

161 commits

Author SHA1 Message Date
Oneric
f576807f1b worker/receiver: don't retry unsupported actions
Observed for e.g. user delete Undos and Bite activities
2025-05-09 22:29:49 +02:00
Oneric
8cdfbf872d Merge pull request 'federation/out: tweak publish retry backoff' (#884) from Oneric/akkoma:publish_backoff into develop
Reviewed-on: https://akkoma.dev/AkkomaGang/akkoma/pulls/884
2025-05-09 20:12:56 +00:00
Oneric
13940a558a Merge pull request 'Expose stats about finally failed AP deliveries in prometheus' (#882) from Oneric/akkoma:telemetry-failed-deliveries into develop
Reviewed-on: https://akkoma.dev/AkkomaGang/akkoma/pulls/882
2025-05-09 20:12:01 +00:00
Oneric
d6f5f4db18 Merge pull request 'receiver_worker: prevent duplicate jobs' (#886) from Oneric/akkoma:receive_dedupe into develop
Reviewed-on: https://akkoma.dev/AkkomaGang/akkoma/pulls/886
2025-05-09 19:13:14 +00:00
Oneric
0d38385d6f publisher: don't mangle between string and atom
Oban jobs only can have string args and there’s no reason to insist on atoms here.

Plus this used unchecked string_to_atom
2025-05-06 17:38:18 +02:00
Oneric
2fee79e1f5 Use apropriate cancellation type for oban jobs
:discard marks jobs as "discarded", i.e. jobs which permanently failed
due to e.g. exhausting all retries or explicitly being discared due to a
fatal error.
:cancel marks jobs as "cancelled" which does not imply failure.

While neither method counts as a job "exception" in the set of
telemetries we currently export via Prometheus, the different state
is visible in the (not-exported) metadata of oban job telemetry.
We can use handlers of those events to build bespoke statistics.

Ideally we'd like to distinguish in the receiver worker between
"invalid" and "already present or delete of unknown" documents,
but this is cumbersome to get get right with a list of
free-form, human-readable descriptions oof the violated constraints.
For now, just count both as an fatal error.
        # but that is cumbersome to get right with a list of string error descriptions
2025-04-15 19:40:26 +02:00
Oneric
195042bdc9 receiver_worker: prevent duplicate jobs
E.g. \*oma federates (most) follower-only posts multiple times
to each personal inbox. This commonly leads to race conditions
with jobs of several copies running at the same time and getting
past the initial "already known" check but then later all but
one will crash with an exception from the unique db index.

Since the only special thing we do with copies anyway is to discard them,
just don't create such duplicate jobs in the first place.
For the same reason and since failed jobs don't count towards
duplicates, this should have virtually no effect on federation.
2025-03-18 03:46:33 +01:00
Oneric
4011d20dbe federation/out: tweak publish retry backoff
With the current strategy the individual
and cumulative backoff looks like this
(the + part denotes max extra random delay):

attempt  backoff_single   cumulative
   1      16+30                16+30
   2      47+60                63+90
   3     243+90  ≈ 4min       321+180
   4    1024+120 ≈17min      1360+300  ≈23+5min
   5    3125+150 ≈20min      4500+450  ≈75+8min
   6    7776+180 ≈ 2.1h    12291+630   ≈3.4h
   7   16807+210 ≈ 4.6h    29113+840   ≈8h
   8   32768+240 ≈ 9.1h    61896+1080  ≈17h
   9   59049+270 ≈16.4h   120960+1350  ≈33h
  10  100000+300 ≈27.7h   220975+1650  ≈61h

We default to 5 retries meaning the least backoff runs with attempt=4.
Therefore outgoing activiities might already be permanently dropped by a
downtime of only 23 minutes which doesn't seem too implausible to occur.
Furthermore it seems excessive to retry this quickly this often at the
beginning.
At the same time, we’d like to have at least one quick'ish retry to deal
with transient issues and maintain reasonable federation responsiveness.

If an admin wants to tolerate one -day downtime of remotes,
retries need to be almost doubled.

The new backoff strategy implemented in this commit instead
switches to an exponetial after a few initial attempts:

attempt  backoff_single   cumulative
   1      16+30              16+30
   2     143+60             159+90
   3    2202+90  ≈37min    2361+180 ≈40min
   4    8160+120 ≈ 2.3h   10521+300 ≈ 3h
   5   77393+150 ≈21.5h   87914+450 ≈24h

Initial retries are still fast, but the same amount of retries
now allows a remote downtime of at least 40 minutes. Customising
the retry count to 5 allows for whole-day downtimes.
2025-03-17 19:37:54 +01:00
Floatingghost
f176294d6d elixir 1.18 formatting 2025-03-02 11:54:00 +00:00
Oneric
4701aa2a38 receiver_worker: log processes crashes
Oban cataches crashes to handle job failure and retry,
thus it never bubbles up all the way and nothing is logged by default.
For better debugging, catch and log any crashes.
2025-02-14 18:46:19 +01:00
Oneric
2c75600532 federation/incoming: improve link_resolve retry decision
To facilitate this ObjectValidator.fetch_actor_and_object is adapted to
return an informative error. Otherwise we’d be unable to make an
informed decision on retrying or not later. There’s no point in
retrying to fetch MRF-blocked stuff or private posts for example.
2025-01-07 20:27:28 +01:00
Oneric
0cd4040db6 Error out earlier on missing mandatory reference
This is the only user of fetch_actor_and_object which previously just
always preteneded to be successful. For all the activity types handled
here, we absolutely need the referenced object to be able to process it
(other than Announce whether or not processing those activity types for
unknown remote objects is desirable in the first place is up for debate)

All other users of the similar fetch_actor already properly check success.

Note, this currently lumps all reolv failure reasons together,
so even e.g. boosts of MRF rejected posts will still exhaust all
retries. The following commit improves on this.
2025-01-07 20:27:28 +01:00
Oneric
0ba5c3649d federator: don't nest {:error, _} tuples
It makes decisions based on error sources harder since all possible
nesting levels need to be checked for. As shown by the return values
handled in the receiver worker something else still nests those,
but this is a first start.
2025-01-07 20:27:28 +01:00
Oneric
cbb0d4b0a8 receiver_worker: log unecpected errors
This can't handle process crash errors
but i hope those get a stacktrace logged by default
2025-01-07 20:27:28 +01:00
Oneric
be2c857845 receiver_worker: don't reattempt invalid documents
Ideally we’d like to split this up more and count most invalid documents
as an error, but silently drop e.g. Deletes for unknown objects.
However, this is hard to extract from the changeset and jobs canceled
with :discard don’t count as exceptions and I’m not aware of a idiomatic
way to cancel further retries while retaining the exception status.

Thus at least keep a log, but since superfluous "Delete"s
seem kinda frequent, don't log at error, only info level.
2025-01-07 20:27:28 +01:00
Oneric
9f4d3a936f cosmetic/receiver_worker: reformat error cases
The next commit adds a multi-statement case
and then mix format will enforce this anyway
2025-01-07 20:27:28 +01:00
Oneric
f9724b5879 Don’t reattempt insertion of already known objects
Might happen if we receive e.g. a Like before the Note arrives
in our inbox and we thus already queried the Note ourselves.
2025-01-07 20:27:27 +01:00
Oneric
92544e8f99 Don't enqueue a plethora of unnecessary NodeInfoFetcher jobs
There were two issues leading to needles effort:
Most importnatly, the use of AP IDs as "source_url" meant multiple
simultaneous jobs got scheduled for the same instance even with the
default unique settings.
Also jobs were scheduled uncontionally for each processed AP object
meaning we incured oberhead from managing Oban jobs even if we knew it
wasn't necessary. By comparison the single query to check if an update
is needed should be cheaper overall.
2025-01-07 20:27:27 +01:00
Oneric
d283ac52c3 Don't create noop SearchIndexingWorker jobs for passive index 2025-01-07 20:27:27 +01:00
Oneric
ed4019e7a3 workers: make custom filtering ahead of enqueue possible 2025-01-07 20:27:27 +01:00
Haelwenn (lanodan) Monnier
c17681ae1e Purge obsolete ap_enabled indicator
It was used to migrate OStatus connections to ActivityPub if possible,
but support for OStatus was long since dropped, all new actors always AP
and if anything wasn't migrated before, their instance is already marked
as unreachable anyway.

The associated logic was also buggy in several ways and deleted users
got set to ap_enabled=false also causing some issues.

This patch is a pretty direct port of the original Pleroma MR;
follow-up commits will further fix and clean up remaining issues.
Changes made (other than trivial merge conflict resolutions):
  - converted CHANGELOG format
  - adapted migration id for Akkoma’s timeline
  - removed ap_enabled from additional tests

Ported-from: https://git.pleroma.social/pleroma/pleroma/-/merge_requests/3880
2025-01-07 20:27:26 +01:00
Oneric
e8bf4422ff Delay attachment deletion
Otherwise attachments have a high chance to disappear with akkoma-fe’s
“delete & redraft” feature when cleanup is enabled in the backend. Since
we don't know whether a deletion was intended to be part of a redraft
process or even if whether the redraft was abandoned we still have to
delete attachments eventually.
A thirty minute delay should provide sufficient time for redrafting.

Fixes: https://akkoma.dev/AkkomaGang/akkoma/issues/775
2025-01-03 20:49:11 +01:00
Oneric
bcfbfbcff5 Don't try to cleanup remote attachments
The cleanup attachment worker was run for every deleted post,
even if it’s a remote post whose attachments we don't even store.
This was especially bad due to attachment cleanup involving a
particularly heavy query wasting a bunch of database perf for nil.

This was uncovered by comparing statistics from
https://akkoma.dev/AkkomaGang/akkoma/issues/784 and
https://akkoma.dev/AkkomaGang/akkoma/issues/765#issuecomment-12256
2025-01-03 20:48:46 +01:00
Mark Felder
5da9cbd8a5 RichMedia refactor
Rich Media parsing was previously handled on-demand with a 2 second HTTP request timeout and retained only in Cachex. Every time a Pleroma instance is restarted it will have to request and parse the data for each status with a URL detected. When fetching a batch of statuses they were processed in parallel to attempt to keep the maximum latency at 2 seconds, but often resulted in a timeline appearing to hang during loading due to a URL that could not be successfully reached. URLs which had images links that expire (Amazon AWS) were parsed and inserted with a TTL to ensure the image link would not break.

Rich Media data is now cached in the database and fetched asynchronously. Cachex is used as a read-through cache. When the data becomes available we stream an update to the clients. If the result is returned quickly the experience is almost seamless. Activities were already processed for their Rich Media data during ingestion to warm the cache, so users should not normally encounter the asynchronous loading of the Rich Media data.

Implementation notes:

- The async worker is a Task with a globally unique process name to prevent duplicate processing of the same URL
- The Task will attempt to fetch the data 3 times with increasing sleep time between attempts
- The HTTP request obeys the default HTTP request timeout value instead of 2 seconds
- URLs that cannot be successfully parsed due to an unexpected error receives a negative cache entry for 15 minutes
- URLs that fail with an expected error will receive a negative cache with no TTL
- Activities that have no detected URLs insert a nil value in the Cachex :scrubber_cache so we do not repeat parsing the object content with Floki every time the activity is rendered
- Expiring image URLs are handled with an Oban job
- There is no automatic cleanup of the Rich Media data in the database, but it is safe to delete at any time
- The post draft/preview feature makes the URL processing synchronous so the rendered post preview will have an accurate rendering

Overall performance of timelines and creating new posts which contain URLs is greatly improved.
2024-06-09 17:33:48 +01:00
Haelwenn (lanodan) Monnier
0c2f200b4d ReceiverWorker: Make sure non-{:ok, _} is returned as {:error, …}
Otherwise an error like `{:signature, {:error, {:error, :not_found}}}`
ends up considered a success.

Cherry-picked-from: a299ddb10e
2024-04-21 20:58:06 +02:00
Floatingghost
370576474c only consider :op and :id args in duplicate checks 2024-04-19 11:39:27 +01:00
Floatingghost
d2cee15c15 mix format says no 2024-04-16 03:07:28 +01:00
Floatingghost
d70fa16383 oban options should be a keyword list 2024-04-16 02:58:50 +01:00
Floatingghost
5043571084 Enable oban job uniqueness
by default just prevent job floods with a 1-seconds
uniqueness check, but override in RemoteFetcherWorker
for 5 minute uniqueness check over all states

:infinity is an option we can go for maybe at some point,
but that would prevent any refetches so maybe not idk.
2024-04-16 02:53:24 +01:00
Floatingghost
b7dd739de1 Make sure we return the right format for oban 2024-04-16 02:35:21 +01:00
Floatingghost
2fc25980d1 fix pattern matching in fetch errors 2024-04-13 23:55:26 +01:00
Mark Felder
2e369aef71 Allow the Remote Fetcher to attempt fetching an unreachable instance 2024-04-12 20:33:21 +01:00
Mark Felder
fed7a78c77 Oban jobs should be discarded on permanent errors 2024-04-12 20:33:17 +01:00
Mark Felder
ff515c05c3 Prevent requeuing Remote Fetcher jobs that exceed thread depth 2024-04-12 20:32:31 +01:00
Mark Felder
7e5004b3e2 Leverage existing atoms as return errors for the object fetcher 2024-04-12 20:32:13 +01:00
Mark Felder
e2b04fac5a Skip remote fetch jobs for unreachable instances 2024-04-12 20:28:36 +01:00
Mark Felder
6d368808d3 Remove mistaken duplicate fetch 2024-04-12 20:28:31 +01:00
Mark Felder
132036f951 Cancel remote fetch jobs for deleted objects 2024-04-12 20:28:21 +01:00
Mark Felder
4c29366fe5 Mark instances as unreachable when returning a 403 from an object fetch
This is a definite sign the instance is blocked and they are enforcing authorized_fetch
2024-04-12 20:27:33 +01:00
Oneric
1a7839eaf2 Prune old Update activities
Once processed they serve no purpose anymore afaict.
Therefor, lets prune them like other transient activities
to not unnecessarily bloat the table.
2024-02-17 16:57:40 +01:00
floatingghost
6b882a2c0b Purge Rejected Follow requests in daily task (#334)
Co-authored-by: FloatingGhost <hannah@coffee-and-dreams.uk>
Reviewed-on: https://akkoma.dev/AkkomaGang/akkoma/pulls/334
2022-12-03 23:17:43 +00:00
floatingghost
db60640c5b Fixing up deletes a bit (#327)
Co-authored-by: FloatingGhost <hannah@coffee-and-dreams.uk>
Reviewed-on: https://akkoma.dev/AkkomaGang/akkoma/pulls/327
2022-12-01 15:00:53 +00:00
FloatingGhost
ee7059c9cf Spin off imports into n oban jobs 2022-11-27 21:45:41 +00:00
floatingghost
2a1f17e3ed and i yoink (#275)
Co-authored-by: Mark Felder <feld@feld.me>
Co-authored-by: FloatingGhost <hannah@coffee-and-dreams.uk>
Reviewed-on: https://akkoma.dev/AkkomaGang/akkoma/pulls/275
2022-11-14 15:07:26 +00:00
floatingghost
c1127e321b Add configurable timeline per oban job (#273)
Heavily inspired by https://git.pleroma.social/pleroma/pleroma/-/merge_requests/3777

Co-authored-by: FloatingGhost <hannah@coffee-and-dreams.uk>
Reviewed-on: https://akkoma.dev/AkkomaGang/akkoma/pulls/273
2022-11-13 23:55:51 +00:00
floatingghost
b7e8ce2350 Scrape instance nodeinfo (#251)
Co-authored-by: FloatingGhost <hannah@coffee-and-dreams.uk>
Reviewed-on: https://akkoma.dev/AkkomaGang/akkoma/pulls/251
2022-11-06 22:49:39 +00:00
FloatingGhost
d3b9cfb03f use :discard instead of cancel 2022-08-11 19:17:50 +01:00
floatingghost
1245141779 treat rejections in MRF as a reject in federator (#155)
Reviewed-on: https://akkoma.dev/AkkomaGang/akkoma/pulls/155
2022-08-08 15:47:57 +00:00
Tusooa Zhu
f08241c8ab
Allow users to create backups without providing email address
Ref: backup-without-email
2022-08-02 22:16:54 -04:00
Ekaterina Vaartis
7aebff799b Fix meilisearch tests and jobs for oban 2022-06-29 20:49:45 +01:00