:discard marks jobs as "discarded", i.e. jobs which permanently failed
due to e.g. exhausting all retries or explicitly being discared due to a
fatal error.
:cancel marks jobs as "cancelled" which does not imply failure.
While neither method counts as a job "exception" in the set of
telemetries we currently export via Prometheus, the different state
is visible in the (not-exported) metadata of oban job telemetry.
We can use handlers of those events to build bespoke statistics.
Ideally we'd like to distinguish in the receiver worker between
"invalid" and "already present or delete of unknown" documents,
but this is cumbersome to get get right with a list of
free-form, human-readable descriptions oof the violated constraints.
For now, just count both as an fatal error.
# but that is cumbersome to get right with a list of string error descriptions
E.g. \*oma federates (most) follower-only posts multiple times
to each personal inbox. This commonly leads to race conditions
with jobs of several copies running at the same time and getting
past the initial "already known" check but then later all but
one will crash with an exception from the unique db index.
Since the only special thing we do with copies anyway is to discard them,
just don't create such duplicate jobs in the first place.
For the same reason and since failed jobs don't count towards
duplicates, this should have virtually no effect on federation.
Oban cataches crashes to handle job failure and retry,
thus it never bubbles up all the way and nothing is logged by default.
For better debugging, catch and log any crashes.
To facilitate this ObjectValidator.fetch_actor_and_object is adapted to
return an informative error. Otherwise we’d be unable to make an
informed decision on retrying or not later. There’s no point in
retrying to fetch MRF-blocked stuff or private posts for example.
This is the only user of fetch_actor_and_object which previously just
always preteneded to be successful. For all the activity types handled
here, we absolutely need the referenced object to be able to process it
(other than Announce whether or not processing those activity types for
unknown remote objects is desirable in the first place is up for debate)
All other users of the similar fetch_actor already properly check success.
Note, this currently lumps all reolv failure reasons together,
so even e.g. boosts of MRF rejected posts will still exhaust all
retries. The following commit improves on this.
It makes decisions based on error sources harder since all possible
nesting levels need to be checked for. As shown by the return values
handled in the receiver worker something else still nests those,
but this is a first start.
Ideally we’d like to split this up more and count most invalid documents
as an error, but silently drop e.g. Deletes for unknown objects.
However, this is hard to extract from the changeset and jobs canceled
with :discard don’t count as exceptions and I’m not aware of a idiomatic
way to cancel further retries while retaining the exception status.
Thus at least keep a log, but since superfluous "Delete"s
seem kinda frequent, don't log at error, only info level.