With the current strategy the individual
and cumulative backoff looks like this
(the + part denotes max extra random delay):
attempt backoff_single cumulative
1 16+30 16+30
2 47+60 63+90
3 243+90 ≈ 4min 321+180
4 1024+120 ≈17min 1360+300 ≈23+5min
5 3125+150 ≈20min 4500+450 ≈75+8min
6 7776+180 ≈ 2.1h 12291+630 ≈3.4h
7 16807+210 ≈ 4.6h 29113+840 ≈8h
8 32768+240 ≈ 9.1h 61896+1080 ≈17h
9 59049+270 ≈16.4h 120960+1350 ≈33h
10 100000+300 ≈27.7h 220975+1650 ≈61h
We default to 5 retries meaning the least backoff runs with attempt=4.
Therefore outgoing activiities might already be permanently dropped by a
downtime of only 23 minutes which doesn't seem too implausible to occur.
Furthermore it seems excessive to retry this quickly this often at the
beginning.
At the same time, we’d like to have at least one quick'ish retry to deal
with transient issues and maintain reasonable federation responsiveness.
If an admin wants to tolerate one -day downtime of remotes,
retries need to be almost doubled.
The new backoff strategy implemented in this commit instead
switches to an exponetial after a few initial attempts:
attempt backoff_single cumulative
1 16+30 16+30
2 143+60 159+90
3 2202+90 ≈37min 2361+180 ≈40min
4 8160+120 ≈ 2.3h 10521+300 ≈ 3h
5 77393+150 ≈21.5h 87914+450 ≈24h
Initial retries are still fast, but the same amount of retries
now allows a remote downtime of at least 40 minutes. Customising
the retry count to 5 allows for whole-day downtimes.
by default just prevent job floods with a 1-seconds
uniqueness check, but override in RemoteFetcherWorker
for 5 minute uniqueness check over all states
:infinity is an option we can go for maybe at some point,
but that would prevent any refetches so maybe not idk.