Exit code 28 (Invalid argument) indicates this user's aria2c does not
support the --fsync option. Remove it unconditionally; the stderr drain,
relaxed RPC timeouts, and increased retry coverage remain in place.
aria2 default --fsync=true calls fsync() after each write, which blocks
the entire single-threaded process on large files under Docker overlay.
Add --fsync=false to eliminate this blocking source.
Relax aiohttp session timeout: total=30 → sock_connect=10, sock_read=60
so that transient I/O delays don't cut off legitimate tellStatus RPCs.
Increase retry params (4 attempts, 3s delay) to give aria2 more recovery
time when blocked on synchronous I/O.
Root cause: aria2c subprocess stderr pipe (64 KB buffer) was never
drained. When enough error/warning output accumulated, aria2's write()
blocked, freezing the entire process including its RPC handler. The
tellStatus call then timed out after 30s with asyncio.TimeoutError(),
producing the empty error message in 'Failed to query aria2 download
status: '.
Fixes:
- Drain stderr in a background task so pipe never fills up
- Retry get_status() RPC calls up to 3 times on transient failure
- In the failure path, preserve .safetensors when .aria2 is absent
(the download was likely complete on disk)
When certifi is available, pass its CA bundle path as --ca-certificate
to the aria2c subprocess so that aria2 downloads use the same
certificate store as Python aiohttp downloads. Graceful fallback when
certifi is not installed.