fix(aria2): drain stderr pipe to prevent aria2 freeze, retry RPC status on transient failure

Root cause: aria2c subprocess stderr pipe (64 KB buffer) was never
drained. When enough error/warning output accumulated, aria2's write()
blocked, freezing the entire process including its RPC handler. The
tellStatus call then timed out after 30s with asyncio.TimeoutError(),
producing the empty error message in 'Failed to query aria2 download
status: '.

Fixes:
- Drain stderr in a background task so pipe never fills up
- Retry get_status() RPC calls up to 3 times on transient failure
- In the failure path, preserve .safetensors when .aria2 is absent
  (the download was likely complete on disk)
This commit is contained in:
Will Miao
2026-06-26 08:25:05 +08:00
parent 0ac10dfd42
commit 3a2941d751
3 changed files with 152 additions and 2 deletions

View File

@@ -2029,7 +2029,21 @@ class DownloadManager:
break
last_error = result
if os.path.exists(save_path):
# For aria2: if the .aria2 control file is missing, aria2 considers
# the download complete. A transient RPC failure may have made us
# think the download failed even though the file is fully on disk.
# Keep the file so a retry can find it already complete.
if (
transfer_backend == "aria2"
and os.path.exists(save_path)
and not os.path.exists(f"{save_path}.aria2")
):
logger.warning(
"aria2 download reported failure but .aria2 file is absent "
"for %s — the file is likely complete. Preserving it for retry.",
save_path,
)
elif os.path.exists(save_path):
try:
os.remove(save_path)
except Exception as e: