Directory

⚓ T379035 Consider lifting AssembleUploadChunks and PublishStashedFile out of the low-traffic consumer
Page MenuHomePhabricator

Consider lifting AssembleUploadChunks and PublishStashedFile out of the low-traffic consumer
Closed, ResolvedPublic

Description

In T378385, a large influx of flaggedrevs_CacheUpdate jobs led to isolation failure among jobs mapped to the "low-traffic" consumer in changeprop-jobqueue.

While T378609 tracks monitoring improvements to try to detect similar scenarios in the future, we could also proactively shift one or more particularly latency sensitive jobs out of low-traffic into a dedicated consumer (as has already happened for a handful of other jobs, see [0]).

Two such jobs that came up in that context are AssembleUploadChunks and PublishStashedFile, given that execution delay translates directly into waiting at the UI.

[0] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/changeprop-jobqueue/values.yaml

Event Timeline

Change #1089313 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] changeprop: add latency_sensitive_jobs_config jobs

https://gerrit.wikimedia.org/r/1089313

Change #1089314 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] changeprop-jobqueue: add AssembleUploadChunks rule

https://gerrit.wikimedia.org/r/1089314

Change #1089315 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] changeprop-jobqueue: enable AssembleUploadChunks rule

https://gerrit.wikimedia.org/r/1089315

Looking at includes/api/ApiUpload.php, it seems there are three jobs in the critical path for various (async) upload use cases: AssembleUploadChunks, PublishStashedFile, and UploadFromUrl.

One option is to do what's proposed in the patch series starting at https://gerrit.wikimedia.org/r/1089313, where each will get its own consumer, in this case starting with AssembleUploadChunks. Once that's verified to work as expected, we'd add consumers for the other two similarly.

Another option would be to lump all three together into a combined latency sensitive consumer - basically, similar to the way low-traffic jobs works now, but rather than being a "kitchen sink" of all jobs not otherwise handled, it would be "invite only" (i.e., would only service jobs that are known to truly be low-traffic and latency sensitive).

Really the only significant benefit of the second option would be reducing the number consumers, though at the expense of these jobs being able to interfere with each other. Which is to say, unless there's a history of issues related to number-of-consumers (need to check on that), I think the simplest option is the first one.

The only advantage of the second option is that it's easier to reason about the max concurrency of all upload jobs, but I doubt we'll have a concurrency high enough to become a problem anyways.

I'd say go with either solution, the traffic on these jobs is indeed low, but we care about latency rather than volume so in practice I don't expect either solution to matter much.

Change #1089315 abandoned by Scott French:

[operations/deployment-charts@master] changeprop-jobqueue: enable AssembleUploadChunks rule

Reason:

This is now enabled from the start in I4d6bfe2b8e537aec9e1558c494b1ce966cedb114

https://gerrit.wikimedia.org/r/1089315

Change #1089313 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: add latency_sensitive_jobs_config (jobqueue)

https://gerrit.wikimedia.org/r/1089313

Change #1089314 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: add latency-sensitive upload jobs rules

https://gerrit.wikimedia.org/r/1089314

Thank you very much for the second opinion and the reviews on the patches, @Joe!

Since ~ 17:50 UTC today, we've been processing all three upload-related job types on dedicated consumers. No issues encountered so far, and I'll continue to keep an eye on this throughout the day.

Scott_French changed the task status from Open to In Progress.Nov 13 2024, 6:18 PM
Scott_French triaged this task as High priority.

No issues encountered throughout the rest of the day today - i.e., job execution and backlog time on the new consumers continues to look good. I'll resolve this, as there are no remaining actions tracked here.