So apparently there's a issue with the Drone webservice that occurs rarely:
Upon a Manager calling insert_history_urls, very rarely, the following will happen:
- The corresponding HistoryUrl object gets created (GOOD)
- The corresponding CompletedUrl object gets created (GOOD)
- The corresponding PendingUrl object does NOT get deleted (unsure?) (BAD)
- The corresponding QueueUrl object does NOT get deleted (unsure?) (BAD)
Essentially, (it seems that) the URL gets visited by a host, but the corresponding URL (on rare occasions) stays on the queue_urls table (along with the PendingUrl association). Even though this stale data exists, the Manager won't attempt to revisit the URL (which is good; otherwise we'd have an infinite loop).
So, the big issue is that we could slowly start building up stale data in queue_urls and pending_urls table.
This is a tough bug to replicate, because it seems as though it only appears after the system has handled > 90,000 URLs — which suggests there may be some sort of weird timing issue.
This ticket is open just so that we can track it, in case the bug still exists.
I have a couple of ideas on how to possibly workaround the issue. For example, in the get_new_queue_urls call, we could add in some "stale data checking logic" at the end that would only execute whenever the rest of the functional block was about to return an empty set of urls. This stale data checking logic could query the queue_urls table and check if any of the already assigned URLs have created_at times that are over X minutes old (i.e., 15 mins?). If so, then the logic could simply reset those corresponding URLs host_id field back to 0, so that they'll get assigned to other hosts.
The upshot of this potential logic, is that if we had assigned URLs to a Manager and (for whatever reason) that Manager dies, then those stale URLs could get picked up by another (alive) Manager and not sit idle in the queue.
The downside is that this code could make our get_new_queue_urls call a bit slower — but, since the call was going to return an empty set of URLs anyway, it's not like we're wasting any time that would be spend otherwise idle by the Manager.
The only caveat to this approach, is that this type of "stale data checking logic" would only occur whenever the queue_urls table is almost empty (i.e., when only URLs whose host_id is not 0 only exist in the table). So, if the queue_urls table is (for whatever reason) always consistently full, then this code never gets executed and we still have a (slow) build-up of stale data.
As such, it sounds like this type of "stale data checking logic" should really be handled in a background ruby process — during regular intervals. Perhaps this could be an initial use of BackgroundRB. Not sure.
Anyway, sorry for the lengthy brain dump.
— Darien