vineri, 12 decembrie 2025

How I tracked down a thread block that looked like a database problem

 <method>

Recently I had a web application that behaved as if it were quietly freezing from the inside out. Threads were piling up, each one waiting forever, never terminating, like cars stuck behind an invisible traffic light that never turns green.

At first glance, the symptom looked straightforward: the threads were blocked at the database level. They were trying to update a table, couldn’t acquire a lock, and ended up waiting indefinitely. Because only operations related to a single element were stuck, I assumed it wasn’t a full table lock but a row lock. Unfortunately, I couldn’t confirm this, because by the time I saw the issue, the application had already been restarted, clearing all locks.

Still, the database block felt like a consequence, not the root cause. My job was to find the first thread that got stuck. That original thread triggered the row lock; all other threads simply joined the queue behind it.

So I needed a way to reconstruct the story from logs alone.

My method was simple:

  1. Export all logs from Graylog for the timeframe into a CSV.
  2. Instead of analysing it with Python/Pandas, I chose a quicker path: upload the CSV into a Postgres table with matching columns.
  3. Query the table to find the last time each thread wrote a log line. If a thread is blocked, you stop seeing it. And there it was: the first thread that went silent.
  4. Check what endpoint it called and with what parameters. And the second voilà: the request was trying to move a folder under itself.
  5. In parallel I checked the endpoint code with Cursor, and it clearly showed several recursive branches that could loop forever.
  6. Reproduce: call the endpoint with parameters that move a folder under itself. Instant block.
  7. Fix: add validation to prevent this. 

Niciun comentariu:

Trimiteți un comentariu