After several hours of debugging in a larger task force yesterday and today, we finally figured out that we must have ran into an XFS bug (seems to work fine on ext4). An
ftruncate syscall hung forever and hence the process was caught in an uninterruptible sleep. This was the first time I ever witnessed
kill -9 not to “work”. But I learned a bunch of new stuff. I never dug this deep into the guts before.
Some of you probably know that
/proc/$PID/syscall tells you the current system call the process is executing. And
/proc/$PID/stack returns the kernel stack trace. Awesome stuff!
That’s a wonderful article on that matter: https://tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quickndirty-troubleshooting/