After several hours of debugging in a larger task force yesterday and today, we finally figured out that we must have ran into an XFS bug (seems to work fine on ext4). An ftruncate
syscall hung forever and hence the process was caught in an uninterruptible sleep. This was the first time I ever witnessed kill -9
not to “work”. But I learned a bunch of new stuff. I never dug this deep into the guts before.
Some of you probably know that /proc/$PID/syscall
tells you the current system call the process is executing. And /proc/$PID/stack
returns the kernel stack trace. Awesome stuff!
That’s a wonderful article on that matter: https://tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quickndirty-troubleshooting/