I've been trying to investigate issues with an application failing, and I have reason to believe that the culprit lies somewhere in the database backend. To this end, I started collecting metrics from the backend MariaDB Galera cluster (currently running on MariaDB version 10.3.16), hoping that the failures would reflect on the metrics collected.
Indeed, about 12 hours before the application started failing spectacularly, the values reported by the 'Master' node (i.e. the node to which the application directs the writes) for Innodb_row_lock_time started growing at a rate never recorded before. Here's a link to a graph demonstrating this fact over the past week:
Note that the graph displays change, not the current value of the metric. MariaDB servers are polled every 90 seconds and datapoints in the graph refer to change per minute. The big drop near the end of the graph indicates the time when the MariaDB service was restarted on the 'Master' node.
My question is how to further investigate this symptom and possibly identify the culprit queries or operation. I also log the output of InnoDB Monitor in logfiles, but I haven't been able to find anything out of the ordinary during the time wiat_time was growing rapidly (although I'm no DB expert).
Is there any other logging functionality I can enable to provide more information on this? And if InnoDB Monitor Output should provide the information needed, what should I be looking for exactly? What kind of operation could lead to rows being locked for so long, considering the sudden manifestation of the issue?
Thank you in advance.