Describe the bug I get the folloing error after all

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey there! With the release of <a href="https://github.com/mariadb-o

[Bug] MariaDB fails to restart after the whole cluster rebooted about mariadb-operator HOT 12 OPEN

outductor commented on June 15, 2024

[Bug] MariaDB fails to restart after the whole cluster rebooted

from mariadb-operator.

Comments (12)

outductor commented on June 15, 2024 1

@mmontes11 can you reopen this? :(

from mariadb-operator.

mmontes11 commented on June 15, 2024 1

Thanks for the detailed information!

This most likely means that the replicas are missing some binary log operations from the primary and therefore are unable to unable to apply some further operations.

We are looking into automatically fix this situation, adding support for physical backups the first thing:

#273

The idea is that the operator would detect when the replica Pods are unable to fully replicate and perform the following steps automatically to bring it back:

#141 (comment)

The same applies to adding new replicas to the cluster, we would need to setup the new replica based on a physical backup from the primary, as some log operations may have been expired in the primary i.e. by configuring expire_logs_days. See:

Related to:

#141

from mariadb-operator.

mmontes11 commented on June 15, 2024 1

Hey there!

With the release of v0.0.25, primary failover done by the operator has been superseded by MaxScale: a sophisticated database proxy, router, and load balancer designed specifically for and by MariaDB. I provides a more resilient primary failover based on MariaDB internals, I encourage you to take a look at the docs:

https://github.com/mariadb-operator/mariadb-operator/blob/main/docs/MAXSCALE.md

from mariadb-operator.

kvaps commented on June 15, 2024 1

Not stale. This is still in a row

from mariadb-operator.

github-actions commented on June 15, 2024

This issue is stale because it has been open 30 days with no activity.

from mariadb-operator.

github-actions commented on June 15, 2024

This issue was closed because it has been stalled for 5 days with no activity.

from mariadb-operator.

boedy commented on June 15, 2024

I might have run into the same issue. I'm not sure if the whole cluster rebooted, but two replicas are out of sync now. This is currently the case in my two MariaDB clusters I run on the same Kubernetes cluster.

I had to delete one of the replica pods today as it was promoted to master by the operator, which brought the issue to light. When I deleted the pod, the most up-to-date replica became master again, restoring the correct state of the cluster. When the deleted pod came back online, the logs showed:

2023-12-13 14:49:21 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2023-12-13 14:49:21 0 [Note] InnoDB: Number of pools: 1
2023-12-13 14:49:21 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2023-12-13 14:49:21 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
2023-12-13 14:49:21 0 [Note] InnoDB: Using Linux native AIO
2023-12-13 14:49:21 0 [Note] InnoDB: Initializing buffer pool, total size = 8589934592, chunk size = 134217728
2023-12-13 14:49:21 0 [Note] InnoDB: Completed initialization of buffer pool
2023-12-13 14:49:21 0 [Note] InnoDB: 128 rollback segments are active.
2023-12-13 14:49:21 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2023-12-13 14:49:21 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2023-12-13 14:49:21 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2023-12-13 14:49:21 0 [Note] InnoDB: 10.6.13 started; log sequence number 16155782456; transaction id 13287581
2023-12-13 14:49:21 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2023-12-13 14:49:21 0 [Note] Plugin 'FEEDBACK' is disabled.
2023-12-13 14:49:21 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-12-13 14:49:21 0 [Note] Server socket created on IP: '0.0.0.0'.
2023-12-13 14:49:21 0 [Note] Server socket created on IP: '::'.
2023-12-13 14:49:21 3 [Note] Reading Master_info: '[email protected]'  Relay_info:'[email protected]'
2023-12-13 14:49:21 3 [Note] Initialized Master_info from '[email protected]'
2023-12-13 14:49:21 4 [Note] Master 'mariadb-operator': Slave I/O thread: Start asynchronous replication to master '[email protected]:3306' in log 'mariadb-bin.000004' at position 13297925
2023-12-13 14:49:21 3 [Note] Started replication for 'mariadb-operator'
2023-12-13 14:49:21 5 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'mariadb-bin.000004' at position 13297925, relay log './[email protected]' position: 4; GTID position ''
2023-12-13 14:49:21 0 [Note] mariadbd: ready for connections.
Version: '10.6.13-MariaDB-1:10.6.13+maria~ubu2004-log'  socket: '/run/mysqld/mysqld.sock'  port: 3306  mariadb.org binary distribution
2023-12-13 14:49:21 4 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]:3306',replication starts at GTID position ''
2023-12-13 14:49:32 0 [Note] InnoDB: Buffer pool(s) load completed at 231213 14:49:32
2023-12-13 14:49:49 5 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000001' at position 33348783; GTID position '0-10-30947', master: mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-13 14:49:49 4 [Note] Master 'mariadb-operator': Slave I/O thread exiting, read up to log 'mariadb-bin.000002', position 374705570; GTID position 0-10-1355222, master mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-13 14:49:49 491 [Note] Master connection name: 'mariadb-operator'  Master_info_file: '[email protected]'  Relay_info_file: '[email protected]'
2023-12-13 14:49:49 491 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='mariadb-0.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='mariadb-bin.000002', master_log_pos='374705570'. New state master_host='mariadb-0.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='', master_log_pos='4'.
2023-12-13 14:49:49 491 [Note] Previous Using_Gtid=Current_Pos. New Using_Gtid=Current_Pos
2023-12-13 14:49:49 492 [Note] Master 'mariadb-operator': Slave I/O thread: Start semi-sync replication to master '[email protected]:3306' in log '' at position 4
2023-12-13 14:49:49 493 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './[email protected]' position: 4; GTID position '0-10-30947'
2023-12-13 14:49:49 492 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]:3306',replication starts at GTID position '0-10-30947'
2023-12-13 14:50:31 493 [ERROR] Master 'mariadb-operator': mariadbd: Can't find record in 'view_track_stats'
2023-12-13 14:50:31 493 [ERROR] Master 'mariadb-operator': Slave SQL: Could not execute Update_rows_v1 event on table analytics.view_track_stats; Can't find record in 'view_track_stats', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mariadb-bin.000001, end_log_pos 128851519, Gtid 0-10-118186, Internal MariaDB error code: 1032
2023-12-13 14:50:31 493 [Warning] Master 'mariadb-operator': Slave: Can't find record in 'view_track_stats' Error_code: 1032
2023-12-13 14:50:31 493 [ERROR] Master 'mariadb-operator': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 128850327; GTID position '0-10-118185'
2023-12-13 14:50:31 493 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000001' at position 128850327; GTID position '0-10-118185', master: mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-13 14:51:42 1865 [Warning] Aborted connection 1865 to db: 'analytics' user: 'root' host: '127.0.0.1' (Got an error reading communication packets)

The second outdated replica has bit more logs to go on:

2023-12-03 14:09:07 24 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './[email protected]' position: 4; GTID position '0-10-9'
2023-12-03 14:09:07 23 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]:3306',replication starts at GTID position '0-10-9'
2023-12-03 14:10:06 24 [ERROR] Master 'mariadb-operator': Slave SQL: Could not execute Write_rows_v1 event on table analytics.events; Duplicate entry '132079351' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mariadb-bin.000001, end_log_pos 2220, Gtid 0-10-10, Internal MariaDB error code: 1062
2023-12-03 14:10:06 24 [Warning] Master 'mariadb-operator': Slave: Duplicate entry '132079351' for key 'PRIMARY' Error_code: 1062
2023-12-03 14:10:06 24 [ERROR] Master 'mariadb-operator': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 1681; GTID position '0-10-9'
2023-12-03 14:10:06 24 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000001' at position 1681; GTID position '0-10-9', master: mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-10  8:50:53 23 [Note] Master 'mariadb-operator': Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000004' at position 13297925; GTID position '0-10-3011382'
2023-12-10  8:50:54 23 [ERROR] Master 'mariadb-operator': Slave I/O: error reconnecting to master '[email protected]:3306' - retry-time: 10  maximum-retries: 100000  message: Lost connection to server at 'reading initial communication packet', system error: 0 "Internal error/check (Not system error)", Internal MariaDB error code: 2013
2023-12-10  8:51:37 23 [Note] Master 'mariadb-operator': Slave: connected to master '[email protected]:3306',replication resumed in log 'mariadb-bin.000004' at position 13297925
2023-12-10  8:51:40 23 [ERROR] Master 'mariadb-operator': Error reading packet from server: Lost connection to server during query (server_errno=2013)
2023-12-10  8:51:40 23 [Note] Master 'mariadb-operator': Slave I/O thread exiting, read up to log 'mariadb-bin.000004', position 13666032; GTID position 0-10-3011653, master mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-10  8:51:40 117175 [Note] Master connection name: 'mariadb-operator'  Master_info_file: '[email protected]'  Relay_info_file: '[email protected]'
2023-12-10  8:51:40 117175 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='mariadb-0.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='mariadb-bin.000004', master_log_pos='13666032'. New state master_host='mariadb-2.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='', master_log_pos='4'.
2023-12-10  8:51:40 117175 [Note] Previous Using_Gtid=Current_Pos. New Using_Gtid=Current_Pos
2023-12-10  8:51:40 117176 [Note] Master 'mariadb-operator': Slave I/O thread: Start semi-sync replication to master '[email protected]:3306' in log '' at position 4
2023-12-10  8:51:40 117177 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './[email protected]' position: 4; GTID position ''
2023-12-10  8:51:40 117176 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]:3306',replication starts at GTID position ''
2023-12-13  4:04:41 117176 [Note] Master 'mariadb-operator': Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000001' at position 1681; GTID position '0-12-3011391'
2023-12-13  4:05:01 117176 [ERROR] Master 'mariadb-operator': Slave I/O: error reconnecting to master '[email protected]:3306' - retry-time: 10  maximum-retries: 100000  message: Unknown server host 'mariadb-2.mariadb-internal.analytics.svc.cluster.local' (-3), Internal MariaDB error code: 2005
2023-12-13  4:05:31 117176 [Note] Master 'mariadb-operator': Slave: connected to master '[email protected]:3306',replication resumed in log 'mariadb-bin.000001' at position 1681
2023-12-13  4:09:02 117176 [Note] Master 'mariadb-operator': Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000001' at position 1681; GTID position '0-12-3011391'
2023-12-13 14:49:07 117176 [Note] Master 'mariadb-operator': Slave: received end packet from server, apparent master shutdown: 
2023-12-13 14:49:07 117176 [Note] Master 'mariadb-operator': Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000001' at position 1681; GTID position '0-12-3011391'
2023-12-13 14:49:07 117176 [ERROR] Master 'mariadb-operator': Slave I/O: error reconnecting to master '[email protected]:3306' - retry-time: 10  maximum-retries: 100000  message: Lost connection to server at 'reading initial communication packet', system error: 0 "Internal error/check (Not system error)", Internal MariaDB error code: 2013
2023-12-13 14:49:17 117177 [Note] Master 'mariadb-operator': Error reading relay log event: slave SQL thread was killed
2023-12-13 14:49:17 117177 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000001' at position 1681; GTID position '0-12-3011391', master: mariadb-2.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-13 14:49:17 117176 [Note] Master 'mariadb-operator': Slave I/O thread killed during or after a reconnect done to recover from failed read
2023-12-13 14:49:17 117176 [Note] Master 'mariadb-operator': Slave I/O thread exiting, read up to log 'mariadb-bin.000001', position 1681; GTID position 0-12-3011391, master mariadb-2.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-13 14:49:18 117176 [Note] Master 'mariadb-operator': cannot connect to master to kill slave io_thread's connection
2023-12-13 14:49:18 173310 [Note] Master connection name: 'mariadb-operator'  Master_info_file: '[email protected]'  Relay_info_file: '[email protected]'
2023-12-13 14:49:18 173310 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='mariadb-2.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='mariadb-bin.000001', master_log_pos='1681'. New state master_host='mariadb-0.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='', master_log_pos='4'.
2023-12-13 14:49:18 173310 [Note] Previous Using_Gtid=Current_Pos. New Using_Gtid=Current_Pos
2023-12-13 14:49:18 173311 [Note] Master 'mariadb-operator': Slave I/O thread: Start semi-sync replication to master '[email protected]:3306' in log '' at position 4
2023-12-13 14:49:18 173312 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './[email protected]' position: 4; GTID position ''
2023-12-13 14:49:18 173311 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]:3306',replication starts at GTID position ''
2023-12-13 14:49:18 173312 [ERROR] Master 'mariadb-operator': mariadbd: Can't find record in 'view_user_stats'
2023-12-13 14:49:18 173312 [ERROR] Master 'mariadb-operator': Slave SQL: Could not execute Update_rows_v1 event on table analytics.view_user_stats; Can't find record in 'view_user_stats', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mariadb-bin.000001, end_log_pos 22844, Gtid 0-10-20, Internal MariaDB error code: 1032
2023-12-13 14:49:18 173312 [Warning] Master 'mariadb-operator': Slave: Can't find record in 'view_user_stats' Error_code: 1032
2023-12-13 14:49:18 173312 [ERROR] Master 'mariadb-operator': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 21411; GTID position '0-10-19'
2023-12-13 14:49:18 173312 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000001' at position 21411; GTID position '0-10-19', master: mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306

For now I'll purge my replica's and re-add them through this method: #141 (comment)

from mariadb-operator.

github-actions commented on June 15, 2024

This issue is stale because it has been open 30 days with no activity.

from mariadb-operator.

marc-cenon commented on June 15, 2024

Hello,
I have the same problems,
I am testing the mariadb operator (one master and 2 replicas) with Proxysql in front.
I am familiar with Proxysql and do not wish to move away from it and switch to MaxScale.

During my tests:

When I kill the master
a replica is elected master
the old master can't connect to the new master

I have the same errors (reconnecting to master with the repl user). I kill the same pod again and still the result is the same.

error reconnecting to master '[email protected]:3306' - retry-time: 10  maximum-retries: 100000  message: Can't connect to server on 'mariadb-0.mariadb-internal.mariadb.svc.cluster.local' (111 "Connection refused")

Thank you in advance for your help

from mariadb-operator.

github-actions commented on June 15, 2024

This issue is stale because it has been open 30 days with no activity.

from mariadb-operator.

github-actions commented on June 15, 2024

This issue is stale because it has been open 30 days with no activity.

from mariadb-operator.

github-actions commented on June 15, 2024

This issue is stale because it has been open 30 days with no activity.

from mariadb-operator.

[Bug] MariaDB fails to restart after the whole cluster rebooted about mariadb-operator HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent