Troubleshooting

Work in progress

This page calls out some scenarios we've identified that require special care. We expect to cover more common issues over time. In the mean time, if you need help with a particular issue, please reach out to us via Discord or Slack.

Restate Clusters

Handling missing snapshots

You are observing a partition processor repeatedly crash-looping with a TrimGapEncountered error, or see one of the following errors in the Restate server logs:

A log trim gap was encountered, but no snapshot repository is configured!

A log trim gap was encountered, but no snapshot is available for this partition!

The latest available snapshot is from an LSN before the target LSN!

You are observing a situation where the local state available on a given worker node does not allow it to resume from the log's trim point - either because it is brand new, or because its applied partition state is behind the trim point of the partition log. If you are attempting to migrate from a single-node Restate to a cluster deployment, you can also refer to the migration guide.

To recover from this situation, you need to make available a snapshot of the partition state from another worker, which is up to date with the log. This situation can arise if you have manually trimmed the log, the node is missing a snapshot repository configuration, or the snapshot repository is otherwise inaccessible. See Log trimming and Snapshots for more context about how logs, partitions, and snapshots are related.

Recovery procedure

Identify whether a snapshot repository is configured and accessible

If a snapshot repository is set up on other nodes in the cluster, and simply not configured on the node where you are seeing the partition processor startup errors, correct the configuration on the new node - refer to Configuring Snapshots. If you have not yet set up a snapshot repository, please do so now. If it is impossible to use an object store to host the snapshots repository, you can export snapshots to a local filesystem and manually transfer them to other nodes - skip to step 2b.

In your server configuration, you should have a snapshot path specified as follows:

[worker.snapshots]
destination = "s3://snapshots/prefix"

Confirm that this is consistent with other nodes in the cluster.

Check the server logs for any access errors; does the node have the necessary credentials and are those credentials authorized to access the snapshots destination?

Publish a snapshot to the repository

Snapshots are produced periodically by partition processors on certain triggers, such as a number of records being appended to the log. If you are seeing the following error, check that snapshot are being written to the object store destination you have configured.

Verify that this partition has an active node:

restatectl partitions list

If you have lost all nodes which previously hosted this partition, you have permanent data loss - the partition state can not be fully recovered. Get in touch with us to assist in re-starting the partition accepting the data loss.

Request a snapshot for this partition:

restatectl snapshots create-snapshot {partition_id}

You can manually confirm that the snapshot was published to the expected destination. Within the specified snapshot bucket and prefix, you will find a partition-based tree structure. Navigate to the bucket path {prefix}/{partition_id} - you should see an entry for the new snapshot id matching the output of the create snapshot command.

Alternative: Manually transfer snapshot from another node

If you are running a cluster but are unable to setup a snapshot repository in a shared object store destination, you can still recover node state by publishing a snapshot from a healthy node ot the local filesystem and manually transferring it to the new node.

Experimenting with snapshots without an object store

Note that shared filesystems are not a supported target for cluster snapshots, and have known correctness risks. The file:// protocol does not support conditional updates, which makes it unsuitable for potentially contended operation.

Identify an up-to-date node which is running the partition by running:

restatectl partitions list

On this node, configure a local destination for the partition snapshot repository - make sure this already exists:

[worker.snapshots]
destination = "file:///mnt/restate-data/snapshots-repository"

Restart the node. If you have multiple nodes which may assume leadership for this partition, you will need to either repeat this on all of them, or temporarily shut them down. Create snapshot(s) for the affected partition(s):

restatectl snapshots create-snapshot {partition_id}

Copy the contents of the snapshots repository to the node experiencing issues, and configure it to point to the snapshot repository. If you have multiple snapshots produced by multiple peer nodes, you can merge them all in the same location - each partition's snapshots will be written to dedicated sub-directory for that partition.

Confirm that the affected node starts up and bootstraps its partition store from a snapshot

Once you have confirmed that a snapshot for the partition is available at the configured location, the configured repository access credentials have the necessary permissions, and the local node configuration is correct, you should see the partition processor start up and join the partition. If you have updated the Restate server configuration in the process, you should restart the server process to ensure that the latest changes are picked up.

Node id misconfiguration puts log server in `data-loss` state

If a misconfigured Restate node with the log server role attempts to join a cluster where the node id is already in use, you will observe that the newly started node aborts with an error:

ERROR restate_core::task_center: Shutting down: task 4 failed with: Node cannot start a log-server on N3, it has detected that it has lost its data. storage-state is `data-loss`

Restarting the existing node that previously owned this id will also cause it to stop with the same message. Follow these steps to return the initial log server into service without losing its stored log segments.

First, prevent the misconfigured node from starting again until the configuration has been corrected. If this was a brand new node, there should be no data stored on it, and you may delete it altogether.

The reused node id has been marked as having data-loss. This precaution that tells the Restate control plane to avoid selecting this node as member of new log nodesets. You can view the current status using the restatectl replicated-loglet tool:

restatectl replicated-loglet servers

Node configuration v21
Log chain v6
 NODE  GEN   STORAGE-STATE  HISTORICAL LOGLETS  ACTIVE LOGLETS
 N1    N1:5  read-write     8                   2
 N2    N2:4  read-write     8                   2
 N3    N3:6  data-loss      6                   0

You should also observe that the control plane is now avoiding using this node for log storage. This will result in reduced fault tolerance or even unavailability, depending on the configured minimum log replication:

restatectl logs list

Logs v3
└ Logs Provider: replicated
 ├ Log replication: {node: 2}
 └ Nodeset size: 0
 L-ID  FROM-LSN  KIND        LOGLET-ID  REPLICATION  SEQUENCER  NODESET
 0     2         Replicated  0_1        {node: 2}    N2:1       [N1, N2]
 1     2         Replicated  1_1        {node: 2}    N2:1       [N1, N2]

To restore the original node's ability to accept writes, we can update its metadata using set-storage-state subcommand.

Dangerous operation

Only proceed if you are confident that you understand the reason why the node is in this state, and are certain that its locally stored data is still intact. Since Restate cannot automatically validate that it safe to put this node back into service, we must use the --force flag to override the default state transition rules.

restatectl replicated-loglet set-storage-state --node-id 3 --storage-state 'read-write' --force

Node N3 storage-state updated from data-loss to read-write

You can validate that the server is once again being used for log storage using logs list and replicated-loglet servers subcommands.

Troubleshooting

Restate Clusters​

Handling missing snapshots​

Recovery procedure​