SolrCloud node gets into unready state forever after Zookeeper goes down and recovers

I am deploying solrcloud to a test cluster using the helm chart for the operator and then the helm chart for solrcloud itself (which uses the operator).

I am running solrCloud version 9.10, which is compatible with the chart according to the version matrix [here](https://apache.github.io/solr-operator/docs/upgrade-notes.html) (assuming 9.10 falls under 9.4+). This might mean the issue is specific to running with a version override (perhaps 9.10 behaves differently from the default 8.11 in the chart values), but I have no easy way of verifying that.

Pretty much all the settings are default, except storage, which is persistent for both zk and solrcloud. I also run only 1 instance of zk and 1 instance of solrcloud on my test cluster, but 3 of each on production and I see the same issue in both - it's just more common on test because of the single node setup.

Here is what happens;
1. ZK starts, solrCloud starts, everything is good.
2. At some point, for whatever reason, ZK goes down temporarily, but auto-recovers a few minutes later.
3. solrCloud goes into a sort of crashed state, but is still accessible. When I open the admin panel I see `KeeperErrorCode = Session expired for /aliases.json`.
4. It never recovers. I have to manually delete the pod to make it restart.
5. After restart, it's fine again.

Note that this does not happen always. Sometimes solrCloud reconnects to zk just fine. Sometimes it doesn't. I cannot figure out why.

From some preliminary digging I did, I think I understand why it's not restarting in this case, but I just wanted to verify that my assumptions are correct;

1. The default readinessProbe reads `/solr/admin/info/health`, which returns a 503 in this case and causes the pod to be marked down, but not restarted.
2. The default livenessProbe reads `/solr/admin/info/system`, which is still accessible and fine and doesn't cause any restart behaviour.

I think the easiest fix, would be to change the livenessProbe to also read the `/health` endpoint, so it forces a pod restart when that starts failing.
I have overridden this in my own deployment and it works as expected, but I did want to raise it in case someone else runs into it, or in case I am missing something and there is a better fix. Here are my overrides;

```
podOptions:
  livenessProbe:
    httpGet:
      path: /solr/admin/info/health
      port: 8983
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SolrCloud node gets into unready state forever after Zookeeper goes down and recovers #819

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SolrCloud node gets into unready state forever after Zookeeper goes down and recovers #819

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions