-
Notifications
You must be signed in to change notification settings - Fork 134
Description
I am deploying solrcloud to a test cluster using the helm chart for the operator and then the helm chart for solrcloud itself (which uses the operator).
I am running solrCloud version 9.10, which is compatible with the chart according to the version matrix here (assuming 9.10 falls under 9.4+). This might mean the issue is specific to running with a version override (perhaps 9.10 behaves differently from the default 8.11 in the chart values), but I have no easy way of verifying that.
Pretty much all the settings are default, except storage, which is persistent for both zk and solrcloud. I also run only 1 instance of zk and 1 instance of solrcloud on my test cluster, but 3 of each on production and I see the same issue in both - it's just more common on test because of the single node setup.
Here is what happens;
- ZK starts, solrCloud starts, everything is good.
- At some point, for whatever reason, ZK goes down temporarily, but auto-recovers a few minutes later.
- solrCloud goes into a sort of crashed state, but is still accessible. When I open the admin panel I see
KeeperErrorCode = Session expired for /aliases.json. - It never recovers. I have to manually delete the pod to make it restart.
- After restart, it's fine again.
Note that this does not happen always. Sometimes solrCloud reconnects to zk just fine. Sometimes it doesn't. I cannot figure out why.
From some preliminary digging I did, I think I understand why it's not restarting in this case, but I just wanted to verify that my assumptions are correct;
- The default readinessProbe reads
/solr/admin/info/health, which returns a 503 in this case and causes the pod to be marked down, but not restarted. - The default livenessProbe reads
/solr/admin/info/system, which is still accessible and fine and doesn't cause any restart behaviour.
I think the easiest fix, would be to change the livenessProbe to also read the /health endpoint, so it forces a pod restart when that starts failing.
I have overridden this in my own deployment and it works as expected, but I did want to raise it in case someone else runs into it, or in case I am missing something and there is a better fix. Here are my overrides;
podOptions:
livenessProbe:
httpGet:
path: /solr/admin/info/health
port: 8983