Truenas Maintenance logs 10-1-2025 - setting up nextcloud preferences after the operating system deleted itself again

This commit is contained in:
NicholaiVogel 2025-10-01 19:55:59 -06:00
commit a309c18db7
3 changed files with 2238 additions and 0 deletions

View File

@ -0,0 +1,288 @@
# TrueNAS Maintenance Log
Date: 2025-10-01
## TL;DR
* Fixed Redis not starting due to bad container args. Set persistence and memory policy via env and verified.
* Stopped Postgres from ignoring tuned configs by removing the CLI override and explicitly setting sane values.
* Tuned ZFS dataset and host kernel settings for DB workloads.
* Verified results inside running pods.
---
## 1) Baseline snapshot script
Collected a fast system snapshot for Nextcloud troubleshooting.
```bash
sudo bash /tmp/nc_sysdump.sh
```
Why: one-shot view of OS, CPU, memory, ZFS, ARC, datasets, k3s pods, open ports, THP, swappiness, timers, and quick Redis/Postgres presence checks. [^snapshot]
---
## 2) ZFS and host tuning for Postgres
Applied ZFS dataset properties and kernel flags appropriate for OLTP.
```bash
PGDATA="Pool2/ix-applications/releases/nextcloud/volumes/ix_volumes/pgData"
sudo zfs set recordsize=8K atime=off compression=lz4 logbias=latency primarycache=all "$PGDATA"
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled >/dev/null
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >/dev/null
# Persist THP disable and low swappiness
sudo tee /etc/systemd/system/disable-thp.service >/dev/null <<'EOF'
[Unit]
Description=Disable Transparent Huge Pages
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/defrag'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now disable-thp.service
sudo sysctl vm.swappiness=1
echo 'vm.swappiness=1' | sudo tee /etc/sysctl.d/99-redis-db.conf >/dev/null
sudo sysctl --system
```
Why: 8K recordsize matches PG page size and reduces read-modify-write churn; logbias=latency reduces ZIL latency; THP off avoids latency spikes for PG; low swappiness keeps hot pages in RAM. [^zfs-pgdata] [^thp] [^swappiness]
---
## 3) Redis: persistence and memory policy
Initial failure was due to passing raw `--` args to the Bitnami entrypoint, which treated them as shell options and crashed. Fixed by removing args and using env-based config.
**Bad args removed**
```bash
NS=ix-nextcloud
DEP=nextcloud-redis
k3s kubectl -n $NS patch deploy $DEP --type=json -p='[
{"op":"remove","path":"/spec/template/spec/containers/0/args"}
]'
```
**Good settings applied via env**
```bash
k3s kubectl -n $NS set env deploy/$DEP \
REDIS_APPENDONLY=yes \
REDIS_APPENDFSYNC=everysec \
REDIS_MAXMEMORY=8gb \
REDIS_MAXMEMORY_POLICY=allkeys-lru
k3s kubectl -n $NS rollout restart deploy/$DEP
```
**Verification**
```bash
NS=ix-nextcloud
POD=$(k3s kubectl -n $NS get pods | awk '/nextcloud-redis/{print $1; exit}')
REDIS_PASS=$(k3s kubectl -n $NS get secret nextcloud-redis-creds -o jsonpath='{.data.REDIS_PASSWORD}' | base64 -d)
k3s kubectl -n $NS exec -it "$POD" -- sh -lc "/opt/bitnami/redis/bin/redis-cli -a \"$REDIS_PASS\" INFO | egrep 'aof_enabled|maxmemory|maxmemory_policy'"
# Output:
# maxmemory:8589934592
# maxmemory_human:8.00G
# maxmemory_policy:allkeys-lru
# aof_enabled:1
```
Why: Bitnami Redis prefers env variables to configure persistence and memory policy. This avoids shell parsing issues and persists across restarts. [^redis-env] [^redis-aof] [^redis-policy]
---
## 4) Postgres: stop the CLI override, then tune
Symptom: `shared_buffers` kept showing 1 GB and `pg_settings.source = 'command line'`. Root cause was a `-c shared_buffers=1024MB` passed via deployment. That always wins over `postgresql.conf`, `conf.d`, and `ALTER SYSTEM`.
**Remove or replace CLI args**
```bash
NS=ix-nextcloud
DEP=nextcloud-postgres
# Remove args if present
k3s kubectl -n $NS patch deploy $DEP --type=json -p='[
{"op":"remove","path":"/spec/template/spec/containers/0/args"}
]' || true
# Replace with tuned args explicitly
k3s kubectl -n $NS patch deploy $DEP --type=json -p='[
{"op":"add","path":"/spec/template/spec/containers/0/args","value":
["-c","shared_buffers=16GB",
"-c","max_connections=200",
"-c","wal_compression=on",
"-c","max_wal_size=8GB",
"-c","random_page_cost=1.25"]}]'
k3s kubectl -n $NS rollout restart deploy/$DEP
```
**Resource limit raised in App UI**
* Memory limit increased to 24 GiB to allow 16 GiB buffers without OOM.
**Verification inside pod**
```bash
SEC=nextcloud-postgres-creds
DBUSER=$(k3s kubectl -n $NS get secret $SEC -o jsonpath='{.data.POSTGRES_USER}' | base64 -d)
DBPASS=$(k3s kubectl -n $NS get secret $SEC -o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d)
DBNAME=$(k3s kubectl -n $NS get secret $SEC -o jsonpath='{.data.POSTGRES_DB}' | base64 -d)
POD=$(k3s kubectl -n $NS get pods -o name | sed -n 's|pod/||p' | grep -E '^nextcloud-postgres' | head -1)
k3s kubectl -n $NS exec -it "$POD" -- bash -lc \
"PGPASSWORD='$DBPASS' psql -h 127.0.0.1 -U '$DBUSER' -d '$DBNAME' -Atc \
\"select name,setting,unit,source from pg_settings
where name in ('shared_buffers','effective_cache_size','wal_compression','max_wal_size','random_page_cost')
order by name;\""
```
Expected results after change:
* `shared_buffers` source should be command line with `16GB`
* `effective_cache_size` from conf.d set to 40 GB
* `wal_compression=on`, `max_wal_size=8GB`, `random_page_cost=1.25`
**Cgroup limit check**
```bash
k3s kubectl -n $NS exec "$POD" -- sh -lc 'cat /sys/fs/cgroup/memory.max || cat /sys/fs/cgroup/memory/memory.limit_in_bytes'
# 25769803776
```
**Huge pages status**
```bash
k3s kubectl -n $NS exec -it "$POD" -- bash -lc \
"psql -Atc \"show huge_pages;\" -U '$DBUSER' -h 127.0.0.1 -d '$DBNAME'"
# off
```
Why: Precedence is CLI args over config files. Removing or replacing the CLI flag is the only way to make buffers larger than 1 GB take effect in this chart. The resource limit must also allow it. [^pg-conf-order] [^pg-memory]
---
## 5) Small cleanups and guardrails
* Created a helper to reapply Redis tuning quickly:
```bash
cat >/root/reapply-redis-tuning.sh <<'EOF'
NS=ix-nextcloud
DEP=nextcloud-redis
k3s kubectl -n $NS set env deploy/$DEP \
REDIS_APPENDONLY=yes \
REDIS_APPENDFSYNC=everysec \
REDIS_MAXMEMORY=8gb \
REDIS_MAXMEMORY_POLICY=allkeys-lru
k3s kubectl -n $NS rollout restart deploy/$DEP
EOF
chmod +x /root/reapply-redis-tuning.sh
```
* Verified Nextclouds Redis password from the correct secret key `REDIS_PASSWORD` after earlier key-name misses.
Why: quick reapply for tunables, fewer fat-fingered loops.
---
## Validation snapshots
### Redis quick state
```bash
connected_clients:11
used_memory_human:1.46M
maxmemory_human:8.00G
maxmemory_policy:allkeys-lru
aof_enabled:1
aof_last_write_status:ok
instantaneous_ops_per_sec:95
evicted_keys:0
role:master
```
### Postgres quick state
* `shared_buffers` now controlled via CLI and aligned with resource limit
* `effective_cache_size=40GB` from conf.d
* `wal_compression=on`, `max_wal_size=8GB`, `random_page_cost=1.25` confirmed
---
## Known gotchas encountered
* Execd into wrong pods/containers repeatedly. Use namespace and label selectors plus `-c` only when the pod actually has multiple containers. [^k3s-pod]
* Bitnami Redis ignores raw `--` args in `args` when passed incorrectly. Use env variables the chart supports.
* Postgres role confusion: default superuser is not always `postgres` in this chart. Use credentials from `nextcloud-postgres-creds`. [^pg-role]
---
## Next actions
* Optional: set `effective_io_concurrency=256` and `maintenance_work_mem=2GB` via conf.d only if not already present in CLI, then restart.
* Consider `shared_buffers=25%` of cgroup memory for mixed workloads. You set 16 GB on a 24 GiB limit which is fine if the pod has headroom. [^pg-sizing]
* Keep `work_mem` moderate to avoid per-query explosion; current `128MB` is aggressive if concurrency spikes.
---
## Footnotes
[^snapshot]: The snapshot script prints OS, CPU, memory, ZFS pools and ARC, datasets matching Nextcloud and DB, app platform state, network listeners, THP, swappiness, timers, and versions. Good first move before any tuning.
[^zfs-pgdata]: ZFS `recordsize=8K` matches Postgres 8 KB page size; `atime=off` avoids metadata writes; `compression=lz4` is typically net positive for WAL and heap; `logbias=latency` optimizes synchronous intent logging. These are standard PG-on-ZFS choices.
[^thp]: Transparent Huge Pages can cause latency spikes for memory alloc and compaction. PG recommends `never`. You persisted it with a systemd unit and verified `huge_pages=off` in PG.
[^swappiness]: `vm.swappiness=1` favors keeping hot working sets in memory. DB nodes typically set this low to avoid writeback storms.
[^redis-env]: The TrueNAS Bitnami chart maps well-known env vars like `REDIS_APPENDONLY` and `REDIS_MAXMEMORY_POLICY` into redis.conf, avoiding brittle `args` parsing.
[^redis-aof]: `appendonly yes` with `everysec` gives durability with good throughput. It is the sane default for NC caching plus locking patterns.
[^redis-policy]: `allkeys-lru` prevents unbounded memory growth and prioritizes hot keys. With `maxmemory 8gb`, eviction is predictable.
[^pg-conf-order]: Postgres configuration precedence is: command line `-c` flags override includes and `postgresql.conf`, then `ALTER SYSTEM`, then file includes. If the container passes `-c shared_buffers=1024MB`, it will override everything else.
[^pg-memory]: With a 24 GiB cgroup limit, `shared_buffers=16GB` is aggressive but acceptable if app memory and FS cache are still healthy. Monitor `OOMKilled` events and PG memory stats.
[^k3s-pod]: When kubectl says “container not found,” the pod likely has a single container with a different name than you assumed. Use `kubectl -n NS get pod POD -o jsonpath='{.spec.containers[*].name}'` to confirm.
[^pg-role]: The Bitnami PG image often creates the app user as the primary DB user. The secret shows the authoritative `POSTGRES_USER`, `POSTGRES_PASSWORD`, and `POSTGRES_DB` you should use.
[^pg-sizing]: Rule of thumb: `shared_buffers` 2025 percent of RAM for mixed workloads, higher only if the rest of the stack is memory-light and you monitor for OOM. Effective cache can be 23x buffers.
---
## Appendix: Handy one-liners
**Show who is forcing PG settings**
```sql
select name,setting,source,sourcefile
from pg_settings
where name in ('shared_buffers','effective_cache_size','wal_compression','max_wal_size','random_page_cost')
order by name;
```
**Show current pod memory limit**
```bash
cat /sys/fs/cgroup/memory.max || cat /sys/fs/cgroup/memory/memory.limit_in_bytes
```
**Redis sanity**
```bash
REDISCLI_AUTH="$REDIS_PASS" redis-cli INFO | egrep -i 'used_memory_human|maxmemory_human|maxmemory_policy|aof_enabled|evicted_keys'
```

1945
Logs/2025-10-01/Log.md Normal file

File diff suppressed because it is too large Load Diff

5
README.md Normal file
View File

@ -0,0 +1,5 @@
# TRUENAS SCALE MAINTENANCE LOGS
### A REPO OF FUCKING LOGS
(FUCK YOU)