Truenas Maintenance logs 10-1-2025 - setting up nextcloud preferences after the operating system deleted itself again

2025-10-01 19:55:59 -06:00 · 2025-10-01 19:55:59 -06:00 · a309c18db7
commit a309c18db7
3 changed files with 2238 additions and 0 deletions
--- a/Logs/2025-10-01/Log-Summary.md
+++ b/Logs/2025-10-01/Log-Summary.md
@ -0,0 +1,288 @@
+# TrueNAS Maintenance Log
+
+Date: 2025-10-01
+
+## TL;DR
+
+* Fixed Redis not starting due to bad container args. Set persistence and memory policy via env and verified.
+* Stopped Postgres from ignoring tuned configs by removing the CLI override and explicitly setting sane values.
+* Tuned ZFS dataset and host kernel settings for DB workloads.
+* Verified results inside running pods.
+
+---
+
+## 1) Baseline snapshot script
+
+Collected a fast system snapshot for Nextcloud troubleshooting.
+
+```bash
+sudo bash /tmp/nc_sysdump.sh
+```
+
+Why: one-shot view of OS, CPU, memory, ZFS, ARC, datasets, k3s pods, open ports, THP, swappiness, timers, and quick Redis/Postgres presence checks. [^snapshot]
+
+---
+
+## 2) ZFS and host tuning for Postgres
+
+Applied ZFS dataset properties and kernel flags appropriate for OLTP.
+
+```bash
+PGDATA="Pool2/ix-applications/releases/nextcloud/volumes/ix_volumes/pgData"
+sudo zfs set recordsize=8K atime=off compression=lz4 logbias=latency primarycache=all "$PGDATA"
+echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled >/dev/null
+echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >/dev/null
+
+# Persist THP disable and low swappiness
+sudo tee /etc/systemd/system/disable-thp.service >/dev/null <<'EOF'
+[Unit]
+Description=Disable Transparent Huge Pages
+After=multi-user.target
+[Service]
+Type=oneshot
+ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'
+ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/defrag'
+RemainAfterExit=yes
+[Install]
+WantedBy=multi-user.target
+EOF
+sudo systemctl daemon-reload
+sudo systemctl enable --now disable-thp.service
+
+sudo sysctl vm.swappiness=1
+echo 'vm.swappiness=1' | sudo tee /etc/sysctl.d/99-redis-db.conf >/dev/null
+sudo sysctl --system
+```
+
+Why: 8K recordsize matches PG page size and reduces read-modify-write churn; logbias=latency reduces ZIL latency; THP off avoids latency spikes for PG; low swappiness keeps hot pages in RAM. [^zfs-pgdata] [^thp] [^swappiness]
+
+---
+
+## 3) Redis: persistence and memory policy
+
+Initial failure was due to passing raw `--` args to the Bitnami entrypoint, which treated them as shell options and crashed. Fixed by removing args and using env-based config.
+
+**Bad args removed**
+
+```bash
+NS=ix-nextcloud
+DEP=nextcloud-redis
+k3s kubectl -n $NS patch deploy $DEP --type=json -p='[
+ {"op":"remove","path":"/spec/template/spec/containers/0/args"}
+]'
+```
+
+**Good settings applied via env**
+
+```bash
+k3s kubectl -n $NS set env deploy/$DEP \
+  REDIS_APPENDONLY=yes \
+  REDIS_APPENDFSYNC=everysec \
+  REDIS_MAXMEMORY=8gb \
+  REDIS_MAXMEMORY_POLICY=allkeys-lru
+k3s kubectl -n $NS rollout restart deploy/$DEP
+```
+
+**Verification**
+
+```bash
+NS=ix-nextcloud
+POD=$(k3s kubectl -n $NS get pods | awk '/nextcloud-redis/{print $1; exit}')
+REDIS_PASS=$(k3s kubectl -n $NS get secret nextcloud-redis-creds -o jsonpath='{.data.REDIS_PASSWORD}' | base64 -d)
+
+k3s kubectl -n $NS exec -it "$POD" -- sh -lc "/opt/bitnami/redis/bin/redis-cli -a \"$REDIS_PASS\" INFO | egrep 'aof_enabled|maxmemory|maxmemory_policy'"
+# Output:
+# maxmemory:8589934592
+# maxmemory_human:8.00G
+# maxmemory_policy:allkeys-lru
+# aof_enabled:1
+```
+
+Why: Bitnami Redis prefers env variables to configure persistence and memory policy. This avoids shell parsing issues and persists across restarts. [^redis-env] [^redis-aof] [^redis-policy]
+
+---
+
+## 4) Postgres: stop the CLI override, then tune
+
+Symptom: `shared_buffers` kept showing 1 GB and `pg_settings.source = 'command line'`. Root cause was a `-c shared_buffers=1024MB` passed via deployment. That always wins over `postgresql.conf`, `conf.d`, and `ALTER SYSTEM`.
+
+**Remove or replace CLI args**
+
+```bash
+NS=ix-nextcloud
+DEP=nextcloud-postgres
+
+# Remove args if present
+k3s kubectl -n $NS patch deploy $DEP --type=json -p='[
+ {"op":"remove","path":"/spec/template/spec/containers/0/args"}
+]' || true
+
+# Replace with tuned args explicitly
+k3s kubectl -n $NS patch deploy $DEP --type=json -p='[
+  {"op":"add","path":"/spec/template/spec/containers/0/args","value":
+   ["-c","shared_buffers=16GB",
+    "-c","max_connections=200",
+    "-c","wal_compression=on",
+    "-c","max_wal_size=8GB",
+    "-c","random_page_cost=1.25"]}]'
+k3s kubectl -n $NS rollout restart deploy/$DEP
+```
+
+**Resource limit raised in App UI**
+
+* Memory limit increased to 24 GiB to allow 16 GiB buffers without OOM.
+
+**Verification inside pod**
+
+```bash
+SEC=nextcloud-postgres-creds
+DBUSER=$(k3s kubectl -n $NS get secret $SEC -o jsonpath='{.data.POSTGRES_USER}' | base64 -d)
+DBPASS=$(k3s kubectl -n $NS get secret $SEC -o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d)
+DBNAME=$(k3s kubectl -n $NS get secret $SEC -o jsonpath='{.data.POSTGRES_DB}' | base64 -d)
+POD=$(k3s kubectl -n $NS get pods -o name | sed -n 's|pod/||p' | grep -E '^nextcloud-postgres' | head -1)
+
+k3s kubectl -n $NS exec -it "$POD" -- bash -lc \
+"PGPASSWORD='$DBPASS' psql -h 127.0.0.1 -U '$DBUSER' -d '$DBNAME' -Atc \
+\"select name,setting,unit,source from pg_settings
+  where name in ('shared_buffers','effective_cache_size','wal_compression','max_wal_size','random_page_cost')
+  order by name;\""
+```
+
+Expected results after change:
+
+* `shared_buffers` source should be command line with `16GB`
+* `effective_cache_size` from conf.d set to 40 GB
+* `wal_compression=on`, `max_wal_size=8GB`, `random_page_cost=1.25`
+
+**Cgroup limit check**
+
+```bash
+k3s kubectl -n $NS exec "$POD" -- sh -lc 'cat /sys/fs/cgroup/memory.max || cat /sys/fs/cgroup/memory/memory.limit_in_bytes'
+# 25769803776
+```
+
+**Huge pages status**
+
+```bash
+k3s kubectl -n $NS exec -it "$POD" -- bash -lc \
+"psql -Atc \"show huge_pages;\" -U '$DBUSER' -h 127.0.0.1 -d '$DBNAME'"
+# off
+```
+
+Why: Precedence is CLI args over config files. Removing or replacing the CLI flag is the only way to make buffers larger than 1 GB take effect in this chart. The resource limit must also allow it. [^pg-conf-order] [^pg-memory]
+
+---
+
+## 5) Small cleanups and guardrails
+
+* Created a helper to reapply Redis tuning quickly:
+
+  ```bash
+  cat >/root/reapply-redis-tuning.sh <<'EOF'
+  NS=ix-nextcloud
+  DEP=nextcloud-redis
+  k3s kubectl -n $NS set env deploy/$DEP \
+    REDIS_APPENDONLY=yes \
+    REDIS_APPENDFSYNC=everysec \
+    REDIS_MAXMEMORY=8gb \
+    REDIS_MAXMEMORY_POLICY=allkeys-lru
+  k3s kubectl -n $NS rollout restart deploy/$DEP
+  EOF
+  chmod +x /root/reapply-redis-tuning.sh
+  ```
+* Verified Nextcloud’s Redis password from the correct secret key `REDIS_PASSWORD` after earlier key-name misses.
+
+Why: quick reapply for tunables, fewer fat-fingered loops.
+
+---
+
+## Validation snapshots
+
+### Redis quick state
+
+```bash
+connected_clients:11
+used_memory_human:1.46M
+maxmemory_human:8.00G
+maxmemory_policy:allkeys-lru
+aof_enabled:1
+aof_last_write_status:ok
+instantaneous_ops_per_sec:95
+evicted_keys:0
+role:master
+```
+
+### Postgres quick state
+
+* `shared_buffers` now controlled via CLI and aligned with resource limit
+* `effective_cache_size=40GB` from conf.d
+* `wal_compression=on`, `max_wal_size=8GB`, `random_page_cost=1.25` confirmed
+
+---
+
+## Known gotchas encountered
+
+* Exec’d into wrong pods/containers repeatedly. Use namespace and label selectors plus `-c` only when the pod actually has multiple containers. [^k3s-pod]
+* Bitnami Redis ignores raw `--` args in `args` when passed incorrectly. Use env variables the chart supports.
+* Postgres role confusion: default superuser is not always `postgres` in this chart. Use credentials from `nextcloud-postgres-creds`. [^pg-role]
+
+---
+
+## Next actions
+
+* Optional: set `effective_io_concurrency=256` and `maintenance_work_mem=2GB` via conf.d only if not already present in CLI, then restart.
+* Consider `shared_buffers=25%` of cgroup memory for mixed workloads. You set 16 GB on a 24 GiB limit which is fine if the pod has headroom. [^pg-sizing]
+* Keep `work_mem` moderate to avoid per-query explosion; current `128MB` is aggressive if concurrency spikes.
+
+---
+
+## Footnotes
+
+[^snapshot]: The snapshot script prints OS, CPU, memory, ZFS pools and ARC, datasets matching Nextcloud and DB, app platform state, network listeners, THP, swappiness, timers, and versions. Good first move before any tuning.
+
+[^zfs-pgdata]: ZFS `recordsize=8K` matches Postgres 8 KB page size; `atime=off` avoids metadata writes; `compression=lz4` is typically net positive for WAL and heap; `logbias=latency` optimizes synchronous intent logging. These are standard PG-on-ZFS choices.
+
+[^thp]: Transparent Huge Pages can cause latency spikes for memory alloc and compaction. PG recommends `never`. You persisted it with a systemd unit and verified `huge_pages=off` in PG.
+
+[^swappiness]: `vm.swappiness=1` favors keeping hot working sets in memory. DB nodes typically set this low to avoid writeback storms.
+
+[^redis-env]: The TrueNAS Bitnami chart maps well-known env vars like `REDIS_APPENDONLY` and `REDIS_MAXMEMORY_POLICY` into redis.conf, avoiding brittle `args` parsing.
+
+[^redis-aof]: `appendonly yes` with `everysec` gives durability with good throughput. It is the sane default for NC caching plus locking patterns.
+
+[^redis-policy]: `allkeys-lru` prevents unbounded memory growth and prioritizes hot keys. With `maxmemory 8gb`, eviction is predictable.
+
+[^pg-conf-order]: Postgres configuration precedence is: command line `-c` flags override includes and `postgresql.conf`, then `ALTER SYSTEM`, then file includes. If the container passes `-c shared_buffers=1024MB`, it will override everything else.
+
+[^pg-memory]: With a 24 GiB cgroup limit, `shared_buffers=16GB` is aggressive but acceptable if app memory and FS cache are still healthy. Monitor `OOMKilled` events and PG memory stats.
+
+[^k3s-pod]: When kubectl says “container not found,” the pod likely has a single container with a different name than you assumed. Use `kubectl -n NS get pod POD -o jsonpath='{.spec.containers[*].name}'` to confirm.
+
+[^pg-role]: The Bitnami PG image often creates the app user as the primary DB user. The secret shows the authoritative `POSTGRES_USER`, `POSTGRES_PASSWORD`, and `POSTGRES_DB` you should use.
+
+[^pg-sizing]: Rule of thumb: `shared_buffers` 20–25 percent of RAM for mixed workloads, higher only if the rest of the stack is memory-light and you monitor for OOM. Effective cache can be 2–3x buffers.
+
+---
+
+## Appendix: Handy one-liners
+
+**Show who is forcing PG settings**
+
+```sql
+select name,setting,source,sourcefile
+from pg_settings
+where name in ('shared_buffers','effective_cache_size','wal_compression','max_wal_size','random_page_cost')
+order by name;
+```
+
+**Show current pod memory limit**
+
+```bash
+cat /sys/fs/cgroup/memory.max || cat /sys/fs/cgroup/memory/memory.limit_in_bytes
+```
+
+**Redis sanity**
+
+```bash
+REDISCLI_AUTH="$REDIS_PASS" redis-cli INFO | egrep -i 'used_memory_human|maxmemory_human|maxmemory_policy|aof_enabled|evicted_keys'
+```
--- a/Logs/2025-10-01/Log.md
+++ b/Logs/2025-10-01/Log.md
--- a/README.md
+++ b/README.md
@ -0,0 +1,5 @@
+# TRUENAS SCALE MAINTENANCE LOGS
+
+### A REPO OF FUCKING LOGS
+
+(FUCK YOU)