10 KiB
Raw Permalink Blame History

TrueNAS Maintenance Log

Date: 2025-10-01

TL;DR

  • Fixed Redis not starting due to bad container args. Set persistence and memory policy via env and verified.
  • Stopped Postgres from ignoring tuned configs by removing the CLI override and explicitly setting sane values.
  • Tuned ZFS dataset and host kernel settings for DB workloads.
  • Verified results inside running pods.

1) Baseline snapshot script

Collected a fast system snapshot for Nextcloud troubleshooting.

sudo bash /tmp/nc_sysdump.sh

Why: one-shot view of OS, CPU, memory, ZFS, ARC, datasets, k3s pods, open ports, THP, swappiness, timers, and quick Redis/Postgres presence checks. 1


2) ZFS and host tuning for Postgres

Applied ZFS dataset properties and kernel flags appropriate for OLTP.

PGDATA="Pool2/ix-applications/releases/nextcloud/volumes/ix_volumes/pgData"
sudo zfs set recordsize=8K atime=off compression=lz4 logbias=latency primarycache=all "$PGDATA"
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled >/dev/null
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >/dev/null

# Persist THP disable and low swappiness
sudo tee /etc/systemd/system/disable-thp.service >/dev/null <<'EOF'
[Unit]
Description=Disable Transparent Huge Pages
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/defrag'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now disable-thp.service

sudo sysctl vm.swappiness=1
echo 'vm.swappiness=1' | sudo tee /etc/sysctl.d/99-redis-db.conf >/dev/null
sudo sysctl --system

Why: 8K recordsize matches PG page size and reduces read-modify-write churn; logbias=latency reduces ZIL latency; THP off avoids latency spikes for PG; low swappiness keeps hot pages in RAM. 2 3 4


3) Redis: persistence and memory policy

Initial failure was due to passing raw -- args to the Bitnami entrypoint, which treated them as shell options and crashed. Fixed by removing args and using env-based config.

Bad args removed

NS=ix-nextcloud
DEP=nextcloud-redis
k3s kubectl -n $NS patch deploy $DEP --type=json -p='[
 {"op":"remove","path":"/spec/template/spec/containers/0/args"}
]'

Good settings applied via env

k3s kubectl -n $NS set env deploy/$DEP \
  REDIS_APPENDONLY=yes \
  REDIS_APPENDFSYNC=everysec \
  REDIS_MAXMEMORY=8gb \
  REDIS_MAXMEMORY_POLICY=allkeys-lru
k3s kubectl -n $NS rollout restart deploy/$DEP

Verification

NS=ix-nextcloud
POD=$(k3s kubectl -n $NS get pods | awk '/nextcloud-redis/{print $1; exit}')
REDIS_PASS=$(k3s kubectl -n $NS get secret nextcloud-redis-creds -o jsonpath='{.data.REDIS_PASSWORD}' | base64 -d)

k3s kubectl -n $NS exec -it "$POD" -- sh -lc "/opt/bitnami/redis/bin/redis-cli -a \"$REDIS_PASS\" INFO | egrep 'aof_enabled|maxmemory|maxmemory_policy'"
# Output:
# maxmemory:8589934592
# maxmemory_human:8.00G
# maxmemory_policy:allkeys-lru
# aof_enabled:1

Why: Bitnami Redis prefers env variables to configure persistence and memory policy. This avoids shell parsing issues and persists across restarts. 5 6 7


4) Postgres: stop the CLI override, then tune

Symptom: shared_buffers kept showing 1 GB and pg_settings.source = 'command line'. Root cause was a -c shared_buffers=1024MB passed via deployment. That always wins over postgresql.conf, conf.d, and ALTER SYSTEM.

Remove or replace CLI args

NS=ix-nextcloud
DEP=nextcloud-postgres

# Remove args if present
k3s kubectl -n $NS patch deploy $DEP --type=json -p='[
 {"op":"remove","path":"/spec/template/spec/containers/0/args"}
]' || true

# Replace with tuned args explicitly
k3s kubectl -n $NS patch deploy $DEP --type=json -p='[
  {"op":"add","path":"/spec/template/spec/containers/0/args","value":
   ["-c","shared_buffers=16GB",
    "-c","max_connections=200",
    "-c","wal_compression=on",
    "-c","max_wal_size=8GB",
    "-c","random_page_cost=1.25"]}]'
k3s kubectl -n $NS rollout restart deploy/$DEP

Resource limit raised in App UI

  • Memory limit increased to 24 GiB to allow 16 GiB buffers without OOM.

Verification inside pod

SEC=nextcloud-postgres-creds
DBUSER=$(k3s kubectl -n $NS get secret $SEC -o jsonpath='{.data.POSTGRES_USER}' | base64 -d)
DBPASS=$(k3s kubectl -n $NS get secret $SEC -o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d)
DBNAME=$(k3s kubectl -n $NS get secret $SEC -o jsonpath='{.data.POSTGRES_DB}' | base64 -d)
POD=$(k3s kubectl -n $NS get pods -o name | sed -n 's|pod/||p' | grep -E '^nextcloud-postgres' | head -1)

k3s kubectl -n $NS exec -it "$POD" -- bash -lc \
"PGPASSWORD='$DBPASS' psql -h 127.0.0.1 -U '$DBUSER' -d '$DBNAME' -Atc \
\"select name,setting,unit,source from pg_settings
  where name in ('shared_buffers','effective_cache_size','wal_compression','max_wal_size','random_page_cost')
  order by name;\""

Expected results after change:

  • shared_buffers source should be command line with 16GB
  • effective_cache_size from conf.d set to 40 GB
  • wal_compression=on, max_wal_size=8GB, random_page_cost=1.25

Cgroup limit check

k3s kubectl -n $NS exec "$POD" -- sh -lc 'cat /sys/fs/cgroup/memory.max || cat /sys/fs/cgroup/memory/memory.limit_in_bytes'
# 25769803776

Huge pages status

k3s kubectl -n $NS exec -it "$POD" -- bash -lc \
"psql -Atc \"show huge_pages;\" -U '$DBUSER' -h 127.0.0.1 -d '$DBNAME'"
# off

Why: Precedence is CLI args over config files. Removing or replacing the CLI flag is the only way to make buffers larger than 1 GB take effect in this chart. The resource limit must also allow it. 8 9


5) Small cleanups and guardrails

  • Created a helper to reapply Redis tuning quickly:

    cat >/root/reapply-redis-tuning.sh <<'EOF'
    NS=ix-nextcloud
    DEP=nextcloud-redis
    k3s kubectl -n $NS set env deploy/$DEP \
      REDIS_APPENDONLY=yes \
      REDIS_APPENDFSYNC=everysec \
      REDIS_MAXMEMORY=8gb \
      REDIS_MAXMEMORY_POLICY=allkeys-lru
    k3s kubectl -n $NS rollout restart deploy/$DEP
    EOF
    chmod +x /root/reapply-redis-tuning.sh
    
  • Verified Nextclouds Redis password from the correct secret key REDIS_PASSWORD after earlier key-name misses.

Why: quick reapply for tunables, fewer fat-fingered loops.


Validation snapshots

Redis quick state

connected_clients:11
used_memory_human:1.46M
maxmemory_human:8.00G
maxmemory_policy:allkeys-lru
aof_enabled:1
aof_last_write_status:ok
instantaneous_ops_per_sec:95
evicted_keys:0
role:master

Postgres quick state

  • shared_buffers now controlled via CLI and aligned with resource limit
  • effective_cache_size=40GB from conf.d
  • wal_compression=on, max_wal_size=8GB, random_page_cost=1.25 confirmed

Known gotchas encountered

  • Execd into wrong pods/containers repeatedly. Use namespace and label selectors plus -c only when the pod actually has multiple containers. 10
  • Bitnami Redis ignores raw -- args in args when passed incorrectly. Use env variables the chart supports.
  • Postgres role confusion: default superuser is not always postgres in this chart. Use credentials from nextcloud-postgres-creds. 11

Next actions

  • Optional: set effective_io_concurrency=256 and maintenance_work_mem=2GB via conf.d only if not already present in CLI, then restart.
  • Consider shared_buffers=25% of cgroup memory for mixed workloads. You set 16 GB on a 24 GiB limit which is fine if the pod has headroom. 12
  • Keep work_mem moderate to avoid per-query explosion; current 128MB is aggressive if concurrency spikes.

Footnotes


Appendix: Handy one-liners

Show who is forcing PG settings

select name,setting,source,sourcefile
from pg_settings
where name in ('shared_buffers','effective_cache_size','wal_compression','max_wal_size','random_page_cost')
order by name;

Show current pod memory limit

cat /sys/fs/cgroup/memory.max || cat /sys/fs/cgroup/memory/memory.limit_in_bytes

Redis sanity

REDISCLI_AUTH="$REDIS_PASS" redis-cli INFO | egrep -i 'used_memory_human|maxmemory_human|maxmemory_policy|aof_enabled|evicted_keys'

  1. The snapshot script prints OS, CPU, memory, ZFS pools and ARC, datasets matching Nextcloud and DB, app platform state, network listeners, THP, swappiness, timers, and versions. Good first move before any tuning. ↩︎

  2. ZFS recordsize=8K matches Postgres 8 KB page size; atime=off avoids metadata writes; compression=lz4 is typically net positive for WAL and heap; logbias=latency optimizes synchronous intent logging. These are standard PG-on-ZFS choices. ↩︎

  3. Transparent Huge Pages can cause latency spikes for memory alloc and compaction. PG recommends never. You persisted it with a systemd unit and verified huge_pages=off in PG. ↩︎

  4. vm.swappiness=1 favors keeping hot working sets in memory. DB nodes typically set this low to avoid writeback storms. ↩︎

  5. The TrueNAS Bitnami chart maps well-known env vars like REDIS_APPENDONLY and REDIS_MAXMEMORY_POLICY into redis.conf, avoiding brittle args parsing. ↩︎

  6. appendonly yes with everysec gives durability with good throughput. It is the sane default for NC caching plus locking patterns. ↩︎

  7. allkeys-lru prevents unbounded memory growth and prioritizes hot keys. With maxmemory 8gb, eviction is predictable. ↩︎

  8. Postgres configuration precedence is: command line -c flags override includes and postgresql.conf, then ALTER SYSTEM, then file includes. If the container passes -c shared_buffers=1024MB, it will override everything else. ↩︎

  9. With a 24 GiB cgroup limit, shared_buffers=16GB is aggressive but acceptable if app memory and FS cache are still healthy. Monitor OOMKilled events and PG memory stats. ↩︎

  10. When kubectl says “container not found,” the pod likely has a single container with a different name than you assumed. Use kubectl -n NS get pod POD -o jsonpath='{.spec.containers[*].name}' to confirm. ↩︎

  11. The Bitnami PG image often creates the app user as the primary DB user. The secret shows the authoritative POSTGRES_USER, POSTGRES_PASSWORD, and POSTGRES_DB you should use. ↩︎

  12. Rule of thumb: shared_buffers 2025 percent of RAM for mixed workloads, higher only if the rest of the stack is memory-light and you monitor for OOM. Effective cache can be 23x buffers. ↩︎