Merge pull request #3473 from janhq/dev

Release Cut 0.5.3
2024-08-27 16:58:55 +07:00 · 2024-08-27 16:58:55 +07:00 · c0ffd03f61
commit c0ffd03f61
parent 506cbb8834 1c5b6355d9
202 changed files with 3208 additions and 8412 deletions
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@ -0,0 +1,37 @@
+---
+name: "🖋️ Report"
+about: Create a report to help us improve Jan
+title: 'bug: [DESCRIPTION]'
+labels: 'type: bug'
+assignees: ''
+
+---
+
+**Describe the bug**
+A clear and concise description of what the bug is.
+
+**Steps to reproduce**
+Steps to reproduce the behavior:
+1. Go to '...'
+2. Click on '....'
+3. Scroll down to '....'
+4. See error
+
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+
+**Screenshots**
+If applicable, add screenshots to help explain your issue.
+
+**Environment details**
+- Operating System: [Specify your OS. e.g., MacOS Sonoma 14.2.1, Windows 11, Ubuntu 22, etc]
+- Jan Version: [e.g., 0.4.xxx nightly or manual]
+- Processor: [e.g., Apple M1, Intel Core i7, AMD Ryzen 5, etc]
+- RAM: [e.g., 8GB, 16GB]
+- Any additional relevant hardware specifics: [e.g., Graphics card, SSD/HDD]
+
+**Logs**
+If the cause of the error is not clear, kindly provide your usage logs: https://jan.ai/docs/troubleshooting#how-to-get-error-logs
+
+**Additional context**
+Add any other context or information that could be helpful in diagnosing the problem.
--- a/.github/workflows/jan-electron-build-nightly.yml
+++ b/.github/workflows/jan-electron-build-nightly.yml
@ -10,7 +10,7 @@ on:
        description: 'Public Provider'
        options:
          - none
-          - cloudflare-r2
+          - aws-s3
        default: none

 jobs:
@ -28,10 +28,10 @@ jobs:
            echo "::set-output name=ref::${{ github.ref }}"
          else
            if [ "${{ github.event_name }}" == "schedule" ]; then
-              echo "::set-output name=public_provider::cloudflare-r2"
+              echo "::set-output name=public_provider::aws-s3"
              echo "::set-output name=ref::refs/heads/dev"
            elif [ "${{ github.event_name }}" == "push" ]; then
-              echo "::set-output name=public_provider::cloudflare-r2"
+              echo "::set-output name=public_provider::aws-s3"
              echo "::set-output name=ref::${{ github.ref }}"
            else
              echo "::set-output name=public_provider::none"
@ -112,13 +112,13 @@ jobs:
          cat ./latest-mac.yml
      
      - name: Upload latest-mac.yml
-        if: ${{ needs.set-public-provider.outputs.public_provider == 'cloudflare-r2' }}
+        if: ${{ needs.set-public-provider.outputs.public_provider == 'aws-s3' }}
        run: |
-          aws s3api put-object --endpoint-url https://${{ secrets.CLOUDFLARE_ACCOUNT_ID }}.r2.cloudflarestorage.com --bucket ${{ secrets.CLOUDFLARE_R2_BUCKET_NAME }} --key "latest/latest-mac.yml" --body "./latest-mac.yml"
+          aws s3 cp ./latest-mac.yml "s3://${{ secrets.DELTA_AWS_S3_BUCKET_NAME }}/latest/latest-mac.yml"
        env:
-          AWS_ACCESS_KEY_ID: ${{ secrets.CLOUDFLARE_R2_ACCESS_KEY_ID }}
-          AWS_SECRET_ACCESS_KEY: ${{ secrets.CLOUDFLARE_R2_SECRET_ACCESS_KEY }}
-          AWS_DEFAULT_REGION: auto
+          AWS_ACCESS_KEY_ID: ${{ secrets.DELTA_AWS_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.DELTA_AWS_SECRET_ACCESS_KEY }}
+          AWS_DEFAULT_REGION: ${{ secrets.DELTA_AWS_REGION }}
          AWS_EC2_METADATA_DISABLED: "true"

    
@ -147,7 +147,7 @@ jobs:
  noti-discord-manual-and-update-url-readme:
    needs: [build-macos-x64, build-macos-arm64, build-windows-x64, build-linux-x64, get-update-version, set-public-provider, combine-latest-mac-yml]
    secrets: inherit
-    if: github.event_name == 'workflow_dispatch' && github.event.inputs.public_provider == 'cloudflare-r2'
+    if: github.event_name == 'workflow_dispatch' && github.event.inputs.public_provider == 'aws-s3'
    uses: ./.github/workflows/template-noti-discord-and-update-url-readme.yml
    with:
      ref: refs/heads/dev
--- a/.github/workflows/jan-server-build-nightly.yml
+++ b/.github/workflows/jan-server-build-nightly.yml
@ -1,40 +0,0 @@
-name: Docker Builder - Nightly / Manual
-
-on:
-  push:
-    branches:
-      - main
-      - feature/helmchart-and-ci-jan-server
-    paths-ignore:
-      - 'README.md'
-      - 'docs/**'
-  schedule:
-    - cron: '0 21 * * 1,2,3' # At 8 PM UTC on Monday, Tuesday, and Wednesday which is 4 AM UTC+7 Tuesday, Wednesday, and Thursday
-  workflow_dispatch:
-
-jobs:
-  # Job create Update app version based on latest release tag with build number and save to output
-  get-update-version:
-    uses: ./.github/workflows/template-get-update-version.yml
-
-  build-cpu:
-    uses: ./.github/workflows/template-build-jan-server.yml
-    permissions:
-      packages: write
-    secrets: inherit
-    needs: [get-update-version]
-    with:
-      dockerfile_path: ./Dockerfile
-      docker_image_tag: "ghcr.io/janhq/jan-server:dev-cpu-latest,ghcr.io/janhq/jan-server:dev-cpu-${{ needs.get-update-version.outputs.new_version }}"
-
-  build-gpu:
-    uses: ./.github/workflows/template-build-jan-server.yml
-    permissions:
-      packages: write
-    secrets: inherit
-    needs: [get-update-version]
-    with:
-      dockerfile_path: ./Dockerfile.gpu
-      docker_image_tag: "ghcr.io/janhq/jan-server:dev-cuda-12.2-latest,ghcr.io/janhq/jan-server:dev-cuda-12.2-${{ needs.get-update-version.outputs.new_version }}"
-    
-
--- a/.github/workflows/jan-server-build.yml
+++ b/.github/workflows/jan-server-build.yml
@ -1,30 +0,0 @@
-name: Docker Builder - Tag
-
-on:
-  push:
-    tags: ["v[0-9]+.[0-9]+.[0-9]+"]
-
-jobs:
-  # Job create Update app version based on latest release tag with build number and save to output
-  get-update-version:
-    uses: ./.github/workflows/template-get-update-version.yml
-
-  build-cpu:
-    permissions:
-      packages: write
-    uses: ./.github/workflows/template-build-jan-server.yml
-    secrets: inherit
-    needs: [get-update-version]
-    with:
-      dockerfile_path: ./Dockerfile
-      docker_image_tag: "ghcr.io/janhq/jan-server:cpu-latest,ghcr.io/janhq/jan-server:cpu-${{ needs.get-update-version.outputs.new_version }}"
-
-  build-gpu:
-    permissions:
-      packages: write
-    uses: ./.github/workflows/template-build-jan-server.yml
-    secrets: inherit
-    needs: [get-update-version]
-    with:
-      dockerfile_path: ./Dockerfile.gpu
-      docker_image_tag: "ghcr.io/janhq/jan-server:cuda-12.2-latest,ghcr.io/janhq/jan-server:cuda-12.2-${{ needs.get-update-version.outputs.new_version }}"
--- a/.github/workflows/template-build-linux-x64.yml
+++ b/.github/workflows/template-build-linux-x64.yml
@ -10,23 +10,21 @@ on:
        required: true
        type: string
        default: none
-        description: 'none: build only, github: build and publish to github, cloudflare: build and publish to cloudflare'
+        description: 'none: build only, github: build and publish to github, aws s3: build and publish to aws s3'
      new_version:
        required: true
        type: string
        default: ''
-      cloudflare_r2_path:
+      aws_s3_prefix:
        required: false
        type: string
        default: '/latest/'
    secrets:
-      CLOUDFLARE_R2_BUCKET_NAME:
+      DELTA_AWS_S3_BUCKET_NAME:
        required: false
-      CLOUDFLARE_R2_ACCESS_KEY_ID:
+      DELTA_AWS_ACCESS_KEY_ID:
        required: false
-      CLOUDFLARE_R2_SECRET_ACCESS_KEY:
-        required: false
-      CLOUDFLARE_ACCOUNT_ID:
+      DELTA_AWS_SECRET_ACCESS_KEY:
        required: false

 jobs:
@ -58,7 +56,7 @@ jobs:
          mv /tmp/package.json electron/package.json
          jq --arg version "${{ inputs.new_version }}" '.version = $version' web/package.json > /tmp/package.json
          mv /tmp/package.json web/package.json
-          jq '.build.publish = [{"provider": "generic", "url": "${{ secrets.CLOUDFLARE_R2_PUBLIC_URL }}", "channel": "latest"}, {"provider": "s3", "bucket": "${{ secrets.CLOUDFLARE_R2_BUCKET_NAME }}", "region": "auto", "endpoint": "https://${{ secrets.CLOUDFLARE_ACCOUNT_ID }}.r2.cloudflarestorage.com", "path": "${{ inputs.cloudflare_r2_path }}", "channel": "latest"}]' electron/package.json > /tmp/package.json
+          jq '.build.publish = [{"provider": "generic", "url": "${{ secrets.CLOUDFLARE_R2_PUBLIC_URL }}", "channel": "latest"}, {"provider": "s3", "acl": null, "bucket": "${{ secrets.DELTA_AWS_S3_BUCKET_NAME }}", "region": "${{ secrets.DELTA_AWS_REGION}}", "path": "${{ inputs.aws_s3_prefix }}", "channel": "latest"}]' electron/package.json > /tmp/package.json
          mv /tmp/package.json electron/package.json
          cat electron/package.json

@ -76,7 +74,7 @@ jobs:
        env:
          VERSION_TAG: ${{ inputs.new_version }}

-      - name: Build and publish app to cloudflare r2 or github artifactory
+      - name: Build and publish app to aws s3 r2 or github artifactory
        if: inputs.public_provider != 'github'
        run: |
          # check public_provider is true or not
@ -88,9 +86,10 @@ jobs:
          fi
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          AWS_ACCESS_KEY_ID: ${{ secrets.CLOUDFLARE_R2_ACCESS_KEY_ID }}
-          AWS_SECRET_ACCESS_KEY: ${{ secrets.CLOUDFLARE_R2_SECRET_ACCESS_KEY }}
+          AWS_ACCESS_KEY_ID: ${{ secrets.DELTA_AWS_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.DELTA_AWS_SECRET_ACCESS_KEY }}
          AWS_EC2_METADATA_DISABLED: "true"
+          AWS_MAX_ATTEMPTS: "5"

      - name: Build and publish app to github
        if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/') && inputs.public_provider == 'github'
--- a/.github/workflows/template-build-macos-arm64.yml
+++ b/.github/workflows/template-build-macos-arm64.yml
@ -10,23 +10,21 @@ on:
        required: true
        type: string
        default: none
-        description: 'none: build only, github: build and publish to github, cloudflare: build and publish to cloudflare'
+        description: 'none: build only, github: build and publish to github, aws s3: build and publish to aws s3'
      new_version:
        required: true
        type: string
        default: ''
-      cloudflare_r2_path:
+      aws_s3_prefix:
        required: false
        type: string
        default: '/latest/'
    secrets:
-      CLOUDFLARE_R2_BUCKET_NAME:
+      DELTA_AWS_S3_BUCKET_NAME:
        required: false
-      CLOUDFLARE_R2_ACCESS_KEY_ID:
+      DELTA_AWS_ACCESS_KEY_ID:
        required: false
-      CLOUDFLARE_R2_SECRET_ACCESS_KEY:
-        required: false
-      CLOUDFLARE_ACCOUNT_ID:
+      DELTA_AWS_SECRET_ACCESS_KEY:
        required: false
      CODE_SIGN_P12_BASE64:
        required: false
@ -70,7 +68,7 @@ jobs:
          jq --arg version "${{ inputs.new_version }}" '.version = $version' web/package.json > /tmp/package.json
          mv /tmp/package.json web/package.json

-          jq '.build.publish = [{"provider": "generic", "url": "${{ secrets.CLOUDFLARE_R2_PUBLIC_URL }}", "channel": "latest"}, {"provider": "s3", "bucket": "${{ secrets.CLOUDFLARE_R2_BUCKET_NAME }}", "region": "auto", "endpoint": "https://${{ secrets.CLOUDFLARE_ACCOUNT_ID }}.r2.cloudflarestorage.com", "path": "${{ inputs.cloudflare_r2_path }}", "channel": "latest"}]' electron/package.json > /tmp/package.json
+          jq '.build.publish = [{"provider": "generic", "url": "${{ secrets.CLOUDFLARE_R2_PUBLIC_URL }}", "channel": "latest"}, {"provider": "s3", "acl": null, "bucket": "${{ secrets.DELTA_AWS_S3_BUCKET_NAME }}", "region": "${{ secrets.DELTA_AWS_REGION}}", "path": "${{ inputs.aws_s3_prefix }}", "channel": "latest"}]' electron/package.json > /tmp/package.json
          mv /tmp/package.json electron/package.json

          jq --arg teamid "${{ secrets.APPLE_TEAM_ID }}" '.build.mac.notarize.teamId = $teamid' electron/package.json > /tmp/package.json
@ -107,7 +105,7 @@ jobs:
          p12-file-base64: ${{ secrets.CODE_SIGN_P12_BASE64 }}
          p12-password: ${{ secrets.CODE_SIGN_P12_PASSWORD }}

-      - name: Build and publish app to cloudflare r2 or github artifactory
+      - name: Build and publish app to aws s3 r2 or github artifactory
        if: inputs.public_provider != 'github'
        run: |
          # check public_provider is true or not
@ -126,10 +124,11 @@ jobs:
          APPLE_APP_SPECIFIC_PASSWORD: ${{ secrets.APPLE_APP_SPECIFIC_PASSWORD }}
          APP_PATH: "."
          DEVELOPER_ID: ${{ secrets.DEVELOPER_ID }}
-          AWS_ACCESS_KEY_ID: ${{ secrets.CLOUDFLARE_R2_ACCESS_KEY_ID }}
-          AWS_SECRET_ACCESS_KEY: ${{ secrets.CLOUDFLARE_R2_SECRET_ACCESS_KEY }}
+          AWS_ACCESS_KEY_ID: ${{ secrets.DELTA_AWS_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.DELTA_AWS_SECRET_ACCESS_KEY }}
          AWS_DEFAULT_REGION: auto
          AWS_EC2_METADATA_DISABLED: "true"
+          AWS_MAX_ATTEMPTS: "5"

      - name: Build and publish app to github
        if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/') && inputs.public_provider == 'github'
--- a/.github/workflows/template-build-macos-x64.yml
+++ b/.github/workflows/template-build-macos-x64.yml
@ -10,23 +10,21 @@ on:
        required: true
        type: string
        default: none
-        description: 'none: build only, github: build and publish to github, cloudflare: build and publish to cloudflare'
+        description: 'none: build only, github: build and publish to github, aws s3: build and publish to aws s3'
      new_version:
        required: true
        type: string
        default: ''
-      cloudflare_r2_path:
+      aws_s3_prefix:
        required: false
        type: string
        default: '/latest/'
    secrets:
-      CLOUDFLARE_R2_BUCKET_NAME:
+      DELTA_AWS_S3_BUCKET_NAME:
        required: false
-      CLOUDFLARE_R2_ACCESS_KEY_ID:
+      DELTA_AWS_ACCESS_KEY_ID:
        required: false
-      CLOUDFLARE_R2_SECRET_ACCESS_KEY:
-        required: false
-      CLOUDFLARE_ACCOUNT_ID:
+      DELTA_AWS_SECRET_ACCESS_KEY:
        required: false
      CODE_SIGN_P12_BASE64:
        required: false
@ -70,7 +68,7 @@ jobs:
          jq --arg version "${{ inputs.new_version }}" '.version = $version' web/package.json > /tmp/package.json
          mv /tmp/package.json web/package.json

-          jq '.build.publish = [{"provider": "generic", "url": "${{ secrets.CLOUDFLARE_R2_PUBLIC_URL }}", "channel": "latest"}, {"provider": "s3", "bucket": "${{ secrets.CLOUDFLARE_R2_BUCKET_NAME }}", "region": "auto", "endpoint": "https://${{ secrets.CLOUDFLARE_ACCOUNT_ID }}.r2.cloudflarestorage.com", "path": "${{ inputs.cloudflare_r2_path }}", "channel": "latest"}]' electron/package.json > /tmp/package.json
+          jq '.build.publish = [{"provider": "generic", "url": "${{ secrets.CLOUDFLARE_R2_PUBLIC_URL }}", "channel": "latest"}, {"provider": "s3", "acl": null, "bucket": "${{ secrets.DELTA_AWS_S3_BUCKET_NAME }}", "region": "${{ secrets.DELTA_AWS_REGION}}", "path": "${{ inputs.aws_s3_prefix }}", "channel": "latest"}]' electron/package.json > /tmp/package.json
          mv /tmp/package.json electron/package.json

          jq --arg teamid "${{ secrets.APPLE_TEAM_ID }}" '.build.mac.notarize.teamId = $teamid' electron/package.json > /tmp/package.json
@ -107,7 +105,7 @@ jobs:
          p12-file-base64: ${{ secrets.CODE_SIGN_P12_BASE64 }}
          p12-password: ${{ secrets.CODE_SIGN_P12_PASSWORD }}

-      - name: Build and publish app to cloudflare r2 or github artifactory
+      - name: Build and publish app to aws s3 r2 or github artifactory
        if: inputs.public_provider != 'github'
        run: |
          # check public_provider is true or not
@ -126,10 +124,11 @@ jobs:
          APPLE_APP_SPECIFIC_PASSWORD: ${{ secrets.APPLE_APP_SPECIFIC_PASSWORD }}
          APP_PATH: "."
          DEVELOPER_ID: ${{ secrets.DEVELOPER_ID }}
-          AWS_ACCESS_KEY_ID: ${{ secrets.CLOUDFLARE_R2_ACCESS_KEY_ID }}
-          AWS_SECRET_ACCESS_KEY: ${{ secrets.CLOUDFLARE_R2_SECRET_ACCESS_KEY }}
+          AWS_ACCESS_KEY_ID: ${{ secrets.DELTA_AWS_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.DELTA_AWS_SECRET_ACCESS_KEY }}
          AWS_DEFAULT_REGION: auto
          AWS_EC2_METADATA_DISABLED: "true"
+          AWS_MAX_ATTEMPTS: "5"

      - name: Build and publish app to github
        if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/') && inputs.public_provider == 'github'
--- a/.github/workflows/template-build-windows-x64.yml
+++ b/.github/workflows/template-build-windows-x64.yml
@ -10,23 +10,21 @@ on:
        required: true
        type: string
        default: none
-        description: 'none: build only, github: build and publish to github, cloudflare: build and publish to cloudflare'
+        description: 'none: build only, github: build and publish to github, aws s3: build and publish to aws s3'
      new_version:
        required: true
        type: string
        default: ''
-      cloudflare_r2_path:
+      aws_s3_prefix:
        required: false
        type: string
        default: '/latest/'
    secrets:
-      CLOUDFLARE_R2_BUCKET_NAME:
+      DELTA_AWS_S3_BUCKET_NAME:
        required: false
-      CLOUDFLARE_R2_ACCESS_KEY_ID:
+      DELTA_AWS_ACCESS_KEY_ID:
        required: false
-      CLOUDFLARE_R2_SECRET_ACCESS_KEY:
-        required: false
-      CLOUDFLARE_ACCOUNT_ID:
+      DELTA_AWS_SECRET_ACCESS_KEY:
        required: false
      AZURE_KEY_VAULT_URI:
        required: false
@ -71,7 +69,7 @@ jobs:
          jq --arg version "${{ inputs.new_version }}" '.version = $version' web/package.json > /tmp/package.json
          mv /tmp/package.json web/package.json

-          jq '.build.publish = [{"provider": "generic", "url": "${{ secrets.CLOUDFLARE_R2_PUBLIC_URL }}", "channel": "latest"}, {"provider": "s3", "bucket": "${{ secrets.CLOUDFLARE_R2_BUCKET_NAME }}", "region": "auto", "endpoint": "https://${{ secrets.CLOUDFLARE_ACCOUNT_ID }}.r2.cloudflarestorage.com", "path": "${{ inputs.cloudflare_r2_path }}", "channel": "latest"}]' electron/package.json > /tmp/package.json
+          jq '.build.publish = [{"provider": "generic", "url": "${{ secrets.CLOUDFLARE_R2_PUBLIC_URL }}", "channel": "latest"}, {"provider": "s3", "acl": null, "bucket": "${{ secrets.DELTA_AWS_S3_BUCKET_NAME }}", "region": "${{ secrets.DELTA_AWS_REGION}}", "path": "${{ inputs.aws_s3_prefix }}", "channel": "latest"}]' electron/package.json > /tmp/package.json
          mv /tmp/package.json electron/package.json

          jq '.build.win.sign = "./sign.js"' electron/package.json > /tmp/package.json
@ -99,7 +97,7 @@ jobs:
        run: |
          dotnet tool install --global AzureSignTool

-      - name: Build and publish app to cloudflare r2 or github artifactory
+      - name: Build and publish app to aws s3 r2 or github artifactory
        shell: bash
        if: inputs.public_provider != 'github'
        run: |
@ -116,10 +114,11 @@ jobs:
          AZURE_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
          AZURE_CLIENT_SECRET: ${{ secrets.AZURE_CLIENT_SECRET }}
          AZURE_CERT_NAME: ${{ secrets.AZURE_CERT_NAME }}
-          AWS_ACCESS_KEY_ID: ${{ secrets.CLOUDFLARE_R2_ACCESS_KEY_ID }}
-          AWS_SECRET_ACCESS_KEY: ${{ secrets.CLOUDFLARE_R2_SECRET_ACCESS_KEY }}
+          AWS_ACCESS_KEY_ID: ${{ secrets.DELTA_AWS_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.DELTA_AWS_SECRET_ACCESS_KEY }}
          AWS_DEFAULT_REGION: auto
          AWS_EC2_METADATA_DISABLED: "true"
+          AWS_MAX_ATTEMPTS: "5"

      - name: Build app and publish app to github
        if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/') && inputs.public_provider == 'github'
--- a/.gitignore
+++ b/.gitignore
@ -39,3 +39,5 @@ extensions/*-extension/bin/vulkaninfo

 # Turborepo
 .turbo
+electron/test-data
+electron/test-results
--- a/.husky/pre-commit
+++ b/.husky/pre-commit
@ -1,4 +1 @@
-#!/usr/bin/env sh
-. "$(dirname -- "$0")/_/husky.sh"
-
-npx pretty-quick --staged
+npm run lint --fix
--- a/60
+++ b/60
@ -1,60 +0,0 @@
-FROM node:20-bookworm AS base
-
-# 1. Install dependencies only when needed
-FROM base AS builder
-
-# Install g++ 11
-RUN apt update && apt install -y gcc-11 g++-11 cpp-11 jq xsel && rm -rf /var/lib/apt/lists/*
-
-WORKDIR /app
-
-# Install dependencies based on the preferred package manager
-COPY . ./
-
-RUN export NITRO_VERSION=$(cat extensions/inference-nitro-extension/bin/version.txt) && \
-    jq --arg nitroVersion $NITRO_VERSION '(.scripts."downloadnitro:linux" | gsub("\\${NITRO_VERSION}"; $nitroVersion)) | gsub("\r"; "")' extensions/inference-nitro-extension/package.json > /tmp/newcommand.txt && export NEW_COMMAND=$(sed 's/^"//;s/"$//' /tmp/newcommand.txt) && jq --arg newCommand "$NEW_COMMAND" '.scripts."downloadnitro:linux" = $newCommand' extensions/inference-nitro-extension/package.json > /tmp/package.json && mv /tmp/package.json extensions/inference-nitro-extension/package.json
-RUN make install-and-build
-
-# # 2. Rebuild the source code only when needed
-FROM base AS runner
-
-# Install g++ 11
-RUN apt update && apt install -y gcc-11 g++-11 cpp-11 jq xsel && rm -rf /var/lib/apt/lists/*
-
-WORKDIR /app
-
-# Copy the package.json and yarn.lock of root yarn space to leverage Docker cache
-COPY --from=builder /app/package.json ./package.json
-COPY --from=builder /app/node_modules ./node_modules/
-COPY --from=builder /app/yarn.lock ./yarn.lock
-
-# Copy the package.json, yarn.lock, and build output of server yarn space to leverage Docker cache
-COPY --from=builder /app/core ./core/
-COPY --from=builder /app/server ./server/
-RUN cd core && yarn install && yarn run build
-RUN yarn workspace @janhq/server install && yarn workspace @janhq/server build
-COPY --from=builder /app/docs/openapi ./docs/openapi/
-
-# Copy pre-install dependencies
-COPY --from=builder /app/pre-install ./pre-install/
-
-# Copy the package.json, yarn.lock, and output of web yarn space to leverage Docker cache
-COPY --from=builder /app/joi ./joi/
-COPY --from=builder /app/web ./web/
-
-RUN yarn workspace @janhq/joi install && yarn workspace @janhq/joi build
-RUN yarn workspace @janhq/web install
-
-RUN npm install -g serve@latest
-
-EXPOSE 1337 3000 3928
-
-ENV JAN_API_HOST 0.0.0.0
-ENV JAN_API_PORT 1337
-
-ENV API_BASE_URL http://localhost:1337
-
-CMD ["sh", "-c", "export NODE_ENV=production && yarn workspace @janhq/web build && cd web && npx serve out & cd server && node build/main.js"]
-
-# docker build -t jan .
-# docker run -p 1337:1337 -p 3000:3000 -p 3928:3928 jan
--- a/Dockerfile.gpu
+++ b/Dockerfile.gpu
@ -1,87 +0,0 @@
-# Please change the base image to the appropriate CUDA version base on NVIDIA Driver Compatibility
-# Run nvidia-smi to check the CUDA version and the corresponding driver version
-# Then update the base image to the appropriate CUDA version refer https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda/tags
-
-FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04 AS base 
-
-# 1. Install dependencies only when needed
-FROM base AS builder
-
-# Install g++ 11
-RUN apt update && apt install -y gcc-11 g++-11 cpp-11 jq xsel curl gnupg make python3-dev && curl -sL https://deb.nodesource.com/setup_20.x | bash - && apt install nodejs -y && rm -rf /var/lib/apt/lists/*
-
-# Update alternatives for GCC and related tools
-RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 110 \
-                         --slave /usr/bin/g++ g++ /usr/bin/g++-11 \
-                         --slave /usr/bin/gcov gcov /usr/bin/gcov-11 \
-                         --slave /usr/bin/gcc-ar gcc-ar /usr/bin/gcc-ar-11 \
-                         --slave /usr/bin/gcc-ranlib gcc-ranlib /usr/bin/gcc-ranlib-11 && \
-    update-alternatives --install /usr/bin/cpp cpp /usr/bin/cpp-11 110
-
-RUN npm install -g yarn
-
-WORKDIR /app
-
-# Install dependencies based on the preferred package manager
-COPY . ./
-
-RUN export NITRO_VERSION=$(cat extensions/inference-nitro-extension/bin/version.txt) && \
-    jq --arg nitroVersion $NITRO_VERSION '(.scripts."downloadnitro:linux" | gsub("\\${NITRO_VERSION}"; $nitroVersion)) | gsub("\r"; "")' extensions/inference-nitro-extension/package.json > /tmp/newcommand.txt && export NEW_COMMAND=$(sed 's/^"//;s/"$//' /tmp/newcommand.txt) && jq --arg newCommand "$NEW_COMMAND" '.scripts."downloadnitro:linux" = $newCommand' extensions/inference-nitro-extension/package.json > /tmp/package.json && mv /tmp/package.json extensions/inference-nitro-extension/package.json
-RUN make install-and-build
-
-# # 2. Rebuild the source code only when needed
-FROM base AS runner
-
-# Install g++ 11
-RUN apt update && apt install -y gcc-11 g++-11 cpp-11 jq xsel curl gnupg make python3-dev && curl -sL https://deb.nodesource.com/setup_20.x | bash - && apt-get install nodejs -y && rm -rf /var/lib/apt/lists/*
-
-# Update alternatives for GCC and related tools
-RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 110 \
-                         --slave /usr/bin/g++ g++ /usr/bin/g++-11 \
-                         --slave /usr/bin/gcov gcov /usr/bin/gcov-11 \
-                         --slave /usr/bin/gcc-ar gcc-ar /usr/bin/gcc-ar-11 \
-                         --slave /usr/bin/gcc-ranlib gcc-ranlib /usr/bin/gcc-ranlib-11 && \
-    update-alternatives --install /usr/bin/cpp cpp /usr/bin/cpp-11 110
-
-RUN npm install -g yarn
-
-WORKDIR /app
-
-# Copy the package.json and yarn.lock of root yarn space to leverage Docker cache
-COPY --from=builder /app/package.json ./package.json
-COPY --from=builder /app/node_modules ./node_modules/
-COPY --from=builder /app/yarn.lock ./yarn.lock
-
-# Copy the package.json, yarn.lock, and build output of server yarn space to leverage Docker cache
-COPY --from=builder /app/core ./core/
-COPY --from=builder /app/server ./server/
-RUN cd core && yarn install && yarn run build
-RUN yarn workspace @janhq/server install && yarn workspace @janhq/server build
-COPY --from=builder /app/docs/openapi ./docs/openapi/
-
-# Copy pre-install dependencies
-COPY --from=builder /app/pre-install ./pre-install/
-
-# Copy the package.json, yarn.lock, and output of web yarn space to leverage Docker cache
-COPY --from=builder /app/joi ./joi/
-COPY --from=builder /app/web ./web/
-
-RUN yarn workspace @janhq/joi install && yarn workspace @janhq/joi build
-RUN yarn workspace @janhq/web install
-
-RUN npm install -g serve@latest
-
-EXPOSE 1337 3000 3928
-
-ENV LD_LIBRARY_PATH=/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda-12.0/compat${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
-
-ENV JAN_API_HOST 0.0.0.0
-ENV JAN_API_PORT 1337
-
-ENV API_BASE_URL http://localhost:1337
-
-CMD ["sh", "-c", "export NODE_ENV=production && yarn workspace @janhq/web build && cd web && npx serve out & cd server && node build/main.js"]
-
-# pre-requisites: nvidia-docker
-# docker build -t jan-gpu . -f Dockerfile.gpu
-# docker run -p 1337:1337 -p 3000:3000 -p 3928:3928 --gpus all jan-gpu
--- a/charts/server/Chart.lock
+++ b/charts/server/Chart.lock
@ -1,6 +0,0 @@
-dependencies:
- name: common
-  repository: oci://ghcr.io/janhq/charts
-  version: 0.1.2
-digest: sha256:35e98bde174130787755b0f8ea2359b7b6790d965a7157c2f7cabf1bc8c04471
-generated: "2024-02-20T16:20:37.6530108+07:00"
--- a/charts/server/Chart.yaml
+++ b/charts/server/Chart.yaml
@ -1,10 +0,0 @@
-apiVersion: v2
-name: jan-server
-description: A Helm chart for Kubernetes
-type: application
-version: 0.1.0
-appVersion: '1.0.0'
-dependencies:
-  - name: common
-    version: 0.1.2 # common-chart-version
-    repository: oci://ghcr.io/janhq/charts
--- a/charts/server/charts/common-0.1.2.tgz
+++ b/charts/server/charts/common-0.1.2.tgz
--- a/charts/server/config.json
+++ b/charts/server/config.json
@ -1,4 +0,0 @@
-{
-    "image-list": "server=ghcr.io/janhq/jan-server",
-    "platforms": "linux/amd64"
-}
--- a/charts/server/values.yaml
+++ b/charts/server/values.yaml
@ -1,256 +0,0 @@
-common:
-  imageTag: v0.4.6-cpu
-  # DO NOT CHANGE THE LINE ABOVE. MAKE ALL CHANGES BELOW
-
-  # Global pvc for all workload
-  pvc:
-    enabled: false
-    name: 'janroot'
-    accessModes: 'ReadWriteOnce'
-    storageClassName: ''
-    capacity: '50Gi'
-
-  # Global image pull secret
-  imagePullSecrets: []
-
-  externalSecret:
-    create: false
-    name: ''
-    annotations: {}
-
-  nameOverride: 'jan-server'
-  fullnameOverride: 'jan-server'
-
-  serviceAccount:
-    create: true
-    annotations: {}
-    name: 'jan-server-service-account'
-
-  podDisruptionBudget:
-    create: false
-    minAvailable: 1
-
-  workloads:
-    - name: server
-      image:
-        repository: ghcr.io/janhq/jan-server
-        pullPolicy: Always
-
-      command: ['/bin/sh', '-c']
-      args: ['cd server && node build/main.js']
-
-      replicaCount: 1
-      ports:
-        containerPort: 1337
-
-      strategy:
-        canary:
-          steps:
-            - setWeight: 50
-            - pause: { duration: 1m }
-
-      ingress:
-        enabled: true
-        className: 'nginx'
-        annotations:
-          nginx.ingress.kubernetes.io/proxy-body-size: '100m'
-          nginx.ingress.kubernetes.io/proxy-read-timeout: '1800'
-          nginx.ingress.kubernetes.io/proxy-send-timeout: '1800'
-          # cert-manager.io/cluster-issuer: 'jan-ai-dns01-cluster-issuer'
-          # nginx.ingress.kubernetes.io/force-ssl-redirect: 'true'
-          nginx.ingress.kubernetes.io/backend-protocol: HTTP
-        hosts:
-          - host: server.local
-            paths:
-              - path: /
-                pathType: Prefix
-        tls:
-          []
-          # - hosts:
-          #     - server-dev.jan.ai
-          #   secretName: jan-server-prod-tls-v2
-
-      instrumentation:
-        enabled: false
-      podAnnotations: {}
-
-      podSecurityContext: {}
-
-      securityContext: {}
-
-      service:
-        externalLabel: {}
-        type: ClusterIP
-        port: 1337
-        targetPort: 1337
-
-      # If you want to use GPU, please uncomment the following lines and change imageTag to the one with GPU support
-      resources:
-        # limits:
-        #   nvidia.com/gpu: 1
-        requests:
-          cpu: 2000m
-          memory: 8192M
-
-      # If you want to use pv, please uncomment the following lines and enable pvc.enabled
-      volumes:
-        []
-        # - name: janroot
-        #   persistentVolumeClaim:
-        #     claimName: janroot
-
-      volumeMounts:
-        []
-        # - name: janroot
-        #   mountPath: /app/server/build/jan
-
-      # AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, S3_BUCKET_NAME, AWS_ENDPOINT, AWS_REGION should mount as a secret env instead of plain text here
-      # Change API_BASE_URL to your server's public domain
-      env:
-        - name: API_BASE_URL
-          value: 'http://server.local'
-
-      lifecycle: {}
-      autoscaling:
-        enabled: false
-        minReplicas: 2
-        maxReplicas: 3
-        targetCPUUtilizationPercentage: 95
-        targetMemoryUtilizationPercentage: 95
-
-      kedaScaling:
-        enabled: false # ignore if autoscaling.enable = true
-        cooldownPeriod: 30
-        pollingInterval: 2
-        minReplicas: 1
-        maxReplicas: 5
-        metricName: celery_queue_length
-        query: celery_queue_length{queue_name="myqueue"} # change queue_name here
-        serverAddress: http://prometheus-prod-kube-prome-prometheus.monitoring.svc:9090
-        threshold: '3'
-
-      nodeSelector: {}
-
-      tolerations: []
-
-      podSecurityGroup:
-        enabled: false
-        securitygroupid: []
-
-      # Reloader Option
-      reloader: 'false'
-      vpa:
-        enabled: false
-
-    - name: web
-      image:
-        repository: ghcr.io/janhq/jan-server
-        pullPolicy: Always
-
-      command: ['/bin/sh', '-c']
-      args:
-        [
-          'export NODE_ENV=production && yarn workspace @janhq/web build && cd web && npx serve out',
-        ]
-
-      replicaCount: 1
-      ports:
-        containerPort: 3000
-
-      strategy:
-        canary:
-          steps:
-            - setWeight: 50
-            - pause: { duration: 1m }
-
-      ingress:
-        enabled: true
-        className: 'nginx'
-        annotations:
-          nginx.ingress.kubernetes.io/proxy-body-size: '100m'
-          nginx.ingress.kubernetes.io/proxy-read-timeout: '1800'
-          nginx.ingress.kubernetes.io/proxy-send-timeout: '1800'
-          # cert-manager.io/cluster-issuer: 'jan-ai-dns01-cluster-issuer'
-          # nginx.ingress.kubernetes.io/force-ssl-redirect: 'true'
-          nginx.ingress.kubernetes.io/backend-protocol: HTTP
-        hosts:
-          - host: web.local
-            paths:
-              - path: /
-                pathType: Prefix
-        tls:
-          []
-          # - hosts:
-          #     - server-dev.jan.ai
-          #   secretName: jan-server-prod-tls-v2
-
-      instrumentation:
-        enabled: false
-      podAnnotations: {}
-
-      podSecurityContext: {}
-
-      securityContext: {}
-
-      service:
-        externalLabel: {}
-        type: ClusterIP
-        port: 3000
-        targetPort: 3000
-
-      resources:
-        limits:
-          cpu: 1000m
-          memory: 2048M
-        requests:
-          cpu: 50m
-          memory: 500M
-
-      volumes:
-        []
-        # - name: janroot
-        #   persistentVolumeClaim:
-        #     claimName: janroot
-
-      volumeMounts:
-        []
-        # - name: janroot
-        #   mountPath: /app/server/build/jan
-
-      # AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, S3_BUCKET_NAME, AWS_ENDPOINT, AWS_REGION should mount as a secret env instead of plain text here
-      # Change API_BASE_URL to your server's public domain
-      env:
-        - name: API_BASE_URL
-          value: 'http://server.local'
-
-      lifecycle: {}
-      autoscaling:
-        enabled: true
-        minReplicas: 1
-        maxReplicas: 3
-        targetCPUUtilizationPercentage: 95
-        targetMemoryUtilizationPercentage: 95
-
-      kedaScaling:
-        enabled: false # ignore if autoscaling.enable = true
-        cooldownPeriod: 30
-        pollingInterval: 2
-        minReplicas: 1
-        maxReplicas: 5
-        metricName: celery_queue_length
-        query: celery_queue_length{queue_name="myqueue"} # change queue_name here
-        serverAddress: http://prometheus-prod-kube-prome-prometheus.monitoring.svc:9090
-        threshold: '3'
-
-      nodeSelector: {}
-
-      tolerations: []
-
-      podSecurityGroup:
-        enabled: false
-        securitygroupid: []
-
-      # Reloader Option
-      reloader: 'false'
-      vpa:
-        enabled: false
--- a/core/src/browser/extension.ts
+++ b/core/src/browser/extension.ts
@ -118,10 +118,21 @@ export abstract class BaseExtension implements ExtensionType {
      setting.extensionName = this.name
    })
    try {
-      await fs.mkdir(extensionSettingFolderPath)
+      if (!(await fs.existsSync(extensionSettingFolderPath)))
+        await fs.mkdir(extensionSettingFolderPath)
      const settingFilePath = await joinPath([extensionSettingFolderPath, this.settingFileName])

-      if (await fs.existsSync(settingFilePath)) return
+      // Persists new settings
+      if (await fs.existsSync(settingFilePath)) {
+        const oldSettings = JSON.parse(await fs.readFileSync(settingFilePath, 'utf-8'))
+        settings.forEach((setting) => {
+          // Keep setting value
+          if (setting.controllerProps && Array.isArray(oldSettings))
+            setting.controllerProps.value = oldSettings.find(
+              (e: any) => e.key === setting.key
+            )?.controllerProps?.value
+        })
+      }
      await fs.writeFileSync(settingFilePath, JSON.stringify(settings, null, 2))
    } catch (err) {
      console.error(err)
@ -168,6 +179,7 @@ export abstract class BaseExtension implements ExtensionType {
    ])

    try {
+      if (!(await fs.existsSync(settingPath))) return []
      const content = await fs.readFileSync(settingPath, 'utf-8')
      const settings: SettingComponentProps[] = JSON.parse(content)
      return settings
--- a/core/src/browser/extensions/engines/OAIEngine.ts
+++ b/core/src/browser/extensions/engines/OAIEngine.ts
@ -89,6 +89,7 @@ export abstract class OAIEngine extends AIEngine {
      model: model.id,
      stream: true,
      ...model.parameters,
+      ...(this.provider === 'nitro' ? { engine: 'cortex.llamacpp'} : {}),
    }
    if (this.transformPayload) {
      requestBody = this.transformPayload(requestBody)
--- a/core/src/browser/fs.ts
+++ b/core/src/browser/fs.ts
@ -58,6 +58,15 @@ const appendFileSync = (...args: any[]) => globalThis.core.api?.appendFileSync(.
 const copyFile: (src: string, dest: string) => Promise<void> = (src, dest) =>
  globalThis.core.api?.copyFile(src, dest)

+/**
+ * Gets the list of gguf files in a directory
+ * 
+ * @param path - The paths to the file.
+ * @returns {Promise<{any}>} - A promise that resolves with the list of gguf and non-gguf files
+ */
+const getGgufFiles: (paths: string[]) => Promise<any> = (
+  paths) => globalThis.core.api?.getGgufFiles(paths)
+
 /**
 * Gets the file's stats.
 *
@ -84,4 +93,5 @@ export const fs = {
  copyFile,
  fileStat,
  writeBlob,
+  getGgufFiles,
 }
--- a/core/src/node/api/processors/app.ts
+++ b/core/src/node/api/processors/app.ts
@ -77,8 +77,8 @@ export class App implements Processor {
      port: args?.port,
      isCorsEnabled: args?.isCorsEnabled,
      isVerboseEnabled: args?.isVerboseEnabled,
-      schemaPath: join(await appResourcePath(), 'docs', 'openapi', 'jan.yaml'),
-      baseDir: join(await appResourcePath(), 'docs', 'openapi'),
+      schemaPath: join(appResourcePath(), 'docs', 'openapi', 'jan.yaml'),
+      baseDir: join(appResourcePath(), 'docs', 'openapi'),
      prefix: args?.prefix,
    })
  }
--- a/core/src/node/api/processors/extension.ts
+++ b/core/src/node/api/processors/extension.ts
@ -42,7 +42,7 @@ export class Extension implements Processor {
   * @returns An array of paths to the base extensions.
   */
  async baseExtensions() {
-    const baseExtensionPath = join(await appResourcePath(), 'pre-install')
+    const baseExtensionPath = join(appResourcePath(), 'pre-install')
    return readdirSync(baseExtensionPath)
      .filter((file) => extname(file) === '.tgz')
      .map((file) => join(baseExtensionPath, file))
--- a/core/src/node/api/processors/fsExt.ts
+++ b/core/src/node/api/processors/fsExt.ts
@ -1,7 +1,7 @@
-import { join } from 'path'
-import fs from 'fs'
+import { basename, join } from 'path'
+import fs, { readdirSync } from 'fs'
 import { appResourcePath, normalizeFilePath, validatePath } from '../../helper/path'
-import { getJanDataFolderPath, getJanDataFolderPath as getPath } from '../../helper'
+import { defaultAppConfig, getJanDataFolderPath, getJanDataFolderPath as getPath } from '../../helper'
 import { Processor } from './Processor'
 import { FileStat } from '../../../types'

@ -28,9 +28,10 @@ export class FSExt implements Processor {
    return appResourcePath()
  }

-  // Handles the 'getUserHomePath' IPC event. This event is triggered to get the user home path.
+  // Handles the 'getUserHomePath' IPC event. This event is triggered to get the user app data path.
+  // CAUTION: This would not return OS home path but the app data path.
  getUserHomePath() {
-    return process.env[process.platform == 'win32' ? 'USERPROFILE' : 'HOME']
+    return defaultAppConfig().data_folder
  }

  // handle fs is directory here
@ -79,4 +80,53 @@ export class FSExt implements Processor {
      })
    })
  }
+
+  async getGgufFiles(paths: string[]) {
+    const sanitizedFilePaths: {
+      path: string
+      name: string
+      size: number
+    }[] = []
+    for (const filePath of paths) {
+      const normalizedPath = normalizeFilePath(filePath)
+     
+      const isExist = fs.existsSync(normalizedPath)
+      if (!isExist) continue
+      const fileStats = fs.statSync(normalizedPath)
+      if (!fileStats) continue
+      if (!fileStats.isDirectory()) {
+        const fileName = await basename(normalizedPath)
+        sanitizedFilePaths.push({
+          path: normalizedPath,
+          name: fileName,
+          size: fileStats.size,
+        })
+      } else {
+        // allowing only one level of directory
+        const files = await readdirSync(normalizedPath)
+  
+        for (const file of files) {
+          const fullPath = await join(normalizedPath, file)
+          const fileStats = await fs.statSync(fullPath)
+          if (!fileStats || fileStats.isDirectory()) continue
+  
+          sanitizedFilePaths.push({
+            path: fullPath,
+            name: file,
+            size: fileStats.size,
+          })
+        }
+      }
+    }
+    const unsupportedFiles = sanitizedFilePaths.filter(
+      (file) => !file.path.endsWith('.gguf')
+    )
+    const supportedFiles = sanitizedFilePaths.filter((file) =>
+      file.path.endsWith('.gguf')
+    )
+    return {
+      unsupportedFiles,
+      supportedFiles,
+    }
+  }
 }
--- a/core/src/node/api/restful/v1.ts
+++ b/core/src/node/api/restful/v1.ts
@ -1,16 +1,16 @@
 import { HttpServer } from '../HttpServer'
 import { commonRouter } from './common'
-import { downloadRouter } from './app/download'
-import { handleRequests } from './app/handlers'

 export const v1Router = async (app: HttpServer) => {
  // MARK: Public API Routes
  app.register(commonRouter)

  // MARK: Internal Application Routes
-  handleRequests(app)
+  // DEPRECATED: Vulnerability possible issues
+  // handleRequests(app)

  // Expanded route for tracking download progress
  // TODO: Replace by Observer Wrapper (ZeroMQ / Vanilla Websocket)
-  app.register(downloadRouter)
+  // DEPRECATED: Jan FE Docker deploy is deprecated
+  // app.register(downloadRouter)
 }
--- a/core/src/node/helper/config.ts
+++ b/core/src/node/helper/config.ts
@ -1,25 +1,18 @@
 import { AppConfiguration, SettingComponentProps } from '../../types'
-import { join } from 'path'
+import { join, resolve } from 'path'
 import fs from 'fs'
 import os from 'os'
 import childProcess from 'child_process'
-
 const configurationFileName = 'settings.json'

-// TODO: do no specify app name in framework module
-// TODO: do not default the os.homedir
-const defaultJanDataFolder = join(os?.homedir() || '', 'jan')
-const defaultAppConfig: AppConfiguration = {
-  data_folder: defaultJanDataFolder,
-  quick_ask: false,
-}
-
 /**
 * Getting App Configurations.
 *
 * @returns {AppConfiguration} The app configurations.
 */
 export const getAppConfigurations = (): AppConfiguration => {
+  const appDefaultConfiguration = defaultAppConfig()
+  if (process.env.CI === 'e2e') return appDefaultConfiguration
  // Retrieve Application Support folder path
  // Fallback to user home directory if not found
  const configurationFile = getConfigurationFilePath()
@ -27,8 +20,8 @@ export const getAppConfigurations = (): AppConfiguration => {
  if (!fs.existsSync(configurationFile)) {
    // create default app config if we don't have one
    console.debug(`App config not found, creating default config at ${configurationFile}`)
-    fs.writeFileSync(configurationFile, JSON.stringify(defaultAppConfig))
-    return defaultAppConfig
+    fs.writeFileSync(configurationFile, JSON.stringify(appDefaultConfiguration))
+    return appDefaultConfiguration
  }

  try {
@ -38,7 +31,7 @@ export const getAppConfigurations = (): AppConfiguration => {
    return appConfigurations
  } catch (err) {
    console.error(`Failed to read app config, return default config instead! Err: ${err}`)
-    return defaultAppConfig
+    return defaultAppConfig()
  }
 }

@ -155,3 +148,22 @@ export const getEngineConfiguration = async (engineId: string) => {
    full_url: fullUrl,
  }
 }
+
+/**
+ * Default app configurations
+ * App Data Folder default to Electron's userData
+ * %APPDATA% on Windows
+ * $XDG_CONFIG_HOME or ~/.config on Linux
+ * ~/Library/Application Support on macOS
+ */
+export const defaultAppConfig = (): AppConfiguration => {
+  const { app } = require('electron')
+  const defaultJanDataFolder = join(app?.getPath('userData') ?? os?.homedir() ?? '', 'data')
+  return {
+    data_folder:
+      process.env.CI === 'e2e'
+        ? (process.env.APP_CONFIG_PATH ?? resolve('./test-data'))
+        : defaultJanDataFolder,
+    quick_ask: false,
+  }
+}
--- a/core/src/node/helper/path.ts
+++ b/core/src/node/helper/path.ts
@ -11,34 +11,41 @@ export function normalizeFilePath(path: string): string {
  return path.replace(/^(file:[\\/]+)([^:\s]+)$/, '$2')
 }

-export async function appResourcePath(): Promise<string> {
-  let electron: any = undefined
-
+/**
+ * App resources path
+ * Returns string - The current application directory.
+ */
+export function appResourcePath() {
  try {
-    const moduleName = 'electron'
-    electron = await import(moduleName)
+    const electron = require('electron')
+    // electron
+    if (electron && electron.protocol) {
+      let appPath = join(electron.app.getAppPath(), '..', 'app.asar.unpacked')
+
+      if (!electron.app.isPackaged) {
+        // for development mode
+        appPath = join(electron.app.getAppPath())
+      }
+      return appPath
+    }
  } catch (err) {
    console.error('Electron is not available')
  }

-  // electron
-  if (electron && electron.protocol) {
-    let appPath = join(electron.app.getAppPath(), '..', 'app.asar.unpacked')
-
-    if (!electron.app.isPackaged) {
-      // for development mode
-      appPath = join(electron.app.getAppPath())
-    }
-    return appPath
-  }
  // server
  return join(global.core.appPath(), '../../..')
 }

 export function validatePath(path: string) {
-  const janDataFolderPath = getJanDataFolderPath()
+  const appDataFolderPath = getJanDataFolderPath()
+  const resourcePath = appResourcePath()
+  const applicationSupportPath = global.core?.appPath() ?? resourcePath
  const absolutePath = resolve(__dirname, path)
-  if (!absolutePath.startsWith(janDataFolderPath)) {
+  if (
+    ![appDataFolderPath, resourcePath, applicationSupportPath].some((whiteListedPath) =>
+      absolutePath.startsWith(whiteListedPath)
+    )
+  ) {
    throw new Error(`Invalid path: ${absolutePath}`)
  }
 }
--- a/core/src/types/api/index.ts
+++ b/core/src/types/api/index.ts
@ -105,6 +105,7 @@ export enum FileManagerRoute {
  getUserHomePath = 'getUserHomePath',
  fileStat = 'fileStat',
  writeBlob = 'writeBlob',
+  getGgufFiles = 'getGgufFiles',
 }

 export type ApiFunction = (...args: any[]) => any
--- a/core/src/types/model/modelEntity.ts
+++ b/core/src/types/model/modelEntity.ts
@ -25,6 +25,10 @@ export enum InferenceEngine {
  triton_trtllm = 'triton_trtllm',
  nitro_tensorrt_llm = 'nitro-tensorrt-llm',
  cohere = 'cohere',
+  nvidia = 'nvidia',
+  cortex_llamacpp = 'cortex.llamacpp',
+  cortex_onnx = 'cortex.onnx',
+  cortex_tensorrtllm = 'cortex.tensorrt-llm',
 }

 export type ModelArtifact = {
@ -103,6 +107,9 @@ export type ModelMetadata = {
  tags: string[]
  size: number
  cover?: string
+  // These settings to preserve model settings across threads
+  default_ctx_len?: number
+  default_max_tokens?: number
 }

 /**
--- a/docker-compose-dev.yml
+++ b/docker-compose-dev.yml
@ -1,171 +0,0 @@
-# Docker Compose file for setting up Minio, createbuckets, app_cpu, and app_gpu services
-
-version: '3.7'
-
-services:
-  # Minio service for object storage
-  minio:
-    image: minio/minio
-    volumes:
-      - minio_data:/data
-    ports:
-      - '9000:9000'
-      - '9001:9001'
-    environment:
-      # Set the root user and password for Minio
-      MINIO_ROOT_USER: minioadmin # This acts as AWS_ACCESS_KEY
-      MINIO_ROOT_PASSWORD: minioadmin # This acts as AWS_SECRET_ACCESS_KEY
-    command: server --console-address ":9001" /data
-    restart: always
-    healthcheck:
-      test: ['CMD', 'curl', '-f', 'http://localhost:9000/minio/health/live']
-      interval: 30s
-      timeout: 20s
-      retries: 3
-    networks:
-      vpcbr:
-        ipv4_address: 10.5.0.2
-
-  # createbuckets service to create a bucket and set its policy
-  createbuckets:
-    image: minio/mc
-    depends_on:
-      - minio
-    entrypoint: >
-      /bin/sh -c "
-      /usr/bin/mc alias set myminio http://minio:9000 minioadmin minioadmin;
-      /usr/bin/mc mb myminio/mybucket;
-      /usr/bin/mc policy set public myminio/mybucket;
-      exit 0;
-      "
-    networks:
-      vpcbr:
-
-  # app_cpu service for running the CPU version of the application
-  app_cpu_s3fs:
-    image: jan:latest
-    volumes:
-      - app_data_cpu_s3fs:/app/server/build/jan
-    build:
-      context: .
-      dockerfile: Dockerfile
-    environment:
-      # Set the AWS access key, secret access key, bucket name, endpoint, and region for app_cpu
-      AWS_ACCESS_KEY_ID: minioadmin
-      AWS_SECRET_ACCESS_KEY: minioadmin
-      S3_BUCKET_NAME: mybucket
-      AWS_ENDPOINT: http://10.5.0.2:9000
-      AWS_REGION: us-east-1
-      API_BASE_URL: http://localhost:1337
-    restart: always
-    profiles:
-      - cpu-s3fs
-    ports:
-      - '3000:3000'
-      - '1337:1337'
-      - '3928:3928'
-    networks:
-      vpcbr:
-        ipv4_address: 10.5.0.3
-
-  # app_gpu service for running the GPU version of the application
-  app_gpu_s3fs:
-    deploy:
-      resources:
-        reservations:
-          devices:
-            - driver: nvidia
-              count: all
-              capabilities: [gpu]
-    image: jan-gpu:latest
-    volumes:
-      - app_data_gpu_s3fs:/app/server/build/jan
-    build:
-      context: .
-      dockerfile: Dockerfile.gpu
-    restart: always
-    environment:
-      # Set the AWS access key, secret access key, bucket name, endpoint, and region for app_gpu
-      AWS_ACCESS_KEY_ID: minioadmin
-      AWS_SECRET_ACCESS_KEY: minioadmin
-      S3_BUCKET_NAME: mybucket
-      AWS_ENDPOINT: http://10.5.0.2:9000
-      AWS_REGION: us-east-1
-      API_BASE_URL: http://localhost:1337
-    profiles:
-      - gpu-s3fs
-    ports:
-      - '3000:3000'
-      - '1337:1337'
-      - '3928:3928'
-    networks:
-      vpcbr:
-        ipv4_address: 10.5.0.4
-
-  app_cpu_fs:
-    image: jan:latest
-    volumes:
-      - app_data_cpu_fs:/app/server/build/jan
-    build:
-      context: .
-      dockerfile: Dockerfile
-    environment:
-      API_BASE_URL: http://localhost:1337
-    restart: always
-    profiles:
-      - cpu-fs
-    ports:
-      - '3000:3000'
-      - '1337:1337'
-      - '3928:3928'
-    networks:
-      vpcbr:
-        ipv4_address: 10.5.0.5
-
-  # app_gpu service for running the GPU version of the application
-  app_gpu_fs:
-    deploy:
-      resources:
-        reservations:
-          devices:
-            - driver: nvidia
-              count: all
-              capabilities: [gpu]
-    image: jan-gpu:latest
-    volumes:
-      - app_data_gpu_fs:/app/server/build/jan
-    build:
-      context: .
-      dockerfile: Dockerfile.gpu
-    restart: always
-    environment:
-      API_BASE_URL: http://localhost:1337
-    profiles:
-      - gpu-fs
-    ports:
-      - '3000:3000'
-      - '1337:1337'
-      - '3928:3928'
-    networks:
-      vpcbr:
-        ipv4_address: 10.5.0.6
-
-volumes:
-  minio_data:
-  app_data_cpu_s3fs:
-  app_data_gpu_s3fs:
-  app_data_cpu_fs:
-  app_data_gpu_fs:
-
-networks:
-  vpcbr:
-    driver: bridge
-    ipam:
-      config:
-        - subnet: 10.5.0.0/16
-          gateway: 10.5.0.1
-# Usage:
-# - Run 'docker compose -f docker-compose-dev.yml --profile cpu-s3fs up -d' to start the app_cpu service
-# - Run 'docker compose -f docker-compose-dev.yml --profile gpu-s3fs up -d' to start the app_gpu service
-# - Run 'docker compose -f docker-compose-dev.yml --profile cpu-fs up -d' to start the app_cpu service
-# - Run 'docker compose -f docker-compose-dev.yml --profile gpu-fs up -d' to start the app_gpu service
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -1,159 +0,0 @@
-# Docker Compose file for setting up Minio, createbuckets, app_cpu, and app_gpu services
-
-version: '3.7'
-
-services:
-  # Minio service for object storage
-  minio:
-    image: minio/minio
-    volumes:
-      - minio_data:/data
-    ports:
-      - '9000:9000'
-      - '9001:9001'
-    environment:
-      # Set the root user and password for Minio
-      MINIO_ROOT_USER: minioadmin # This acts as AWS_ACCESS_KEY
-      MINIO_ROOT_PASSWORD: minioadmin # This acts as AWS_SECRET_ACCESS_KEY
-    command: server --console-address ":9001" /data
-    restart: always
-    healthcheck:
-      test: ['CMD', 'curl', '-f', 'http://localhost:9000/minio/health/live']
-      interval: 30s
-      timeout: 20s
-      retries: 3
-    networks:
-      vpcbr:
-        ipv4_address: 10.5.0.2
-
-  # createbuckets service to create a bucket and set its policy
-  createbuckets:
-    image: minio/mc
-    depends_on:
-      - minio
-    entrypoint: >
-      /bin/sh -c "
-      /usr/bin/mc alias set myminio http://minio:9000 minioadmin minioadmin;
-      /usr/bin/mc mb myminio/mybucket;
-      /usr/bin/mc policy set public myminio/mybucket;
-      exit 0;
-      "
-    networks:
-      vpcbr:
-
-  # app_cpu service for running the CPU version of the application
-  app_cpu_s3fs:
-    volumes:
-      - app_data_cpu_s3fs:/app/server/build/jan
-    image: ghcr.io/janhq/jan-server:dev-cpu-latest
-    environment:
-      # Set the AWS access key, secret access key, bucket name, endpoint, and region for app_cpu
-      AWS_ACCESS_KEY_ID: minioadmin
-      AWS_SECRET_ACCESS_KEY: minioadmin
-      S3_BUCKET_NAME: mybucket
-      AWS_ENDPOINT: http://10.5.0.2:9000
-      AWS_REGION: us-east-1
-      API_BASE_URL: http://localhost:1337
-    restart: always
-    profiles:
-      - cpu-s3fs
-    ports:
-      - '3000:3000'
-      - '1337:1337'
-      - '3928:3928'
-    networks:
-      vpcbr:
-        ipv4_address: 10.5.0.3
-
-  # app_gpu service for running the GPU version of the application
-  app_gpu_s3fs:
-    deploy:
-      resources:
-        reservations:
-          devices:
-            - driver: nvidia
-              count: all
-              capabilities: [gpu]
-    image: ghcr.io/janhq/jan-server:dev-cuda-12.2-latest
-    volumes:
-      - app_data_gpu_s3fs:/app/server/build/jan
-    restart: always
-    environment:
-      # Set the AWS access key, secret access key, bucket name, endpoint, and region for app_gpu
-      AWS_ACCESS_KEY_ID: minioadmin
-      AWS_SECRET_ACCESS_KEY: minioadmin
-      S3_BUCKET_NAME: mybucket
-      AWS_ENDPOINT: http://10.5.0.2:9000
-      AWS_REGION: us-east-1
-      API_BASE_URL: http://localhost:1337
-    profiles:
-      - gpu-s3fs
-    ports:
-      - '3000:3000'
-      - '1337:1337'
-      - '3928:3928'
-    networks:
-      vpcbr:
-        ipv4_address: 10.5.0.4
-
-  app_cpu_fs:
-    image: ghcr.io/janhq/jan-server:dev-cpu-latest
-    volumes:
-      - app_data_cpu_fs:/app/server/build/jan
-    environment:
-      API_BASE_URL: http://localhost:1337
-    restart: always
-    profiles:
-      - cpu-fs
-    ports:
-      - '3000:3000'
-      - '1337:1337'
-      - '3928:3928'
-    networks:
-      vpcbr:
-        ipv4_address: 10.5.0.5
-
-  # app_gpu service for running the GPU version of the application
-  app_gpu_fs:
-    deploy:
-      resources:
-        reservations:
-          devices:
-            - driver: nvidia
-              count: all
-              capabilities: [gpu]
-    image: ghcr.io/janhq/jan-server:dev-cuda-12.2-latest
-    volumes:
-      - app_data_gpu_fs:/app/server/build/jan
-    restart: always
-    environment:
-      API_BASE_URL: http://localhost:1337
-    profiles:
-      - gpu-fs
-    ports:
-      - '3000:3000'
-      - '1337:1337'
-      - '3928:3928'
-    networks:
-      vpcbr:
-        ipv4_address: 10.5.0.6
-
-volumes:
-  minio_data:
-  app_data_cpu_s3fs:
-  app_data_gpu_s3fs:
-  app_data_cpu_fs:
-  app_data_gpu_fs:
-
-networks:
-  vpcbr:
-    driver: bridge
-    ipam:
-      config:
-        - subnet: 10.5.0.0/16
-          gateway: 10.5.0.1
-# Usage:
-# - Run 'docker compose --profile cpu-s3fs up -d' to start the app_cpu service
-# - Run 'docker compose --profile gpu-s3fs up -d' to start the app_gpu service
-# - Run 'docker compose --profile cpu-fs up -d' to start the app_cpu service
-# - Run 'docker compose --profile gpu-fs up -d' to start the app_gpu service
--- a/electron/managers/mainWindowConfig.ts
+++ b/electron/managers/mainWindowConfig.ts
@ -1,8 +1,10 @@
 const DEFAULT_MIN_WIDTH = 400
+const DEFAULT_MIN_HEIGHT = 600

 export const mainWindowConfig: Electron.BrowserWindowConstructorOptions = {
  skipTaskbar: false,
  minWidth: DEFAULT_MIN_WIDTH,
+  minHeight: DEFAULT_MIN_HEIGHT,
  show: true,
  transparent: true,
  frame: false,
--- a/electron/utils/migration.ts
+++ b/electron/utils/migration.ts
@ -12,9 +12,9 @@ import {
 } from 'fs'
 import Store from 'electron-store'
 import {
-  getJanExtensionsPath,
  getJanDataFolderPath,
  appResourcePath,
+  getJanExtensionsPath,
 } from '@janhq/core/node'

 /**
@ -28,8 +28,9 @@ export async function migrate() {
  if (store.get('migrated_version') !== app.getVersion()) {
    console.debug('start migration:', store.get('migrated_version'))

-    // if (existsSync(getJanExtensionsPath()))
-    //   rmdirSync(getJanExtensionsPath(), { recursive: true })
+    if (existsSync(getJanExtensionsPath()))
+      rmdirSync(getJanExtensionsPath(), { recursive: true })
+
    await migrateThemes()

    store.set('migrated_version', app.getVersion())
@ -43,9 +44,9 @@ async function migrateThemes() {
  if (!existsSync(join(getJanDataFolderPath(), 'themes')))
    mkdirSync(join(getJanDataFolderPath(), 'themes'), { recursive: true })

-  const themes = readdirSync(join(await appResourcePath(), 'themes'))
+  const themes = readdirSync(join(appResourcePath(), 'themes'))
  for (const theme of themes) {
-    const themePath = join(await appResourcePath(), 'themes', theme)
+    const themePath = join(appResourcePath(), 'themes', theme)
    if (existsSync(themePath) && !lstatSync(themePath).isDirectory()) {
      continue
    }
--- a/extensions/inference-anthropic-extension/resources/settings.json
+++ b/extensions/inference-anthropic-extension/resources/settings.json
@ -1,4 +1,16 @@
 [
+  {
+    "key": "anthropic-api-key",
+    "title": "API Key",
+    "description": "The Anthropic API uses API keys for authentication. Visit your [API Keys](https://console.anthropic.com/settings/keys) page to retrieve the API key you'll use in your requests.",
+    "controllerType": "input",
+    "controllerProps": {
+      "placeholder": "Insert API Key",
+      "value": "",
+      "type": "password",
+      "inputActions": ["unobscure", "copy"]
+    }
+  },
  {
    "key": "chat-completions-endpoint",
    "title": "Chat Completions Endpoint",
@ -8,16 +20,5 @@
      "placeholder": "https://api.anthropic.com/v1/messages",
      "value": "https://api.anthropic.com/v1/messages"
    }
-  },
-  {
-    "key": "anthropic-api-key",
-    "title": "API Key",
-    "description": "The Anthropic API uses API keys for authentication. Visit your [API Keys](https://console.anthropic.com/settings/keys) page to retrieve the API key you'll use in your requests.",
-    "controllerType": "input",
-    "controllerProps": {
-      "placeholder": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
-      "value": "",
-      "type": "password"
-    }
  }
-]
+]
--- a/extensions/inference-cohere-extension/resources/settings.json
+++ b/extensions/inference-cohere-extension/resources/settings.json
@ -1,4 +1,16 @@
 [
+  {
+    "key": "cohere-api-key",
+    "title": "API Key",
+    "description": "The Cohere API uses API keys for authentication. Visit your [API Keys](https://dashboard.cohere.com/api-keys) page to retrieve the API key you'll use in your requests.",
+    "controllerType": "input",
+    "controllerProps": {
+      "placeholder": "Insert API Key",
+      "value": "",
+      "type": "password",
+      "inputActions": ["unobscure", "copy"]
+    }
+  },
  {
    "key": "chat-completions-endpoint",
    "title": "Chat Completions Endpoint",
@ -8,16 +20,5 @@
      "placeholder": "https://api.cohere.ai/v1/chat",
      "value": "https://api.cohere.ai/v1/chat"
    }
-  },
-  {
-    "key": "cohere-api-key",
-    "title": "API Key",
-    "description": "The Cohere API uses API keys for authentication. Visit your [API Keys](https://dashboard.cohere.com/api-keys) page to retrieve the API key you'll use in your requests.",
-    "controllerType": "input",
-    "controllerProps": {
-      "placeholder": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
-      "value": "",
-      "type": "password"
-    }
  }
 ]
--- a/extensions/inference-groq-extension/resources/settings.json
+++ b/extensions/inference-groq-extension/resources/settings.json
@ -1,4 +1,16 @@
 [
+  {
+    "key": "groq-api-key",
+    "title": "API Key",
+    "description": "The Groq API uses API keys for authentication. Visit your [API Keys](https://console.groq.com/keys) page to retrieve the API key you'll use in your requests.",
+    "controllerType": "input",
+    "controllerProps": {
+      "placeholder": "Insert API Key",
+      "value": "",
+      "type": "password",
+      "inputActions": ["unobscure", "copy"]
+    }
+  },
  {
    "key": "chat-completions-endpoint",
    "title": "Chat Completions Endpoint",
@ -8,16 +20,5 @@
      "placeholder": "https://api.groq.com/openai/v1/chat/completions",
      "value": "https://api.groq.com/openai/v1/chat/completions"
    }
-  },
-  {
-    "key": "groq-api-key",
-    "title": "API Key",
-    "description": "The Groq API uses API keys for authentication. Visit your [API Keys](https://console.groq.com/keys) page to retrieve the API key you'll use in your requests.",
-    "controllerType": "input",
-    "controllerProps": {
-      "placeholder": "gsk_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
-      "value": "",
-      "type": "password"
-    }
  }
 ]
--- a/extensions/inference-martian-extension/resources/settings.json
+++ b/extensions/inference-martian-extension/resources/settings.json
@ -1,4 +1,16 @@
 [
+  {
+    "key": "martian-api-key",
+    "title": "API Key",
+    "description": "The Martian API uses API keys for authentication. Visit your [API Keys](https://withmartian.com/dashboard) page to retrieve the API key you'll use in your requests.",
+    "controllerType": "input",
+    "controllerProps": {
+      "placeholder": "Insert API Key",
+      "value": "",
+      "type": "password",
+      "inputActions": ["unobscure", "copy"]
+    }
+  },
  {
    "key": "chat-completions-endpoint",
    "title": "Chat Completions Endpoint",
@ -8,16 +20,5 @@
      "placeholder": "https://withmartian.com/api/openai/v1/chat/completions",
      "value": "https://withmartian.com/api/openai/v1/chat/completions"
    }
-  },
-  {
-    "key": "martian-api-key",
-    "title": "API Key",
-    "description": "The Martian API uses API keys for authentication. Visit your [API Keys](https://withmartian.com/dashboard) page to retrieve the API key you'll use in your requests.",
-    "controllerType": "input",
-    "controllerProps": {
-      "placeholder": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
-      "value": "",
-      "type": "password"
-    }
  }
 ]
--- a/extensions/inference-mistral-extension/resources/settings.json
+++ b/extensions/inference-mistral-extension/resources/settings.json
@ -1,4 +1,16 @@
 [
+  {
+    "key": "mistral-api-key",
+    "title": "API Key",
+    "description": "The Mistral API uses API keys for authentication. Visit your [API Keys](https://console.mistral.ai/api-keys/) page to retrieve the API key you'll use in your requests.",
+    "controllerType": "input",
+    "controllerProps": {
+      "placeholder": "Insert API Key",
+      "value": "",
+      "type": "password",
+      "inputActions": ["unobscure", "copy"]
+    }
+  },
  {
    "key": "chat-completions-endpoint",
    "title": "Chat Completions Endpoint",
@ -8,16 +20,5 @@
      "placeholder": "https://api.mistral.ai/v1/chat/completions",
      "value": "https://api.mistral.ai/v1/chat/completions"
    }
-  },
-  {
-    "key": "mistral-api-key",
-    "title": "API Key",
-    "description": "The Mistral API uses API keys for authentication. Visit your [API Keys](https://console.mistral.ai/api-keys/) page to retrieve the API key you'll use in your requests.",
-    "controllerType": "input",
-    "controllerProps": {
-      "placeholder": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
-      "value": "",
-      "type": "password"
-    }
  }
 ]
--- a/extensions/inference-nitro-extension/bin/version.txt
+++ b/extensions/inference-nitro-extension/bin/version.txt
@ -1 +1 @@
-0.4.20
+0.5.0
--- a/extensions/inference-nitro-extension/download.bat
+++ b/extensions/inference-nitro-extension/download.bat
@ -1,3 +1,3 @@
@echo off
 set /p CORTEX_VERSION=<./bin/version.txt
-.\node_modules\.bin\download https://github.com/janhq/cortex/releases/download/v%CORTEX_VERSION%/cortex-cpp-%CORTEX_VERSION%-windows-amd64-avx2-cuda-12-0.tar.gz -e --strip 1 -o ./bin/win-cuda-12-0 && .\node_modules\.bin\download https://github.com/janhq/cortex/releases/download/v%CORTEX_VERSION%/cortex-cpp-%CORTEX_VERSION%-windows-amd64-avx2-cuda-11-7.tar.gz -e --strip 1 -o ./bin/win-cuda-11-7 && .\node_modules\.bin\download https://github.com/janhq/nitro/releases/download/v%CORTEX_VERSION%/cortex-cpp-%CORTEX_VERSION%-windows-amd64-avx2.tar.gz -e --strip 1 -o ./bin/win-cpu && .\node_modules\.bin\download https://github.com/janhq/cortex/releases/download/v%CORTEX_VERSION%/cortex-cpp-%CORTEX_VERSION%-windows-amd64-vulkan.tar.gz -e --strip 1 -o ./bin/win-vulkan
+.\node_modules\.bin\download https://github.com/janhq/cortex/releases/download/v%CORTEX_VERSION%/cortex-cpp-%CORTEX_VERSION%-windows-amd64.tar.gz -e --strip 1 -o ./bin/win-cuda-12-0 && .\node_modules\.bin\download https://github.com/janhq/cortex/releases/download/v%CORTEX_VERSION%/cortex-cpp-%CORTEX_VERSION%-windows-amd64.tar.gz -e --strip 1 -o ./bin/win-cuda-11-7 && .\node_modules\.bin\download https://github.com/janhq/cortex/releases/download/v%CORTEX_VERSION%/cortex-cpp-%CORTEX_VERSION%-windows-amd64.tar.gz -e --strip 1 -o ./bin/win-cpu && .\node_modules\.bin\download https://github.com/janhq/cortex/releases/download/v%CORTEX_VERSION%/cortex-cpp-%CORTEX_VERSION%-windows-amd64.tar.gz -e --strip 1 -o ./bin/win-vulkan && .\node_modules\.bin\download https://github.com/janhq/cortex.llamacpp/releases/download/v0.1.25/cortex.llamacpp-0.1.25-windows-amd64-noavx-cuda-12-0.tar.gz -e --strip 1 -o ./bin/win-cuda-12-0/engines/cortex.llamacpp && .\node_modules\.bin\download https://github.com/janhq/cortex.llamacpp/releases/download/v0.1.25/cortex.llamacpp-0.1.25-windows-amd64-noavx-cuda-11-7.tar.gz -e --strip 1 -o ./bin/win-cuda-11-7/engines/cortex.llamacpp && .\node_modules\.bin\download https://github.com/janhq/cortex.llamacpp/releases/download/v0.1.25/cortex.llamacpp-0.1.25-windows-amd64-noavx.tar.gz -e --strip 1 -o ./bin/win-cpu/engines/cortex.llamacpp && .\node_modules\.bin\download https://github.com/janhq/cortex.llamacpp/releases/download/v0.1.25/cortex.llamacpp-0.1.25-windows-amd64-vulkan.tar.gz -e --strip 1 -o ./bin/win-vulkan/engines/cortex.llamacpp
--- a/extensions/inference-nitro-extension/package.json
+++ b/extensions/inference-nitro-extension/package.json
@ -1,7 +1,7 @@
 {
  "name": "@janhq/inference-cortex-extension",
  "productName": "Cortex Inference Engine",
-  "version": "1.0.14",
+  "version": "1.0.15",
  "description": "This extension embeds cortex.cpp, a lightweight inference engine written in C++. See https://nitro.jan.ai.\nAdditional dependencies could be installed to run without Cuda Toolkit installation.",
  "main": "dist/index.js",
  "node": "dist/node/index.cjs.js",
@ -10,8 +10,8 @@
  "scripts": {
    "test": "jest",
    "build": "tsc --module commonjs && rollup -c rollup.config.ts",
-    "downloadnitro:linux": "CORTEX_VERSION=$(cat ./bin/version.txt) && download https://github.com/janhq/cortex/releases/download/v${CORTEX_VERSION}/cortex-cpp-${CORTEX_VERSION}-linux-amd64-avx2.tar.gz -e --strip 1 -o ./bin/linux-cpu && chmod +x ./bin/linux-cpu/cortex-cpp && download https://github.com/janhq/cortex/releases/download/v${CORTEX_VERSION}/cortex-cpp-${CORTEX_VERSION}-linux-amd64-avx2-cuda-12-0.tar.gz -e --strip 1 -o ./bin/linux-cuda-12-0 && chmod +x ./bin/linux-cuda-12-0/cortex-cpp && download https://github.com/janhq/cortex/releases/download/v${CORTEX_VERSION}/cortex-cpp-${CORTEX_VERSION}-linux-amd64-avx2-cuda-11-7.tar.gz -e --strip 1 -o ./bin/linux-cuda-11-7 && chmod +x ./bin/linux-cuda-11-7/cortex-cpp && download https://github.com/janhq/cortex/releases/download/v${CORTEX_VERSION}/cortex-cpp-${CORTEX_VERSION}-linux-amd64-vulkan.tar.gz -e --strip 1 -o ./bin/linux-vulkan && chmod +x ./bin/linux-vulkan/cortex-cpp",
-    "downloadnitro:darwin": "CORTEX_VERSION=$(cat ./bin/version.txt) && download https://github.com/janhq/cortex/releases/download/v${CORTEX_VERSION}/cortex-cpp-${CORTEX_VERSION}-mac-arm64.tar.gz -o ./bin/ && mkdir -p ./bin/mac-arm64 && tar -zxvf ./bin/cortex-cpp-${CORTEX_VERSION}-mac-arm64.tar.gz --strip-components=1 -C ./bin/mac-arm64 && rm -rf ./bin/cortex-cpp-${CORTEX_VERSION}-mac-arm64.tar.gz && chmod +x ./bin/mac-arm64/cortex-cpp && download https://github.com/janhq/cortex/releases/download/v${CORTEX_VERSION}/cortex-cpp-${CORTEX_VERSION}-mac-amd64.tar.gz -o ./bin/ && mkdir -p ./bin/mac-amd64 && tar -zxvf ./bin/cortex-cpp-${CORTEX_VERSION}-mac-amd64.tar.gz --strip-components=1 -C ./bin/mac-amd64 && rm -rf ./bin/cortex-cpp-${CORTEX_VERSION}-mac-amd64.tar.gz && chmod +x ./bin/mac-amd64/cortex-cpp",
+    "downloadnitro:linux": "CORTEX_VERSION=$(cat ./bin/version.txt) && download https://github.com/janhq/cortex/releases/download/v${CORTEX_VERSION}/cortex-cpp-${CORTEX_VERSION}-linux-amd64.tar.gz -e --strip 1 -o ./bin/linux-cpu && chmod +x ./bin/linux-cpu/cortex-cpp && download https://github.com/janhq/cortex/releases/download/v${CORTEX_VERSION}/cortex-cpp-${CORTEX_VERSION}-linux-amd64.tar.gz -e --strip 1 -o ./bin/linux-cuda-12-0 && chmod +x ./bin/linux-cuda-12-0/cortex-cpp && download https://github.com/janhq/cortex/releases/download/v${CORTEX_VERSION}/cortex-cpp-${CORTEX_VERSION}-linux-amd64.tar.gz -e --strip 1 -o ./bin/linux-cuda-11-7 && chmod +x ./bin/linux-cuda-11-7/cortex-cpp && download https://github.com/janhq/cortex/releases/download/v${CORTEX_VERSION}/cortex-cpp-${CORTEX_VERSION}-linux-amd64.tar.gz -e --strip 1 -o ./bin/linux-vulkan && chmod +x ./bin/linux-vulkan/cortex-cpp && download https://github.com/janhq/cortex.llamacpp/releases/download/v0.1.25/cortex.llamacpp-0.1.25-linux-amd64-noavx.tar.gz -e --strip 1 -o ./bin/linux-cpu/engines/cortex.llamacpp && download https://github.com/janhq/cortex.llamacpp/releases/download/v0.1.25/cortex.llamacpp-0.1.25-linux-amd64-noavx-cuda-12-0.tar.gz -e --strip 1 -o ./bin/linux-cuda-12-0/engines/cortex.llamacpp && download https://github.com/janhq/cortex.llamacpp/releases/download/v0.1.25/cortex.llamacpp-0.1.25-linux-amd64-noavx-cuda-11-7.tar.gz -e --strip 1 -o ./bin/linux-cuda-11-7/engines/cortex.llamacpp && download https://github.com/janhq/cortex.llamacpp/releases/download/v0.1.25/cortex.llamacpp-0.1.25-linux-amd64-vulkan.tar.gz -e --strip 1 -o ./bin/linux-vulkan/engines/cortex.llamacpp",
+    "downloadnitro:darwin": "CORTEX_VERSION=$(cat ./bin/version.txt) && download https://github.com/janhq/cortex/releases/download/v${CORTEX_VERSION}/cortex-cpp-${CORTEX_VERSION}-mac-arm64.tar.gz -o ./bin/ && mkdir -p ./bin/mac-arm64 && tar -zxvf ./bin/cortex-cpp-${CORTEX_VERSION}-mac-arm64.tar.gz --strip-components=1 -C ./bin/mac-arm64 && rm -rf ./bin/cortex-cpp-${CORTEX_VERSION}-mac-arm64.tar.gz && chmod +x ./bin/mac-arm64/cortex-cpp && download https://github.com/janhq/cortex/releases/download/v${CORTEX_VERSION}/cortex-cpp-${CORTEX_VERSION}-mac-amd64.tar.gz -o ./bin/ && mkdir -p ./bin/mac-amd64 && tar -zxvf ./bin/cortex-cpp-${CORTEX_VERSION}-mac-amd64.tar.gz --strip-components=1 -C ./bin/mac-amd64 && rm -rf ./bin/cortex-cpp-${CORTEX_VERSION}-mac-amd64.tar.gz && chmod +x ./bin/mac-amd64/cortex-cpp && download https://github.com/janhq/cortex.llamacpp/releases/download/v0.1.25/cortex.llamacpp-0.1.25-mac-arm64.tar.gz -e --strip 1 -o ./bin/mac-arm64/engines/cortex.llamacpp && download https://github.com/janhq/cortex.llamacpp/releases/download/v0.1.25/cortex.llamacpp-0.1.25-mac-amd64.tar.gz -e --strip 1 -o ./bin/mac-amd64/engines/cortex.llamacpp",
    "downloadnitro:win32": "download.bat",
    "downloadnitro": "run-script-os",
    "build:publish:darwin": "rimraf *.tgz --glob && yarn build && npm run downloadnitro && ../../.github/scripts/auto-sign.sh && cpx \"bin/**\" \"dist/bin\" && npm pack && cpx *.tgz ../../pre-install",
--- a/extensions/inference-nitro-extension/resources/models/gemma-1.1-2b/model.json
+++ b/extensions/inference-nitro-extension/resources/models/gemma-1.1-2b/model.json
@ -1,20 +1,20 @@
 {
  "sources": [
    {
-      "filename": "gemma-2b-it-q4_k_m.gguf",
-      "url": "https://huggingface.co/lmstudio-ai/gemma-2b-it-GGUF/resolve/main/gemma-2b-it-q4_k_m.gguf"
+      "filename": "gemma-1.1-2b-it-q4_k_m.gguf",
+      "url": "https://huggingface.co/bartowski/gemma-1.1-2b-it-GGUF/resolve/main/gemma-1.1-2b-it-Q4_K_M.gguf"
    }
  ],
-  "id": "gemma-2b",
+  "id": "gemma-1.1-2b-it",
  "object": "model",
-  "name": "Gemma 2B Q4",
+  "name": "Gemma 1.1 2B Q4",
  "version": "1.3",
  "description": "Gemma is built from the same technology with Google's Gemini.",
  "format": "gguf",
  "settings": {
    "ctx_len": 8192,
    "prompt_template": "<start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model",
-    "llama_model_path": "gemma-2b-it-q4_k_m.gguf",
+    "llama_model_path": "gemma-1.1-2b-it-Q4_K_M.gguf",
    "ngl": 19
  },
  "parameters": {
@ -29,7 +29,7 @@
  "metadata": {
    "author": "Google",
    "tags": ["2B", "Finetuned", "Tiny"],
-    "size": 1500000000
+    "size": 1630000000
  },
  "engine": "nitro"
 }
--- a/extensions/inference-nitro-extension/resources/models/gemma-1.1-7b/model.json
+++ b/extensions/inference-nitro-extension/resources/models/gemma-1.1-7b/model.json
@ -1,20 +1,20 @@
 {
  "sources": [
    {
-      "filename": "gemma-7b-it-q4_K_M.gguf",
-      "url": "https://huggingface.co/mmnga/gemma-7b-it-gguf/resolve/main/gemma-7b-it-q4_K_M.gguf"
+      "filename": "gemma-1.1-7b-it-q4_K_M.gguf",
+      "url": "https://huggingface.co/bartowski/gemma-1.1-7b-it-GGUF/resolve/main/gemma-1.1-7b-it-Q4_K_M.gguf"
    }
  ],
-  "id": "gemma-7b",
+  "id": "gemma-1.1-7b-it",
  "object": "model",
-  "name": "Gemma 7B Q4",
+  "name": "Gemma 1.1 7B Q4",
  "version": "1.2",
  "description": "Google's Gemma is built for multilingual purpose",
  "format": "gguf",
  "settings": {
    "ctx_len": 8192,
    "prompt_template": "<start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model",
-    "llama_model_path": "gemma-7b-it-q4_K_M.gguf",
+    "llama_model_path": "gemma-1.1-7b-it-q4_K_M.gguf",
    "ngl": 29
  },
  "parameters": {
--- a/extensions/inference-nitro-extension/resources/models/gemma-2-27b/model.json
+++ b/extensions/inference-nitro-extension/resources/models/gemma-2-27b/model.json
@ -0,0 +1,42 @@
+{
+  "sources": [
+    {
+      "filename": "gemma-2-27b-it-Q4_K_M.gguf",
+      "url": "https://huggingface.co/bartowski/gemma-2-27b-it-GGUF/resolve/main/gemma-2-27b-it-Q4_K_M.gguf"
+    }
+  ],
+  "id": "gemma-2-27b-it",
+  "object": "model",
+  "name": "Gemma 2 27B Q4",
+  "version": "1.0",
+  "description": "Gemma is built from the same technology with Google's Gemini.",
+  "format": "gguf",
+  "settings": {
+    "ctx_len": 8192,
+    "prompt_template": "<bos><start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n<end_of_turn>\n<start_of_turn>model\n",
+    "llama_model_path": "gemma-2-27b-it-Q4_K_M.gguf",
+    "ngl": 47
+  },
+  "parameters": {
+    "temperature": 0.7,
+    "top_p": 0.95,
+    "stream": true,
+    "max_tokens": 8192,
+    "stop": [
+      "<end_of_turn>"
+    ],
+    "frequency_penalty": 0,
+    "presence_penalty": 0
+  },
+  "metadata": {
+    "author": "Google",
+    "tags": [
+      "27B",
+      "Conversational",
+      "Text-generation",
+      "Featured"
+    ],
+    "size": 16600000000
+  },
+  "engine": "nitro"
+}
--- a/extensions/inference-nitro-extension/resources/models/gemma-2-2b/model.json
+++ b/extensions/inference-nitro-extension/resources/models/gemma-2-2b/model.json
@ -0,0 +1,43 @@
+{
+  "sources": [
+    {
+      "filename": "gemma-2-2b-it-Q4_K_M.gguf",
+      "url": "https://huggingface.co/bartowski/gemma-2-2b-it-GGUF/resolve/main/gemma-2-2b-it-Q4_K_M.gguf"
+    }
+  ],
+  "id": "gemma-2-2b-it",
+  "object": "model",
+  "name": "Gemma 2 2B Q4",
+  "version": "1.0",
+  "description": "Gemma is built from the same technology with Google's Gemini.",
+  "format": "gguf",
+  "settings": {
+    "ctx_len": 8192,
+    "prompt_template": "<bos><start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n<end_of_turn>\n<start_of_turn>model\n",
+    "llama_model_path": "gemma-2-2b-it-Q4_K_M.gguf",
+    "ngl": 27
+  },
+  "parameters": {
+    "temperature": 0.7,
+    "top_p": 0.95,
+    "stream": true,
+    "max_tokens": 8192,
+    "stop": [
+      "<end_of_turn>"
+    ],
+    "frequency_penalty": 0,
+    "presence_penalty": 0
+  },
+  "metadata": {
+    "author": "Google",
+    "tags": [
+      "2B",
+      "Tiny",
+      "Conversational",
+      "Text-generation",
+      "Featured"
+    ],
+    "size": 1710000000
+  },
+  "engine": "nitro"
+}
--- a/extensions/inference-nitro-extension/resources/models/gemma-2-9b/model.json
+++ b/extensions/inference-nitro-extension/resources/models/gemma-2-9b/model.json
@ -0,0 +1,42 @@
+{
+  "sources": [
+    {
+      "filename": "gemma-2-9b-it-Q4_K_M.gguf",
+      "url": "https://huggingface.co/bartowski/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-Q4_K_M.gguf"
+    }
+  ],
+  "id": "gemma-2-9b-it",
+  "object": "model",
+  "name": "Gemma 2 9B Q4",
+  "version": "1.0",
+  "description": "Gemma is built from the same technology with Google's Gemini.",
+  "format": "gguf",
+  "settings": {
+    "ctx_len": 8192,
+    "prompt_template": "<bos><start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n<end_of_turn>\n<start_of_turn>model\n",
+    "llama_model_path": "gemma-2-9b-it-Q4_K_M.gguf",
+    "ngl": 43
+  },
+  "parameters": {
+    "temperature": 0.7,
+    "top_p": 0.95,
+    "stream": true,
+    "max_tokens": 8192,
+    "stop": [
+      "<end_of_turn>"
+    ],
+    "frequency_penalty": 0,
+    "presence_penalty": 0
+  },
+  "metadata": {
+    "author": "Google",
+    "tags": [
+      "9B",
+      "Conversational",
+      "Text-generation",
+      "Featured"
+    ],
+    "size": 5760000000
+  },
+  "engine": "nitro"
+}
--- a/extensions/inference-nitro-extension/resources/models/llama3-8b-instruct/model.json
+++ b/extensions/inference-nitro-extension/resources/models/llama3-8b-instruct/model.json
@ -2,7 +2,7 @@
    "sources": [
      {
        "filename": "Meta-Llama-3-8B-Instruct-Q4_K_M.gguf",
-        "url": "https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf"
+        "url": "https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf"
      }
    ],
    "id": "llama3-8b-instruct",
@ -28,7 +28,7 @@
    },
    "metadata": {
      "author": "MetaAI",
-      "tags": ["7B", "Featured"],
+      "tags": ["8B", "Featured"],
      "size": 4920000000
    },
    "engine": "nitro"
--- a/extensions/inference-nitro-extension/resources/models/llama3.1-70b-instruct/model.json
+++ b/extensions/inference-nitro-extension/resources/models/llama3.1-70b-instruct/model.json
@ -0,0 +1,42 @@
+{
+  "sources": [
+    {
+      "filename": "Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf",
+      "url": "https://huggingface.co/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf"
+    }
+  ],
+  "id": "llama3.1-70b-instruct",
+  "object": "model",
+  "name": "Llama 3.1 70B Q4 Instruct",
+  "version": "1.0",
+  "description": "Meta's Llama 3.1 excels at general usage situations, including chat, general world knowledge, and coding.",
+  "format": "gguf",
+  "settings": {
+    "ctx_len": 131072,
+    "prompt_template": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
+    "llama_model_path": "Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf",
+    "ngl": 33
+  },
+  "parameters": {
+    "temperature": 0.7,
+    "top_p": 0.95,
+    "stream": true,
+    "max_tokens": 8192,
+    "stop": [
+      "<|end_of_text|>",
+      "<|eot_id|>",
+      "<|eom_id|>"
+    ],
+    "frequency_penalty": 0,
+    "presence_penalty": 0
+  },
+  "metadata": {
+    "author": "MetaAI",
+    "tags": [
+      "70B",
+      "Featured"
+    ],
+    "size": 42500000000
+  },
+  "engine": "nitro"
+}
--- a/extensions/inference-nitro-extension/resources/models/llama3.1-8b-instruct/model.json
+++ b/extensions/inference-nitro-extension/resources/models/llama3.1-8b-instruct/model.json
@ -0,0 +1,42 @@
+{
+  "sources": [
+    {
+      "filename": "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
+      "url": "https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"
+    }
+  ],
+  "id": "llama3.1-8b-instruct",
+  "object": "model",
+  "name": "Llama 3.1 8B Q4 Instruct",
+  "version": "1.0",
+  "description": "Meta's Llama 3.1 excels at general usage situations, including chat, general world knowledge, and coding.",
+  "format": "gguf",
+  "settings": {
+    "ctx_len": 131072,
+    "prompt_template": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
+    "llama_model_path": "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf",
+    "ngl": 33
+  },
+  "parameters": {
+    "temperature": 0.7,
+    "top_p": 0.95,
+    "stream": true,
+    "max_tokens": 8192,
+    "stop": [
+      "<|end_of_text|>",
+      "<|eot_id|>",
+      "<|eom_id|>"
+    ],
+    "frequency_penalty": 0,
+    "presence_penalty": 0
+  },
+  "metadata": {
+    "author": "MetaAI",
+    "tags": [
+      "8B",
+      "Featured"
+    ],
+    "size": 4920000000
+  },
+  "engine": "nitro"
+}
--- a/extensions/inference-nitro-extension/rollup.config.ts
+++ b/extensions/inference-nitro-extension/rollup.config.ts
@ -12,8 +12,8 @@ const codeninja7bJson = require('./resources/models/codeninja-1.0-7b/model.json'
 const commandr34bJson = require('./resources/models/command-r-34b/model.json')
 const deepseekCoder13bJson = require('./resources/models/deepseek-coder-1.3b/model.json')
 const deepseekCoder34bJson = require('./resources/models/deepseek-coder-34b/model.json')
-const gemma2bJson = require('./resources/models/gemma-2b/model.json')
-const gemma7bJson = require('./resources/models/gemma-7b/model.json')
+const gemma112bJson = require('./resources/models/gemma-1.1-2b/model.json')
+const gemma117bJson = require('./resources/models/gemma-1.1-7b/model.json')
 const llama2Chat70bJson = require('./resources/models/llama2-chat-70b/model.json')
 const llama2Chat7bJson = require('./resources/models/llama2-chat-7b/model.json')
 const llamacorn1bJson = require('./resources/models/llamacorn-1.1b/model.json')
@ -40,7 +40,11 @@ const aya35bJson = require('./resources/models/aya-23-35b/model.json')
 const phimediumJson = require('./resources/models/phi3-medium/model.json')
 const codestralJson = require('./resources/models/codestral-22b/model.json')
 const qwen2Json = require('./resources/models/qwen2-7b/model.json')
-
+const llama318bJson = require('./resources/models/llama3.1-8b-instruct/model.json')
+const llama3170bJson = require('./resources/models/llama3.1-70b-instruct/model.json')
+const gemma22bJson = require('./resources/models/gemma-2-2b/model.json')
+const gemma29bJson = require('./resources/models/gemma-2-9b/model.json')
+const gemma227bJson = require('./resources/models/gemma-2-27b/model.json')

 export default [
  {
@ -60,8 +64,8 @@ export default [
          commandr34bJson,
          deepseekCoder13bJson,
          deepseekCoder34bJson,
-          gemma2bJson,
-          gemma7bJson,
+          gemma112bJson,
+          gemma117bJson,
          llama2Chat70bJson,
          llama2Chat7bJson,
          llamacorn1bJson,
@ -87,7 +91,12 @@ export default [
          aya8bJson,
          aya35bJson,
          codestralJson,
-          qwen2Json
+          qwen2Json,
+          llama318bJson,
+          llama3170bJson,
+          gemma22bJson,
+          gemma29bJson,
+          gemma227bJson
        ]),
        NODE: JSON.stringify(`${packageJson.name}/${packageJson.node}`),
        DEFAULT_SETTINGS: JSON.stringify(defaultSettingJson),
--- a/extensions/inference-nitro-extension/src/node/index.ts
+++ b/extensions/inference-nitro-extension/src/node/index.ts
@ -260,9 +260,14 @@ function loadLLMModel(settings: any): Promise<Response> {
 async function validateModelStatus(modelId: string): Promise<void> {
  // Send a GET request to the validation URL.
  // Retry the request up to 3 times if it fails, with a delay of 500 milliseconds between retries.
+  log(`[CORTEX]::Debug: Validating model ${modelId}`)
  return fetchRetry(NITRO_HTTP_VALIDATE_MODEL_URL, {
    method: 'POST',
-    body: JSON.stringify({ model: modelId }),
+    body: JSON.stringify({ 
+      model: modelId,
+      // TODO: force to use cortex llamacpp by default
+      engine: 'cortex.llamacpp'
+    }),
    headers: {
      'Content-Type': 'application/json',
    },
@ -288,8 +293,9 @@ async function validateModelStatus(modelId: string): Promise<void> {
        return Promise.resolve()
      }
    }
+    const errorBody = await res.text()
    log(
-      `[CORTEX]::Debug: Validate model state failed with response ${JSON.stringify(
+      `[CORTEX]::Debug: Validate model state failed with response ${errorBody} and status is ${JSON.stringify(
        res.statusText
      )}`
    )
--- a/extensions/inference-nvidia-extension/resources/settings.json
+++ b/extensions/inference-nvidia-extension/resources/settings.json
@ -1,4 +1,16 @@
 [
+  {
+    "key": "nvidia-api-key",
+    "title": "API Key",
+    "description": "The NVIDIA API uses API keys for authentication. Visit your [API Keys](https://org.ngc.nvidia.com/setup/personal-keys) page to retrieve the API key you'll use in your requests..",
+    "controllerType": "input",
+    "controllerProps": {
+      "placeholder": "Insert API Key",
+      "value": "",
+      "type": "password",
+      "inputActions": ["unobscure", "copy"]
+    }
+  },
  {
    "key": "chat-completions-endpoint",
    "title": "Chat Completions Endpoint",
@ -8,17 +20,5 @@
      "placeholder": "https://integrate.api.nvidia.com/v1/chat/completions",
      "value": "https://integrate.api.nvidia.com/v1/chat/completions"
    }
-  },
-  {
-    "key": "nvidia-api-key",
-    "title": "API Key",
-    "description": "The NVIDIA API uses API keys for authentication. Visit your [API Keys](https://org.ngc.nvidia.com/setup/personal-keys) page to retrieve the API key you'll use in your requests..",
-    "controllerType": "input",
-    "controllerProps": {
-      "placeholder": "nvapi-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
-      "value": "",
-      "type": "password",
-      "inputActions": ["unobscure", "copy"]
-    }
  }
 ]
--- a/extensions/inference-openai-extension/resources/settings.json
+++ b/extensions/inference-openai-extension/resources/settings.json
@ -1,4 +1,16 @@
 [
+  {
+    "key": "openai-api-key",
+    "title": "API Key",
+    "description": "The OpenAI API uses API keys for authentication. Visit your [API Keys](https://platform.openai.com/account/api-keys) page to retrieve the API key you'll use in your requests.",
+    "controllerType": "input",
+    "controllerProps": {
+      "placeholder": "Insert API Key",
+      "value": "",
+      "type": "password",
+      "inputActions": ["unobscure", "copy"]
+    }
+  },
  {
    "key": "chat-completions-endpoint",
    "title": "Chat Completions Endpoint",
@ -8,16 +20,5 @@
      "placeholder": "https://api.openai.com/v1/chat/completions",
      "value": "https://api.openai.com/v1/chat/completions"
    }
-  },
-  {
-    "key": "openai-api-key",
-    "title": "API Key",
-    "description": "The OpenAI API uses API keys for authentication. Visit your [API Keys](https://platform.openai.com/account/api-keys) page to retrieve the API key you'll use in your requests.",
-    "controllerType": "input",
-    "controllerProps": {
-      "placeholder": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
-      "value": "",
-      "type": "password"
-    }
  }
 ]
--- a/extensions/inference-openrouter-extension/resources/settings.json
+++ b/extensions/inference-openrouter-extension/resources/settings.json
@ -1,8 +1,20 @@
 [
+  {
+    "key": "openrouter-api-key",
+    "title": "API Key",
+    "description": "The OpenRouter API uses API keys for authentication. Visit your [API Keys](https://openrouter.ai/keys) page to retrieve the API key you'll use in your requests.",
+    "controllerType": "input",
+    "controllerProps": {
+      "placeholder": "Insert API Key",
+      "value": "",
+      "type": "password",
+      "inputActions": ["unobscure", "copy"]
+    }
+  },
  {
    "key": "chat-completions-endpoint",
    "title": "Chat Completions Endpoint",
-    "description": "The endpoint to use for chat completions. See the [OpenRouter API documentation](https://openrouter.ai/docs) for more information.",
+    "description": "The endpoint to use for chat completions. See the [OpenRouter API documentation](https://openrouter.ai/docs/requests) for more information.",
    "controllerType": "input",
    "controllerProps": {
      "placeholder": "https://openrouter.ai/api/v1/chat/completions",
@ -10,14 +22,13 @@
    }
  },
  {
-    "key": "openrouter-api-key",
-    "title": "API Key",
-    "description": "The OpenRouter API uses API keys for authentication. Visit your [API Keys](https://openrouter.ai/keys) page to retrieve the API key you'll use in your requests.",
+    "key": "openrouter-model",
+    "title": "Model",
+    "description": "If the model parameter is omitted, the user or payer's default is used. Otherwise, remember to select a value for model from the [supported models](https://openrouter.ai/docs/models) or API, and include the organization prefix.",
    "controllerType": "input",
    "controllerProps": {
-      "placeholder": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
-      "value": "",
-      "type": "password"
+      "placeholder": "Leave empty for default model",
+      "value": ""
    }
  }
 ]
--- a/extensions/inference-openrouter-extension/src/index.ts
+++ b/extensions/inference-openrouter-extension/src/index.ts
@ -8,22 +8,16 @@

 import { RemoteOAIEngine } from '@janhq/core'
 import { PayloadType } from '@janhq/core'
-import { ChatCompletionRole } from '@janhq/core'

 declare const SETTINGS: Array<any>
 declare const MODELS: Array<any>

 enum Settings {
  apiKey = 'openrouter-api-key',
+  model = 'openrouter-model',
  chatCompletionsEndPoint = 'chat-completions-endpoint',
 }

-enum RoleType {
-  user = 'USER',
-  chatbot = 'CHATBOT',
-  system = 'SYSTEM',
-}
-
 /**
 * A class that implements the InferenceExtension interface from the @janhq/core package.
 * The class provides methods for initializing and stopping a model, and for making inference requests.
@ -32,6 +26,7 @@ enum RoleType {
 export default class JanInferenceOpenRouterExtension extends RemoteOAIEngine {
  inferenceUrl: string = ''
  provider: string = 'openrouter'
+  model?: string | undefined

  override async onLoad(): Promise<void> {
    super.onLoad()
@ -45,6 +40,9 @@ export default class JanInferenceOpenRouterExtension extends RemoteOAIEngine {
      Settings.chatCompletionsEndPoint,
      ''
    )
+    this.model = await this.getSetting<string>(Settings.model, '')
+    // Openrouter uses default model on no model param set
+    if (!this.model?.length) this.model = undefined
    if (this.inferenceUrl.length === 0) {
      SETTINGS.forEach((setting) => {
        if (setting.key === Settings.chatCompletionsEndPoint) {
@ -54,6 +52,14 @@ export default class JanInferenceOpenRouterExtension extends RemoteOAIEngine {
    }
  }

+  override async headers(): Promise<HeadersInit> {
+    return {
+      'Content-Type': 'application/json',
+      'HTTP-Referer': 'https://jan.ai',
+      'Authorization': `Bearer ${this.apiKey}`,
+    }
+  }
+
  onSettingUpdate<T>(key: string, value: T): void {
    if (key === Settings.apiKey) {
      this.apiKey = value as string
@ -69,8 +75,14 @@ export default class JanInferenceOpenRouterExtension extends RemoteOAIEngine {
      } else {
        this.inferenceUrl = value
      }
+    } else if (key === Settings.model) {
+      this.model =
+        typeof value === 'string' && value.length > 0 ? value : undefined
    }
  }

-  transformPayload = (payload: PayloadType)=>({...payload,model:"openrouter/auto"})
+  transformPayload = (payload: PayloadType) => ({
+    ...payload,
+    model: this.model,
+  })
 }
--- a/extensions/inference-triton-trtllm-extension/resources/settings.json
+++ b/extensions/inference-triton-trtllm-extension/resources/settings.json
@ -1,4 +1,16 @@
 [
+  {
+    "key": "tritonllm-api-key",
+    "title": "API Key",
+    "description": "The Triton LLM API uses API keys for authentication.",
+    "controllerType": "input",
+    "controllerProps": {
+      "placeholder": "Insert API Key",
+      "value": "",
+      "type": "password",
+      "inputActions": ["unobscure", "copy"]
+    }
+  },
  {
    "key": "chat-completions-endpoint",
    "title": "Chat Completions Endpoint",
@ -8,16 +20,5 @@
      "placeholder": "http://localhost:8000/v2/models/tensorrt_llm_bls/generate",
      "value": "http://localhost:8000/v2/models/tensorrt_llm_bls/generate"
    }
-  },
-  {
-    "key": "tritonllm-api-key",
-    "title": "Triton LLM API Key",
-    "description": "The Triton LLM API uses API keys for authentication.",
-    "controllerType": "input",
-    "controllerProps": {
-      "placeholder": "xxxxxxxxxxxxxxxxxxxx",
-      "value": "",
-      "type": "password"
-    }
  }
 ]
--- a/extensions/model-extension/download.bat
+++ b/extensions/model-extension/download.bat
@ -1,3 +0,0 @@
-@echo off
-set /p LLAMA_CPP_VERSION=<./scripts/version.txt
-.\node_modules\.bin\download https://github.com/ggerganov/llama.cpp/archive/refs/tags/%LLAMA_CPP_VERSION%.tar.gz -o . --filename ./scripts/llama.cpp.tar.gz && tar -xzf .\scripts\llama.cpp.tar.gz "llama.cpp-%LLAMA_CPP_VERSION%/convert.py" "llama.cpp-%LLAMA_CPP_VERSION%/convert-hf-to-gguf.py" "llama.cpp-%LLAMA_CPP_VERSION%/gguf-py" && cpx "./llama.cpp-%LLAMA_CPP_VERSION%/**" "scripts" && rimraf "./scripts/llama.cpp.tar.gz" && rimraf "./llama.cpp-%LLAMA_CPP_VERSION%"
--- a/extensions/model-extension/package.json
+++ b/extensions/model-extension/package.json
@ -9,31 +9,25 @@
  "license": "AGPL-3.0",
  "scripts": {
    "build": "tsc --module commonjs && rollup -c rollup.config.ts --configPlugin @rollup/plugin-typescript --bundleConfigAsCjs",
-    "download:llama": "run-script-os",
-    "download:llama:linux": "LLAMA_CPP_VERSION=$(cat ./scripts/version.txt) && download https://github.com/ggerganov/llama.cpp/archive/refs/tags/${LLAMA_CPP_VERSION}.tar.gz -o . --filename ./scripts/llama.cpp.tar.gz && tar -xzf ./scripts/llama.cpp.tar.gz --wildcards '*/convert.py' '*/convert-hf-to-gguf.py' '*/gguf-py' && cpx \"./llama.cpp-$LLAMA_CPP_VERSION/**\" \"scripts\" && rimraf \"./scripts/llama.cpp.tar.gz\" && rimraf \"./llama.cpp-$LLAMA_CPP_VERSION\"",
-    "download:llama:darwin": "LLAMA_CPP_VERSION=$(cat ./scripts/version.txt) && download https://github.com/ggerganov/llama.cpp/archive/refs/tags/${LLAMA_CPP_VERSION}.tar.gz -o . --filename ./scripts/llama.cpp.tar.gz && tar -xzf ./scripts/llama.cpp.tar.gz '*/convert.py' '*/convert-hf-to-gguf.py' '*/gguf-py' && cpx \"./llama.cpp-$LLAMA_CPP_VERSION/**\" \"scripts\" && rimraf \"./scripts/llama.cpp.tar.gz\" && rimraf \"./llama.cpp-$LLAMA_CPP_VERSION\"",
-    "download:llama:win32": "download.bat",
-    "build:publish:linux": "rimraf *.tgz --glob && yarn build && yarn download:llama && cpx \"scripts/**\" \"dist/scripts\" && cpx \"bin/**\" \"dist/bin\" && npm pack && cpx *.tgz ../../pre-install",
-    "build:publish:darwin": "rimraf *.tgz --glob && yarn build && yarn download:llama && cpx \"scripts/**\" \"dist/scripts\" && cpx \"bin/**\" \"dist/bin\" && ../../.github/scripts/auto-sign.sh && npm pack && cpx *.tgz ../../pre-install",
-    "build:publish:win32": "rimraf *.tgz --glob && yarn build && yarn download:llama && cpx \"scripts/**\" \"dist/scripts\" && cpx \"bin/**\" \"dist/bin\" && npm pack && cpx *.tgz ../../pre-install",
-    "build:publish": "run-script-os"
+    "build:publish": "rimraf *.tgz --glob && yarn build && npm pack && cpx *.tgz ../../pre-install"
  },
  "devDependencies": {
-    "cpx": "^1.5.0",
-    "download-cli": "^1.1.1",
-    "rimraf": "^3.0.2",
-    "ts-loader": "^9.5.0",
-    "typescript": "5.3.3",
    "@rollup/plugin-commonjs": "^25.0.7",
    "@rollup/plugin-json": "^6.1.0",
    "@rollup/plugin-node-resolve": "^15.2.3",
    "@rollup/plugin-replace": "^5.0.5",
    "@rollup/plugin-typescript": "^11.1.6",
    "@types/pdf-parse": "^1.1.4",
+    "cpx": "^1.5.0",
+    "download-cli": "^1.1.1",
+    "rimraf": "^3.0.2",
    "rollup": "^2.38.5",
    "rollup-plugin-define": "^1.0.1",
    "rollup-plugin-sourcemaps": "^0.6.3",
-    "rollup-plugin-typescript2": "^0.36.0"
+    "rollup-plugin-typescript2": "^0.36.0",
+    "run-script-os": "^1.1.6",
+    "ts-loader": "^9.5.0",
+    "typescript": "5.3.3"
  },
  "files": [
    "dist/*",
@ -41,8 +35,15 @@
    "README.md"
  ],
  "dependencies": {
-    "@janhq/core": "file:../../core",
    "@huggingface/gguf": "^0.0.11",
+    "@huggingface/jinja": "^0.3.0",
+    "@janhq/core": "file:../../core",
+    "hyllama": "^0.2.2",
    "python-shell": "^5.0.0"
-  }
+  },
+  "bundleDependencies": [
+    "hyllama",
+    "@huggingface/gguf",
+    "@huggingface/jinja"
+  ]
 }
--- a/extensions/model-extension/rollup.config.ts
+++ b/extensions/model-extension/rollup.config.ts
@ -3,7 +3,7 @@ import sourceMaps from 'rollup-plugin-sourcemaps'
 import typescript from 'rollup-plugin-typescript2'
 import json from '@rollup/plugin-json'
 import replace from '@rollup/plugin-replace'
-
+import commonjs from '@rollup/plugin-commonjs'
 const settingJson = require('./resources/settings.json')
 const packageJson = require('./package.json')
 const defaultModelJson = require('./resources/default-model.json')
@ -39,6 +39,39 @@ export default [
        browser: true,
      }),

+      // Resolve source maps to the original source
+      sourceMaps(),
+    ],
+  },
+  {
+    input: `src/node/index.ts`,
+    output: [
+      {
+        file: 'dist/node/index.cjs.js',
+        format: 'cjs',
+        sourcemap: true,
+        inlineDynamicImports: true,
+      },
+    ],
+    // Indicate here external modules you don't wanna include in your bundle (i.e.: 'lodash')
+    external: ['@janhq/core/node'],
+    watch: {
+      include: 'src/node/**',
+    },
+    plugins: [
+      // Allow json resolution
+      json(),
+      // Compile TypeScript files
+      typescript({ useTsconfigDeclarationDir: true }),
+      // Allow bundling cjs modules (unlike webpack, rollup doesn't understand cjs)
+      commonjs(),
+      // Allow node_modules resolution, so you can use 'external' to control
+      // which external modules to include in the bundle
+      // https://github.com/rollup/rollup-plugin-node-resolve#usage
+      resolve({
+        extensions: ['.ts', '.js', '.json'],
+      }),
+
      // Resolve source maps to the original source
      sourceMaps(),
    ],
--- a/extensions/model-extension/scripts/convert-hf-to-gguf.py
+++ b/extensions/model-extension/scripts/convert-hf-to-gguf.py
--- a/extensions/model-extension/scripts/convert.py
+++ b/extensions/model-extension/scripts/convert.py
--- a/extensions/model-extension/scripts/gguf-py/LICENSE
+++ b/extensions/model-extension/scripts/gguf-py/LICENSE
@ -1,21 +0,0 @@
-MIT License
-
-Copyright (c) 2023 Georgi Gerganov
-
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
--- a/extensions/model-extension/scripts/gguf-py/README.md
+++ b/extensions/model-extension/scripts/gguf-py/README.md
@ -1,81 +0,0 @@
-## gguf
-
-This is a Python package for writing binary files in the [GGUF](https://github.com/ggerganov/ggml/pull/302)
-(GGML Universal File) format.
-
-See [convert-llama-hf-to-gguf.py](https://github.com/ggerganov/llama.cpp/blob/master/convert-hf-to-gguf.py)
-as an example for its usage.
-
-## Installation
-```sh
-pip install gguf
-```
-
-## API Examples/Simple Tools
-
-[examples/writer.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/examples/writer.py) — Generates `example.gguf` in the current directory to demonstrate generating a GGUF file. Note that this file cannot be used as a model.
-
-[scripts/gguf-dump.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf-dump.py) — Dumps a GGUF file's metadata to the console.
-
-[scripts/gguf-set-metadata.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf-set-metadata.py) — Allows changing simple metadata values in a GGUF file by key.
-
-[scripts/gguf-convert-endian.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf-convert-endian.py) — Allows converting the endianness of GGUF files.
-
-## Development
-Maintainers who participate in development of this package are advised to install it in editable mode:
-
-```sh
-cd /path/to/llama.cpp/gguf-py
-
-pip install --editable .
-```
-
-**Note**: This may require to upgrade your Pip installation, with a message saying that editable installation currently requires `setup.py`.
-In this case, upgrade Pip to the latest:
-
-```sh
-pip install --upgrade pip
-```
-
-## Automatic publishing with CI
-
-There's a GitHub workflow to make a release automatically upon creation of tags in a specified format.
-
-1. Bump the version in `pyproject.toml`.
-2. Create a tag named `gguf-vx.x.x` where `x.x.x` is the semantic version number.
-
-```sh
-git tag -a gguf-v1.0.0 -m "Version 1.0 release"
-```
-
-3. Push the tags.
-
-```sh
-git push origin --tags
-```
-
-## Manual publishing
-If you want to publish the package manually for any reason, you need to have `twine` and `build` installed:
-
-```sh
-pip install build twine
-```
-
-Then, follow these steps to release a new version:
-
-1. Bump the version in `pyproject.toml`.
-2. Build the package:
-
-```sh
-python -m build
-```
-
-3. Upload the generated distribution archives:
-
-```sh
-python -m twine upload dist/*
-```
-
-## TODO
- [ ] Add tests
- [ ] Include conversion scripts as command line entry points in this package.
--- a/extensions/model-extension/scripts/gguf-py/examples/writer.py
+++ b/extensions/model-extension/scripts/gguf-py/examples/writer.py
@ -1,40 +0,0 @@
-#!/usr/bin/env python3
-import sys
-from pathlib import Path
-
-import numpy as np
-
-# Necessary to load the local gguf package
-sys.path.insert(0, str(Path(__file__).parent.parent))
-
-from gguf import GGUFWriter  # noqa: E402
-
-
-# Example usage:
-def writer_example() -> None:
-    # Example usage with a file
-    gguf_writer = GGUFWriter("example.gguf", "llama")
-
-    gguf_writer.add_architecture()
-    gguf_writer.add_block_count(12)
-    gguf_writer.add_uint32("answer", 42)  # Write a 32-bit integer
-    gguf_writer.add_float32("answer_in_float", 42.0)  # Write a 32-bit float
-    gguf_writer.add_custom_alignment(64)
-
-    tensor1 = np.ones((32,), dtype=np.float32) * 100.0
-    tensor2 = np.ones((64,), dtype=np.float32) * 101.0
-    tensor3 = np.ones((96,), dtype=np.float32) * 102.0
-
-    gguf_writer.add_tensor("tensor1", tensor1)
-    gguf_writer.add_tensor("tensor2", tensor2)
-    gguf_writer.add_tensor("tensor3", tensor3)
-
-    gguf_writer.write_header_to_file()
-    gguf_writer.write_kv_data_to_file()
-    gguf_writer.write_tensors_to_file()
-
-    gguf_writer.close()
-
-
-if __name__ == '__main__':
-    writer_example()
--- a/extensions/model-extension/scripts/gguf-py/gguf/init.py
+++ b/extensions/model-extension/scripts/gguf-py/gguf/init.py
@ -1,5 +0,0 @@
-from .constants import *
-from .gguf_reader import *
-from .gguf_writer import *
-from .tensor_mapping import *
-from .vocab import *
--- a/extensions/model-extension/scripts/gguf-py/gguf/constants.py
+++ b/extensions/model-extension/scripts/gguf-py/gguf/constants.py
@ -1,665 +0,0 @@
-from __future__ import annotations
-
-import sys
-from enum import Enum, IntEnum, auto
-from typing import Any
-
-#
-# constants
-#
-
-GGUF_MAGIC             = 0x46554747  # "GGUF"
-GGUF_VERSION           = 3
-GGUF_DEFAULT_ALIGNMENT = 32
-
-#
-# metadata keys
-#
-
-
-class Keys:
-    class General:
-        ARCHITECTURE         = "general.architecture"
-        QUANTIZATION_VERSION = "general.quantization_version"
-        ALIGNMENT            = "general.alignment"
-        NAME                 = "general.name"
-        AUTHOR               = "general.author"
-        URL                  = "general.url"
-        DESCRIPTION          = "general.description"
-        LICENSE              = "general.license"
-        SOURCE_URL           = "general.source.url"
-        SOURCE_HF_REPO       = "general.source.huggingface.repository"
-        FILE_TYPE            = "general.file_type"
-
-    class LLM:
-        CONTEXT_LENGTH        = "{arch}.context_length"
-        EMBEDDING_LENGTH      = "{arch}.embedding_length"
-        BLOCK_COUNT           = "{arch}.block_count"
-        FEED_FORWARD_LENGTH   = "{arch}.feed_forward_length"
-        USE_PARALLEL_RESIDUAL = "{arch}.use_parallel_residual"
-        TENSOR_DATA_LAYOUT    = "{arch}.tensor_data_layout"
-        EXPERT_COUNT          = "{arch}.expert_count"
-        EXPERT_USED_COUNT     = "{arch}.expert_used_count"
-
-    class Attention:
-        HEAD_COUNT        = "{arch}.attention.head_count"
-        HEAD_COUNT_KV     = "{arch}.attention.head_count_kv"
-        MAX_ALIBI_BIAS    = "{arch}.attention.max_alibi_bias"
-        CLAMP_KQV         = "{arch}.attention.clamp_kqv"
-        KEY_LENGTH        = "{arch}.attention.key_length"
-        VALUE_LENGTH      = "{arch}.attention.value_length"
-        LAYERNORM_EPS     = "{arch}.attention.layer_norm_epsilon"
-        LAYERNORM_RMS_EPS = "{arch}.attention.layer_norm_rms_epsilon"
-
-    class Rope:
-        DIMENSION_COUNT      = "{arch}.rope.dimension_count"
-        FREQ_BASE            = "{arch}.rope.freq_base"
-        SCALING_TYPE         = "{arch}.rope.scaling.type"
-        SCALING_FACTOR       = "{arch}.rope.scaling.factor"
-        SCALING_ORIG_CTX_LEN = "{arch}.rope.scaling.original_context_length"
-        SCALING_FINETUNED    = "{arch}.rope.scaling.finetuned"
-
-    class Tokenizer:
-        MODEL         = "tokenizer.ggml.model"
-        LIST          = "tokenizer.ggml.tokens"
-        TOKEN_TYPE    = "tokenizer.ggml.token_type"
-        SCORES        = "tokenizer.ggml.scores"
-        MERGES        = "tokenizer.ggml.merges"
-        BOS_ID        = "tokenizer.ggml.bos_token_id"
-        EOS_ID        = "tokenizer.ggml.eos_token_id"
-        UNK_ID        = "tokenizer.ggml.unknown_token_id"
-        SEP_ID        = "tokenizer.ggml.seperator_token_id"
-        PAD_ID        = "tokenizer.ggml.padding_token_id"
-        ADD_BOS       = "tokenizer.ggml.add_bos_token"
-        ADD_EOS       = "tokenizer.ggml.add_eos_token"
-        ADD_PREFIX    = "tokenizer.ggml.add_space_prefix"
-        HF_JSON       = "tokenizer.huggingface.json"
-        RWKV          = "tokenizer.rwkv.world"
-        CHAT_TEMPLATE = "tokenizer.chat_template"
-
-
-#
-# recommended mapping of model tensor names for storage in gguf
-#
-
-
-class MODEL_ARCH(IntEnum):
-    LLAMA     = auto()
-    FALCON    = auto()
-    BAICHUAN  = auto()
-    GPT2      = auto()
-    GPTJ      = auto()
-    GPTNEOX   = auto()
-    MPT       = auto()
-    STARCODER = auto()
-    PERSIMMON = auto()
-    REFACT    = auto()
-    BERT      = auto()
-    BLOOM     = auto()
-    STABLELM  = auto()
-    QWEN      = auto()
-    QWEN2     = auto()
-    PHI2      = auto()
-    PLAMO     = auto()
-    CODESHELL = auto()
-    ORION     = auto()
-    INTERNLM2  = auto()
-    MINICPM   = auto()
-
-
-class MODEL_TENSOR(IntEnum):
-    TOKEN_EMBD      = auto()
-    TOKEN_EMBD_NORM = auto()
-    TOKEN_TYPES     = auto()
-    POS_EMBD        = auto()
-    OUTPUT          = auto()
-    OUTPUT_NORM     = auto()
-    ROPE_FREQS      = auto()
-    ATTN_Q          = auto()
-    ATTN_K          = auto()
-    ATTN_V          = auto()
-    ATTN_QKV        = auto()
-    ATTN_OUT        = auto()
-    ATTN_NORM       = auto()
-    ATTN_NORM_2     = auto()
-    ATTN_ROT_EMBD   = auto()
-    FFN_GATE_INP    = auto()
-    FFN_NORM        = auto()
-    FFN_GATE        = auto()
-    FFN_DOWN        = auto()
-    FFN_UP          = auto()
-    FFN_ACT         = auto()
-    FFN_GATE_EXP    = auto()
-    FFN_DOWN_EXP    = auto()
-    FFN_UP_EXP      = auto()
-    ATTN_Q_NORM     = auto()
-    ATTN_K_NORM     = auto()
-
-
-MODEL_ARCH_NAMES: dict[MODEL_ARCH, str] = {
-    MODEL_ARCH.LLAMA:          "llama",
-    MODEL_ARCH.FALCON:         "falcon",
-    MODEL_ARCH.BAICHUAN:       "baichuan",
-    MODEL_ARCH.GPT2:           "gpt2",
-    MODEL_ARCH.GPTJ:           "gptj",
-    MODEL_ARCH.GPTNEOX:        "gptneox",
-    MODEL_ARCH.MPT:            "mpt",
-    MODEL_ARCH.STARCODER:      "starcoder",
-    MODEL_ARCH.PERSIMMON:      "persimmon",
-    MODEL_ARCH.REFACT:         "refact",
-    MODEL_ARCH.BERT:           "bert",
-    MODEL_ARCH.BLOOM:          "bloom",
-    MODEL_ARCH.STABLELM:       "stablelm",
-    MODEL_ARCH.QWEN:           "qwen",
-    MODEL_ARCH.QWEN2:          "qwen2",
-    MODEL_ARCH.PHI2:           "phi2",
-    MODEL_ARCH.PLAMO:          "plamo",
-    MODEL_ARCH.CODESHELL:      "codeshell",
-    MODEL_ARCH.ORION:          "orion",
-    MODEL_ARCH.INTERNLM2:      "internlm2",
-    MODEL_ARCH.MINICPM:        "minicpm",
-}
-
-TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
-    MODEL_TENSOR.TOKEN_EMBD:      "token_embd",
-    MODEL_TENSOR.TOKEN_EMBD_NORM: "token_embd_norm",
-    MODEL_TENSOR.TOKEN_TYPES:     "token_types",
-    MODEL_TENSOR.POS_EMBD:        "position_embd",
-    MODEL_TENSOR.OUTPUT_NORM:     "output_norm",
-    MODEL_TENSOR.OUTPUT:          "output",
-    MODEL_TENSOR.ROPE_FREQS:      "rope_freqs",
-    MODEL_TENSOR.ATTN_NORM:       "blk.{bid}.attn_norm",
-    MODEL_TENSOR.ATTN_NORM_2:     "blk.{bid}.attn_norm_2",
-    MODEL_TENSOR.ATTN_QKV:        "blk.{bid}.attn_qkv",
-    MODEL_TENSOR.ATTN_Q:          "blk.{bid}.attn_q",
-    MODEL_TENSOR.ATTN_K:          "blk.{bid}.attn_k",
-    MODEL_TENSOR.ATTN_V:          "blk.{bid}.attn_v",
-    MODEL_TENSOR.ATTN_OUT:        "blk.{bid}.attn_output",
-    MODEL_TENSOR.ATTN_ROT_EMBD:   "blk.{bid}.attn_rot_embd",
-    MODEL_TENSOR.ATTN_Q_NORM:     "blk.{bid}.attn_q_norm",
-    MODEL_TENSOR.ATTN_K_NORM:     "blk.{bid}.attn_k_norm",
-    MODEL_TENSOR.FFN_GATE_INP:    "blk.{bid}.ffn_gate_inp",
-    MODEL_TENSOR.FFN_NORM:        "blk.{bid}.ffn_norm",
-    MODEL_TENSOR.FFN_GATE:        "blk.{bid}.ffn_gate",
-    MODEL_TENSOR.FFN_DOWN:        "blk.{bid}.ffn_down",
-    MODEL_TENSOR.FFN_UP:          "blk.{bid}.ffn_up",
-    MODEL_TENSOR.FFN_ACT:         "blk.{bid}.ffn",
-    MODEL_TENSOR.FFN_GATE_EXP:    "blk.{bid}.ffn_gate.{xid}",
-    MODEL_TENSOR.FFN_DOWN_EXP:    "blk.{bid}.ffn_down.{xid}",
-    MODEL_TENSOR.FFN_UP_EXP:      "blk.{bid}.ffn_up.{xid}",
-}
-
-MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
-    MODEL_ARCH.LLAMA: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ROPE_FREQS,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_Q,
-        MODEL_TENSOR.ATTN_K,
-        MODEL_TENSOR.ATTN_V,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.ATTN_ROT_EMBD,
-        MODEL_TENSOR.FFN_GATE_INP,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_GATE,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-        MODEL_TENSOR.FFN_GATE_EXP,
-        MODEL_TENSOR.FFN_DOWN_EXP,
-        MODEL_TENSOR.FFN_UP_EXP,
-    ],
-    MODEL_ARCH.GPTNEOX: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_QKV,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.FALCON: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_NORM_2,
-        MODEL_TENSOR.ATTN_QKV,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.BAICHUAN: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ROPE_FREQS,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_Q,
-        MODEL_TENSOR.ATTN_K,
-        MODEL_TENSOR.ATTN_V,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.ATTN_ROT_EMBD,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_GATE,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.STARCODER: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.POS_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_QKV,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.BERT: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.TOKEN_TYPES,
-        MODEL_TENSOR.POS_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_Q,
-        MODEL_TENSOR.ATTN_K,
-        MODEL_TENSOR.ATTN_V,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.MPT: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_QKV,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-        MODEL_TENSOR.FFN_ACT,
-    ],
-    MODEL_ARCH.GPTJ: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_Q,
-        MODEL_TENSOR.ATTN_K,
-        MODEL_TENSOR.ATTN_V,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.PERSIMMON: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_QKV,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-        MODEL_TENSOR.ATTN_Q_NORM,
-        MODEL_TENSOR.ATTN_K_NORM,
-        MODEL_TENSOR.ATTN_ROT_EMBD,
-    ],
-    MODEL_ARCH.REFACT: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_Q,
-        MODEL_TENSOR.ATTN_K,
-        MODEL_TENSOR.ATTN_V,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_GATE,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.BLOOM: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.TOKEN_EMBD_NORM,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_QKV,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.STABLELM: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ROPE_FREQS,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_Q,
-        MODEL_TENSOR.ATTN_K,
-        MODEL_TENSOR.ATTN_V,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_GATE,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.QWEN: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ROPE_FREQS,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_QKV,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.ATTN_ROT_EMBD,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_GATE,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.QWEN2: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_Q,
-        MODEL_TENSOR.ATTN_K,
-        MODEL_TENSOR.ATTN_V,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_GATE,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.PLAMO: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ROPE_FREQS,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_Q,
-        MODEL_TENSOR.ATTN_K,
-        MODEL_TENSOR.ATTN_V,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.ATTN_ROT_EMBD,
-        MODEL_TENSOR.FFN_GATE,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.GPT2: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.POS_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_QKV,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.PHI2: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_QKV,
-        MODEL_TENSOR.ATTN_Q,
-        MODEL_TENSOR.ATTN_K,
-        MODEL_TENSOR.ATTN_V,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.CODESHELL: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.POS_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_QKV,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.ATTN_ROT_EMBD,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.ORION: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ROPE_FREQS,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_Q,
-        MODEL_TENSOR.ATTN_K,
-        MODEL_TENSOR.ATTN_V,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.ATTN_ROT_EMBD,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_GATE,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.INTERNLM2: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_Q,
-        MODEL_TENSOR.ATTN_K,
-        MODEL_TENSOR.ATTN_V,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.ATTN_ROT_EMBD,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_GATE,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-    ],
-    MODEL_ARCH.MINICPM: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.ROPE_FREQS,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_Q,
-        MODEL_TENSOR.ATTN_K,
-        MODEL_TENSOR.ATTN_V,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.ATTN_ROT_EMBD,
-        MODEL_TENSOR.FFN_GATE_INP,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_GATE,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-        MODEL_TENSOR.FFN_GATE_EXP,
-        MODEL_TENSOR.FFN_DOWN_EXP,
-        MODEL_TENSOR.FFN_UP_EXP,
-    ],
-    # TODO
-}
-
-# tensors that will not be serialized
-MODEL_TENSOR_SKIP: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
-    MODEL_ARCH.LLAMA: [
-        MODEL_TENSOR.ROPE_FREQS,
-        MODEL_TENSOR.ATTN_ROT_EMBD,
-    ],
-    MODEL_ARCH.BAICHUAN: [
-        MODEL_TENSOR.ROPE_FREQS,
-        MODEL_TENSOR.ATTN_ROT_EMBD,
-    ],
-    MODEL_ARCH.PERSIMMON: [
-        MODEL_TENSOR.ROPE_FREQS,
-    ],
-    MODEL_ARCH.QWEN: [
-        MODEL_TENSOR.ROPE_FREQS,
-        MODEL_TENSOR.ATTN_ROT_EMBD,
-    ],
-    MODEL_ARCH.CODESHELL: [
-        MODEL_TENSOR.ROPE_FREQS,
-        MODEL_TENSOR.ATTN_ROT_EMBD,
-    ],
-    MODEL_ARCH.ORION: [
-        MODEL_TENSOR.ROPE_FREQS,
-        MODEL_TENSOR.ATTN_ROT_EMBD,
-    ],
-}
-
-#
-# types
-#
-
-
-class TokenType(IntEnum):
-    NORMAL       = 1
-    UNKNOWN      = 2
-    CONTROL      = 3
-    USER_DEFINED = 4
-    UNUSED       = 5
-    BYTE         = 6
-
-
-class RopeScalingType(Enum):
-    NONE   = 'none'
-    LINEAR = 'linear'
-    YARN   = 'yarn'
-
-
-class GGMLQuantizationType(IntEnum):
-    F32  = 0
-    F16  = 1
-    Q4_0 = 2
-    Q4_1 = 3
-    Q5_0 = 6
-    Q5_1 = 7
-    Q8_0 = 8
-    Q8_1 = 9
-    Q2_K = 10
-    Q3_K = 11
-    Q4_K = 12
-    Q5_K = 13
-    Q6_K = 14
-    Q8_K = 15
-
-
-class GGUFEndian(IntEnum):
-    LITTLE = 0
-    BIG = 1
-
-
-class GGUFValueType(IntEnum):
-    UINT8   = 0
-    INT8    = 1
-    UINT16  = 2
-    INT16   = 3
-    UINT32  = 4
-    INT32   = 5
-    FLOAT32 = 6
-    BOOL    = 7
-    STRING  = 8
-    ARRAY   = 9
-    UINT64  = 10
-    INT64   = 11
-    FLOAT64 = 12
-
-    @staticmethod
-    def get_type(val: Any) -> GGUFValueType:
-        if isinstance(val, (str, bytes, bytearray)):
-            return GGUFValueType.STRING
-        elif isinstance(val, list):
-            return GGUFValueType.ARRAY
-        elif isinstance(val, float):
-            return GGUFValueType.FLOAT32
-        elif isinstance(val, bool):
-            return GGUFValueType.BOOL
-        elif isinstance(val, int):
-            return GGUFValueType.INT32
-        # TODO: need help with 64-bit types in Python
-        else:
-            print("Unknown type:", type(val))
-            sys.exit()
-
-
-# Note: Does not support GGML_QKK_64
-QK_K = 256
-# Items here are (block size, type size)
-GGML_QUANT_SIZES = {
-    GGMLQuantizationType.F32:  (1, 4),
-    GGMLQuantizationType.F16:  (1, 2),
-    GGMLQuantizationType.Q4_0: (32, 2 + 16),
-    GGMLQuantizationType.Q4_1: (32, 2 + 2 + 16),
-    GGMLQuantizationType.Q5_0: (32, 2 + 4 + 16),
-    GGMLQuantizationType.Q5_1: (32, 2 + 2 + 4 + 16),
-    GGMLQuantizationType.Q8_0: (32, 2 + 32),
-    GGMLQuantizationType.Q8_1: (32, 4 + 4 + 32),
-    GGMLQuantizationType.Q2_K: (256, 2 + 2 + QK_K // 16 + QK_K // 4),
-    GGMLQuantizationType.Q3_K: (256, 2 + QK_K // 4 + QK_K // 8 + 12),
-    GGMLQuantizationType.Q4_K: (256, 2 + 2 + QK_K // 2 + 12),
-    GGMLQuantizationType.Q5_K: (256, 2 + 2 + QK_K // 2 + QK_K // 8 + 12),
-    GGMLQuantizationType.Q6_K: (256, 2 + QK_K // 2 + QK_K // 4 + QK_K // 16),
-    GGMLQuantizationType.Q8_K: (256, 4 + QK_K + QK_K // 8),
-}
-
-
-# Aliases for backward compatibility.
-
-# general
-KEY_GENERAL_ARCHITECTURE         = Keys.General.ARCHITECTURE
-KEY_GENERAL_QUANTIZATION_VERSION = Keys.General.QUANTIZATION_VERSION
-KEY_GENERAL_ALIGNMENT            = Keys.General.ALIGNMENT
-KEY_GENERAL_NAME                 = Keys.General.NAME
-KEY_GENERAL_AUTHOR               = Keys.General.AUTHOR
-KEY_GENERAL_URL                  = Keys.General.URL
-KEY_GENERAL_DESCRIPTION          = Keys.General.DESCRIPTION
-KEY_GENERAL_LICENSE              = Keys.General.LICENSE
-KEY_GENERAL_SOURCE_URL           = Keys.General.SOURCE_URL
-KEY_GENERAL_SOURCE_HF_REPO       = Keys.General.SOURCE_HF_REPO
-KEY_GENERAL_FILE_TYPE            = Keys.General.FILE_TYPE
-
-# LLM
-KEY_CONTEXT_LENGTH        = Keys.LLM.CONTEXT_LENGTH
-KEY_EMBEDDING_LENGTH      = Keys.LLM.EMBEDDING_LENGTH
-KEY_BLOCK_COUNT           = Keys.LLM.BLOCK_COUNT
-KEY_FEED_FORWARD_LENGTH   = Keys.LLM.FEED_FORWARD_LENGTH
-KEY_USE_PARALLEL_RESIDUAL = Keys.LLM.USE_PARALLEL_RESIDUAL
-KEY_TENSOR_DATA_LAYOUT    = Keys.LLM.TENSOR_DATA_LAYOUT
-
-# attention
-KEY_ATTENTION_HEAD_COUNT        = Keys.Attention.HEAD_COUNT
-KEY_ATTENTION_HEAD_COUNT_KV     = Keys.Attention.HEAD_COUNT_KV
-KEY_ATTENTION_MAX_ALIBI_BIAS    = Keys.Attention.MAX_ALIBI_BIAS
-KEY_ATTENTION_CLAMP_KQV         = Keys.Attention.CLAMP_KQV
-KEY_ATTENTION_LAYERNORM_EPS     = Keys.Attention.LAYERNORM_EPS
-KEY_ATTENTION_LAYERNORM_RMS_EPS = Keys.Attention.LAYERNORM_RMS_EPS
-
-# RoPE
-KEY_ROPE_DIMENSION_COUNT      = Keys.Rope.DIMENSION_COUNT
-KEY_ROPE_FREQ_BASE            = Keys.Rope.FREQ_BASE
-KEY_ROPE_SCALING_TYPE         = Keys.Rope.SCALING_TYPE
-KEY_ROPE_SCALING_FACTOR       = Keys.Rope.SCALING_FACTOR
-KEY_ROPE_SCALING_ORIG_CTX_LEN = Keys.Rope.SCALING_ORIG_CTX_LEN
-KEY_ROPE_SCALING_FINETUNED    = Keys.Rope.SCALING_FINETUNED
-
-# tokenization
-KEY_TOKENIZER_MODEL      = Keys.Tokenizer.MODEL
-KEY_TOKENIZER_LIST       = Keys.Tokenizer.LIST
-KEY_TOKENIZER_TOKEN_TYPE = Keys.Tokenizer.TOKEN_TYPE
-KEY_TOKENIZER_SCORES     = Keys.Tokenizer.SCORES
-KEY_TOKENIZER_MERGES     = Keys.Tokenizer.MERGES
-KEY_TOKENIZER_BOS_ID     = Keys.Tokenizer.BOS_ID
-KEY_TOKENIZER_EOS_ID     = Keys.Tokenizer.EOS_ID
-KEY_TOKENIZER_UNK_ID     = Keys.Tokenizer.UNK_ID
-KEY_TOKENIZER_SEP_ID     = Keys.Tokenizer.SEP_ID
-KEY_TOKENIZER_PAD_ID     = Keys.Tokenizer.PAD_ID
-KEY_TOKENIZER_HF_JSON    = Keys.Tokenizer.HF_JSON
-KEY_TOKENIZER_RWKV       = Keys.Tokenizer.RWKV
--- a/extensions/model-extension/scripts/gguf-py/gguf/gguf.py
+++ b/extensions/model-extension/scripts/gguf-py/gguf/gguf.py
@ -1,15 +0,0 @@
-# This file left for compatibility. If you want to use the GGUF API from Python
-# then don't import gguf/gguf.py directly. If you're looking for examples, see the
-# examples/ directory for gguf-py
-
-import importlib
-import sys
-from pathlib import Path
-
-sys.path.insert(0, str(Path(__file__).parent.parent))
-
-# Compatibility for people trying to import gguf/gguf.py directly instead of as a package.
-importlib.invalidate_caches()
-import gguf  # noqa: E402
-
-importlib.reload(gguf)
--- a/extensions/model-extension/scripts/gguf-py/gguf/gguf_reader.py
+++ b/extensions/model-extension/scripts/gguf-py/gguf/gguf_reader.py
@ -1,264 +0,0 @@
-#
-# GGUF file reading/modification support. For API usage information,
-# please see the files scripts/ for some fairly simple examples.
-#
-from __future__ import annotations
-
-import os
-from collections import OrderedDict
-from typing import Any, Literal, NamedTuple, TypeVar, Union
-
-import numpy as np
-import numpy.typing as npt
-
-if __name__ == "__main__":
-    import sys
-    from pathlib import Path
-
-    # Allow running file in package as a script.
-    sys.path.insert(0, str(Path(__file__).parent.parent))
-
-from gguf.constants import (
-    GGML_QUANT_SIZES,
-    GGUF_DEFAULT_ALIGNMENT,
-    GGUF_MAGIC,
-    GGUF_VERSION,
-    GGMLQuantizationType,
-    GGUFValueType,
-)
-
-
-READER_SUPPORTED_VERSIONS = [2, GGUF_VERSION]
-
-
-class ReaderField(NamedTuple):
-    # Offset to start of this field.
-    offset: int
-
-    # Name of the field (not necessarily from file data).
-    name: str
-
-    # Data parts. Some types have multiple components, such as strings
-    # that consist of a length followed by the string data.
-    parts: list[npt.NDArray[Any]] = []
-
-    # Indexes into parts that we can call the actual data. For example
-    # an array of strings will be populated with indexes to the actual
-    # string data.
-    data: list[int] = [-1]
-
-    types: list[GGUFValueType] = []
-
-
-class ReaderTensor(NamedTuple):
-    name: str
-    tensor_type: GGMLQuantizationType
-    shape: npt.NDArray[np.uint32]
-    n_elements: int
-    n_bytes: int
-    data_offset: int
-    data: npt.NDArray[Any]
-    field: ReaderField
-
-
-class GGUFReader:
-    # I - same as host, S - swapped
-    byte_order: Literal['I' | 'S'] = 'I'
-    alignment: int = GGUF_DEFAULT_ALIGNMENT
-
-    # Note: Internal helper, API may change.
-    gguf_scalar_to_np: dict[GGUFValueType, type[np.generic]] = {
-        GGUFValueType.UINT8:   np.uint8,
-        GGUFValueType.INT8:    np.int8,
-        GGUFValueType.UINT16:  np.uint16,
-        GGUFValueType.INT16:   np.int16,
-        GGUFValueType.UINT32:  np.uint32,
-        GGUFValueType.INT32:   np.int32,
-        GGUFValueType.FLOAT32: np.float32,
-        GGUFValueType.UINT64:  np.uint64,
-        GGUFValueType.INT64:   np.int64,
-        GGUFValueType.FLOAT64: np.float64,
-        GGUFValueType.BOOL:    np.bool_,
-    }
-
-    def __init__(self, path: os.PathLike[str] | str, mode: Literal['r' | 'r+' | 'c'] = 'r'):
-        self.data = np.memmap(path, mode = mode)
-        offs = 0
-        if self._get(offs, np.uint32, override_order = '<')[0] != GGUF_MAGIC:
-            raise ValueError('GGUF magic invalid')
-        offs += 4
-        temp_version = self._get(offs, np.uint32)
-        if temp_version[0] & 65535 == 0:
-            # If we get 0 here that means it's (probably) a GGUF file created for
-            # the opposite byte order of the machine this script is running on.
-            self.byte_order = 'S'
-            temp_version = temp_version.newbyteorder(self.byte_order)
-        version = temp_version[0]
-        if version not in READER_SUPPORTED_VERSIONS:
-            raise ValueError(f'Sorry, file appears to be version {version} which we cannot handle')
-        self.fields: OrderedDict[str, ReaderField] = OrderedDict()
-        self.tensors: list[ReaderTensor] = []
-        offs += self._push_field(ReaderField(offs, 'GGUF.version', [temp_version], [0], [GGUFValueType.UINT32]))
-        temp_counts = self._get(offs, np.uint64, 2)
-        offs += self._push_field(ReaderField(offs, 'GGUF.tensor_count', [temp_counts[:1]], [0], [GGUFValueType.UINT64]))
-        offs += self._push_field(ReaderField(offs, 'GGUF.kv_count', [temp_counts[1:]], [0], [GGUFValueType.UINT64]))
-        tensor_count, kv_count = temp_counts
-        offs = self._build_fields(offs, kv_count)
-        offs, tensors_fields = self._build_tensors_fields(offs, tensor_count)
-        new_align = self.fields.get('general.alignment')
-        if new_align is not None:
-            if new_align.types != [GGUFValueType.UINT32]:
-                raise ValueError('Bad type for general.alignment field')
-            self.alignment = new_align.parts[-1][0]
-        padding = offs % self.alignment
-        if padding != 0:
-            offs += self.alignment - padding
-        self._build_tensors(offs, tensors_fields)
-
-    _DT = TypeVar('_DT', bound = npt.DTypeLike)
-
-    # Fetch a key/value metadata field by key.
-    def get_field(self, key: str) -> Union[ReaderField, None]:
-        return self.fields.get(key, None)
-
-    # Fetch a tensor from the list by index.
-    def get_tensor(self, idx: int) -> ReaderTensor:
-        return self.tensors[idx]
-
-    def _get(
-        self, offset: int, dtype: npt.DTypeLike, count: int = 1, override_order: None | Literal['I' | 'S' | '<'] = None,
-    ) -> npt.NDArray[Any]:
-        count = int(count)
-        itemsize = int(np.empty([], dtype = dtype).itemsize)
-        end_offs = offset + itemsize * count
-        return (
-            self.data[offset:end_offs]
-            .view(dtype = dtype)[:count]
-            .newbyteorder(override_order or self.byte_order)
-        )
-
-    def _push_field(self, field: ReaderField, skip_sum: bool = False) -> int:
-        if field.name in self.fields:
-            raise KeyError(f'Duplicate {field.name} already in list at offset {field.offset}')
-        self.fields[field.name] = field
-        return 0 if skip_sum else sum(int(part.nbytes) for part in field.parts)
-
-    def _get_str(self, offset: int) -> tuple[npt.NDArray[np.uint64], npt.NDArray[np.uint8]]:
-        slen = self._get(offset, np.uint64)
-        return slen, self._get(offset + 8, np.uint8, slen[0])
-
-    def _get_field_parts(
-        self, orig_offs: int, raw_type: int,
-    ) -> tuple[int, list[npt.NDArray[Any]], list[int], list[GGUFValueType]]:
-        offs = orig_offs
-        types: list[GGUFValueType] = []
-        gtype = GGUFValueType(raw_type)
-        types.append(gtype)
-        # Handle strings.
-        if gtype == GGUFValueType.STRING:
-            sparts: list[npt.NDArray[Any]] = list(self._get_str(offs))
-            size = sum(int(part.nbytes) for part in sparts)
-            return size, sparts, [1], types
-        # Check if it's a simple scalar type.
-        nptype = self.gguf_scalar_to_np.get(gtype)
-        if nptype is not None:
-            val = self._get(offs, nptype)
-            return int(val.nbytes), [val], [0], types
-        # Handle arrays.
-        if gtype == GGUFValueType.ARRAY:
-            raw_itype = self._get(offs, np.uint32)
-            offs += int(raw_itype.nbytes)
-            alen = self._get(offs, np.uint64)
-            offs += int(alen.nbytes)
-            aparts: list[npt.NDArray[Any]] = [raw_itype, alen]
-            data_idxs: list[int] = []
-            for idx in range(alen[0]):
-                curr_size, curr_parts, curr_idxs, curr_types = self._get_field_parts(offs, raw_itype[0])
-                if idx == 0:
-                    types += curr_types
-                idxs_offs = len(aparts)
-                aparts += curr_parts
-                data_idxs += (idx + idxs_offs for idx in curr_idxs)
-                offs += curr_size
-            return offs - orig_offs, aparts, data_idxs, types
-        # We can't deal with this one.
-        raise ValueError('Unknown/unhandled field type {gtype}')
-
-    def _get_tensor(self, orig_offs: int) -> ReaderField:
-        offs = orig_offs
-        name_len, name_data = self._get_str(offs)
-        offs += int(name_len.nbytes + name_data.nbytes)
-        n_dims = self._get(offs, np.uint32)
-        offs += int(n_dims.nbytes)
-        dims = self._get(offs, np.uint64, n_dims[0])
-        offs += int(dims.nbytes)
-        raw_dtype = self._get(offs, np.uint32)
-        offs += int(raw_dtype.nbytes)
-        offset_tensor = self._get(offs, np.uint64)
-        offs += int(offset_tensor.nbytes)
-        return ReaderField(
-            orig_offs,
-            str(bytes(name_data), encoding = 'utf-8'),
-            [name_len, name_data, n_dims, dims, raw_dtype, offset_tensor],
-            [1, 3, 4, 5],
-        )
-
-    def _build_fields(self, offs: int, count: int) -> int:
-        for _ in range(count):
-            orig_offs = offs
-            kv_klen, kv_kdata = self._get_str(offs)
-            offs += int(kv_klen.nbytes + kv_kdata.nbytes)
-            raw_kv_type = self._get(offs, np.uint32)
-            offs += int(raw_kv_type.nbytes)
-            parts: list[npt.NDArray[Any]] = [kv_klen, kv_kdata, raw_kv_type]
-            idxs_offs = len(parts)
-            field_size, field_parts, field_idxs, field_types = self._get_field_parts(offs, raw_kv_type[0])
-            parts += field_parts
-            self._push_field(ReaderField(
-                orig_offs,
-                str(bytes(kv_kdata), encoding = 'utf-8'),
-                parts,
-                [idx + idxs_offs for idx in field_idxs],
-                field_types,
-            ), skip_sum = True)
-            offs += field_size
-        return offs
-
-    def _build_tensors_fields(self, offs: int, count: int) -> tuple[int, list[ReaderField]]:
-        tensor_fields = []
-        for _ in range(count):
-            field = self._get_tensor(offs)
-            offs += sum(int(part.nbytes) for part in field.parts)
-            tensor_fields.append(field)
-        return offs, tensor_fields
-
-    def _build_tensors(self, start_offs: int, fields: list[ReaderField]) -> None:
-        tensors = []
-        for field in fields:
-            _name_len, name_data, _n_dims, dims, raw_dtype, offset_tensor = field.parts
-            ggml_type = GGMLQuantizationType(raw_dtype[0])
-            n_elems = np.prod(dims)
-            block_size, type_size = GGML_QUANT_SIZES[ggml_type]
-            n_bytes = n_elems * type_size // block_size
-            data_offs = int(start_offs + offset_tensor[0])
-            item_type: npt.DTypeLike
-            if ggml_type == GGMLQuantizationType.F32:
-                item_count = n_elems
-                item_type = np.float32
-            elif ggml_type == GGMLQuantizationType.F16:
-                item_count = n_elems
-                item_type = np.float16
-            else:
-                item_count = n_bytes
-                item_type = np.uint8
-            tensors.append(ReaderTensor(
-                name = str(bytes(name_data), encoding = 'utf-8'),
-                tensor_type = ggml_type,
-                shape = dims,
-                n_elements = n_elems,
-                n_bytes = n_bytes,
-                data_offset = data_offs,
-                data = self._get(data_offs, item_type, item_count),
-                field = field,
-            ))
-        self.tensors = tensors
--- a/extensions/model-extension/scripts/gguf-py/gguf/gguf_writer.py
+++ b/extensions/model-extension/scripts/gguf-py/gguf/gguf_writer.py
@ -1,427 +0,0 @@
-from __future__ import annotations
-
-import os
-import shutil
-import struct
-import tempfile
-from enum import Enum, auto
-from io import BufferedWriter
-from typing import IO, Any, Sequence
-
-import numpy as np
-
-from .constants import (
-    GGUF_DEFAULT_ALIGNMENT,
-    GGUF_MAGIC,
-    GGUF_VERSION,
-    GGMLQuantizationType,
-    GGUFEndian,
-    GGUFValueType,
-    Keys,
-    RopeScalingType,
-    TokenType,
-)
-
-
-class WriterState(Enum):
-    EMPTY   = auto()
-    HEADER  = auto()
-    KV_DATA = auto()
-    TI_DATA = auto()
-
-
-class GGUFWriter:
-    fout: BufferedWriter
-    temp_file: tempfile.SpooledTemporaryFile[bytes] | None
-    tensors: list[np.ndarray[Any, Any]]
-    _simple_value_packing = {
-        GGUFValueType.UINT8:   "B",
-        GGUFValueType.INT8:    "b",
-        GGUFValueType.UINT16:  "H",
-        GGUFValueType.INT16:   "h",
-        GGUFValueType.UINT32:  "I",
-        GGUFValueType.INT32:   "i",
-        GGUFValueType.FLOAT32: "f",
-        GGUFValueType.UINT64:  "Q",
-        GGUFValueType.INT64:   "q",
-        GGUFValueType.FLOAT64: "d",
-        GGUFValueType.BOOL:    "?",
-    }
-
-    def __init__(
-        self, path: os.PathLike[str] | str, arch: str, use_temp_file: bool = True,
-        endianess: GGUFEndian = GGUFEndian.LITTLE,
-    ):
-        self.fout = open(path, "wb")
-        self.arch = arch
-        self.endianess = endianess
-        self.offset_tensor = 0
-        self.data_alignment = GGUF_DEFAULT_ALIGNMENT
-        self.kv_data = bytearray()
-        self.kv_data_count = 0
-        self.ti_data = bytearray()
-        self.ti_data_count = 0
-        self.use_temp_file = use_temp_file
-        self.temp_file = None
-        self.tensors = []
-        print("gguf: This GGUF file is for {0} Endian only".format(
-            "Big" if self.endianess == GGUFEndian.BIG else "Little",
-        ))
-        self.state = WriterState.EMPTY
-
-        self.add_architecture()
-
-    def write_header_to_file(self) -> None:
-        if self.state is not WriterState.EMPTY:
-            raise ValueError(f'Expected output file to be empty, got {self.state}')
-
-        self._write_packed("<I", GGUF_MAGIC, skip_pack_prefix = True)
-        self._write_packed("I", GGUF_VERSION)
-        self._write_packed("Q", self.ti_data_count)
-        self._write_packed("Q", self.kv_data_count)
-        self.flush()
-        self.state = WriterState.HEADER
-
-    def write_kv_data_to_file(self) -> None:
-        if self.state is not WriterState.HEADER:
-            raise ValueError(f'Expected output file to contain the header, got {self.state}')
-
-        self.fout.write(self.kv_data)
-        self.flush()
-        self.state = WriterState.KV_DATA
-
-    def write_ti_data_to_file(self) -> None:
-        if self.state is not WriterState.KV_DATA:
-            raise ValueError(f'Expected output file to contain KV data, got {self.state}')
-
-        self.fout.write(self.ti_data)
-        self.flush()
-        self.state = WriterState.TI_DATA
-
-    def add_key(self, key: str) -> None:
-        self.add_val(key, GGUFValueType.STRING, add_vtype=False)
-
-    def add_uint8(self, key: str, val: int) -> None:
-        self.add_key(key)
-        self.add_val(val, GGUFValueType.UINT8)
-
-    def add_int8(self, key: str, val: int) -> None:
-        self.add_key(key)
-        self.add_val(val, GGUFValueType.INT8)
-
-    def add_uint16(self, key: str, val: int) -> None:
-        self.add_key(key)
-        self.add_val(val, GGUFValueType.UINT16)
-
-    def add_int16(self, key: str, val: int) -> None:
-        self.add_key(key)
-        self.add_val(val, GGUFValueType.INT16)
-
-    def add_uint32(self, key: str, val: int) -> None:
-        self.add_key(key)
-        self.add_val(val, GGUFValueType.UINT32)
-
-    def add_int32(self, key: str, val: int) -> None:
-        self.add_key(key)
-        self.add_val(val, GGUFValueType.INT32)
-
-    def add_float32(self, key: str, val: float) -> None:
-        self.add_key(key)
-        self.add_val(val, GGUFValueType.FLOAT32)
-
-    def add_uint64(self, key: str, val: int) -> None:
-        self.add_key(key)
-        self.add_val(val, GGUFValueType.UINT64)
-
-    def add_int64(self, key: str, val: int) -> None:
-        self.add_key(key)
-        self.add_val(val, GGUFValueType.INT64)
-
-    def add_float64(self, key: str, val: float) -> None:
-        self.add_key(key)
-        self.add_val(val, GGUFValueType.FLOAT64)
-
-    def add_bool(self, key: str, val: bool) -> None:
-        self.add_key(key)
-        self.add_val(val, GGUFValueType.BOOL)
-
-    def add_string(self, key: str, val: str) -> None:
-        if not val:
-            return
-        self.add_key(key)
-        self.add_val(val, GGUFValueType.STRING)
-
-    def add_array(self, key: str, val: Sequence[Any]) -> None:
-        if not isinstance(val, Sequence):
-            raise ValueError("Value must be a sequence for array type")
-
-        self.add_key(key)
-        self.add_val(val, GGUFValueType.ARRAY)
-
-    def add_val(self, val: Any, vtype: GGUFValueType | None = None, add_vtype: bool = True) -> None:
-        if vtype is None:
-            vtype = GGUFValueType.get_type(val)
-
-        if add_vtype:
-            self.kv_data += self._pack("I", vtype)
-            self.kv_data_count += 1
-
-        pack_fmt = self._simple_value_packing.get(vtype)
-        if pack_fmt is not None:
-            self.kv_data += self._pack(pack_fmt, val, skip_pack_prefix = vtype == GGUFValueType.BOOL)
-        elif vtype == GGUFValueType.STRING:
-            encoded_val = val.encode("utf8") if isinstance(val, str) else val
-            self.kv_data += self._pack("Q", len(encoded_val))
-            self.kv_data += encoded_val
-        elif vtype == GGUFValueType.ARRAY and isinstance(val, Sequence) and val:
-            ltype = GGUFValueType.get_type(val[0])
-            if not all(GGUFValueType.get_type(i) is ltype for i in val[1:]):
-                raise ValueError("All items in a GGUF array should be of the same type")
-            self.kv_data += self._pack("I", ltype)
-            self.kv_data += self._pack("Q", len(val))
-            for item in val:
-                self.add_val(item, add_vtype=False)
-        else:
-            raise ValueError("Invalid GGUF metadata value type or value")
-
-    @staticmethod
-    def ggml_pad(x: int, n: int) -> int:
-        return ((x + n - 1) // n) * n
-
-    def add_tensor_info(
-        self, name: str, tensor_shape: Sequence[int], tensor_dtype: np.dtype[np.float16] | np.dtype[np.float32],
-        tensor_nbytes: int, raw_dtype: GGMLQuantizationType | None = None,
-    ) -> None:
-        if self.state is not WriterState.EMPTY:
-            raise ValueError(f'Expected output file to be empty, got {self.state}')
-
-        if raw_dtype is None and tensor_dtype not in (np.float32, np.float16):
-            raise ValueError("Only F32 and F16 tensors are supported for now")
-
-        encoded_name = name.encode("utf8")
-        self.ti_data += self._pack("Q", len(encoded_name))
-        self.ti_data += encoded_name
-        n_dims = len(tensor_shape)
-        self.ti_data += self._pack("I", n_dims)
-        for i in range(n_dims):
-            self.ti_data += self._pack("Q", tensor_shape[n_dims - 1 - i])
-        if raw_dtype is None:
-            dtype = GGMLQuantizationType.F32 if tensor_dtype == np.float32 else GGMLQuantizationType.F16
-        else:
-            dtype = raw_dtype
-        self.ti_data += self._pack("I", dtype)
-        self.ti_data += self._pack("Q", self.offset_tensor)
-        self.offset_tensor += GGUFWriter.ggml_pad(tensor_nbytes, self.data_alignment)
-        self.ti_data_count += 1
-
-    def add_tensor(
-        self, name: str, tensor: np.ndarray[Any, Any], raw_shape: Sequence[int] | None = None,
-        raw_dtype: GGMLQuantizationType | None = None,
-    ) -> None:
-        if self.endianess == GGUFEndian.BIG:
-            tensor.byteswap(inplace=True)
-        if self.use_temp_file and self.temp_file is None:
-            fp = tempfile.SpooledTemporaryFile(mode="w+b", max_size=256 * 1024 * 1024)
-            fp.seek(0)
-            self.temp_file = fp
-
-        shape: Sequence[int] = raw_shape if raw_shape is not None else tensor.shape
-        self.add_tensor_info(name, shape, tensor.dtype, tensor.nbytes, raw_dtype = raw_dtype)
-
-        if self.temp_file is None:
-            self.tensors.append(tensor)
-            return
-
-        tensor.tofile(self.temp_file)
-        self.write_padding(self.temp_file, tensor.nbytes)
-
-    def write_padding(self, fp: IO[bytes], n: int, align: int | None = None) -> None:
-        pad = GGUFWriter.ggml_pad(n, align if align is not None else self.data_alignment) - n
-        if pad != 0:
-            fp.write(bytes([0] * pad))
-
-    def write_tensor_data(self, tensor: np.ndarray[Any, Any]) -> None:
-        if self.state is not WriterState.TI_DATA:
-            raise ValueError(f'Expected output file to contain tensor info, got {self.state}')
-
-        if self.endianess == GGUFEndian.BIG:
-            tensor.byteswap(inplace=True)
-        self.write_padding(self.fout, self.fout.tell())
-        tensor.tofile(self.fout)
-        self.write_padding(self.fout, tensor.nbytes)
-
-    def write_tensors_to_file(self) -> None:
-        self.write_ti_data_to_file()
-
-        self.write_padding(self.fout, self.fout.tell())
-
-        if self.temp_file is None:
-            while True:
-                try:
-                    tensor = self.tensors.pop(0)
-                except IndexError:
-                    break
-                tensor.tofile(self.fout)
-                self.write_padding(self.fout, tensor.nbytes)
-            return
-
-        self.temp_file.seek(0)
-
-        shutil.copyfileobj(self.temp_file, self.fout)
-        self.flush()
-        self.temp_file.close()
-
-    def flush(self) -> None:
-        self.fout.flush()
-
-    def close(self) -> None:
-        self.fout.close()
-
-    def add_architecture(self) -> None:
-        self.add_string(Keys.General.ARCHITECTURE, self.arch)
-
-    def add_author(self, author: str) -> None:
-        self.add_string(Keys.General.AUTHOR, author)
-
-    def add_tensor_data_layout(self, layout: str) -> None:
-        self.add_string(Keys.LLM.TENSOR_DATA_LAYOUT.format(arch=self.arch), layout)
-
-    def add_url(self, url: str) -> None:
-        self.add_string(Keys.General.URL, url)
-
-    def add_description(self, description: str) -> None:
-        self.add_string(Keys.General.DESCRIPTION, description)
-
-    def add_source_url(self, url: str) -> None:
-        self.add_string(Keys.General.SOURCE_URL, url)
-
-    def add_source_hf_repo(self, repo: str) -> None:
-        self.add_string(Keys.General.SOURCE_HF_REPO, repo)
-
-    def add_file_type(self, ftype: int) -> None:
-        self.add_uint32(Keys.General.FILE_TYPE, ftype)
-
-    def add_name(self, name: str) -> None:
-        self.add_string(Keys.General.NAME, name)
-
-    def add_quantization_version(self, quantization_version: GGMLQuantizationType) -> None:
-        self.add_uint32(
-            Keys.General.QUANTIZATION_VERSION, quantization_version)
-
-    def add_custom_alignment(self, alignment: int) -> None:
-        self.data_alignment = alignment
-        self.add_uint32(Keys.General.ALIGNMENT, alignment)
-
-    def add_context_length(self, length: int) -> None:
-        self.add_uint32(Keys.LLM.CONTEXT_LENGTH.format(arch=self.arch), length)
-
-    def add_embedding_length(self, length: int) -> None:
-        self.add_uint32(Keys.LLM.EMBEDDING_LENGTH.format(arch=self.arch), length)
-
-    def add_block_count(self, length: int) -> None:
-        self.add_uint32(Keys.LLM.BLOCK_COUNT.format(arch=self.arch), length)
-
-    def add_feed_forward_length(self, length: int) -> None:
-        self.add_uint32(Keys.LLM.FEED_FORWARD_LENGTH.format(arch=self.arch), length)
-
-    def add_parallel_residual(self, use: bool) -> None:
-        self.add_bool(Keys.LLM.USE_PARALLEL_RESIDUAL.format(arch=self.arch), use)
-
-    def add_head_count(self, count: int) -> None:
-        self.add_uint32(Keys.Attention.HEAD_COUNT.format(arch=self.arch), count)
-
-    def add_head_count_kv(self, count: int) -> None:
-        self.add_uint32(Keys.Attention.HEAD_COUNT_KV.format(arch=self.arch), count)
-
-    def add_key_length(self, length: int) -> None:
-        self.add_uint32(Keys.Attention.KEY_LENGTH.format(arch=self.arch), length)
-
-    def add_value_length(self, length: int) -> None:
-        self.add_uint32(Keys.Attention.VALUE_LENGTH.format(arch=self.arch), length)
-
-    def add_max_alibi_bias(self, bias: float) -> None:
-        self.add_float32(Keys.Attention.MAX_ALIBI_BIAS.format(arch=self.arch), bias)
-
-    def add_clamp_kqv(self, value: float) -> None:
-        self.add_float32(Keys.Attention.CLAMP_KQV.format(arch=self.arch), value)
-
-    def add_expert_count(self, count: int) -> None:
-        self.add_uint32(Keys.LLM.EXPERT_COUNT.format(arch=self.arch), count)
-
-    def add_expert_used_count(self, count: int) -> None:
-        self.add_uint32(Keys.LLM.EXPERT_USED_COUNT.format(arch=self.arch), count)
-
-    def add_layer_norm_eps(self, value: float) -> None:
-        self.add_float32(Keys.Attention.LAYERNORM_EPS.format(arch=self.arch), value)
-
-    def add_layer_norm_rms_eps(self, value: float) -> None:
-        self.add_float32(Keys.Attention.LAYERNORM_RMS_EPS.format(arch=self.arch), value)
-
-    def add_rope_dimension_count(self, count: int) -> None:
-        self.add_uint32(Keys.Rope.DIMENSION_COUNT.format(arch=self.arch), count)
-
-    def add_rope_freq_base(self, value: float) -> None:
-        self.add_float32(Keys.Rope.FREQ_BASE.format(arch=self.arch), value)
-
-    def add_rope_scaling_type(self, value: RopeScalingType) -> None:
-        self.add_string(Keys.Rope.SCALING_TYPE.format(arch=self.arch), value.value)
-
-    def add_rope_scaling_factor(self, value: float) -> None:
-        self.add_float32(Keys.Rope.SCALING_FACTOR.format(arch=self.arch), value)
-
-    def add_rope_scaling_orig_ctx_len(self, value: int) -> None:
-        self.add_uint32(Keys.Rope.SCALING_ORIG_CTX_LEN.format(arch=self.arch), value)
-
-    def add_rope_scaling_finetuned(self, value: bool) -> None:
-        self.add_bool(Keys.Rope.SCALING_FINETUNED.format(arch=self.arch), value)
-
-    def add_tokenizer_model(self, model: str) -> None:
-        self.add_string(Keys.Tokenizer.MODEL, model)
-
-    def add_token_list(self, tokens: Sequence[str] | Sequence[bytes] | Sequence[bytearray]) -> None:
-        self.add_array(Keys.Tokenizer.LIST, tokens)
-
-    def add_token_merges(self, merges: Sequence[str] | Sequence[bytes] | Sequence[bytearray]) -> None:
-        self.add_array(Keys.Tokenizer.MERGES, merges)
-
-    def add_token_types(self, types: Sequence[TokenType] | Sequence[int]) -> None:
-        self.add_array(Keys.Tokenizer.TOKEN_TYPE, types)
-
-    def add_token_scores(self, scores: Sequence[float]) -> None:
-        self.add_array(Keys.Tokenizer.SCORES, scores)
-
-    def add_bos_token_id(self, id: int) -> None:
-        self.add_uint32(Keys.Tokenizer.BOS_ID, id)
-
-    def add_eos_token_id(self, id: int) -> None:
-        self.add_uint32(Keys.Tokenizer.EOS_ID, id)
-
-    def add_unk_token_id(self, id: int) -> None:
-        self.add_uint32(Keys.Tokenizer.UNK_ID, id)
-
-    def add_sep_token_id(self, id: int) -> None:
-        self.add_uint32(Keys.Tokenizer.SEP_ID, id)
-
-    def add_pad_token_id(self, id: int) -> None:
-        self.add_uint32(Keys.Tokenizer.PAD_ID, id)
-
-    def add_add_bos_token(self, value: bool) -> None:
-        self.add_bool(Keys.Tokenizer.ADD_BOS, value)
-
-    def add_add_eos_token(self, value: bool) -> None:
-        self.add_bool(Keys.Tokenizer.ADD_EOS, value)
-
-    def add_add_space_prefix(self, value: bool) -> None:
-        self.add_bool(Keys.Tokenizer.ADD_PREFIX, value)
-
-    def add_chat_template(self, value: str) -> None:
-        self.add_string(Keys.Tokenizer.CHAT_TEMPLATE, value)
-
-    def _pack(self, fmt: str, value: Any, skip_pack_prefix: bool = False) -> bytes:
-        pack_prefix = ''
-        if not skip_pack_prefix:
-            pack_prefix = '<' if self.endianess == GGUFEndian.LITTLE else '>'
-        return struct.pack(f'{pack_prefix}{fmt}', value)
-
-    def _write_packed(self, fmt: str, value: Any, skip_pack_prefix: bool = False) -> None:
-        self.fout.write(self._pack(fmt, value, skip_pack_prefix))
--- a/extensions/model-extension/scripts/gguf-py/gguf/py.typed
+++ b/extensions/model-extension/scripts/gguf-py/gguf/py.typed
--- a/extensions/model-extension/scripts/gguf-py/gguf/tensor_mapping.py
+++ b/extensions/model-extension/scripts/gguf-py/gguf/tensor_mapping.py
@ -1,332 +0,0 @@
-from __future__ import annotations
-
-from typing import Sequence
-
-from .constants import MODEL_ARCH, MODEL_TENSOR, MODEL_TENSORS, TENSOR_NAMES
-
-
-class TensorNameMap:
-    mappings_cfg: dict[MODEL_TENSOR, tuple[str, ...]] = {
-        # Token embeddings
-        MODEL_TENSOR.TOKEN_EMBD: (
-            "gpt_neox.embed_in",                         # gptneox
-            "transformer.wte",                           # gpt2 gpt-j mpt refact qwen
-            "transformer.word_embeddings",               # falcon
-            "word_embeddings",                           # bloom
-            "model.embed_tokens",                        # llama-hf
-            "tok_embeddings",                            # llama-pth
-            "embeddings.word_embeddings",                # bert
-            "language_model.embedding.word_embeddings",  # persimmon
-            "wte",                                       # gpt2
-            "transformer.embd.wte",                      # phi2
-            "model.tok_embeddings",                      # internlm2
-        ),
-
-        # Token type embeddings
-        MODEL_TENSOR.TOKEN_TYPES: (
-            "embeddings.token_type_embeddings",  # bert
-        ),
-
-        # Normalization of token embeddings
-        MODEL_TENSOR.TOKEN_EMBD_NORM: (
-            "word_embeddings_layernorm",  # bloom
-        ),
-
-        # Position embeddings
-        MODEL_TENSOR.POS_EMBD: (
-            "transformer.wpe",                 # gpt2
-            "embeddings.position_embeddings",  # bert
-            "wpe",                             # gpt2
-        ),
-
-        # Output
-        MODEL_TENSOR.OUTPUT: (
-            "embed_out",                 # gptneox
-            "lm_head",                   # gpt2 mpt falcon llama-hf baichuan qwen
-            "output",                    # llama-pth bloom internlm2
-            "word_embeddings_for_head",  # persimmon
-            "lm_head.linear",            # phi2
-        ),
-
-        # Output norm
-        MODEL_TENSOR.OUTPUT_NORM: (
-            "gpt_neox.final_layer_norm",               # gptneox
-            "transformer.ln_f",                        # gpt2 gpt-j falcon
-            "model.norm",                              # llama-hf baichuan internlm2
-            "norm",                                    # llama-pth
-            "embeddings.LayerNorm",                    # bert
-            "transformer.norm_f",                      # mpt
-            "ln_f",                                    # refact bloom qwen gpt2
-            "language_model.encoder.final_layernorm",  # persimmon
-            "model.final_layernorm",                   # persimmon
-            "lm_head.ln",                              # phi2
-        ),
-
-        # Rope frequencies
-        MODEL_TENSOR.ROPE_FREQS: (
-            "rope.freqs",  # llama-pth
-        ),
-    }
-
-    block_mappings_cfg: dict[MODEL_TENSOR, tuple[str, ...]] = {
-        # Attention norm
-        MODEL_TENSOR.ATTN_NORM: (
-            "gpt_neox.layers.{bid}.input_layernorm",                # gptneox
-            "transformer.h.{bid}.ln_1",                             # gpt2 gpt-j refact qwen
-            "transformer.blocks.{bid}.norm_1",                      # mpt
-            "transformer.h.{bid}.input_layernorm",                  # falcon7b
-            "h.{bid}.input_layernorm",                              # bloom
-            "transformer.h.{bid}.ln_mlp",                           # falcon40b
-            "model.layers.{bid}.input_layernorm",                   # llama-hf
-            "layers.{bid}.attention_norm",                          # llama-pth
-            "encoder.layer.{bid}.attention.output.LayerNorm",       # bert
-            "language_model.encoder.layers.{bid}.input_layernorm",  # persimmon
-            "model.layers.{bid}.ln1",                               # yi
-            "h.{bid}.ln_1",                                         # gpt2
-            "transformer.h.{bid}.ln",                               # phi2
-            "model.layers.layers.{bid}.norm",                       # plamo
-            "model.layers.{bid}.attention_norm",                    # internlm2
-        ),
-
-        # Attention norm 2
-        MODEL_TENSOR.ATTN_NORM_2: (
-            "transformer.h.{bid}.ln_attn",  # falcon40b
-        ),
-
-        # Attention query-key-value
-        MODEL_TENSOR.ATTN_QKV: (
-            "gpt_neox.layers.{bid}.attention.query_key_value",                     # gptneox
-            "transformer.h.{bid}.attn.c_attn",                                     # gpt2 qwen
-            "transformer.blocks.{bid}.attn.Wqkv",                                  # mpt
-            "transformer.h.{bid}.self_attention.query_key_value",                  # falcon
-            "h.{bid}.self_attention.query_key_value",                              # bloom
-            "language_model.encoder.layers.{bid}.self_attention.query_key_value",  # persimmon
-            "model.layers.{bid}.self_attn.query_key_value",                        # persimmon
-            "h.{bid}.attn.c_attn",                                                 # gpt2
-            "transformer.h.{bid}.mixer.Wqkv",                                      # phi2
-        ),
-
-        # Attention query
-        MODEL_TENSOR.ATTN_Q: (
-            "model.layers.{bid}.self_attn.q_proj",         # llama-hf
-            "layers.{bid}.attention.wq",                   # llama-pth
-            "encoder.layer.{bid}.attention.self.query",    # bert
-            "transformer.h.{bid}.attn.q_proj",             # gpt-j
-            "model.layers.layers.{bid}.self_attn.q_proj",  # plamo
-            "model.layers.{bid}.attention.wq"             # internlm2
-        ),
-
-        # Attention key
-        MODEL_TENSOR.ATTN_K: (
-            "model.layers.{bid}.self_attn.k_proj",         # llama-hf
-            "layers.{bid}.attention.wk",                   # llama-pth
-            "encoder.layer.{bid}.attention.self.key",      # bert
-            "transformer.h.{bid}.attn.k_proj",             # gpt-j
-            "model.layers.layers.{bid}.self_attn.k_proj",  # plamo
-            "model.layers.{bid}.attention.wk"             # internlm2
-        ),
-
-        # Attention value
-        MODEL_TENSOR.ATTN_V: (
-            "model.layers.{bid}.self_attn.v_proj",         # llama-hf
-            "layers.{bid}.attention.wv",                   # llama-pth
-            "encoder.layer.{bid}.attention.self.value",    # bert
-            "transformer.h.{bid}.attn.v_proj",             # gpt-j
-            "model.layers.layers.{bid}.self_attn.v_proj",  # plamo
-            "model.layers.{bid}.attention.wv"             # internlm2
-        ),
-
-        # Attention output
-        MODEL_TENSOR.ATTN_OUT: (
-            "gpt_neox.layers.{bid}.attention.dense",                     # gptneox
-            "transformer.h.{bid}.attn.c_proj",                           # gpt2 refact qwen
-            "transformer.blocks.{bid}.attn.out_proj",                    # mpt
-            "transformer.h.{bid}.self_attention.dense",                  # falcon
-            "h.{bid}.self_attention.dense",                              # bloom
-            "model.layers.{bid}.self_attn.o_proj",                       # llama-hf
-            "layers.{bid}.attention.wo",                                 # llama-pth
-            "encoder.layer.{bid}.attention.output.dense",                # bert
-            "transformer.h.{bid}.attn.out_proj",                         # gpt-j
-            "language_model.encoder.layers.{bid}.self_attention.dense",  # persimmon
-            "model.layers.{bid}.self_attn.dense",                        # persimmon
-            "h.{bid}.attn.c_proj",                                       # gpt2
-            "transformer.h.{bid}.mixer.out_proj",                        # phi2
-            "model.layers.layers.{bid}.self_attn.o_proj",                # plamo
-            "model.layers.{bid}.attention.wo",                           # internlm2
-        ),
-
-        # Rotary embeddings
-        MODEL_TENSOR.ATTN_ROT_EMBD: (
-            "model.layers.{bid}.self_attn.rotary_emb.inv_freq",        # llama-hf
-            "layers.{bid}.attention.inner_attention.rope.freqs",       # llama-pth
-            "model.layers.layers.{bid}.self_attn.rotary_emb.inv_freq", # plamo
-            "transformer.h.{bid}.attn.rotary_emb.inv_freq",            # codeshell
-        ),
-
-        # Feed-forward norm
-        MODEL_TENSOR.FFN_NORM: (
-            "gpt_neox.layers.{bid}.post_attention_layernorm",                # gptneox
-            "transformer.h.{bid}.ln_2",                                      # gpt2 refact qwen
-            "h.{bid}.post_attention_layernorm",                              # bloom
-            "transformer.blocks.{bid}.norm_2",                               # mpt
-            "model.layers.{bid}.post_attention_layernorm",                   # llama-hf
-            "layers.{bid}.ffn_norm",                                         # llama-pth
-            "encoder.layer.{bid}.output.LayerNorm",                          # bert
-            "language_model.encoder.layers.{bid}.post_attention_layernorm",  # persimmon
-            "model.layers.{bid}.ln2",                                        # yi
-            "h.{bid}.ln_2",                                                  # gpt2
-            "model.layers.{bid}.ffn_norm",                                   # internlm2
-        ),
-
-        MODEL_TENSOR.FFN_GATE_INP: (
-            "layers.{bid}.feed_forward.gate",           # mixtral
-            "model.layers.{bid}.block_sparse_moe.gate", # mixtral
-        ),
-
-        # Feed-forward up
-        MODEL_TENSOR.FFN_UP: (
-            "gpt_neox.layers.{bid}.mlp.dense_h_to_4h",                # gptneox
-            "transformer.h.{bid}.mlp.c_fc",                           # gpt2
-            "transformer.blocks.{bid}.ffn.up_proj",                   # mpt
-            "transformer.h.{bid}.mlp.dense_h_to_4h",                  # falcon
-            "h.{bid}.mlp.dense_h_to_4h",                              # bloom
-            "model.layers.{bid}.mlp.up_proj",                         # llama-hf refact
-            "layers.{bid}.feed_forward.w3",                           # llama-pth
-            "encoder.layer.{bid}.intermediate.dense",                 # bert
-            "transformer.h.{bid}.mlp.fc_in",                          # gpt-j
-            "language_model.encoder.layers.{bid}.mlp.dense_h_to_4h",  # persimmon
-            "model.layers.{bid}.mlp.dense_h_to_4h",                   # persimmon
-            "transformer.h.{bid}.mlp.w1",                             # qwen
-            "h.{bid}.mlp.c_fc",                                       # gpt2
-            "transformer.h.{bid}.mlp.fc1",                            # phi2
-            "model.layers.{bid}.mlp.fc1",                             # phi2
-            "model.layers.layers.{bid}.mlp.up_proj",                  # plamo
-            "model.layers.{bid}.feed_forward.w3",                     # internlm2
-        ),
-
-        MODEL_TENSOR.FFN_UP_EXP: (
-            "layers.{bid}.feed_forward.experts.{xid}.w3",           # mixtral
-            "model.layers.{bid}.block_sparse_moe.experts.{xid}.w3", # mixtral
-        ),
-
-        # AWQ-activation gate
-        MODEL_TENSOR.FFN_ACT: (
-            "transformer.blocks.{bid}.ffn.act",  # mpt
-        ),
-
-        # Feed-forward gate
-        MODEL_TENSOR.FFN_GATE: (
-            "model.layers.{bid}.mlp.gate_proj",           # llama-hf refact
-            "layers.{bid}.feed_forward.w1",               # llama-pth
-            "transformer.h.{bid}.mlp.w2",                 # qwen
-            "model.layers.layers.{bid}.mlp.gate_proj",    # plamo
-            "model.layers.{bid}.feed_forward.w1",         # internlm2
-        ),
-
-        MODEL_TENSOR.FFN_GATE_EXP: (
-            "layers.{bid}.feed_forward.experts.{xid}.w1",           # mixtral
-            "model.layers.{bid}.block_sparse_moe.experts.{xid}.w1", # mixtral
-        ),
-
-        # Feed-forward down
-        MODEL_TENSOR.FFN_DOWN: (
-            "gpt_neox.layers.{bid}.mlp.dense_4h_to_h",                # gptneox
-            "transformer.h.{bid}.mlp.c_proj",                         # gpt2 refact qwen
-            "transformer.blocks.{bid}.ffn.down_proj",                 # mpt
-            "transformer.h.{bid}.mlp.dense_4h_to_h",                  # falcon
-            "h.{bid}.mlp.dense_4h_to_h",                              # bloom
-            "model.layers.{bid}.mlp.down_proj",                       # llama-hf
-            "layers.{bid}.feed_forward.w2",                           # llama-pth
-            "encoder.layer.{bid}.output.dense",                       # bert
-            "transformer.h.{bid}.mlp.fc_out",                         # gpt-j
-            "language_model.encoder.layers.{bid}.mlp.dense_4h_to_h",  # persimmon
-            "model.layers.{bid}.mlp.dense_4h_to_h",                   # persimmon
-            "h.{bid}.mlp.c_proj",                                     # gpt2
-            "transformer.h.{bid}.mlp.fc2",                            # phi2
-            "model.layers.{bid}.mlp.fc2",                             # phi2
-            "model.layers.layers.{bid}.mlp.down_proj",                # plamo
-            "model.layers.{bid}.feed_forward.w2",                     # internlm2
-        ),
-
-        MODEL_TENSOR.FFN_DOWN_EXP: (
-            "layers.{bid}.feed_forward.experts.{xid}.w2",           # mixtral
-            "model.layers.{bid}.block_sparse_moe.experts.{xid}.w2", # mixtral
-        ),
-
-        MODEL_TENSOR.ATTN_Q_NORM: (
-            "language_model.encoder.layers.{bid}.self_attention.q_layernorm",
-            "model.layers.{bid}.self_attn.q_layernorm",                       # persimmon
-        ),
-
-        MODEL_TENSOR.ATTN_K_NORM: (
-            "language_model.encoder.layers.{bid}.self_attention.k_layernorm",
-            "model.layers.{bid}.self_attn.k_layernorm",                       # persimmon
-        ),
-
-        MODEL_TENSOR.ROPE_FREQS: (
-            "language_model.encoder.layers.{bid}.self_attention.rotary_emb.inv_freq",  # persimmon
-        ),
-    }
-
-    mapping: dict[str, tuple[MODEL_TENSOR, str]]
-
-    def __init__(self, arch: MODEL_ARCH, n_blocks: int):
-        self.mapping = {}
-        for tensor, keys in self.mappings_cfg.items():
-            if tensor not in MODEL_TENSORS[arch]:
-                continue
-            tensor_name = TENSOR_NAMES[tensor]
-            self.mapping[tensor_name] = (tensor, tensor_name)
-            for key in keys:
-                self.mapping[key] = (tensor, tensor_name)
-        for bid in range(n_blocks):
-            for tensor, keys in self.block_mappings_cfg.items():
-                if tensor not in MODEL_TENSORS[arch]:
-                    continue
-                # TODO: make this configurable
-                n_experts = 8
-                for xid in range(n_experts):
-                    tensor_name = TENSOR_NAMES[tensor].format(bid = bid, xid = xid)
-                    self.mapping[tensor_name] = (tensor, tensor_name)
-                    for key in keys:
-                        key = key.format(bid = bid, xid = xid)
-                        self.mapping[key] = (tensor, tensor_name)
-
-    def get_type_and_name(self, key: str, try_suffixes: Sequence[str] = ()) -> tuple[MODEL_TENSOR, str] | None:
-        result = self.mapping.get(key)
-        if result is not None:
-            return result
-        for suffix in try_suffixes:
-            if key.endswith(suffix):
-                result = self.mapping.get(key[:-len(suffix)])
-                if result is not None:
-                    return result[0], result[1] + suffix
-        return None
-
-    def get_name(self, key: str, try_suffixes: Sequence[str] = ()) -> str | None:
-        result = self.get_type_and_name(key, try_suffixes = try_suffixes)
-        if result is None:
-            return None
-        return result[1]
-
-    def get_type(self, key: str, try_suffixes: Sequence[str] = ()) -> MODEL_TENSOR | None:
-        result = self.get_type_and_name(key, try_suffixes = try_suffixes)
-        if result is None:
-            return None
-        return result[0]
-
-    def __getitem__(self, key: str) -> str:
-        try:
-            return self.mapping[key][1]
-        except KeyError:
-            raise KeyError(key)
-
-    def __contains__(self, key: str) -> bool:
-        return key in self.mapping
-
-    def __repr__(self) -> str:
-        return repr(self.mapping)
-
-
-def get_tensor_name_map(arch: MODEL_ARCH, n_blocks: int) -> TensorNameMap:
-    return TensorNameMap(arch, n_blocks)
--- a/extensions/model-extension/scripts/gguf-py/gguf/vocab.py
+++ b/extensions/model-extension/scripts/gguf-py/gguf/vocab.py
@ -1,185 +0,0 @@
-from __future__ import annotations
-
-import json
-import os
-import sys
-from pathlib import Path
-from typing import Any, Callable
-
-from .gguf_writer import GGUFWriter
-
-
-class SpecialVocab:
-    merges: list[str]
-    add_special_token: dict[str, bool]
-    special_token_ids: dict[str, int]
-    chat_template: str | None
-
-    def __init__(
-        self, path: str | os.PathLike[str], load_merges: bool = False,
-        special_token_types: tuple[str, ...] | None = None,
-        n_vocab: int | None = None,
-    ):
-        self.special_token_ids = {}
-        self.add_special_token = {}
-        self.n_vocab = n_vocab
-        self.load_merges = load_merges
-        self.merges = []
-        self.chat_template = None
-        if special_token_types is not None:
-            self.special_token_types = special_token_types
-        else:
-            self.special_token_types = ('bos', 'eos', 'unk', 'sep', 'pad')
-        self._load(Path(path))
-
-    def __repr__(self) -> str:
-        return '<SpecialVocab with {} merges, special tokens {}, add special tokens {}>'.format(
-            len(self.merges), self.special_token_ids or "unset", self.add_special_token or "unset",
-        )
-
-    def add_to_gguf(self, gw: GGUFWriter, quiet: bool = False) -> None:
-        if self.merges:
-            if not quiet:
-                print(f'gguf: Adding {len(self.merges)} merge(s).')
-            gw.add_token_merges(self.merges)
-        elif self.load_merges:
-            print(
-                'gguf: WARNING: Adding merges requested but no merges found, output may be non-functional.',
-                file = sys.stderr,
-            )
-        for typ, tokid in self.special_token_ids.items():
-            id_handler: Callable[[int], None] | None = getattr(gw, f'add_{typ}_token_id', None)
-            if id_handler is None:
-                print(
-                    f'gguf: WARNING: No handler for special token type {typ} with id {tokid} - skipping',
-                    file = sys.stderr,
-                )
-                continue
-            if not quiet:
-                print(f'gguf: Setting special token type {typ} to {tokid}')
-            id_handler(tokid)
-        for typ, value in self.add_special_token.items():
-            add_handler: Callable[[bool], None] | None = getattr(gw, f'add_add_{typ}_token', None)
-            if add_handler is None:
-                print(
-                    f'gguf: WARNING: No handler for add_{typ}_token with value {value} - skipping',
-                    file = sys.stderr,
-                )
-                continue
-            if not quiet:
-                print(f'gguf: Setting add_{typ}_token to {value}')
-            add_handler(value)
-        if self.chat_template is not None:
-            if not quiet:
-                print(f'gguf: Setting chat_template to {self.chat_template}')
-            gw.add_chat_template(self.chat_template)
-
-    def _load(self, path: Path) -> None:
-        self._try_load_from_tokenizer_json(path)
-        self._try_load_from_config_json(path)
-        if self.load_merges and not self.merges:
-            self._try_load_merges_txt(path)
-
-    def _try_load_merges_txt(self, path: Path) -> bool:
-        merges_file = path / 'merges.txt'
-        if not merges_file.is_file():
-            return False
-        with open(merges_file, 'r', encoding = 'utf-8') as fp:
-            first_line = next(fp, '').strip()
-            if not first_line.startswith('#'):
-                fp.seek(0)
-                line_num = 0
-            else:
-                line_num = 1
-            merges = []
-            for line in fp:
-                line_num += 1
-                line = line.strip()
-                if not line:
-                    continue
-                parts = line.split(None, 3)
-                if len(parts) != 2:
-                    print(
-                        f'gguf: WARNING: {merges_file.name}: Line {line_num}: Entry malformed, ignoring',
-                        file = sys.stderr,
-                    )
-                    continue
-                merges.append(f'{parts[0]} {parts[1]}')
-        self.merges = merges
-        return True
-
-    def _set_special_token(self, typ: str, tid: Any) -> None:
-        if not isinstance(tid, int):
-            return
-        if tid < 0:
-            raise ValueError(f'invalid value for special token type {typ}: {tid}')
-        if self.n_vocab is None or tid < self.n_vocab:
-            if typ in self.special_token_ids:
-                return
-            self.special_token_ids[typ] = tid
-            return
-        print(
-            f'gguf: WARNING: Special token type {typ}, id {tid} out of range, must be under {self.n_vocab} - skipping',
-            file = sys.stderr,
-        )
-
-    def _try_load_from_tokenizer_json(self, path: Path) -> bool:
-        tokenizer_file = path / 'tokenizer.json'
-        if tokenizer_file.is_file():
-            with open(tokenizer_file, encoding = 'utf-8') as f:
-                tokenizer = json.load(f)
-            if self.load_merges:
-                merges = tokenizer.get('model', {}).get('merges')
-                if isinstance(merges, list) and merges and isinstance(merges[0], str):
-                    self.merges = merges
-            added_tokens = tokenizer.get('added_tokens', {})
-        else:
-            added_tokens = {}
-        tokenizer_config_file = path / 'tokenizer_config.json'
-        if not tokenizer_config_file.is_file():
-            return True
-        with open(tokenizer_config_file, encoding = 'utf-8') as f:
-            tokenizer_config = json.load(f)
-        chat_template = tokenizer_config.get('chat_template')
-        if chat_template is None or isinstance(chat_template, str):
-            self.chat_template = chat_template
-        else:
-            print(
-                f'gguf: WARNING: Bad type for chat_template field in {tokenizer_config_file!r} - ignoring',
-                file = sys.stderr
-            )
-        for typ in self.special_token_types:
-            add_entry = tokenizer_config.get(f'add_{typ}_token')
-            if isinstance(add_entry, bool):
-                self.add_special_token[typ] = add_entry
-            if not added_tokens:
-                # We will need this to get the content for the token, so if it's empty
-                # may as well just give up.
-                continue
-            entry = tokenizer_config.get(f'{typ}_token')
-            if isinstance(entry, str):
-                tc_content = entry
-            elif isinstance(entry, dict):
-                entry_content = entry.get('content')
-                if not isinstance(entry_content, str):
-                    continue
-                tc_content = entry_content
-            else:
-                continue
-            # We only need the first match here.
-            maybe_token_id = next(
-                (atok.get('id') for atok in added_tokens if atok.get('content') == tc_content),
-                None,
-            )
-            self._set_special_token(typ, maybe_token_id)
-        return True
-
-    def _try_load_from_config_json(self, path: Path) -> bool:
-        config_file = path / 'config.json'
-        if not config_file.is_file():
-            return False
-        with open(config_file, encoding = 'utf-8') as f:
-            config = json.load(f)
-        for typ in self.special_token_types:
-            self._set_special_token(typ, config.get(f'{typ}_token_id'))
-        return True
--- a/extensions/model-extension/scripts/gguf-py/pyproject.toml
+++ b/extensions/model-extension/scripts/gguf-py/pyproject.toml
@ -1,35 +0,0 @@
-[tool.poetry]
-name = "gguf"
-version = "0.7.0"
-description = "Read and write ML models in GGUF for GGML"
-authors = ["GGML <ggml@ggml.ai>"]
-packages = [
-    {include = "gguf"},
-    {include = "gguf/py.typed"},
-    {include = "scripts"},
-]
-readme = "README.md"
-homepage = "https://ggml.ai"
-repository = "https://github.com/ggerganov/llama.cpp"
-keywords = ["ggml", "gguf", "llama.cpp"]
-classifiers = [
-    "Programming Language :: Python :: 3",
-    "License :: OSI Approved :: MIT License",
-    "Operating System :: OS Independent",
-]
-
-[tool.poetry.dependencies]
-python = ">=3.8"
-numpy = ">=1.17"
-
-[tool.poetry.dev-dependencies]
-pytest = "^5.2"
-
-[build-system]
-requires = ["poetry-core>=1.0.0"]
-build-backend = "poetry.core.masonry.api"
-
-[tool.poetry.scripts]
-gguf-convert-endian = "scripts:gguf_convert_endian_entrypoint"
-gguf-dump = "scripts:gguf_dump_entrypoint"
-gguf-set-metadata = "scripts:gguf_set_metadata_entrypoint"
--- a/extensions/model-extension/scripts/gguf-py/scripts/init.py
+++ b/extensions/model-extension/scripts/gguf-py/scripts/init.py
@ -1,12 +0,0 @@
-import os
-
-from importlib import import_module
-
-
-os.environ["NO_LOCAL_GGUF"] = "TRUE"
-
-gguf_convert_endian_entrypoint = import_module("scripts.gguf-convert-endian").main
-gguf_dump_entrypoint           = import_module("scripts.gguf-dump").main
-gguf_set_metadata_entrypoint   = import_module("scripts.gguf-set-metadata").main
-
-del import_module, os
--- a/extensions/model-extension/scripts/gguf-py/scripts/gguf-convert-endian.py
+++ b/extensions/model-extension/scripts/gguf-py/scripts/gguf-convert-endian.py
@ -1,112 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-
-import argparse
-import os
-import sys
-from pathlib import Path
-
-import numpy as np
-
-# Necessary to load the local gguf package
-if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent / 'gguf-py').exists():
-    sys.path.insert(0, str(Path(__file__).parent.parent))
-
-import gguf
-
-
-def convert_byteorder(reader: gguf.GGUFReader, args: argparse.Namespace) -> None:
-    if np.uint32(1) == np.uint32(1).newbyteorder("<"):
-        # Host is little endian
-        host_endian = "little"
-        swapped_endian = "big"
-    else:
-        # Sorry PDP or other weird systems that don't use BE or LE.
-        host_endian = "big"
-        swapped_endian = "little"
-    if reader.byte_order == "S":
-        file_endian = swapped_endian
-    else:
-        file_endian = host_endian
-    order = host_endian if args.order == "native" else args.order
-    print(f"* Host is {host_endian.upper()} endian, GGUF file seems to be {file_endian.upper()} endian")
-    if file_endian == order:
-        print(f"* File is already {order.upper()} endian. Nothing to do.")
-        sys.exit(0)
-    print("* Checking tensors for conversion compatibility")
-    for tensor in reader.tensors:
-        if tensor.tensor_type not in (
-            gguf.GGMLQuantizationType.F32,
-            gguf.GGMLQuantizationType.F16,
-            gguf.GGMLQuantizationType.Q8_0,
-        ):
-            raise ValueError(f"Cannot handle type {tensor.tensor_type.name} for tensor {repr(tensor.name)}")
-    print(f"* Preparing to convert from {file_endian.upper()} to {order.upper()}")
-    if args.dry_run:
-        return
-    print("\n*** Warning *** Warning *** Warning **")
-    print("* This conversion process may damage the file. Ensure you have a backup.")
-    if order != host_endian:
-        print("* Requested endian differs from host, you will not be able to load the model on this machine.")
-    print("* The file will be modified immediately, so if conversion fails or is interrupted")
-    print("* the file will be corrupted. Enter exactly YES if you are positive you want to proceed:")
-    response = input("YES, I am sure> ")
-    if response != "YES":
-        print("You didn't enter YES. Okay then, see ya!")
-        sys.exit(0)
-    print(f"\n* Converting fields ({len(reader.fields)})")
-    for idx, field in enumerate(reader.fields.values()):
-        print(f"- {idx:4}: Converting field {repr(field.name)}, part count: {len(field.parts)}")
-        for part in field.parts:
-            part.byteswap(inplace=True)
-    print(f"\n* Converting tensors ({len(reader.tensors)})")
-    for idx, tensor in enumerate(reader.tensors):
-        print(
-            f"  - {idx:4}: Converting tensor {repr(tensor.name)}, type={tensor.tensor_type.name}, "
-            f"elements={tensor.n_elements}... ",
-            end="",
-        )
-        tensor_type = tensor.tensor_type
-        for part in tensor.field.parts:
-            part.byteswap(inplace=True)
-        if tensor_type != gguf.GGMLQuantizationType.Q8_0:
-            tensor.data.byteswap(inplace=True)
-            print()
-            continue
-        # A Q8_0 block consists of a f16 delta followed by 32 int8 quants, so 34 bytes
-        block_size = 34
-        n_blocks = len(tensor.data) // block_size
-        for block_num in range(n_blocks):
-            block_offs = block_num * block_size
-            # I know I said f16, but it doesn't matter here - any simple 16 bit type works.
-            delta = tensor.data[block_offs:block_offs + 2].view(dtype=np.uint16)
-            delta.byteswap(inplace=True)
-            if block_num % 100000 == 0:
-                print(f"[{(n_blocks - block_num) // 1000}K]", end="")
-                sys.stdout.flush()
-        print()
-    print("* Completion")
-
-
-def main() -> None:
-    parser = argparse.ArgumentParser(description="Convert GGUF file byte order")
-    parser.add_argument(
-        "model", type=str,
-        help="GGUF format model filename",
-    )
-    parser.add_argument(
-        "order", type=str, choices=['big', 'little', 'native'],
-        help="Requested byte order",
-    )
-    parser.add_argument(
-        "--dry-run", action="store_true",
-        help="Don't actually change anything",
-    )
-    args = parser.parse_args(None if len(sys.argv) > 1 else ["--help"])
-    print(f'* Loading: {args.model}')
-    reader = gguf.GGUFReader(args.model, 'r' if args.dry_run else 'r+')
-    convert_byteorder(reader, args)
-
-
-if __name__ == "__main__":
-    main()
--- a/extensions/model-extension/scripts/gguf-py/scripts/gguf-dump.py
+++ b/extensions/model-extension/scripts/gguf-py/scripts/gguf-dump.py
@ -1,117 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-
-import argparse
-import os
-import sys
-from pathlib import Path
-from typing import Any
-
-import numpy as np
-
-# Necessary to load the local gguf package
-if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent / 'gguf-py').exists():
-    sys.path.insert(0, str(Path(__file__).parent.parent))
-
-from gguf import GGUFReader, GGUFValueType  # noqa: E402
-
-
-def get_file_host_endian(reader: GGUFReader) -> tuple[str, str]:
-    host_endian = 'LITTLE' if np.uint32(1) == np.uint32(1).newbyteorder("<") else 'BIG'
-    if reader.byte_order == 'S':
-        file_endian = 'BIG' if host_endian == 'LITTLE' else 'LITTLE'
-    else:
-        file_endian = host_endian
-    return (host_endian, file_endian)
-
-
-# For more information about what field.parts and field.data represent,
-# please see the comments in the modify_gguf.py example.
-def dump_metadata(reader: GGUFReader, args: argparse.Namespace) -> None:
-    host_endian, file_endian = get_file_host_endian(reader)
-    print(f'* File is {file_endian} endian, script is running on a {host_endian} endian host.')
-    print(f'\n* Dumping {len(reader.fields)} key/value pair(s)')
-    for n, field in enumerate(reader.fields.values(), 1):
-        if not field.types:
-            pretty_type = 'N/A'
-        elif field.types[0] == GGUFValueType.ARRAY:
-            nest_count = len(field.types) - 1
-            pretty_type = '[' * nest_count + str(field.types[-1].name) + ']' * nest_count
-        else:
-            pretty_type = str(field.types[-1].name)
-        print(f'  {n:5}: {pretty_type:10} | {len(field.data):8} | {field.name}', end = '')
-        if len(field.types) == 1:
-            curr_type = field.types[0]
-            if curr_type == GGUFValueType.STRING:
-                print(' = {0}'.format(repr(str(bytes(field.parts[-1]), encoding='utf8')[:60])), end = '')
-            elif field.types[0] in reader.gguf_scalar_to_np:
-                print(' = {0}'.format(field.parts[-1][0]), end = '')
-        print()
-    if args.no_tensors:
-        return
-    print(f'\n* Dumping {len(reader.tensors)} tensor(s)')
-    for n, tensor in enumerate(reader.tensors, 1):
-        prettydims = ', '.join('{0:5}'.format(d) for d in list(tensor.shape) + [1] * (4 - len(tensor.shape)))
-        print(f'  {n:5}: {tensor.n_elements:10} | {prettydims} | {tensor.tensor_type.name:7} | {tensor.name}')
-
-
-def dump_metadata_json(reader: GGUFReader, args: argparse.Namespace) -> None:
-    import json
-    host_endian, file_endian = get_file_host_endian(reader)
-    metadata: dict[str, Any] = {}
-    tensors: dict[str, Any] = {}
-    result = {
-        "filename": args.model,
-        "endian": file_endian,
-        "metadata": metadata,
-        "tensors": tensors,
-    }
-    for idx, field in enumerate(reader.fields.values()):
-        curr: dict[str, Any] = {
-            "index": idx,
-            "type": field.types[0].name if field.types else 'UNKNOWN',
-            "offset": field.offset,
-        }
-        metadata[field.name] = curr
-        if field.types[:1] == [GGUFValueType.ARRAY]:
-            curr["array_types"] = [t.name for t in field.types][1:]
-            if not args.json_array:
-                continue
-            itype = field.types[-1]
-            if itype == GGUFValueType.STRING:
-                curr["value"] = [str(bytes(field.parts[idx]), encoding="utf-8") for idx in field.data]
-            else:
-                curr["value"] = [pv for idx in field.data for pv in field.parts[idx].tolist()]
-        elif field.types[0] == GGUFValueType.STRING:
-            curr["value"] = str(bytes(field.parts[-1]), encoding="utf-8")
-        else:
-            curr["value"] = field.parts[-1].tolist()[0]
-    if not args.no_tensors:
-        for idx, tensor in enumerate(reader.tensors):
-            tensors[tensor.name] = {
-                "index": idx,
-                "shape": tensor.shape.tolist(),
-                "type": tensor.tensor_type.name,
-                "offset": tensor.field.offset,
-            }
-    json.dump(result, sys.stdout)
-
-
-def main() -> None:
-    parser = argparse.ArgumentParser(description="Dump GGUF file metadata")
-    parser.add_argument("model",           type=str,            help="GGUF format model filename")
-    parser.add_argument("--no-tensors", action="store_true", help="Don't dump tensor metadata")
-    parser.add_argument("--json",       action="store_true", help="Produce JSON output")
-    parser.add_argument("--json-array", action="store_true", help="Include full array values in JSON output (long)")
-    args = parser.parse_args(None if len(sys.argv) > 1 else ["--help"])
-    if not args.json:
-        print(f'* Loading: {args.model}')
-    reader = GGUFReader(args.model, 'r')
-    if args.json:
-        dump_metadata_json(reader, args)
-    else:
-        dump_metadata(reader, args)
-
-
-if __name__ == '__main__':
-    main()
--- a/extensions/model-extension/scripts/gguf-py/scripts/gguf-set-metadata.py
+++ b/extensions/model-extension/scripts/gguf-py/scripts/gguf-set-metadata.py
@ -1,90 +0,0 @@
-#!/usr/bin/env python3
-import argparse
-import os
-import sys
-from pathlib import Path
-
-# Necessary to load the local gguf package
-if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent / 'gguf-py').exists():
-    sys.path.insert(0, str(Path(__file__).parent.parent))
-
-from gguf import GGUFReader  # noqa: E402
-
-
-def minimal_example(filename: str) -> None:
-    reader = GGUFReader(filename, 'r+')
-    field = reader.fields['tokenizer.ggml.bos_token_id']
-    if field is None:
-        return
-    part_index = field.data[0]
-    field.parts[part_index][0] = 2  # Set tokenizer.ggml.bos_token_id to 2
-    #
-    # So what's this field.data thing? It's helpful because field.parts contains
-    # _every_ part of the GGUF field. For example, tokenizer.ggml.bos_token_id consists
-    # of:
-    #
-    #  Part index 0: Key length (27)
-    #  Part index 1: Key data ("tokenizer.ggml.bos_token_id")
-    #  Part index 2: Field type (4, the id for GGUFValueType.UINT32)
-    #  Part index 3: Field value
-    #
-    # Note also that each part is an NDArray slice, so even a part that
-    # is only a single value like the key length will be a NDArray of
-    # the key length type (numpy.uint32).
-    #
-    # The .data attribute in the Field is a list of relevant part indexes
-    # and doesn't contain internal GGUF details like the key length part.
-    # In this case, .data will be [3] - just the part index of the
-    # field value itself.
-
-
-def set_metadata(reader: GGUFReader, args: argparse.Namespace) -> None:
-    field = reader.get_field(args.key)
-    if field is None:
-        print(f'! Field {repr(args.key)} not found', file = sys.stderr)
-        sys.exit(1)
-    # Note that field.types is a list of types. This is because the GGUF
-    # format supports arrays. For example, an array of UINT32 would
-    # look like [GGUFValueType.ARRAY, GGUFValueType.UINT32]
-    handler = reader.gguf_scalar_to_np.get(field.types[0]) if field.types else None
-    if handler is None:
-        print(
-            f'! This tool only supports changing simple values, {repr(args.key)} has unsupported type {field.types}',
-            file = sys.stderr,
-        )
-        sys.exit(1)
-    current_value = field.parts[field.data[0]][0]
-    new_value = handler(args.value)
-    print(f'* Preparing to change field {repr(args.key)} from {current_value} to {new_value}')
-    if current_value == new_value:
-        print(f'- Key {repr(args.key)} already set to requested value {current_value}')
-        sys.exit(0)
-    if args.dry_run:
-        sys.exit(0)
-    if not args.force:
-        print('*** Warning *** Warning *** Warning **')
-        print('* Changing fields in a GGUF file can make it unusable. Proceed at your own risk.')
-        print('* Enter exactly YES if you are positive you want to proceed:')
-        response = input('YES, I am sure> ')
-        if response != 'YES':
-            print("You didn't enter YES. Okay then, see ya!")
-            sys.exit(0)
-    field.parts[field.data[0]][0] = new_value
-    print('* Field changed. Successful completion.')
-
-
-def main() -> None:
-    parser = argparse.ArgumentParser(description="Set a simple value in GGUF file metadata")
-    parser.add_argument("model",     type=str,            help="GGUF format model filename")
-    parser.add_argument("key",       type=str,            help="Metadata key to set")
-    parser.add_argument("value",     type=str,            help="Metadata value to set")
-    parser.add_argument("--dry-run", action="store_true", help="Don't actually change anything")
-    parser.add_argument("--force",   action="store_true", help="Change the field without confirmation")
-    args = parser.parse_args(None if len(sys.argv) > 1 else ["--help"])
-    print(f'* Loading: {args.model}')
-    reader = GGUFReader(args.model, 'r' if args.dry_run else 'r+')
-    set_metadata(reader, args)
-
-
-if __name__ == '__main__':
-    main()
--- a/extensions/model-extension/scripts/gguf-py/tests/test_gguf.py
+++ b/extensions/model-extension/scripts/gguf-py/tests/test_gguf.py
@ -1,7 +0,0 @@
-import gguf  # noqa: F401
-
-# TODO: add tests
-
-
-def test_write_gguf() -> None:
-    pass
--- a/extensions/model-extension/scripts/install_deps.py
+++ b/extensions/model-extension/scripts/install_deps.py
@ -1,14 +0,0 @@
-import subprocess
-import sys
-
-deps = [
-    'numpy~=1.24.4',
-    'sentencepiece~=0.1.98',
-    'transformers>=4.35.2,<5.0.0',
-    'gguf>=0.1.0',
-    'protobuf>=4.21.0,<5.0.0',
-    'torch~=2.1.1',
-    'packaging>=20.0',
-    'tiktoken~=0.5.0'
-]
-subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--upgrade', '--force-reinstall', *deps])
--- a/extensions/model-extension/scripts/version.txt
+++ b/extensions/model-extension/scripts/version.txt
@ -1 +0,0 @@
-b2106
--- a/extensions/model-extension/src/index.ts
+++ b/extensions/model-extension/src/index.ts
@ -19,8 +19,6 @@ import {
  DownloadRequest,
  executeOnMain,
  HuggingFaceRepoData,
-  Quantization,
-  log,
  getFileSize,
  AllQuantizations,
  ModelEvent,
@ -353,7 +351,7 @@ export default class JanModelExtension extends ModelExtension {
  }

  /**
-   * Saves a machine learning model.
+   * Saves a model file.
   * @param model - The model to save.
   * @returns A Promise that resolves when the model is saved.
   */
@ -565,6 +563,19 @@ export default class JanModelExtension extends ModelExtension {
    }

    const defaultModel = (await this.getDefaultModel()) as Model
+    const metadata = await executeOnMain(
+      NODE,
+      'retrieveGGUFMetadata',
+      await joinPath([
+        await getJanDataFolderPath(),
+        'models',
+        dirName,
+        binaryFileName,
+      ])
+    )
+
+    const eos_id = metadata?.['tokenizer.ggml.eos_token_id']
+
    if (!defaultModel) {
      console.error('Unable to find default model')
      return
@ -581,8 +592,20 @@ export default class JanModelExtension extends ModelExtension {
          filename: binaryFileName,
        },
      ],
+      parameters: {
+        ...defaultModel.parameters,
+        stop: eos_id
+          ? [metadata['tokenizer.ggml.tokens'][eos_id] ?? '']
+          : defaultModel.parameters.stop,
+      },
      settings: {
        ...defaultModel.settings,
+        prompt_template:
+          metadata?.parsed_chat_template ??
+          defaultModel.settings.prompt_template,
+        ctx_len:
+          metadata?.['llama.context_length'] ?? defaultModel.settings.ctx_len,
+        ngl: (metadata?.['llama.block_count'] ?? 32) + 1,
        llama_model_path: binaryFileName,
      },
      created: Date.now(),
@ -657,6 +680,13 @@ export default class JanModelExtension extends ModelExtension {
      return
    }

+    const metadata = await executeOnMain(
+      NODE,
+      'retrieveGGUFMetadata',
+      modelBinaryPath
+    )
+    const eos_id = metadata?.['tokenizer.ggml.eos_token_id']
+
    const binaryFileName = await baseName(modelBinaryPath)

    const model: Model = {
@ -669,8 +699,21 @@ export default class JanModelExtension extends ModelExtension {
          filename: binaryFileName,
        },
      ],
+      parameters: {
+        ...defaultModel.parameters,
+        stop: eos_id
+          ? [metadata?.['tokenizer.ggml.tokens'][eos_id] ?? '']
+          : defaultModel.parameters.stop,
+      },
+
      settings: {
        ...defaultModel.settings,
+        prompt_template:
+          metadata?.parsed_chat_template ??
+          defaultModel.settings.prompt_template,
+        ctx_len:
+          metadata?.['llama.context_length'] ?? defaultModel.settings.ctx_len,
+        ngl: (metadata?.['llama.block_count'] ?? 32) + 1,
        llama_model_path: binaryFileName,
      },
      created: Date.now(),
@ -710,9 +753,17 @@ export default class JanModelExtension extends ModelExtension {
    const updatedModel: Model = {
      ...model,
      ...modelInfo,
+      parameters: {
+        ...model.parameters,
+        ...modelInfo.parameters,
+      },
+      settings: {
+        ...model.settings,
+        ...modelInfo.settings,
+      },
      metadata: {
        ...model.metadata,
-        tags: modelInfo.metadata?.tags ?? [],
+        ...modelInfo.metadata,
      },
    }

@ -826,218 +877,4 @@ export default class JanModelExtension extends ModelExtension {
      importedModels
    )
  }
-
-  private getGgufFileList(
-    repoData: HuggingFaceRepoData,
-    selectedQuantization: Quantization
-  ): string[] {
-    return repoData.siblings
-      .map((file) => file.rfilename)
-      .filter((file) => file.indexOf(selectedQuantization) !== -1)
-      .filter((file) => file.endsWith('.gguf'))
-  }
-
-  private getFileList(repoData: HuggingFaceRepoData): string[] {
-    // SafeTensors first, if not, then PyTorch
-    const modelFiles = repoData.siblings
-      .map((file) => file.rfilename)
-      .filter((file) =>
-        JanModelExtension._safetensorsRegexs.some((regex) => regex.test(file))
-      )
-    if (modelFiles.length === 0) {
-      repoData.siblings.forEach((file) => {
-        if (
-          JanModelExtension._pytorchRegexs.some((regex) =>
-            regex.test(file.rfilename)
-          )
-        ) {
-          modelFiles.push(file.rfilename)
-        }
-      })
-    }
-
-    const vocabFiles = [
-      'tokenizer.model',
-      'vocab.json',
-      'tokenizer.json',
-    ].filter((file) =>
-      repoData.siblings.some((sibling) => sibling.rfilename === file)
-    )
-
-    const etcFiles = repoData.siblings
-      .map((file) => file.rfilename)
-      .filter(
-        (file) =>
-          (file.endsWith('.json') && !vocabFiles.includes(file)) ||
-          file.endsWith('.txt') ||
-          file.endsWith('.py') ||
-          file.endsWith('.tiktoken')
-      )
-
-    return [...modelFiles, ...vocabFiles, ...etcFiles]
-  }
-
-  private async getModelDirPath(repoID: string): Promise<string> {
-    const modelName = repoID.split('/').slice(1).join('/')
-    return joinPath([await getJanDataFolderPath(), 'models', modelName])
-  }
-
-  private async getConvertedModelPath(repoID: string): Promise<string> {
-    const modelName = repoID.split('/').slice(1).join('/')
-    const modelDirPath = await this.getModelDirPath(repoID)
-    return joinPath([modelDirPath, modelName + '.gguf'])
-  }
-
-  private async getQuantizedModelPath(
-    repoID: string,
-    quantization: Quantization
-  ): Promise<string> {
-    const modelName = repoID.split('/').slice(1).join('/')
-    const modelDirPath = await this.getModelDirPath(repoID)
-    return joinPath([
-      modelDirPath,
-      modelName + `-${quantization.toLowerCase()}.gguf`,
-    ])
-  }
-  private getCtxLength(config: {
-    max_sequence_length?: number
-    max_position_embeddings?: number
-    n_ctx?: number
-  }): number {
-    if (config.max_sequence_length) return config.max_sequence_length
-    if (config.max_position_embeddings) return config.max_position_embeddings
-    if (config.n_ctx) return config.n_ctx
-    return 2048
-  }
-
-  /**
-   * Converts a Hugging Face model to GGUF.
-   * @param repoID - The repo ID of the model to convert.
-   * @returns A promise that resolves when the conversion is complete.
-   */
-  async convert(repoID: string): Promise<void> {
-    if (this.interrupted) return
-    const modelDirPath = await this.getModelDirPath(repoID)
-    const modelOutPath = await this.getConvertedModelPath(repoID)
-    if (!(await fs.existsSync(modelDirPath))) {
-      throw new Error('Model dir not found')
-    }
-    if (await fs.existsSync(modelOutPath)) return
-
-    await executeOnMain(NODE, 'installDeps')
-    if (this.interrupted) return
-
-    try {
-      await executeOnMain(
-        NODE,
-        'convertHf',
-        modelDirPath,
-        modelOutPath + '.temp'
-      )
-    } catch (err) {
-      log(`[Conversion]::Debug: Error using hf-to-gguf.py, trying convert.py`)
-
-      let ctx = 2048
-      try {
-        const config = await fs.readFileSync(
-          await joinPath([modelDirPath, 'config.json']),
-          'utf8'
-        )
-        const configParsed = JSON.parse(config)
-        ctx = this.getCtxLength(configParsed)
-        configParsed.max_sequence_length = ctx
-        await fs.writeFileSync(
-          await joinPath([modelDirPath, 'config.json']),
-          JSON.stringify(configParsed, null, 2)
-        )
-      } catch (err) {
-        log(`${err}`)
-        // ignore missing config.json
-      }
-
-      const bpe = await fs.existsSync(
-        await joinPath([modelDirPath, 'vocab.json'])
-      )
-
-      await executeOnMain(
-        NODE,
-        'convert',
-        modelDirPath,
-        modelOutPath + '.temp',
-        {
-          ctx,
-          bpe,
-        }
-      )
-    }
-    await executeOnMain(
-      NODE,
-      'renameSync',
-      modelOutPath + '.temp',
-      modelOutPath
-    )
-
-    for (const file of await fs.readdirSync(modelDirPath)) {
-      if (
-        modelOutPath.endsWith(file) ||
-        (file.endsWith('config.json') && !file.endsWith('_config.json'))
-      )
-        continue
-      await fs.unlinkSync(await joinPath([modelDirPath, file]))
-    }
-  }
-
-  /**
-   * Quantizes a GGUF model.
-   * @param repoID - The repo ID of the model to quantize.
-   * @param quantization - The quantization to use.
-   * @returns A promise that resolves when the quantization is complete.
-   */
-  async quantize(repoID: string, quantization: Quantization): Promise<void> {
-    if (this.interrupted) return
-    const modelDirPath = await this.getModelDirPath(repoID)
-    const modelOutPath = await this.getQuantizedModelPath(repoID, quantization)
-    if (!(await fs.existsSync(modelDirPath))) {
-      throw new Error('Model dir not found')
-    }
-    if (await fs.existsSync(modelOutPath)) return
-
-    await executeOnMain(
-      NODE,
-      'quantize',
-      await this.getConvertedModelPath(repoID),
-      modelOutPath + '.temp',
-      quantization
-    )
-    await executeOnMain(
-      NODE,
-      'renameSync',
-      modelOutPath + '.temp',
-      modelOutPath
-    )
-
-    await fs.unlinkSync(await this.getConvertedModelPath(repoID))
-  }
-
-  /**
-   * Cancels the convert of current Hugging Face model.
-   * @param repoID - The repository ID to cancel.
-   * @param repoData - The repository data to cancel.
-   * @returns {Promise<void>} A promise that resolves when the download has been cancelled.
-   */
-  async cancelConvert(
-    repoID: string,
-    repoData: HuggingFaceRepoData
-  ): Promise<void> {
-    this.interrupted = true
-    const modelDirPath = await this.getModelDirPath(repoID)
-    const files = this.getFileList(repoData)
-    for (const file of files) {
-      const filePath = file
-      const localPath = await joinPath([modelDirPath, filePath])
-      await abortDownload(localPath)
-    }
-
-    executeOnMain(NODE, 'killProcesses')
-  }
 }
--- a/extensions/model-extension/src/node/index.ts
+++ b/extensions/model-extension/src/node/index.ts
@ -1,182 +1,47 @@
-import { PythonShell } from 'python-shell'
-import { spawn, ChildProcess } from 'child_process'
-import { resolve as presolve, join as pjoin } from 'path'
-import { log, Quantization } from '@janhq/core/node'
-import { statSync } from 'fs'
-export { renameSync } from 'fs'
+import { closeSync, openSync, readSync } from 'fs'
+import { Template } from '@huggingface/jinja'
+/**
+ * This is to retrieve the metadata from a GGUF file
+ * It uses hyllama and jinja from @huggingface module
+ */
+export const retrieveGGUFMetadata = async (ggufPath: string) => {
+  try {
+    const { ggufMetadata } = await import('hyllama')
+    // Read first 10mb of gguf file
+    const fd = openSync(ggufPath, 'r')
+    const buffer = new Uint8Array(10_000_000)
+    readSync(fd, buffer, 0, 10_000_000, 0)
+    closeSync(fd)

-let pythonShell: PythonShell | undefined = undefined
-let quantizeProcess: ChildProcess | undefined = undefined
+    // Parse metadata and tensor info
+    const { metadata } = ggufMetadata(buffer.buffer)

-export const getSize = (path: string): number => statSync(path).size
-
-export const killProcesses = () => {
-  if (pythonShell) {
-    pythonShell.kill()
-    pythonShell = undefined
-  }
-  if (quantizeProcess) {
-    quantizeProcess.kill()
-    quantizeProcess = undefined
+    const template = new Template(metadata['tokenizer.chat_template'])
+    const eos_id = metadata['tokenizer.ggml.eos_token_id']
+    const bos_id = metadata['tokenizer.ggml.bos_token_id']
+    const eos_token = metadata['tokenizer.ggml.tokens'][eos_id]
+    const bos_token = metadata['tokenizer.ggml.tokens'][bos_id]
+    // Parse jinja template
+    const renderedTemplate = template.render({
+      add_generation_prompt: true,
+      eos_token,
+      bos_token,
+      messages: [
+        {
+          role: 'system',
+          content: '{system_message}',
+        },
+        {
+          role: 'user',
+          content: '{prompt}',
+        },
+      ],
+    })
+    return {
+      ...metadata,
+      parsed_chat_template: renderedTemplate,
+    }
+  } catch (e) {
+    console.log('[MODEL_EXT]', e)
  }
 }
-
-export const getQuantizeExecutable = (): string => {
-  let binaryFolder = pjoin(__dirname, '..', 'bin') // Current directory by default
-  let binaryName = 'quantize'
-  /**
-   * The binary folder is different for each platform.
-   */
-  if (process.platform === 'win32') {
-    binaryFolder = pjoin(binaryFolder, 'win')
-    binaryName = 'quantize.exe'
-  } else if (process.platform === 'darwin') {
-    /**
-     *  For MacOS: mac-universal both Silicon and InteL
-     */
-    binaryFolder = pjoin(binaryFolder, 'mac-universal')
-  } else {
-    binaryFolder = pjoin(binaryFolder, 'linux-cpu')
-  }
-  return pjoin(binaryFolder, binaryName)
-}
-
-export const installDeps = (): Promise<void> => {
-  return new Promise((resolve, reject) => {
-    const _pythonShell = new PythonShell(
-      presolve(__dirname, '..', 'scripts', 'install_deps.py')
-    )
-    _pythonShell.on('message', (message) => {
-      log(`[Install Deps]::Debug: ${message}`)
-    })
-    _pythonShell.on('stderr', (stderr) => {
-      log(`[Install Deps]::Error: ${stderr}`)
-    })
-    _pythonShell.on('error', (err) => {
-      pythonShell = undefined
-      log(`[Install Deps]::Error: ${err}`)
-      reject(err)
-    })
-    _pythonShell.on('close', () => {
-      const exitCode = _pythonShell.exitCode
-      pythonShell = undefined
-      log(
-        `[Install Deps]::Debug: Deps installation exited with code: ${exitCode}`
-      )
-      exitCode === 0 ? resolve() : reject(exitCode)
-    })
-  })
-}
-
-export const convertHf = async (
-  modelDirPath: string,
-  outPath: string
-): Promise<void> => {
-  return await new Promise<void>((resolve, reject) => {
-    const _pythonShell = new PythonShell(
-      presolve(__dirname, '..', 'scripts', 'convert-hf-to-gguf.py'),
-      {
-        args: [modelDirPath, '--outfile', outPath],
-      }
-    )
-    pythonShell = _pythonShell
-    _pythonShell.on('message', (message) => {
-      log(`[Conversion]::Debug: ${message}`)
-    })
-    _pythonShell.on('stderr', (stderr) => {
-      log(`[Conversion]::Error: ${stderr}`)
-    })
-    _pythonShell.on('error', (err) => {
-      pythonShell = undefined
-      log(`[Conversion]::Error: ${err}`)
-      reject(err)
-    })
-    _pythonShell.on('close', () => {
-      const exitCode = _pythonShell.exitCode
-      pythonShell = undefined
-      if (exitCode !== 0) {
-        log(`[Conversion]::Debug: Conversion exited with code: ${exitCode}`)
-        reject(exitCode)
-      } else {
-        resolve()
-      }
-    })
-  })
-}
-
-export const convert = async (
-  modelDirPath: string,
-  outPath: string,
-  { ctx, bpe }: { ctx?: number; bpe?: boolean }
-): Promise<void> => {
-  const args = [modelDirPath, '--outfile', outPath]
-  if (ctx) {
-    args.push('--ctx')
-    args.push(ctx.toString())
-  }
-  if (bpe) {
-    args.push('--vocab-type')
-    args.push('bpe')
-  }
-  return await new Promise<void>((resolve, reject) => {
-    const _pythonShell = new PythonShell(
-      presolve(__dirname, '..', 'scripts', 'convert.py'),
-      {
-        args,
-      }
-    )
-    _pythonShell.on('message', (message) => {
-      log(`[Conversion]::Debug: ${message}`)
-    })
-    _pythonShell.on('stderr', (stderr) => {
-      log(`[Conversion]::Error: ${stderr}`)
-    })
-    _pythonShell.on('error', (err) => {
-      pythonShell = undefined
-      log(`[Conversion]::Error: ${err}`)
-      reject(err)
-    })
-    _pythonShell.on('close', () => {
-      const exitCode = _pythonShell.exitCode
-      pythonShell = undefined
-      if (exitCode !== 0) {
-        log(`[Conversion]::Debug: Conversion exited with code: ${exitCode}`)
-        reject(exitCode)
-      } else {
-        resolve()
-      }
-    })
-  })
-}
-
-export const quantize = async (
-  modelPath: string,
-  outPath: string,
-  quantization: Quantization
-): Promise<void> => {
-  return await new Promise<void>((resolve, reject) => {
-    const quantizeExecutable = getQuantizeExecutable()
-    const _quantizeProcess = spawn(quantizeExecutable, [
-      modelPath,
-      outPath,
-      quantization,
-    ])
-    quantizeProcess = _quantizeProcess
-
-    _quantizeProcess.stdout?.on('data', (data) => {
-      log(`[Quantization]::Debug: ${data}`)
-    })
-    _quantizeProcess.stderr?.on('data', (data) => {
-      log(`[Quantization]::Error: ${data}`)
-    })
-
-    _quantizeProcess.on('close', (code) => {
-      if (code !== 0) {
-        log(`[Quantization]::Debug: Quantization exited with code: ${code}`)
-        reject(code)
-      } else {
-        resolve()
-      }
-    })
-  })
-}
--- a/extensions/monitoring-extension/resources/settings.json
+++ b/extensions/monitoring-extension/resources/settings.json
@ -1,8 +1,8 @@
 [
  {
    "key": "log-enabled",
-    "title": "App Logging Enabled",
-    "description": "We recommend enabling this setting to help us improve the app. Your data will be kept private on your computer, and you can opt out at any time.",
+    "title": "Enable App Logs",
+    "description": "Saves app logs locally on your computer. This enables you to send us crash reports.",
    "controllerType": "checkbox",
    "controllerProps": {
      "value": true
@ -11,7 +11,7 @@
  {
    "key": "log-cleaning-interval",
    "title": "Log Cleaning Interval",
-    "description": "Log cleaning interval in milliseconds.",
+    "description": "Automatically delete local logs after a certain time interval (in milliseconds).",
    "controllerType": "input",
    "controllerProps": {
      "value": "120000",
@ -19,4 +19,4 @@
      "textAlign": "right"
    }
  }
-]
+]
--- a/joi/src/core/Input/index.tsx
+++ b/joi/src/core/Input/index.tsx
@ -2,17 +2,30 @@ import React, { ReactNode, forwardRef } from 'react'
 import { twMerge } from 'tailwind-merge'

 import './styles.scss'
+import { Cross2Icon } from '@radix-ui/react-icons'

 export interface Props extends React.InputHTMLAttributes<HTMLInputElement> {
  textAlign?: 'left' | 'right'
  prefixIcon?: ReactNode
  suffixIcon?: ReactNode
  onCLick?: () => void
+  clearable?: boolean
+  onClear?: () => void
 }

 const Input = forwardRef<HTMLInputElement, Props>(
  (
-    { className, type, textAlign, prefixIcon, suffixIcon, onClick, ...props },
+    {
+      className,
+      type,
+      textAlign,
+      prefixIcon,
+      suffixIcon,
+      onClick,
+      onClear,
+      clearable,
+      ...props
+    },
    ref
  ) => {
    return (
@ -27,6 +40,11 @@ const Input = forwardRef<HTMLInputElement, Props>(
            {suffixIcon}
          </div>
        )}
+        {clearable && (
+          <div className="input__clear-icon" onClick={onClear}>
+            <Cross2Icon className="text-red-200" />
+          </div>
+        )}
        <input
          type={type}
          className={twMerge(
--- a/joi/src/core/Input/styles.scss
+++ b/joi/src/core/Input/styles.scss
@ -40,4 +40,11 @@
      padding-right: 32px;
    }
  }
+  &__clear-icon {
+    @apply absolute right-3 top-1/2 -translate-y-1/2 cursor-pointer;
+    color: hsla(var(--input-icon));
+    + .input {
+      padding: 0 32px;
+    }
+  }
 }
--- a/joi/src/core/Modal/index.tsx
+++ b/joi/src/core/Modal/index.tsx
@ -33,13 +33,16 @@ const Modal = ({
    <DialogPrimitive.Portal>
      <DialogPrimitive.Overlay className="modal__overlay" />
      <DialogPrimitive.Content
+        aria-describedby={undefined}
        className={twMerge(
          'modal__content',
          fullPage && 'modal__content--fullpage',
          className
        )}
      >
-        <div className="modal__title">{title}</div>
+        <DialogPrimitive.Title className="modal__title">
+          {title}
+        </DialogPrimitive.Title>
        {content}
        {!hideClose && (
          <ModalClose asChild>
--- a/joi/src/core/Modal/styles.scss
+++ b/joi/src/core/Modal/styles.scss
@ -42,7 +42,7 @@ fieldset,
  }

  &__title {
-    @apply line-clamp-1;
+    @apply leading-relaxed;
    margin: 0 0 8px 0;
    padding-right: 16px;
    font-weight: 600;
--- a/joi/src/core/ScrollArea/index.tsx
+++ b/joi/src/core/ScrollArea/index.tsx
@ -9,7 +9,7 @@ const ScrollArea = React.forwardRef<
  React.ComponentPropsWithoutRef<typeof ScrollAreaPrimitive.Root>
 >(({ className, children, onScroll, ...props }, ref) => (
  <ScrollAreaPrimitive.Root
-    type="scroll"
+    type="auto"
    className={twMerge('scroll-area__root', className)}
    {...props}
  >
--- a/joi/src/core/ScrollArea/styles.scss
+++ b/joi/src/core/ScrollArea/styles.scss
@ -53,8 +53,8 @@
 }

 ::-webkit-scrollbar {
-  width: 6px;
-  height: 6px;
+  width: 8px;
+  height: 8px;
 }
 ::-webkit-scrollbar-track,
 ::-webkit-scrollbar-thumb {
--- a/joi/src/core/Tooltip/styles.scss
+++ b/joi/src/core/Tooltip/styles.scss
@ -10,7 +10,7 @@
    animation-timing-function: cubic-bezier(0.16, 1, 0.3, 1);
    will-change: transform, opacity;
    font-weight: 500;
-    z-index: 100;
+    z-index: 999999999;
    max-width: 240px;
    @apply text-sm leading-normal;
  }
--- a/package.json
+++ b/package.json
@ -41,14 +41,16 @@
    "build": "yarn build:web && yarn build:electron",
    "build:publish": "yarn copy:assets && yarn build:web && yarn workspace jan build:publish",
    "dev:joi": "yarn workspace @janhq/joi install && yarn workspace @janhq/joi dev",
-    "build:joi": "yarn workspace @janhq/joi install && yarn workspace @janhq/joi build"
+    "build:joi": "yarn workspace @janhq/joi install && yarn workspace @janhq/joi build",
+    "prepare": "husky"
  },
  "devDependencies": {
    "concurrently": "^8.2.1",
    "cpx": "^1.5.0",
+    "husky": "^9.1.5",
    "rimraf": "^3.0.2",
-    "wait-on": "^7.0.1",
-    "run-script-os": "^1.1.6"
+    "run-script-os": "^1.1.6",
+    "wait-on": "^7.0.1"
  },
  "version": "0.0.0"
 }
--- a/specs/Makefile
+++ b/specs/Makefile
@ -1,14 +0,0 @@
-spec:
-	@echo "Initiating a Spec..."
-	@last_number=$$(ls $(CURDIR)/jan-[0-9][0-9][0-9]-* | sort -V | tail -n 1 | cut -d '-' -f 2); \
-	last_number=$$(echo $$last_number | sed 's/^0*//'); \
-	next_number=$$(printf "%03d" $$(( $$last_number + 1 ))); \
-	read -p "Enter Spec title: " title; \
-	title=$$(echo $$title | tr ' ' '-'); \
-	cp $(CURDIR)/spec-template.md $(CURDIR)/jan-$$next_number-$$title.md; \
-	date=$$(date +%Y-%m-%d); \
-	usernames=$$(git config user.name); \
-	sed -i '' 's/{SPEC-NUM}/'$$next_number'/g' $(CURDIR)/jan-$$next_number-$$title.md; \
-	sed -i '' 's/{TITLE}/'$$title'/g' $(CURDIR)/jan-$$next_number-$$title.md; \
-	sed -i '' 's/{DATE}/'$$date'/g' $(CURDIR)/jan-$$next_number-$$title.md; \
-	sed -i '' 's/{USERNAMES}/'$$usernames'/g' $(CURDIR)/jan-$$next_number-$$title.md
--- a/specs/QA-checklist.md
+++ b/specs/QA-checklist.md
@ -0,0 +1,188 @@
+# Regression test
+
+**Release Version:** v0.6.0
+
+**Operating System:**
+
+---
+
+## A. Installation, Update, and Uninstallation
+
+### 1. Users install app (New user flow)
+
+- [ ] :rocket: Installation package is not corrupted and passes all security checks.
+- [ ] :key: App launches successfully after installation.
+
+### 2. Users update app (Existing user flow)
+
+- [ ] :key: Validate that the update does not corrupt user data or settings.
+- [ ] :key: App restarts or prompts the user to restart after an update.
+- [ ] When updating the app, check if the `/models` directory has any JSON/YML files that change according to the update.
+- [ ] Updating the app also updates extensions correctly, test functionality changes.
+
+### 3. Users uninstall / close app
+
+- [ ] :key: After closing the app, all models are unloaded.
+- [ ] :key::warning: Uninstallation process removes the app successfully from the system.
+- [ ] Clean the data folder and open the app to check if it creates all the necessary folders, especially models and extensions.
+
+
+## B. Overview
+
+### 1. Shortcut key
+
+- [ ] :key: Test each shortcut key to confirm it works as described (My models, navigating, opening, closing, etc.).
+
+### 2. Users check the `active model`
+
+- [ ] :key: The app correctly displays the state of the loading model (e.g., loading, ready, error).
+- [ ] :key: Confirm that the app allows users to switch between models if multiple are available.
+- [ ] Check that the app provides feedback or instructions if the model fails to load.
+- [ ] Verify the troubleshooting assistant correctly capture hardware / log info [#1784](https://github.com/janhq/jan/issues/1784)
+
+## C. Thread
+
+### 1. Users can chat with Jan, the default assistant
+
+- [ ] :key: Sending a message enables users to receive responses from model.
+- [ ] :key: Conversation thread is maintained without any loss of data upon sending multiple messages.
+- [ ] ‌Users should be able to edit msg and the assistant will re-generate the answer based on the edited version of the message.
+- [ ] Test for the ability to send different types of messages (e.g., text, emojis, code blocks).
+- [ ] Check the output format of the AI (code blocks, JSON, markdown, ...).
+- [ ] :key: Validate the scroll functionality in the chat window for lengthy conversations.
+- [ ] User can copy / delete the response.
+- [ ] :key: Check the `clear message` / `delete entire chat` button works.
+- [ ] Deleting all the chat retains the model instruction and settings.
+- [ ] :key: Appropriate error handling and messaging if the assistant fails to respond.
+- [ ] Test assistant's ability to maintain context over multiple exchanges.
+- [ ] :key: Check the `create new chat` button, and new conversation will have an automatically generated thread title based on users msg.
+- [ ] Changing `models` mid-thread the app can still handle it.
+- [ ] Check the `regenerate` button renews the response (single / multiple times).
+- [ ] Check the `Instructions` update correctly after the user updates it midway (mid-thread).
+
+### 2. Users can customize chat settings like model parameters via both the GUI & model.yml
+
+- [ ] Adjust model parameters (e.g., Temperature, Top K, Top P) from the GUI and verify they are reflected in the chat behavior.
+- [ ] :key: Changes can be saved and persisted between sessions.
+- [ ] Users can access and modify the model.yml file.
+- [ ] :key: Changes made in model.yml are correctly applied to the chat session upon reload or restart.
+- [ ] Check the maximum and minimum limits of the adjustable parameters and how they affect the assistant's responses.
+- [ ] :key: Users switch between threads with different models, the app can handle it.
+
+### 3. Model dropdown
+- :key: Model list should highlight recommended based on user RAM (this is not really correct, I think it's based on static formula)
+- [ ] Model size should display (for both installed and imported models)
+
+### 4. Users can click on a history thread
+- [ ] Chat window displays the entire conversation from the selected history thread without any missing messages.
+- [ ] Historical threads reflect the exact state of the chat at that time, including settings.
+- [ ] :key: Ability to delete or clean old threads.
+- [ ] Changing the title of the thread updates correctly.
+
+### 5. Users can config instructions for the assistant.
+- [ ] Instructions set by the user are being followed by the assistant in subsequent conversations.
+- [ ] :key: Changes to instructions are updated in real time and do not require a restart of the application or session.
+- [ ] :key: Ability to reset instructions to default or clear them completely.
+- [ ] :key: RAG - Users can import documents and the system should process queries about the uploaded file, providing accurate and appropriate responses in the conversation thread.
+- [ ] :key: Jan can see - Users can import image and Model with vision can generate responses (e.g. LLaVa model). [#294](https://github.com/janhq/jan/issues/294)
+
+
+## D. Hub
+
+### 1. Users can discover recommended models
+- :key: Each model's recommendations are consistent with the user’s activity and preferences.
+- [ ] Search models and verify results / action on the results
+
+### 2. Users can download models suitable for their devices, e.g. compatible with their RAM
+
+- [ ] Model list should be in order: Featured > Remote > Local
+- [ ] :key: Ensure that models are labeled with RAM requirements.
+- [ ] :key: Check the download model functionality and validate if the cancel download feature works correctly.
+
+### 3. Users can download models via a HuggingFace URL [#1740](https://github.com/janhq/jan/issues/1740)
+
+- [ ] :key: Import via Hugging Face Id / full HuggingFace URL, check the progress bar reflects the download process
+- [ ] :key: Test deeplink import [#2876](https://github.com/janhq/jan/issues/2876)
+- [ ] :key: Users can use / remove the imported model.
+
+### 4. Users can import new models to the Hub
+
+- [ ] :key: Ensure import successfully via drag / drop or upload GGUF.
+- [ ] :key: Verify Move model binary file / Keep Original Files & Symlink option are working
+- [ ] Users can add more info to the imported model / edit name
+- [ ] :key: Ensure the new model updates after restarting the app.
+
+
+### 5. Users can use the model as they want
+
+- [ ] :key: Check `start` / `stop` / `delete`  button response exactly what it does.
+- [ ] Check if starting another model stops the other model entirely.
+- [ ] :rocket: Navigate to `hub` > Click `Use` button to use model. Expect to jump to thread and see the model in dropdown model selector.
+- [ ] :key: Check when deleting a model it will delete all the files on the user's computer.
+- [ ] :warning:The recommended tags should present right for the user's hardware.
+
+### 6. Users can Integrate With a Remote Server
+- [ ] :key: Import openAI GPT model https://jan.ai/guides/using-models/integrate-with-remote-server/ and the model displayed in Hub / Thread dropdown
+- [ ] Users can use the remote model properly (openAI GPT, Groq)
+
+## E. System Monitor
+
+### 1. Users can see disk and RAM utilization
+
+- [ ] :key: Verify that the RAM and VRAM utilization graphs  accurately reported in real time.
+- [ ] :key: Validate that the utilization percentages reflect the actual usage compared to the system's total available resources.
+- [ ] :key: Ensure that the system monitors updates dynamically as the models run and stop.
+
+### 2. Users can start and stop models based on system health
+
+- [ ] :key: Verify the `Start/Stop` action for a model, the system resource usage reflects this change.
+- [ ] Confirm that any changes in model status (start/stop) are logged or reported to the user for transparency.
+- [ ] :key: Check the functionality of `App log` to ensure it opens the correct folder in the system file explorer.
+
+## F. Settings
+
+### 1. Appearance
+
+- [ ] :key: Test the `Light`, `Dark`, and `System` theme settings to ensure they are functioning as expected.
+- [ ] Confirm that the application saves the theme preference and persists it across sessions.
+- [ ] Validate that all elements of the UI are compatible with the theme changes and maintain legibility and contrast.
+
+### 2. Extensions [TBU]
+
+- Validate the `Install Extensions` process by selecting and installing a plugin file.
+- [ ] Enable / disable extensions and the UI should reflex accordingly
+
+### 3. Extension group
+
+- [ ] :key: Users can set valid Endpoint and API Key to use remote models
+- [ ] Monitoring extension should allow users to enable / disable log and set log Cleaning Interval
+
+
+### 4. Advanced settings
+
+- [ ] :key: Test the `Experimental Mode` toggle to confirm it enables or disables experimental features as intended.
+- [ ] :key: Check the functionality of `Open App Directory` to ensure it opens the correct folder in the system file explorer.
+- [ ] Users can move **Jan data folder**
+- [ ] Validate that changes in advanced settings are applied immediately or provide appropriate instructions if a restart is needed.
+- [ ] Attemp to test downloading model from hub using **HTTP Proxy** [guideline](https://github.com/janhq/jan/pull/1562)
+- [ ] Logs that are older than 7 days or exceed 1MB in size will be automatically cleared upon starting the application.
+- [ ] Users can click on Reset button to **factory reset** app settings to its original state & delete all usage data.
+    - [ ] Keep the current app data location
+    - [ ] Reset the current app data location
+- [ ] Users can enable the setting and chat using quick ask.
+
+### 5. Engine
+- [ ] :key: TensorRT Engine - Users able to chat with the model
+- [ ] :key: Onnx Engine - Users able to chat with the model
+- [ ] :key: Other remote Engine - Users able to chat with the model
+
+## G. Local API server
+
+### 1. Local Server Usage with Server Options
+- [ ] :key: Explore API Reference: Swagger API for sending/receiving requests
+    - [ ] Use default server option
+    - [ ] Configure and use custom server options
+- [ ] Test starting/stopping the local API server with different Model/Model settings
+- [ ] Server logs captured with correct Server Options provided
+- [ ] Verify functionality of Open logs/Clear feature
+- [ ] Ensure that threads and other functions impacting the model are disabled while the local server is running
--- a/specs/README.md
+++ b/specs/README.md
@ -1,19 +0,0 @@
-# Jan Improvement Proposals
-
-This is a repo of key architecture decisions for Jan. 
-
-[Read more about ADRs](https://github.com/joelparkerhenderson/architecture-decision-record)
-
-### Get started:
-
-```sh
-# In root:
-make newadr
-```
-
-### Template:
- **Status**: `pending`, `approved`, or `rejected`
- **Context**: a clearly defined problem/goal
- **Decisions**: the proposed architecture choices & changes
- **Consequences**: pros and cons of the decision
- **References**: any relevant materials to read
--- a/specs/adrs/adr-001-jan-deployable-cloud-native.md
+++ b/specs/adrs/adr-001-jan-deployable-cloud-native.md
@ -1,54 +0,0 @@
-# ADR #001: Jan deployable cloud-native
-
-## Changelog
-
- 23.10.03: Initial unfinished draft
- 23.10.16: Remove authentication
-
-## Authors
-
- @nam-john-ho
- @louis
-
-## Context
-
-### Status Quo
-
-* User doesn't have a local GPU machine but wants to run Jan on a rented server
-* User wants a quick, fast way to experiment with Jan on a rented GPU
-* https://github.com/janhq/jan/issues/255
-
-## Decision
-
-* This ADR aims to outline design decisions for deploying Jan in cloud native environments such as: Runpod, AWS, Azure, GCP in a fast and simple way.
-* The current code-base should not change too much.
-* The current plugins must be reusable across environments (Desktop, Cloud-native).
-
-
-### Key Design Decisions
-![Key Design](images/adr-001-02.png "Key Design")
-#### Why middleware
-* The /web codebase needs to operate in both browser and electron environments
-* The /web codebase needs to route plugin routes accordingly, either to /server or /electron
-* Middleware takes care of this
-* We will have a /server codebase that takes care of routing to plugins
-#### Unsuitable Alternatives
-* Not possible to just run electron headless
-* /web is on a different chromium window
-* Does not have all the electron handlers
-* Does not have the IPC handler
-
-## Alternative Approaches
-Separated server process runs along side with electron. https://github.com/janhq/jan/pull/184/commits/6005409a945bb0e80a61132b9eb77f47f19d0aa6 
-
-## Considerations
-* Due to the limitation of accessing the file system in web browsers, the first version of the web app will load all the current plugins by default, and users will not be able to add, remove, or update plugins.
-* Simple authentication will be implemented as a plugin.
-
-## References
-
- https://www.runpod.io/console/templates
- https://repost.aws/articles/ARQ0Tz9eorSL6EAus7XPMG-Q/how-to-install-textgen-webui-on-aws
- https://www.youtube.com/watch?v=_59AsSyMERQ
- https://gpus.llm-utils.org/running-llama-2-on-runpod-with-oobaboogas-text-generation-webui/
- https://medium.com/@jarimh1984/installing-oobabooga-and-oobabooga-api-to-runpod-cloud-step-by-step-tutorial-47457974dfa5
--- a/specs/adrs/adr-002-jan-ai-apps.md
+++ b/specs/adrs/adr-002-jan-ai-apps.md
@ -1,55 +0,0 @@
-# ADR #002: Jan AI apps
-
-## Changelog
- Oct 4th 2023: Initial draft
- Oct 6th 2023: Update sample API
-
-## Authors
- @vuonghoainam - Hiro
- @louis-jan
-
-## Status
-Proposed
-
-## Context
-
-### Business context
-Jan can be a platform and let builders build their own `AI app` using existing tools
- Use-case 1: Medical AI startup uploads "case notes" to Jan, wants to ask it questions (i.e. medical audit)
- Use-case 2: Legal e-discovery: very large amount of documents (~10-15k pages) are uploaded, data is very private and cannot be leaked
- Use-case 3: Jan wants to use Jan to have a QnA chatbot to answer questions on docs
- Use-case 4: Jan wants to use Jan to have a codellama RAG on its own codebase, to generate new PRs
-
-### Extra context
- There are many use cases that the community can develop and sell to the users through Jan as plugin. Jan needs to streamline higher value chain.
- This brings more value and more option to all kind of user
- This can help building ecosystem and streamline value end to end (Jan,  plugins/ model creators, Jan users - enterprise/ individual)
- We at Jan cannot build plugins more on our own, but this one should serve as featured example like [OpenAI Retrieval plugin](https://github.com/openai/chatgpt-retrieval-plugin) does.
- [#232](https://github.com/janhq/jan/issues/232)
-
-## Decision
-
- User can browse and install plugins (with recommended model - llama2, claude, openai …) - This requires plugin dependencies.
- Jan provide consistent interface for plugin developer to use:
-    - Use LLM (this can be switched in runtime) - i.e Dev in llama2-7b but user can use with llama2-70b. Can choose another model as well
-    - Plugin can have API for CRUD indices in vectorDB/ DB, and Jan only exposes corresponding data to the app
-    - A place for a plugin to store the files for persistence
- This works seamlessly on desktop/ Jan hosted version with Jan API abstraction.
-
-### Simple UX
-![UX](images/adr-002-01.png "UX")
-
-### Component design
-![Component design](images/adr-002-02.png "Component design")
-
-## API
- `jan.plugin.<plugin_name>.<function_name>(**args)`
-
- `jan.core.db.sql.command()` -> CRUD/ query
- `jan.plugin.vectra.<function_name>(**args)` -> CRUD/ query for 
-## Consequences
- Jan user can build their own AI apps (and buy from others too) in an easy way
- Clear design for plugin and Jan platform development
-
-## Reference
- [ADR-003](adr-003-jan-plugins.md)
--- a/specs/adrs/adr-003-jan-plugins.md
+++ b/specs/adrs/adr-003-jan-plugins.md
@ -1,65 +0,0 @@
-# ADR 003: JAN PLUGINS
-
-## Changelog
-
- Oct 5th 2023: Initial draft
-
-## Status
-
-Accepted
-
-## Context
-
-Modular Architecture w/ Plugins:
-
- Jan will have an architecture similar to VSCode or k8Lens
- "Desktop Application" whose functionality can be extended thru plugins
- Jan's architecture will need to accommodate plugins for (a) Persistence(b) IAM(c) Teams and RBAC(d) Policy engines(e) "Apps" (i.e. higher-order business logic)(f) Themes (UI)
- Nitro's architecture will need to accommodate plugins for different "model backends"(a) llama.cpp(b) rkwk (and others)(c) 3rd-party AIs
-
-## Decision
-
-![Architecture](./images/adr-003-01.png)
-
-## Consequences
-
-What becomes easier or more difficult to do because of this change?
-
-## CoreService API
-
-Jan frontend components will communicate with plugin functions via Service Interfaces:
-
-All of the available APIs are listed in [CoreService](../web/shared/coreService.ts)
-
- Data Service:
-
-  - GET_CONVERSATIONS: retrieve all of the conversations
-  - CREATE_CONVERSATION: start a new conversation
-  - DELETE_CONVERSATION: delete an existing conversation
-  - GET_CONVERSATION_MESSAGES: retrieve a certain conversation messages
-  - CREATE_MESSAGE: store a new message (both sent & received)
-  - UPDATE_MESSAGE: update an existing message (streaming)
-  - STORE_MODEL: store new model information (when clicking download)
-  - UPDATE_FINISHED_DOWNLOAD: mark a model as downloaded
-  - GET_UNFINISHED_DOWNLOAD_MODELS: retrieve all unfinished downloading model (TBD)
-  - GET_FINISHED_DOWNLOAD_MODELS: retrieve all finished downloading model (TBD)
-  - DELETE_DOWNLOAD_MODEL: delete a model (TBD)
-  - GET_MODEL_BY_ID: retrieve model information by its ID
-
- Inference Service:
-
-  - INFERENCE_URL: retrieve inference endpoint served by plugin
-  - INIT_MODEL: runs a model
-  - STOP_MODEL: stop a running model
-
- Model Management Service: (TBD)
-
-  - GET_AVAILABLE_MODELS: retrieve available models (deprecate soon)
-  - GET_DOWNLOADED_MODELS: (deprecated)
-  - DELETE_MODEL: (deprecated)
-  - DOWNLOAD_MODEL: start to download a model
-  - SEARCH_MODELS: explore models with search query on HuggingFace (TBD)
-
- Monitoring service:
-  - GET_RESOURCES_INFORMATION: retrieve total & used memory information
-  - GET_CURRENT_LOAD_INFORMATION: retrieve CPU load information
--- a/specs/adrs/adr-004-UI-Service.md
+++ b/specs/adrs/adr-004-UI-Service.md
@ -1,52 +0,0 @@
-# ADR 004: UI Service
-
-## Changelog
-
- 10 Oct 2023: initial vision @dan-jan @0xSage
-
-## Status
-
-Proposed
-
-## Context
-
-Plugin devs need an API to change the Jan UI. Before we layer on more features, let's ensure good devex for feature building.
-
-## Decision
-
-![Jan UI Framework](./images/jan-ui-framework.png)
-
- Side-Ribbon: Jan Apps
-
-  - This is a protected area, for Apps
-  - Apps can define Left Panel, Center, and Right Panel
-  - We will only have 1 App for now (no need to build this abstraction yet)
-  - Future: Server mode (see LMStudio), Art Studio (Stable Diffusion)
-
- Side-Ribbon: Global Settings
-
-  - These will all open in a modal
-  - Currently: Model Store, Running Models
-  - Currently: User Login, Settings
-
- Main Window and Right Panel
-
-  - These will mainly be session-based
-
- Console: production logs
-
-## UiService API
-
-We need a UI API for Plugins
-
- e.g. Model Store plugin -> Registers "Global Settings" Icon, defines what will show up in the Modal
- e.g. Model Runner plugin -> Inference Parameters
-
-## Consequences
-
- Increased code complexity
-
-## Reference
-
- VSCode
- Obsidian
--- a/specs/adrs/adr-005-model-installation.md
+++ b/specs/adrs/adr-005-model-installation.md
@ -1,48 +0,0 @@
-# ADR 005: model-installation
-
-## Changelog
-
- 2023-10-18: Initial draft
-
-## Authors
-
- 0xSage
-
-## Status
-
-Proposed
-
-## Context
-
-There are a few issues with our current model installation method (hardcoding jsons in /models repo):
-
- Users want to add their own model binaries
- Maintaining /models is too manual
-
-## Decision
-
-Let Users download models on their own & manually import them to Jan via a "add a model" UI
-
-Links:
-
- Github issue: https://github.com/janhq/jan/issues/359
- Related issue: https://github.com/janhq/jan/issues/304
- Designs: https://www.figma.com/file/JdK7cNIBeVdYeHxKiYeWtk/JAN---Web?type=design&node-id=4092-58218&mode=design&t=8OmFSG0E6I8Y3IjY-0
-
-## Consequences
-
-Closed alternate solutions:
-
- https://github.com/janhq/jan/issues/328
-
-## Alternatives
-
-Thinking through the model selection experience, there are a few possibilities:
-
-1. [current] We hardcode models (via Github) to show up in Explore Models => unnecessarily manual, missing models users want
-1. We mirror HF models for a faster download => users can also do nitro add llama2
-1. [CHOSEN] Users download models on their own & manually import them to Jan via a "add a model" UI => I like this option actually
-1. [LATER] Users paste in a HF link and download the model in Explore Models => do we still render model cards for them?
-1. Users manage their own models folder, e.g. /Users/nicole/models, then they set folder path in Jan. => this one needs a lot of designs/fe work
-
-## Reference
--- a/specs/adrs/adr-006-jan-core-module.md
+++ b/specs/adrs/adr-006-jan-core-module.md
@ -1,36 +0,0 @@
-# ADR 006: jan-core-module
-
-## Changelog
-
- 2023-10-19: Initial draft
-
-## Authors
-
- Louis
-
-## Status
-
-Accepted
-
-## Context
-
-Currently, developers face several challenges while writing a plugin, which include:
- Registering functions using the function name as a string
- Invoking anonymous functions
- No access to native APIs or common functions for data insertion or retrieval
- Lack of communication between the app and plugins.
-
-## Decision
-
-Let developers install and import an npm module to develop our Plugin easier.
-
-Upon boot, Web plugs in window modules. Its components and plugins can then import the core to access exposed functions.
-
-![Jan Core Module](./images/jan-core-module.png)
-## Consequences
-
-Separate PRs should be created for updating the core and app. For instance, if a new app enhancement requires the core module to expose a new API, a new core update must be published on npm to prevent CI failure.
-
-## Alternatives
-
-## Reference
--- a/specs/adrs/adr-007-jan-plugin-catalog.md
+++ b/specs/adrs/adr-007-jan-plugin-catalog.md
@ -1,35 +0,0 @@
-# ADR 007: jan-plugin-catalog
-
-## Changelog
-
- 2023-10-19: Initial draft
-
-## Authors
-
- Louis
-
-## Status
-
-Proposed
-
-## Context
-
-Users should be able to explore plugins, and developers need a channel to publish their plugins
-
-Lesson learned from the Model Catalog: we hosted everything on Github and attempted to retrieve it anonymously, which cost us a lot of effort and led to a limit rate issue. Let's say there are N items in the catalog, and we attempted to send N+1 requests at a time. It was costly and led to an API limit rate issue.
-
-## Decision
-
-1. Combine all JSON items in the catalog into one JSON catalog. Now we just need to work with one catalog file, which means only one request, but the rate limit issue still exists.
-2. CDN - there are cool services out there which support OSS projects, such as [JSDELIVR](https://www.jsdelivr.com).
-3. Downloading a JSON file is not a good approach, though. Exporting a module works better. Webpack + DefinePlugin should work.
-4. Since we have created a new module, we want to publish it as well. Let's publish it on npm so everyone can install and use it. This is also to add a versioning feature.
-5. Installing this npm module would require the user to update their app to the latest version. Instead, let's import the remote module via CDN, which requires just a few lines of code.
-
-![Jan Plugin Catalog](./images/jan-plugin-catalog.png)
-
-## Consequences
-
-## Alternatives
-
-## Reference
--- a/Show More
+++ b/Show More
 @ -1 +1 @@
 .4.20
 .5.0