feat: add prompt evaluation script for develop and deploy

2025-08-20 17:30:29 +07:00 · 2025-08-20 17:30:29 +07:00 · 0225bb6b1e
commit 0225bb6b1e
parent b6813f1c7a
9 changed files with 886 additions and 10 deletions
--- a/autoqa/COMMAND_REFERENCE.md
+++ b/autoqa/COMMAND_REFERENCE.md
@ -140,8 +140,6 @@ MAX_TURNS=50 DELAY_BETWEEN_TESTS=5 python main.py
 ## Migration Testing Arguments
 **Note**: These arguments are planned for future implementation based on your sample commands.
 | Argument | Environment Variable | Default | Description |
 |----------|---------------------|---------|-------------|
 | `--enable-migration-test` | `ENABLE_MIGRATION_TEST` | `false` | Enable migration testing mode |
@ -150,6 +148,15 @@ MAX_TURNS=50 DELAY_BETWEEN_TESTS=5 python main.py
 | `--old-version` | `OLD_VERSION` | - | Path to old version installer |
 | `--new-version` | `NEW_VERSION` | - | Path to new version installer |
 ## Reliability Testing Arguments
 | Argument | Environment Variable | Default | Description |
 |----------|---------------------|---------|-------------|
 | `--enable-reliability-test` | `ENABLE_RELIABILITY_TEST` | `false` | Enable reliability testing mode |
 | `--reliability-phase` | `RELIABILITY_PHASE` | `development` | Testing phase: development (5 runs) or deployment (20 runs) |
 | `--reliability-runs` | `RELIABILITY_RUNS` | `0` | Custom number of runs (overrides phase setting) |
 | `--reliability-test-path` | `RELIABILITY_TEST_PATH` | - | Specific test file path for reliability testing |
 **Examples:**
 ```bash
 # Basic migration test
@ -216,6 +223,52 @@ python main.py \
  --enable-reportportal \
  --rp-token "YOUR_TOKEN" \
  --rp-project "jan_migration_tests"
 # Reliability testing - deployment phase with ReportPortal
 python main.py \
  --enable-reliability-test \
  --reliability-phase deployment \
  --reliability-test-path "tests/base/default-jan-assistant.txt" \
  --max-turns 50 \
  --enable-reportportal \
  --rp-token "YOUR_TOKEN" \
  --rp-project "jan_reliability_tests"
 ### Reliability Testing
 ```bash
 # Development phase reliability test (5 runs)
 python main.py \
  --enable-reliability-test \
  --reliability-phase development \
  --max-turns 40
 # Deployment phase reliability test (20 runs)
 python main.py \
  --enable-reliability-test \
  --reliability-phase deployment \
  --max-turns 40
 # Custom number of runs
 python main.py \
  --enable-reliability-test \
  --reliability-runs 10 \
  --max-turns 40
 # Test specific file with reliability testing
 python main.py \
  --enable-reliability-test \
  --reliability-phase development \
  --reliability-test-path "tests/base/default-jan-assistant.txt" \
  --max-turns 40
 # Reliability testing with ReportPortal
 python main.py \
  --enable-reliability-test \
  --reliability-phase deployment \
  --enable-reportportal \
  --rp-token "YOUR_TOKEN" \
  --max-turns 40
 ```
 ### Advanced Configuration
@ -265,13 +318,19 @@ python main.py \
 - `TESTS_DIR`: Test files directory
 - `DELAY_BETWEEN_TESTS`: Delay between tests
-### Migration Testing (Planned)
+### Migration Testing
 - `ENABLE_MIGRATION_TEST`: Enable migration mode
 - `MIGRATION_TEST_CASE`: Migration test case
 - `MIGRATION_BATCH_MODE`: Use batch mode
 - `OLD_VERSION`: Old installer path
 - `NEW_VERSION`: New installer path
 ### Reliability Testing
 - `ENABLE_RELIABILITY_TEST`: Enable reliability testing mode
 - `RELIABILITY_PHASE`: Testing phase (development/deployment)
 - `RELIABILITY_RUNS`: Custom number of runs
 - `RELIABILITY_TEST_PATH`: Specific test file path
 ## Help and Information
 ### Get Help
--- a/autoqa/QUICK_START.md
+++ b/autoqa/QUICK_START.md
@ -46,6 +46,28 @@ python main.py \
  --max-turns 75
 ```
 ### 4. Reliability Testing
 ```bash
 # Development phase (5 runs)
 python main.py \
  --enable-reliability-test \
  --reliability-phase development \
  --max-turns 40
 # Deployment phase (20 runs)
 python main.py \
  --enable-reliability-test \
  --reliability-phase deployment \
  --max-turns 40
 # Custom number of runs
 python main.py \
  --enable-reliability-test \
  --reliability-runs 10 \
  --max-turns 40
 ```
 ## Test Types
 ### Base Test Cases
@ -61,6 +83,11 @@ python main.py \
 - **`assistants`**: Test custom assistants persist after upgrade
 - **`assistants-complete`**: Test both creation and chat functionality
 ### Reliability Testing
 - **Development Phase**: Run test 5 times to verify basic stability (≥80% success rate)
 - **Deployment Phase**: Run test 20 times to verify production readiness (≥90% success rate)
 - **Custom Runs**: Specify custom number of runs for specific testing needs
 ## Common Commands
 ### Basic Workflow
@ -101,7 +128,18 @@ python main.py \
  --migration-batch-mode \
  --old-version "path/to/old.exe" \
  --new-version "path/to/new.exe"
-```
+
 # Test reliability - development phase
 python main.py \
  --enable-reliability-test \
  --reliability-phase development \
  --max-turns 40
 # Test reliability - deployment phase
 python main.py \
  --enable-reliability-test \
  --reliability-phase deployment \
  --max-turns 40
 ## Configuration Options
@ -130,6 +168,14 @@ python main.py \
 | `--rp-endpoint` | RP endpoint URL | No |
 | `--rp-project` | RP project name | No |
 ### Reliability Testing Arguments
 | Argument | Description | Required |
 |----------|-------------|----------|
 | `--enable-reliability-test` | Enable reliability mode | Yes |
 | `--reliability-phase` | Testing phase (development/deployment) | No |
 | `--reliability-runs` | Custom number of runs | No |
 | `--reliability-test-path` | Specific test file path | No |
 ## Environment Variables
 ```bash
@ -179,6 +225,19 @@ python main.py \
  --rp-project "jan_migration_tests"
 ```
 ### Example 4: Reliability Testing
 ```bash
 # Test reliability with deployment phase
 python main.py \
  --enable-reliability-test \
  --reliability-phase deployment \
  --reliability-test-path "tests/base/default-jan-assistant.txt" \
  --max-turns 50 \
  --enable-reportportal \
  --rp-token "YOUR_TOKEN" \
  --rp-project "jan_reliability_tests"
 ```
 ## Troubleshooting
 ### Common Issues
--- a/autoqa/README.md
+++ b/autoqa/README.md
@ -12,6 +12,7 @@
 - 🎯 **Flexible Configuration**: Command-line arguments and environment variables
 - 🌐 **Cross-platform**: Windows, macOS, and Linux support
 - 📁 **Test Discovery**: Automatically scans test files from directory
 - 🧪 **Reliability Testing**: Run tests multiple times to verify stability (development: 5 runs, deployment: 20 runs)
 ## Prerequisites
@ -74,6 +75,25 @@ python main.py \
  --rp-token "YOUR_API_TOKEN"
 ```
 ### Reliability Testing
 ```bash
 # Development phase (5 runs) - verify basic stability
 python main.py --enable-reliability-test --reliability-phase development
 # Deployment phase (20 runs) - verify production readiness
 python main.py --enable-reliability-test --reliability-phase deployment
 # Custom number of runs
 python main.py --enable-reliability-test --reliability-runs 10
 # Test specific file with reliability testing
 python main.py \
  --enable-reliability-test \
  --reliability-phase development \
  --reliability-test-path "tests/base/default-jan-assistant.txt"
 ```
 ## Configuration
 ### Command Line Arguments
--- a/autoqa/RELIABILITY_TESTING.md
+++ b/autoqa/RELIABILITY_TESTING.md
@ -0,0 +1,296 @@
 # AutoQA Reliability Testing Guide
 🚀 Comprehensive guide for running reliability tests with AutoQA to verify test case stability and reliability.
 ## Overview
 Reliability testing is designed to verify that your test cases are stable and reliable by running them multiple times. This helps identify flaky tests and ensures consistent behavior before deploying to production.
 ## Two Testing Phases
 ### 1. Development Phase
 - **Purpose**: Verify basic stability during development
 - **Runs**: 5 times
 - **Success Rate Requirement**: ≥80%
 - **Use Case**: During development to catch obvious stability issues
 ### 2. Deployment Phase
 - **Purpose**: Verify production readiness
 - **Runs**: 20 times
 - **Success Rate Requirement**: ≥90%
 - **Use Case**: Before deploying to production to ensure reliability
 ## Command Line Usage
 ### Basic Reliability Testing
 ```bash
 # Development phase (5 runs)
 python main.py --enable-reliability-test --reliability-phase development
 # Deployment phase (20 runs)
 python main.py --enable-reliability-test --reliability-phase deployment
 ```
 ### Custom Configuration
 ```bash
 # Custom number of runs
 python main.py --enable-reliability-test --reliability-runs 10
 # Specific test file
 python main.py --enable-reliability-test --reliability-test-path "tests/base/default-jan-assistant.txt"
 # Custom max turns
 python main.py --enable-reliability-test --reliability-phase development --max-turns 50
 ```
 ### With ReportPortal Integration
 ```bash
 # Development phase with ReportPortal
 python main.py \
  --enable-reliability-test \
  --reliability-phase development \
  --enable-reportportal \
  --rp-token "YOUR_TOKEN" \
  --rp-project "jan_reliability_tests"
 # Deployment phase with ReportPortal
 python main.py \
  --enable-reliability-test \
  --reliability-phase deployment \
  --enable-reportportal \
  --rp-token "YOUR_TOKEN" \
  --rp-project "jan_reliability_tests"
 ```
 ## Environment Variables
 ```bash
 # Enable reliability testing
 export ENABLE_RELIABILITY_TEST=true
 # Set phase
 export RELIABILITY_PHASE=deployment
 # Custom runs (overrides phase)
 export RELIABILITY_RUNS=15
 # Specific test path
 export RELIABILITY_TEST_PATH="tests/base/my-test.txt"
 # Run with environment variables
 python main.py --enable-reliability-test
 ```
 ## Command Line Arguments
 | Argument | Environment Variable | Default | Description |
 |----------|---------------------|---------|-------------|
 | `--enable-reliability-test` | `ENABLE_RELIABILITY_TEST` | `false` | Enable reliability testing mode |
 | `--reliability-phase` | `RELIABILITY_PHASE` | `development` | Testing phase: development or deployment |
 | `--reliability-runs` | `RELIABILITY_RUNS` | `0` | Custom number of runs (overrides phase) |
 | `--reliability-test-path` | `RELIABILITY_TEST_PATH` | - | Specific test file path |
 ## Test Execution Flow
 ### Single Test Reliability Testing
 1. **Load Test File**: Read the specified test file
 2. **Run Multiple Times**: Execute the test the specified number of times
 3. **Track Results**: Monitor success/failure for each run
 4. **Calculate Success Rate**: Determine overall reliability
 5. **Generate Report**: Provide detailed results and statistics
 ### Multiple Tests Reliability Testing
 1. **Scan Test Files**: Find all test files in the specified directory
 2. **Run Reliability Tests**: Execute reliability testing on each test file
 3. **Aggregate Results**: Combine results from all tests
 4. **Overall Assessment**: Determine if the entire test suite is reliable
 ## Output and Results
 ### Success Rate Calculation
 ```
 Success Rate = (Successful Runs / Total Runs) × 100
 ```
 ### Development Phase Requirements
 - **Target**: 5 runs
 - **Minimum Success Rate**: 80%
 - **Result**: PASS if ≥80%, FAIL if <80%
 ### Deployment Phase Requirements
 - **Target**: 20 runs
 - **Minimum Success Rate**: 90%
 - **Result**: PASS if ≥90%, FAIL if <90%
 ### Sample Output
 ```
 ==========================================
 RELIABILITY TEST SUMMARY
 ==========================================
 Test: tests/base/default-jan-assistant.txt
 Phase: DEVELOPMENT
 Completed runs: 5/5
 Successful runs: 4
 Failed runs: 1
 Success rate: 80.0%
 Total duration: 125.3 seconds
 Average duration per run: 25.1 seconds
 Overall result: ✅ PASSED
 Development phase requirement: ≥80% success rate
 ```
 ## Use Cases
 ### 1. New Test Development
 ```bash
 # Test a new test case for basic stability
 python main.py \
  --enable-reliability-test \
  --reliability-phase development \
  --reliability-test-path "tests/base/my-new-test.txt"
 ```
 ### 2. Pre-Production Validation
 ```bash
 # Verify test suite is production-ready
 python main.py \
  --enable-reliability-test \
  --reliability-phase deployment \
  --tests-dir "tests/base"
 ```
 ### 3. Flaky Test Investigation
 ```bash
 # Run a potentially flaky test multiple times
 python main.py \
  --enable-reliability-test \
  --reliability-runs 25 \
  --reliability-test-path "tests/base/flaky-test.txt"
 ```
 ### 4. CI/CD Integration
 ```bash
 # Automated reliability testing in CI/CD
 ENABLE_RELIABILITY_TEST=true \
 RELIABILITY_PHASE=deployment \
 python main.py --max-turns 40
 ```
 ## Best Practices
 ### 1. Start with Development Phase
 - Begin with 5 runs to catch obvious issues
 - Use during active development
 - Quick feedback on test stability
 ### 2. Use Deployment Phase for Production
 - Run 20 times before production deployment
 - Ensures high reliability standards
 - Catches intermittent failures
 ### 3. Custom Runs for Specific Needs
 - Use custom run counts for special testing scenarios
 - Investigate flaky tests with higher run counts
 - Balance between thoroughness and execution time
 ### 4. Monitor Execution Time
 - Reliability testing takes longer than single runs
 - Plan accordingly for CI/CD pipelines
 - Consider parallel execution for multiple test files
 ## Troubleshooting
 ### Common Issues
 #### 1. Test File Not Found
 ```bash
 # Ensure test path is correct
 python main.py \
  --enable-reliability-test \
  --reliability-test-path "tests/base/existing-test.txt"
 ```
 #### 2. Low Success Rate
 - Check test environment stability
 - Verify test dependencies
 - Review test logic for race conditions
 #### 3. Long Execution Time
 - Reduce max turns if appropriate
 - Use development phase for quick feedback
 - Consider running fewer test files
 ### Debug Mode
 ```bash
 # Enable debug logging
 export LOG_LEVEL=DEBUG
 export PYTHONPATH=.
 # Run with verbose output
 python main.py --enable-reliability-test --reliability-phase development
 ```
 ## Integration with Existing Workflows
 ### Migration Testing
 ```bash
 # Run reliability tests on migration test cases
 python main.py \
  --enable-reliability-test \
  --reliability-phase deployment \
  --tests-dir "tests/migration"
 ```
 ### Base Testing
 ```bash
 # Run reliability tests on base test cases
 python main.py \
  --enable-reliability-test \
  --reliability-phase development \
  --tests-dir "tests/base"
 ```
 ### Custom Test Directories
 ```bash
 # Run reliability tests on custom test directory
 python main.py \
  --enable-reliability-test \
  --reliability-phase deployment \
  --tests-dir "my_custom_tests"
 ```
 ## Performance Considerations
 ### Execution Time
 - **Development Phase**: ~5x single test execution time
 - **Deployment Phase**: ~20x single test execution time
 - **Multiple Tests**: Multiply by number of test files
 ### Resource Usage
 - Screen recordings for each run
 - Trajectory data for each run
 - ReportPortal uploads (if enabled)
 ### Optimization Tips
 - Use development phase for quick feedback
 - Run deployment phase during off-peak hours
 - Consider parallel execution for multiple test files
 - Clean up old recordings and trajectories regularly
 ## Next Steps
 1. **Start Simple**: Begin with development phase on single test files
 2. **Scale Up**: Move to deployment phase for critical tests
 3. **Automate**: Integrate into CI/CD pipelines
 4. **Monitor**: Track reliability trends over time
 5. **Improve**: Use results to identify and fix flaky tests
 For more information, see the main [README.md](README.md), [QUICK_START.md](QUICK_START.md), and explore the test files in the `tests/` directory.
--- a/autoqa/batch_migration_runner.py
+++ b/autoqa/batch_migration_runner.py
@ -209,7 +209,7 @@ async def run_batch_migration_test(computer, old_version_path, new_version_path,
                    test_case_setup_success = False
                    continue
-                with open(setup_test_path, "r") as f:
+                with open(setup_test_path, "r", encoding="utf-8") as f:
                    setup_content = f.read()
                setup_test_data = {
@ -331,7 +331,7 @@ async def run_batch_migration_test(computer, old_version_path, new_version_path,
                    test_case_verify_success = False
                    continue
-                with open(verify_test_path, "r") as f:
+                with open(verify_test_path, "r", encoding="utf-8") as f:
                    verify_content = f.read()
                verify_test_data = {
--- a/autoqa/individual_migration_runner.py
+++ b/autoqa/individual_migration_runner.py
@ -101,7 +101,7 @@ async def run_individual_migration_test(computer, test_case_key, old_version_pat
        if not os.path.exists(setup_test_path):
            raise FileNotFoundError(f"Setup test file not found: {setup_test_path}")
-        with open(setup_test_path, "r") as f:
+        with open(setup_test_path, "r", encoding="utf-8") as f:
            setup_content = f.read()
        setup_test_data = {
@ -151,7 +151,7 @@ async def run_individual_migration_test(computer, test_case_key, old_version_pat
        if not os.path.exists(verify_test_path):
            raise FileNotFoundError(f"Verification test file not found: {verify_test_path}")
-        with open(verify_test_path, "r") as f:
+        with open(verify_test_path, "r", encoding="utf-8") as f:
            verify_content = f.read()
        verify_test_data = {
--- a/autoqa/main.py
+++ b/autoqa/main.py
@ -13,6 +13,7 @@ from reportportal_client.helpers import timestamp
 from utils import scan_test_files
 from test_runner import run_single_test_with_timeout
 from individual_migration_runner import run_individual_migration_test, run_all_migration_tests, MIGRATION_TEST_CASES
 from reliability_runner import run_reliability_test, run_reliability_tests
 # Configure logging
 logging.basicConfig(
@ -184,8 +185,21 @@ Examples:
  # Run with different model
  python main.py --model-name "gpt-4" --model-base-url "https://api.openai.com/v1"
  # Reliability testing - development phase (5 runs)
  python main.py --enable-reliability-test --reliability-phase development
  # Reliability testing - deployment phase (20 runs)
  python main.py --enable-reliability-test --reliability-phase deployment
  # Reliability testing - custom number of runs
  python main.py --enable-reliability-test --reliability-runs 10
  # Reliability testing - specific test file
  python main.py --enable-reliability-test --reliability-test-path "tests/base/default-jan-assistant.txt"
  # Using environment variables
  ENABLE_REPORTPORTAL=true RP_TOKEN=xxx MODEL_NAME=gpt-4 python main.py
  ENABLE_RELIABILITY_TEST=true RELIABILITY_PHASE=deployment python main.py
        """
    )
@ -321,6 +335,32 @@ Examples:
        help='List available migration test cases and exit'
    )
    # Reliability testing arguments
    reliability_group = parser.add_argument_group('Reliability Testing Configuration')
    reliability_group.add_argument(
        '--enable-reliability-test',
        action='store_true',
        default=os.getenv('ENABLE_RELIABILITY_TEST', 'false').lower() == 'true',
        help='Enable reliability testing mode (env: ENABLE_RELIABILITY_TEST, default: false)'
    )
    reliability_group.add_argument(
        '--reliability-phase',
        choices=['development', 'deployment'],
        default=os.getenv('RELIABILITY_PHASE', 'development'),
        help='Reliability testing phase: development (5 runs) or deployment (20 runs) (env: RELIABILITY_PHASE, default: development)'
    )
    reliability_group.add_argument(
        '--reliability-runs',
        type=int,
        default=int(os.getenv('RELIABILITY_RUNS', '0')),
        help='Custom number of runs for reliability testing (overrides phase setting) (env: RELIABILITY_RUNS, default: 0)'
    )
    reliability_group.add_argument(
        '--reliability-test-path',
        default=os.getenv('RELIABILITY_TEST_PATH'),
        help='Specific test file path for reliability testing (env: RELIABILITY_TEST_PATH, if not specified, uses --tests-dir)'
    )
    args = parser.parse_args()
    # Handle list migration tests
@ -407,6 +447,17 @@ async def main():
        if args.enable_migration_test:
            logger.info(f"Old version installer: {args.old_version}")
            logger.info(f"New version installer: {args.new_version}")
        logger.info(f"Reliability testing: {'ENABLED' if args.enable_reliability_test else 'DISABLED'}")
        if args.enable_reliability_test:
            logger.info(f"Reliability phase: {args.reliability_phase}")
            if args.reliability_runs > 0:
                logger.info(f"Custom runs: {args.reliability_runs}")
            else:
                logger.info(f"Phase runs: {5 if args.reliability_phase == 'development' else 20}")
            if args.reliability_test_path:
                logger.info(f"Specific test path: {args.reliability_test_path}")
            else:
                logger.info(f"Tests directory: {args.tests_dir}")
        logger.info("======================")
        # Initialize ReportPortal client only if enabled
@ -463,8 +514,65 @@ async def main():
        await computer.run()
        logger.info("Computer environment ready")
        # Check if reliability testing is enabled
        if args.enable_reliability_test:
            logger.info("=" * 60)
            logger.info("RELIABILITY TESTING MODE ENABLED")
            logger.info("=" * 60)
            logger.info(f"Phase: {args.reliability_phase}")
            if args.reliability_runs > 0:
                logger.info(f"Custom runs: {args.reliability_runs}")
            else:
                logger.info(f"Phase runs: {5 if args.reliability_phase == 'development' else 20}")
            # Determine test paths for reliability testing
            if args.reliability_test_path:
                # Use specific test path
                if not os.path.exists(args.reliability_test_path):
                    logger.error(f"Reliability test file not found: {args.reliability_test_path}")
                    final_exit_code = 1
                    return final_exit_code
                test_paths = [args.reliability_test_path]
                logger.info(f"Running reliability test on specific file: {args.reliability_test_path}")
            else:
                # Use tests directory
                test_files = scan_test_files(args.tests_dir)
                if not test_files:
                    logger.warning(f"No test files found in directory: {args.tests_dir}")
                    return
                test_paths = [test_data['path'] for test_data in test_files]
                logger.info(f"Running reliability tests on {len(test_paths)} test files from: {args.tests_dir}")
            # Run reliability tests
            reliability_results = await run_reliability_tests(
                computer=computer,
                test_paths=test_paths,
                rp_client=rp_client,
                launch_id=launch_id,
                max_turns=args.max_turns,
                jan_app_path=args.jan_app_path,
                jan_process_name=args.jan_process_name,
                agent_config=agent_config,
                enable_reportportal=args.enable_reportportal,
                phase=args.reliability_phase,
                runs=args.reliability_runs if args.reliability_runs > 0 else None
            )
            # Handle reliability test results
            if reliability_results and reliability_results.get("overall_success", False):
                logger.info(f"[SUCCESS] Reliability testing completed successfully!")
                final_exit_code = 0
            else:
                logger.error(f"[FAILED] Reliability testing failed!")
                if reliability_results and reliability_results.get("error_message"):
                    logger.error(f"Error: {reliability_results['error_message']}")
                final_exit_code = 1
            # Skip regular test execution in reliability mode
            logger.info("Reliability testing completed. Skipping regular test execution.")
        # Check if migration testing is enabled
-        if args.enable_migration_test:
+        elif args.enable_migration_test:
            logger.info("=" * 60)
            logger.info("MIGRATION TESTING MODE ENABLED")
            logger.info("=" * 60)
--- a/autoqa/reliability_runner.py
+++ b/autoqa/reliability_runner.py
@ -0,0 +1,334 @@
 import asyncio
 import logging
 import os
 import time
 from datetime import datetime
 from pathlib import Path
 from test_runner import run_single_test_with_timeout
 from utils import scan_test_files
 logger = logging.getLogger(__name__)
 async def run_reliability_test(computer, test_path, rp_client=None, launch_id=None, 
                              max_turns=30, jan_app_path=None, jan_process_name="Jan.exe", 
                              agent_config=None, enable_reportportal=False, 
                              phase="development", runs=5):
    """
    Run a single test case multiple times to verify reliability and stability
    Args:
        computer: Computer agent instance
        test_path: Path to the test file to run
        rp_client: ReportPortal client (optional)
        launch_id: ReportPortal launch ID (optional)
        max_turns: Maximum turns per test
        jan_app_path: Path to Jan application
        jan_process_name: Jan process name for monitoring
        agent_config: Agent configuration
        enable_reportportal: Whether to upload to ReportPortal
        phase: "development" (5 runs) or "deployment" (20 runs)
        runs: Number of runs to execute (overrides phase if specified)
    Returns:
        dict with reliability test results
    """
    # Determine number of runs based on phase
    if phase == "development":
        target_runs = 5
    elif phase == "deployment":
        target_runs = 20
    else:
        target_runs = runs
    logger.info("=" * 100)
    logger.info(f"RELIABILITY TESTING: {test_path.upper()}")
    logger.info("=" * 100)
    logger.info(f"Phase: {phase.upper()}")
    logger.info(f"Target runs: {target_runs}")
    logger.info(f"Test file: {test_path}")
    logger.info("")
    # Load test content
    if not os.path.exists(test_path):
        raise FileNotFoundError(f"Test file not found: {test_path}")
    with open(test_path, "r", encoding="utf-8") as f:
        test_content = f.read()
    test_data = {
        "path": test_path,
        "prompt": test_content
    }
    # Initialize results tracking
    reliability_results = {
        "test_path": test_path,
        "phase": phase,
        "target_runs": target_runs,
        "completed_runs": 0,
        "successful_runs": 0,
        "failed_runs": 0,
        "run_details": [],
        "start_time": datetime.now(),
        "end_time": None,
        "success_rate": 0.0,
        "overall_success": False
    }
    logger.info(f"Starting reliability testing with {target_runs} runs...")
    logger.info("=" * 80)
    try:
        for run_number in range(1, target_runs + 1):
            logger.info(f"Run {run_number}/{target_runs}")
            logger.info("-" * 40)
            run_start_time = datetime.now()
            try:
                # Run the test
                test_result = await run_single_test_with_timeout(
                    computer=computer,
                    test_data=test_data,
                    rp_client=rp_client,
                    launch_id=launch_id,
                    max_turns=max_turns,
                    jan_app_path=jan_app_path,
                    jan_process_name=jan_process_name,
                    agent_config=agent_config,
                    enable_reportportal=enable_reportportal
                )
                # Extract success status
                success = False
                if test_result:
                    if isinstance(test_result, dict):
                        success = test_result.get('success', False)
                    elif isinstance(test_result, bool):
                        success = test_result
                    elif hasattr(test_result, 'success'):
                        success = getattr(test_result, 'success', False)
                    else:
                        success = bool(test_result)
                run_end_time = datetime.now()
                run_duration = (run_end_time - run_start_time).total_seconds()
                # Record run result
                run_result = {
                    "run_number": run_number,
                    "success": success,
                    "start_time": run_start_time,
                    "end_time": run_end_time,
                    "duration_seconds": run_duration,
                    "test_result": test_result
                }
                reliability_results["run_details"].append(run_result)
                reliability_results["completed_runs"] += 1
                if success:
                    reliability_results["successful_runs"] += 1
                    logger.info(f"✅ Run {run_number}: SUCCESS ({run_duration:.1f}s)")
                else:
                    reliability_results["failed_runs"] += 1
                    logger.error(f"❌ Run {run_number}: FAILED ({run_duration:.1f}s)")
                # Calculate current success rate
                current_success_rate = (reliability_results["successful_runs"] / reliability_results["completed_runs"]) * 100
                logger.info(f"Current success rate: {reliability_results['successful_runs']}/{reliability_results['completed_runs']} ({current_success_rate:.1f}%)")
            except Exception as e:
                run_end_time = datetime.now()
                run_duration = (run_end_time - run_start_time).total_seconds()
                # Record failed run
                run_result = {
                    "run_number": run_number,
                    "success": False,
                    "start_time": run_start_time,
                    "end_time": run_end_time,
                    "duration_seconds": run_duration,
                    "error": str(e)
                }
                reliability_results["run_details"].append(run_result)
                reliability_results["completed_runs"] += 1
                reliability_results["failed_runs"] += 1
                logger.error(f"❌ Run {run_number}: EXCEPTION ({run_duration:.1f}s) - {e}")
                # Calculate current success rate
                current_success_rate = (reliability_results["successful_runs"] / reliability_results["completed_runs"]) * 100
                logger.info(f"Current success rate: {reliability_results['successful_runs']}/{reliability_results['completed_runs']} ({current_success_rate:.1f}%)")
            # Add delay between runs (except for the last run)
            if run_number < target_runs:
                delay_seconds = 5
                logger.info(f"Waiting {delay_seconds} seconds before next run...")
                await asyncio.sleep(delay_seconds)
        # Final calculations
        reliability_results["end_time"] = datetime.now()
        total_duration = (reliability_results["end_time"] - reliability_results["start_time"]).total_seconds()
        reliability_results["total_duration_seconds"] = total_duration
        if reliability_results["completed_runs"] > 0:
            reliability_results["success_rate"] = (reliability_results["successful_runs"] / reliability_results["completed_runs"]) * 100
        # Determine overall success based on phase
        if phase == "development":
            # Development phase: 80% success rate required
            reliability_results["overall_success"] = reliability_results["success_rate"] >= 80.0
        else:
            # Deployment phase: 90% success rate required
            reliability_results["overall_success"] = reliability_results["success_rate"] >= 90.0
        # Print final summary
        logger.info("=" * 80)
        logger.info("RELIABILITY TEST SUMMARY")
        logger.info("=" * 80)
        logger.info(f"Test: {test_path}")
        logger.info(f"Phase: {phase.upper()}")
        logger.info(f"Completed runs: {reliability_results['completed_runs']}/{target_runs}")
        logger.info(f"Successful runs: {reliability_results['successful_runs']}")
        logger.info(f"Failed runs: {reliability_results['failed_runs']}")
        logger.info(f"Success rate: {reliability_results['success_rate']:.1f}%")
        logger.info(f"Total duration: {total_duration:.1f} seconds")
        logger.info(f"Average duration per run: {total_duration / reliability_results['completed_runs']:.1f} seconds")
        logger.info(f"Overall result: {'✅ PASSED' if reliability_results['overall_success'] else '❌ FAILED'}")
        # Phase-specific requirements
        if phase == "development":
            logger.info("Development phase requirement: ≥80% success rate")
        else:
            logger.info("Deployment phase requirement: ≥90% success rate")
        return reliability_results
    except Exception as e:
        logger.error(f"Reliability testing failed with exception: {e}")
        reliability_results["end_time"] = datetime.now()
        reliability_results["error_message"] = str(e)
        return reliability_results
 async def run_reliability_tests(computer, test_paths, rp_client=None, launch_id=None, 
                               max_turns=30, jan_app_path=None, jan_process_name="Jan.exe", 
                               agent_config=None, enable_reportportal=False, 
                               phase="development", runs=None):
    """
    Run reliability tests for multiple test files
    Args:
        computer: Computer agent instance
        test_paths: List of test file paths or single path
        rp_client: ReportPortal client (optional)
        launch_id: ReportPortal launch ID (optional)
        max_turns: Maximum turns per test
        jan_app_path: Path to Jan application
        jan_process_name: Jan process name for monitoring
        agent_config: Agent configuration
        enable_reportportal: Whether to upload to ReportPortal
        phase: "development" (5 runs) or "deployment" (20 runs)
        runs: Number of runs to execute (overrides phase if specified)
    Returns:
        dict with overall reliability test results
    """
    # Convert single path to list
    if isinstance(test_paths, str):
        test_paths = [test_paths]
    logger.info("=" * 100)
    logger.info("RELIABILITY TESTING SUITE")
    logger.info("=" * 100)
    logger.info(f"Phase: {phase.upper()}")
    logger.info(f"Test files: {len(test_paths)}")
    logger.info(f"Test paths: {', '.join(test_paths)}")
    logger.info("")
    overall_results = {
        "phase": phase,
        "total_tests": len(test_paths),
        "completed_tests": 0,
        "passed_tests": 0,
        "failed_tests": 0,
        "test_results": {},
        "start_time": datetime.now(),
        "end_time": None,
        "overall_success": False
    }
    try:
        for i, test_path in enumerate(test_paths, 1):
            logger.info(f"Starting reliability test {i}/{len(test_paths)}: {test_path}")
            test_result = await run_reliability_test(
                computer=computer,
                test_path=test_path,
                rp_client=rp_client,
                launch_id=launch_id,
                max_turns=max_turns,
                jan_app_path=jan_app_path,
                jan_process_name=jan_process_name,
                agent_config=agent_config,
                enable_reportportal=enable_reportportal,
                phase=phase,
                runs=runs
            )
            overall_results["test_results"][test_path] = test_result
            overall_results["completed_tests"] += 1
            if test_result and test_result.get("overall_success", False):
                overall_results["passed_tests"] += 1
                logger.info(f"✅ Test {i} PASSED: {test_path}")
            else:
                overall_results["failed_tests"] += 1
                logger.error(f"❌ Test {i} FAILED: {test_path}")
            # Add delay between tests (except for the last test)
            if i < len(test_paths):
                delay_seconds = 10
                logger.info(f"Waiting {delay_seconds} seconds before next test...")
                await asyncio.sleep(delay_seconds)
        # Final calculations
        overall_results["end_time"] = datetime.now()
        total_duration = (overall_results["end_time"] - overall_results["start_time"]).total_seconds()
        overall_results["total_duration_seconds"] = total_duration
        if overall_results["completed_tests"] > 0:
            overall_results["overall_success"] = overall_results["failed_tests"] == 0
        # Print overall summary
        logger.info("=" * 100)
        logger.info("RELIABILITY TESTING SUITE SUMMARY")
        logger.info("=" * 100)
        logger.info(f"Phase: {phase.upper()}")
        logger.info(f"Total tests: {overall_results['total_tests']}")
        logger.info(f"Completed tests: {overall_results['completed_tests']}")
        logger.info(f"Passed tests: {overall_results['passed_tests']}")
        logger.info(f"Failed tests: {overall_results['failed_tests']}")
        logger.info(f"Total duration: {total_duration:.1f} seconds")
        logger.info(f"Overall result: {'✅ PASSED' if overall_results['overall_success'] else '❌ FAILED'}")
        # Individual test results
        logger.info("")
        logger.info("Individual Test Results:")
        for test_path, test_result in overall_results["test_results"].items():
            if test_result:
                status = "✅ PASSED" if test_result.get("overall_success", False) else "❌ FAILED"
                success_rate = test_result.get("success_rate", 0.0)
                logger.info(f"  {test_path}: {status} ({success_rate:.1f}% success rate)")
            else:
                logger.info(f"  {test_path}: ❌ ERROR (no result)")
        return overall_results
    except Exception as e:
        logger.error(f"Reliability testing suite failed with exception: {e}")
        overall_results["end_time"] = datetime.now()
        overall_results["error_message"] = str(e)
        return overall_results
--- a/autoqa/tests/migration/assistants/setup-chat-with-assistant.txt
+++ b/autoqa/tests/migration/assistants/setup-chat-with-assistant.txt
@ -49,7 +49,7 @@ Step-by-step instructions:
   - Choose: `jan-nano-gguf` under the `Llama.Cpp` section.
 5. Send a test message:
-   - Type: `Hello world` and press Enter or click send message (button with right arrow).
+   - Type: `Hello world` and press Enter or click send message (button with right arrow). You should click at the center of the button.
   - Wait up to 1–2 minutes for the model to load and respond.
 6. Verify the model responds: