Local Farm Agents fail in Strategy Tester - page 2

 
I shall resurrect this thread as it happens from time to time throughout the years. I've definitely seen it before in other builds. EA in this case, is very complex and testing 6 symbols over 25 years m1ohlc. Local agents are very stable but local network agents go about 1 and a half iterations on average for todays problem. network and terminal are 5640. tried disabled all AV, reinstalled network agents after completely deleting relevant folders. Manually killing the stuck task allows a new one to be launched and works again for another iteration. I am not running out of RAM (stable at 85% usage)


0 07:04:46.899 Tester optimization pass 12 started (batch of 3 tasks) CS 0 07:04:46.899 192.168.2.18 prepare for shutdown CS 0 07:04:46.899 192.168.2.18 shutdown finished CS 0 07:04:56.863 192.168.2.18 login (build 5640) CS 0 07:04:56.935 Tester account info found with currency USD CS 0 07:04:56.935 Tester successfully initialized CS 0 07:04:56.935 Network 171 bytes of total initialization data received CS 0 07:04:56.935 Tester AMD Ryzen 9 3950X 16-Core, 130983 MB CS 0 07:04:56.944 Tester optimization pass 12 started (batch of 3 tasks) CS 0 07:04:56.944 192.168.2.18 prepare for shutdown CS 0 07:04:56.944 192.168.2.18 shutdown finished CS 0 07:05:06.849 192.168.2.18 login (build 5640) CS 0 07:05:06.860 Tester account info found with currency USD CS 0 07:05:06.874 Network 146628 bytes of input parameters loaded CS 0 07:05:06.874 Tester successfully initialized CS 0 07:05:06.874 Network 5954 bytes of total initialization data received CS 0 07:05:06.874 Tester AMD Ryzen 9 3950X 16-Core, 130983 MB CS 0 07:05:06.884 Tester optimization pass 70 started CS 0 07:05:06.884 192.168.2.18 prepare for shutdown CS 0 07:05:06.884 192.168.2.18 shutdown finished CS 0 07:05:16.846 192.168.2.18 login (build 5640) CS 0 07:05:16.856 Tester account info found with currency USD CS 0 07:05:16.875 Tester program file added: Indicators\\b_kaufman_efficiency_ratio.ex5. 9999 bytes loaded CS 0 07:05:16.875 Tester program file added: Indicators\\ForceIndexBollingerBands.ex5. 26311 bytes loaded CS 0 07:05:16.876 Tester successfully initialized CS 0 07:05:16.876 Network 34 Kb of total initialization data received CS 0 07:05:16.876 Tester AMD Ryzen 9 3950X 16-Core, 130983 MB CS 0 07:05:16.896 Tester optimization pass 70 started (batch of 3 tasks) CS 0 07:05:16.897 192.168.2.18 prepare for shutdown CS 0 07:05:16.897 192.168.2.18 shutdown finished CS 0 07:05:26.851 192.168.2.18 login (build 5640) CS 0 07:05:26.859 Tester account info found with currency USD CS 0 07:05:26.867 Network 146628 bytes of input parameters loaded CS 0 07:05:26.867 Tester successfully initialized CS 0 07:05:26.867 Network 5948 bytes of total initialization data received CS 0 07:05:26.867 Tester AMD Ryzen 9 3950X 16-Core, 130983 MB CS 0 07:05:26.876 Tester optimization pass 84 started (batch of 3 tasks) CS 0 07:05:26.876 192.168.2.18 prepare for shutdown CS 0 07:05:26.876 192.168.2.18 shutdown finished CS 0 07:05:36.856 192.168.2.18 login (build 5640)

I shall share my work around PowerShell script which detects zombie agents and kills them allowing them to restart. Use at your own risk, you may need to adjust the thresholds for your own use. Run it directly with an elevated PowerShell terminal with -Loop.

PS F:\> .\kill_zombie_agents.ps1 -Loop



# kill_zombie_agents.ps1
# Run on the remote farm PC to kill hung metatester64.exe processes
# Usage:
#   .\kill_zombie_agents.ps1              # Kill zombies CPU 5 for 30s
#   .\kill_zombie_agents.ps1 -All         # Kill ALL metatester64 processes
#   .\kill_zombie_agents.ps1 -Loop        # Monitor continuously every 60s
#   .\kill_zombie_agents.ps1 -Loop -IntervalSec 30

param(
    [switch]$All,
    [switch]$Loop,
    [int]$IntervalSec = 60,
    [int]$ZombieThresholdSec = 30
)

function Get-MetaTesterProcesses {
    Get-Process -Name "metatester64" -ErrorAction SilentlyContinue
}

function Get-SuspectPIDs {
    param([int]$ThresholdSec)

    $filter = "Name LIKE 'metatester64%'"
    $snap1 = @{}
    try {
        $wmi1 = Get-CimInstance Win32_Process -Filter $filter
    } catch {
        Write-Host ("  WMI query failed: " + $_) -ForegroundColor Red
        return @()
    }
    if (-not $wmi1) { return @() }

    foreach ($w in $wmi1) {
        $totalNs = [long]$w.UserModeTime + [long]$w.KernelModeTime
        $snap1[$w.ProcessId] = $totalNs
    }
    Write-Host ("  Tracking " + $snap1.Count + " processes for " + $ThresholdSec + "s...") -ForegroundColor DarkGray

    Start-Sleep -Seconds $ThresholdSec

    $suspects = @()
    $wallNs = [long]$ThresholdSec * 10000000
    try {
        $wmi2 = Get-CimInstance Win32_Process -Filter $filter
    } catch {
        Write-Host ("  WMI query failed: " + $_) -ForegroundColor Red
        return @()
    }

    foreach ($w in $wmi2) {
        $procId = $w.ProcessId
        $before = $snap1[$procId]
        if ($null -eq $before) {
            Write-Host ("  PID " + $procId + ": new process (spawned during sample)") -ForegroundColor DarkGray
            continue
        }
        $after = [long]$w.UserModeTime + [long]$w.KernelModeTime
        $deltaNs = $after - $before
        $cpuPct = 0
        if ($wallNs -gt 0) { $cpuPct = ($deltaNs / $wallNs) * 100 }
        $memMB = [math]::Round($w.WorkingSetSize / 1MB)
        $pctStr = [math]::Round($cpuPct, 2).ToString()

        if ($cpuPct -lt 0.1) {
            Write-Host ("  PID " + $procId + ": " + $pctStr + "% CPU, " + $memMB + " MB - SUSPECT") -ForegroundColor Yellow
            $suspects += $procId
        } else {
            Write-Host ("  PID " + $procId + ": " + $pctStr + "% CPU, " + $memMB + " MB - alive") -ForegroundColor Green
        }
    }
    return $suspects
}

function Show-Status {
    $procs = Get-MetaTesterProcesses
    $count = 0
    $totalMB = 0
    if ($procs) {
        $count = @($procs).Count
        $totalMB = [math]::Round(($procs | Measure-Object WorkingSet64 -Sum).Sum / 1MB)
    }
    $ts = Get-Date -Format "HH:mm:ss"
    Write-Host ("[$ts] metatester64 processes: $count | Total RAM: $totalMB MB") -ForegroundColor Cyan
}

function Kill-Processes {
    param($Processes, [string]$Reason)
    foreach ($p in $Processes) {
        $memMB = [math]::Round($p.WorkingSet64 / 1MB)
        Write-Host ("  Killing PID " + $p.Id + " (" + $memMB + " MB) - " + $Reason) -ForegroundColor Red
        $result = taskkill /F /PID $p.Id 2>&1
        if ($LASTEXITCODE -ne 0) {
            Write-Host ("  Failed: " + $result) -ForegroundColor Red
        }
    }
}

# --- Main ---
$isAdmin = ([Security.Principal.WindowsPrincipal][Security.Principal.WindowsIdentity]::GetCurrent()).IsInRole([Security.Principal.WindowsBuiltInRole]::Administrator)
if (-not $isAdmin) {
    Write-Host "WARNING: Not running as Administrator. Kill commands may fail." -ForegroundColor Red
    Write-Host "Right-click PowerShell -> Run as Administrator" -ForegroundColor Red
    Write-Host ""
}
Write-Host "=== MetaTester Zombie Killer ===" -ForegroundColor Green
Write-Host ""

if ($All) {
    Show-Status
    $procs = Get-MetaTesterProcesses
    if ($procs) {
        $count = @($procs).Count
        Write-Host ("Killing ALL " + $count + " metatester64 processes...") -ForegroundColor Red
        Kill-Processes -Processes $procs -Reason "forced kill-all"
        Start-Sleep -Seconds 2
        Show-Status
    } else {
        Write-Host "No metatester64 processes found." -ForegroundColor Green
    }
    exit
}

if ($Loop) {
    Write-Host ("Monitoring mode: checking every " + $IntervalSec + "s, zombie threshold " + $ZombieThresholdSec + "s")
    Write-Host "Press Ctrl+C to stop"
    Write-Host ""

    $prevSuspects = @()
    while ($true) {
        Show-Status
        $procs = Get-MetaTesterProcesses
        if ($procs -and @($procs).Count -gt 0) {
            $suspects = @(Get-SuspectPIDs -ThresholdSec $ZombieThresholdSec)
            if ($prevSuspects.Count -gt 0) {
                $recovered = @()
                foreach ($prev in $prevSuspects) {
                    if ($suspects -notcontains $prev) { $recovered += $prev }
                }
                if ($recovered.Count -gt 0) {
                    Write-Host ("  " + $recovered.Count + " previous suspect(s) recovered: " + ($recovered -join ", ")) -ForegroundColor DarkGreen
                }
            }
            if ($suspects.Count -gt 0 -and $prevSuspects.Count -gt 0) {
                $toKill = @()
                foreach ($s in $suspects) {
                    if ($prevSuspects -contains $s) {
                        $p = Get-Process -Id $s -ErrorAction SilentlyContinue
                        if ($p) { $toKill += $p }
                    }
                }
                if ($toKill.Count -gt 0) {
                    Write-Host ("  " + $toKill.Count + " confirmed zombie(s) (2 strikes)! Killing...") -ForegroundColor Red
                    Kill-Processes -Processes $toKill -Reason "zombie (2 consecutive checks)"
                } else {
                    Write-Host ("  " + $suspects.Count + " new suspect(s) on strike 1 - will recheck next round") -ForegroundColor DarkYellow
                }
            } elseif ($suspects.Count -gt 0) {
                Write-Host ("  " + $suspects.Count + " suspect(s) on strike 1 - will recheck next round") -ForegroundColor DarkYellow
            } else {
                Write-Host "  All processes healthy" -ForegroundColor Green
            }
            $prevSuspects = $suspects
        } else {
            $prevSuspects = @()
        }

        $sleepRemaining = $IntervalSec - $ZombieThresholdSec
        if ($sleepRemaining -gt 0) {
            Start-Sleep -Seconds $sleepRemaining
        }
    }
} else {
    Show-Status
    $procs = Get-MetaTesterProcesses
    if (-not $procs -or @($procs).Count -eq 0) {
        Write-Host "No metatester64 processes found." -ForegroundColor Green
        exit
    }

    Write-Host ("Check 1 of 2 (" + $ZombieThresholdSec + "s sample)...")
    $suspects1 = @(Get-SuspectPIDs -ThresholdSec $ZombieThresholdSec)
    if ($suspects1.Count -eq 0) {
        Write-Host "No zombies detected - all processes are using CPU" -ForegroundColor Green
        exit
    }
    Write-Host ("  " + $suspects1.Count + " suspect(s) found. Rechecking in " + $ZombieThresholdSec + "s...")
    Write-Host ""
    Write-Host ("Check 2 of 2 (" + $ZombieThresholdSec + "s sample)...")
    $suspects2 = @(Get-SuspectPIDs -ThresholdSec $ZombieThresholdSec)
    $toKill = @()
    foreach ($s in $suspects2) {
        if ($suspects1 -contains $s) {
            $p = Get-Process -Id $s -ErrorAction SilentlyContinue
            if ($p) { $toKill += $p }
        }
    }
    if ($toKill.Count -gt 0) {
        Write-Host ("Confirmed " + $toKill.Count + " zombie(s) (failed both checks):") -ForegroundColor Red
        Kill-Processes -Processes $toKill -Reason "zombie (2 consecutive checks)"
        Start-Sleep -Seconds 2
        Show-Status
    } else {
        Write-Host "No confirmed zombies - suspects recovered on second check" -ForegroundColor Green
    }
}





 

Benjamin Dixon #:
I shall resurrect this thread as it happens from time to time throughout the years. I've definitely seen it before in other builds. EA in this case, is very complex and testing 6 symbols over 25 years m1ohlc. Local agents are very stable but local network agents go about 1 and a half iterations on average for todays problem. network and terminal are 5640. tried disabled all AV, reinstalled network agents after completely deleting relevant folders. Manually killing the stuck task allows a new one to be launched and works again for another iteration. I am not running out of RAM (stable at 85% usage)

It's hard to understand what you are reporting ? An issue with agents from a local network ? but randomly ?

0 07:04:46.899 Tester optimization pass 12 started (batch of 3 tasks) CS 0 07:04:46.899 192.168.2.18 prepare for shutdown CS 0 07:04:46.899 192.168.2.18 shutdown finished CS 0 07:04:56.863 192.168.2.18 login (build 5640) CS 0 07:04:56.935 Tester account info found with currency USD CS 0 07:04:56.935 Tester successfully initialized CS 0 07:04:56.935 Network 171 bytes of total initialization data received CS 0 07:04:56.935 Tester AMD Ryzen 9 3950X 16-Core, 130983 MB CS 0 07:04:56.944 Tester optimization pass 12 started (batch of 3 tasks) CS 0 07:04:56.944 192.168.2.18 prepare for shutdown CS 0 07:04:56.944 192.168.2.18 shutdown finished CS 0 07:05:06.849 192.168.2.18 login (build 5640) CS 0 07:05:06.860 Tester account info found with currency USD CS 0 07:05:06.874 Network 146628 bytes of input parameters loaded CS 0 07:05:06.874 Tester successfully initialized CS 0 07:05:06.874 Network 5954 bytes of total initialization data received CS 0 07:05:06.874 Tester AMD Ryzen 9 3950X 16-Core, 130983 MB CS 0 07:05:06.884 Tester optimization pass 70 started CS 0 07:05:06.884 192.168.2.18 prepare for shutdown CS 0 07:05:06.884 192.168.2.18 shutdown finished CS 0 07:05:16.846 192.168.2.18 login (build 5640) CS 0 07:05:16.856 Tester account info found with currency USD CS 0 07:05:16.875 Tester program file added: Indicators\\b_kaufman_efficiency_ratio.ex5. 9999 bytes loaded CS 0 07:05:16.875 Tester program file added: Indicators\\ForceIndexBollingerBands.ex5. 26311 bytes loaded CS 0 07:05:16.876 Tester successfully initialized CS 0 07:05:16.876 Network 34 Kb of total initialization data received CS 0 07:05:16.876 Tester AMD Ryzen 9 3950X 16-Core, 130983 MB CS 0 07:05:16.896 Tester optimization pass 70 started (batch of 3 tasks) CS 0 07:05:16.897 192.168.2.18 prepare for shutdown CS 0 07:05:16.897 192.168.2.18 shutdown finished CS 0 07:05:26.851 192.168.2.18 login (build 5640) CS 0 07:05:26.859 Tester account info found with currency USD CS 0 07:05:26.867 Network 146628 bytes of input parameters loaded CS 0 07:05:26.867 Tester successfully initialized CS 0 07:05:26.867 Network 5948 bytes of total initialization data received CS 0 07:05:26.867 Tester AMD Ryzen 9 3950X 16-Core, 130983 MB CS 0 07:05:26.876 Tester optimization pass 84 started (batch of 3 tasks) CS 0 07:05:26.876 192.168.2.18 prepare for shutdown CS 0 07:05:26.876 192.168.2.18 shutdown finished CS 0 07:05:36.856 192.168.2.18 login (build 5640)

Posting such log is useless, it's unreadable and nobody will go to the trouble to separate the log lines. You should post untouched log file instead.

I shall share my work around PowerShell script which detects zombie agents and kills them allowing them to restart. Use at your own risk, you may need to adjust the thresholds for your own use. Run it directly with an elevated PowerShell terminal with -Loop.

What are you expecting from this post ?

 
Alain Verleyen #:

It's hard to understand what you are reporting ? An issue with agents from a local network ? but randomly ?

Posting such log is useless, it's unreadable and nobody will go to the trouble to separate the log lines. You should post untouched log file instead.

What are you expecting from this post ?

Just sharing the workaround script was the goal of the post.