Hello,
Customer is trying to run SmartState Analysis on an instance of AWS running CentOS.
In the user tasks this analysis fails with: "job timed out after 3060.770851409 seconds of inactivity. Inactivity threshold [3000 seconds]"
This is Infrastructure Management running as virtual appliance on VMWare.
In the provided logs (TS004217864/2020-09-27/ dir on ecurep) I can see:
{"@timestamp":"2020-10-07T14:04:52.176002 ","hostname":"1-amazon-agent-coordinator-0-55fd49dc87-hsgwd","pid":7,"tid":"2acb296b6e48","level":"debug","message":"[Aws::SQS::Client 400 0.057859 0 retries] get_queue_url(queue_name:\"smartstate_extract_reply-b0d2ab4b-40ed-4025-a5e2-788ba23ba35d\") Aws::SQS::Errors::NonExistentQueue The specified queue does not exist for this wsdl version."}
{"@timestamp":"2020-10-07T14:05:02.577138 ","hostname":"1-amazon-agent-coordinator-0-55fd49dc87-hsgwd","pid":7,"tid":"2acb257d795c","level":"debug","message":"[Aws::S3::Client 404 0.20764 0 retries] head_bucket(bucket:\"smartstate-b0d2ab4b-40ed-4025-a5e2-788ba23ba35d\") Aws::S3::Errors::NotFound "}
--> process agent coordinator running
12190 root 21 1 395768 204608 14740 R 15.4 0.8 0:09.24 MIQ: Amazon::CloudManager::EventCatcher id: 76, queue: ems_3
12188 root 30 10 361160 202108 13100 R 15.3 0.8 0:09.19 MIQ: Amazon::AgentCoordinatorWorker id: 75, queue: ems_agent_coordinator
12194 root 23 3 361408 195428 13128 R 14.7 0.8 0:08.85 MIQ: Amazon::CloudManager::MetricsCollectorWorker id: 78, queue: amazon
--> The related task info with GUID
[----] I, [2020-09-27T15:03:16.993192 #2228:2abfbf325970] INFO -- : MIQ(ManageIQ::Providers::Amazon::AgentCoordinatorWorker#start) Worker started: ID [75], PID [12188], GUID [5b493b88-02ff-4570-8697-6e393122260b]
--> I have looked at ews.log for GUID but, the only timeout error I found is the following:
[----] I, [2020-09-27T15:59:18.098238 #12188:2aeeb017595c] INFO -- : MIQ(ManageIQ::Providers::Amazon::AgentCoordinatorWorker::Runner#do_work) Alive agents in EMS(guid=1c9078f4-1231-4f90-9b63-29fbc205ce72): [].
.....
and then:
[----] I, [2020-09-27T15:59:23.405619 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#deliver) Message id: [5879], Delivering...
[----] W, [2020-09-27T15:59:23.409676 #2684:2ad3ca51b968] WARN -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Scanning::Job#timeout!) Job: guid: [17167f79-8f9d-428f-a382-4cb29ee8e498], job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds], aborting
[----] I, [2020-09-27T15:59:23.416496 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue.put) Message id: [5882], id: [], Zone: [default], Role: [smartstate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [1], Task id: [], Command: [Job.signal_abort], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds]", "error"]
[----] I, [2020-09-27T15:59:23.416693 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#delivered) Message id: [5879], State: [ok], Delivered in [0.01107655] seconds
[----] I, [2020-09-27T15:59:23.422359 #2684:2ad3ca51b968] INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [5881], MiqWorker id: [64], Zone: [], Role: [], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqQueue.check_for_timeout], Timeout: [600], Priority: [90], State: [dequeue], Deliver On: [], Data: [], Args: [], Dequeued in: [3.302457898] seconds
[----] I, [2020-09-27T15:59:23.422617 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#deliver) Message id: [5881], Delivering...
[----] I, [2020-09-27T15:59:23.423784 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#delivered) Message id: [5881], State: [ok], Delivered in [0.001177885] seconds
[----] I, [2020-09-27T15:59:28.470684 #2228:2abfbf325970] INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 1 miq_queue rows for queue_name=generic, wcount=4, priority=200
[----] I, [2020-09-27T15:59:29.506630 #2684:2ad3ca51b968] INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [5882], MiqWorker id: [64], Zone: [default], Role: [smartstate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [1], Task id: [], Command: [Job.signal_abort], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: ["job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds]", "error"], Dequeued in: [6.094388538] seconds
[----] I, [2020-09-27T15:59:29.506939 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#deliver) Message id: [5882], Delivering...
[----] I, [2020-09-27T15:59:29.656524 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue.put) Message id: [5883], id: [], Zone: [default], Role: [automate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [ready], Deliver On: [], Data: [], Args: [{:object_type=>"ManageIQ::Providers::Amazon::CloudManager::Vm", :object_id=>34, :attrs=>{:event_type=>"vm_scan_abort", "VmOrTemplate::vm"=>34, :vm_id=>34, :host=>nil, "MiqEvent::miq_event"=>308, :miq_event_id=>308, "EventStream::event_stream"=>308, :event_stream_id=>308}, :instance_name=>"Event", :user_id=>1, :miq_group_id=>1, :tenant_id=>1, :automate_message=>nil}]
[----] E, [2020-09-27T15:59:29.656639 #2684:2ad3ca51b968] ERROR -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Scanning::Job#process_abort) job aborting, job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds]
[----] E, [2020-09-27T15:59:29.662790 #2684:2ad3ca51b968] ERROR -- : MIQ(MiqQueue#deliver) Message id: [5882], Error: [finish is not permitted at state aborting]
[----] E, [2020-09-27T15:59:29.662909 #2684:2ad3ca51b968] ERROR -- : [RuntimeError]: finish is not permitted at state aborting Method:[block (2 levels) in class:LogProxy]
[----] E, [2020-09-27T15:59:29.662994 #2684:2ad3ca51b968] ERROR -- : /var/www/miq/vmdb/app/models/job/state_machine.rb:50:in `signal'
Not sure if this is related to the issue.