microsoft / docker-provider Goto Github PK

Azure Monitor for Containers

License: Other

Makefile 0.62% Shell 15.35% Ruby 42.79% Python 8.95% PowerShell 13.45% C# 0.93% Batchfile 0.04% Dockerfile 0.37% Go 16.24% Mustache 0.11% C++ 0.19% Bicep 0.42% HCL 0.53%

kubernetes aks monitoring azure containers logging metrics cloud loganalytics hybrid

docker-provider's Issues

OMS Agent does not respect namespace exclusion of custom configmap container-azm-ms-agentconfig

Version: mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod01312022
Platform: AKS

I'm using the proposed configuration template [1], and it seems to get loaded (according to the logs). The omsagents still seem not to respect the arrays of excluded namespaces. If I disable exportation of stdout and stderr, it takes effect, but filtering (while collection is enabled) does not work.

1 - https://github.com/microsoft/Docker-Provider/blob/ci_dev/kubernetes/container-azm-ms-agentconfig.yaml

OMS agent does not exclude namespaces after configuration

This is a duplicate of #737 in which I did post a comment (as no solution is mentioned in there) but got no answer yet.
Since it is a closed issue I don't think many people will look into it, so let me post a new issue.

I do have an AKS cluster setup with Container Insights enabled.
My log analytics workspace contains alot of logs that I don't use so I want to limit the collected logs.

I did create the ConfigMap based on this template in the kube-system which is deployed to my cluster.
When calling kubectl edit configmap container-azm-ms-agentconfig -n kube-system I get the following:

I do have a 3 separate namespaces: kube-system, grafana-namespace and apps-namespace.
I only want to capture the last one's logs.

While checking one of the omsagent-... pods' logs, I get the following:

Both stdout & stderr log collection are turned off for namespaces: '*.csv2,*_kube-system_*.log,*_grafana-namespace_*.log'
****************End Config Processing********************
****************Start Config Processing********************
config::configmap container-azm-ms-agentconfig for agent settings mounted, parsing values
config::Successfully parsed mounted config map

So the configuration itself has no errors in there, and is applied properly.

Now when I check the Log analytics workspace, I still get data in there that is referring to kube-system and grafana-namespace.
So, for example this query returns results for the last 5 minutes while the ConfigMap is already deployed for a week or so:

KubePodInventory
| where Namespace == "kube-system"

When reading the Microsoft docs the ConfigMap should reduce the ingestion of logs if you exclude a namespace.

The main question is, what did I do wrong in the configuration or I am wrong in thinking that KubePodInventory shouldn't contain any data for the excluded namespaces.

Issue with tags script run

We are getting below error when we run ,

curl -sL https://raw.githubusercontent.com/microsoft/Docker-Provider/ci_prod/scripts/onboarding/aksengine/kubernetes/AddMonitoringOnboardingTags.sh | bash -s

No k8s-master VMs or VMSSes found in the specified resource group:

https://github.com/microsoft/Docker-Provider/blob/ci_prod/scripts/onboarding/attach-monitoring-tags.md#attach-tags-using-azure-cli

Fix doc link

Where it says "Create and use a parameters file as a JSON" on the following page, the linked doc is a review page, not the public docs.
https://github.com/microsoft/Docker-Provider/tree/ci_dev/alerts/recommended_alerts_ARM

It links to:
https://review.docs.microsoft.com/en-us/azure/azure-resource-manager/templates/parameter-files
It should link to:
https://docs.microsoft.com/en-us/azure/azure-resource-manager/templates/parameter-files

OMS Agent Win kube-system100%+ CPU Usage

Been experiencing this issue for some time, and not just on one client.
Running Sitecore Containers in Azure.
Advised after raising a support request that advise from MS Support was:

"After discussed with our container product team, seems it’s an known issue for windows containerd. And now the windows contained is an opensource and maintained by community, which means any issue regarding contained issue, we have to raise an issue to the community for the tracking. Thanks for your understanding!"

So hence I am raising issue here for help. There are obviously issues with restarts as you can see in the image, though that it seems that it is being investigated separately by MS.

So any assistance on this would be appreciated.

Performance metrics are traced for completed jobs

We run more than a thousand short-living jobs on our cluster every day. These jobs stay for some time in the "Completed" state. As a result, we reach our Log Analytics size quota much earlier than expected, because performance metrics are written every minute for every completed job along with other pods (as far as I understand, in_kube_perfinventory.rb is responsible for that).

Can completed pods be excluded from performance traces?

Thanks

Why this repo name?

From the "About" and also the contents, this repo focuses on AzMon for containers. This makes the current name "Docker-Provider" quite bizarre and confusing.

Extending Container_HostInventory

Hi,
Add, please, to Container_HostInventory additional properties which returns generic API - Containers, ContainersRunning, ContainersPaused, ContainersStopped, Images.

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

Unable to specify settings for prometheus in ConfigMap

I am currently configuring the ConfigMap that is used by the OMS-agent pods. What I want to achieve is sending Prometheus metrics to a log analytics workspace.
For this I am following this Microsoft docs page.
On that page we can see this:

prometheus.io/scrape: "true"
prometheus.io/path: "/mymetrics"
prometheus.io/port: "8000"
prometheus.io/scheme: "http"

And then in the table under Cluster-wide we have keys like these:

prometheus.io/scrape
prometheus.io/path
...

So, my understanding is that a user can specify what port e.g. the OMS-agent has to look at in the annotations of an application pod.
In my case, I have a pod that has an annotation: prometheus.io/port=8900. And the default that is mentioned in the documentation is 9102.

I tried to specify the following TOML in the ConfigMap:

prometheus-data-collection-settings: |-
    [prometheus_data_collection_settings.cluster]
        interval = "1m"
        fieldpass = ["platform_user_sessions", "platform_connection_bus"]
        monitor_kubernetes_pods = true
        monitor_kubernetes_pods_namespaces = ["dev-group-apps"]
        prometheus.io/port = 8900

Once the ConfigMap is read by the OMS-agent, I get the following error in the logs of the pod:

"config::error::Exception while parsing config map for prometheus config: \nparse error on value \"/\" (error), using defaults, please check config map for errors"

When commenting the prometheus.io/port = 8900, it is parsed successfully.

I started to look in the source code to find the error, and what it does when it successfully parses the configmap.
There I bumped into these statements:

Docker-Provider/build/common/installer/scripts/tomlparser-prom-customconfig.rb

Lines 144 to 154 in 0903e86

 interval = parsedConfig[:prometheus_data_collection_settings][:cluster][:interval] 

 fieldPass = parsedConfig[:prometheus_data_collection_settings][:cluster][:fieldpass] 

 fieldDrop = parsedConfig[:prometheus_data_collection_settings][:cluster][:fielddrop] 

 urls = parsedConfig[:prometheus_data_collection_settings][:cluster][:urls] 

 kubernetesServices = parsedConfig[:prometheus_data_collection_settings][:cluster][:kubernetes_services] 

 # Remove below 4 lines after phased rollout 

 monitorKubernetesPods = parsedConfig[:prometheus_data_collection_settings][:cluster][:monitor_kubernetes_pods] 

 monitorKubernetesPodsNamespaces = parsedConfig[:prometheus_data_collection_settings][:cluster][:monitor_kubernetes_pods_namespaces] 

 kubernetesLabelSelectors = parsedConfig[:prometheus_data_collection_settings][:cluster][:kubernetes_label_selector] 

 kubernetesFieldSelectors = parsedConfig[:prometheus_data_collection_settings][:cluster][:kubernetes_field_selector]

There is no prometheus read from the parsedConfig so I am definitely doing something wrong.

How are we able to specify that the OMS-agent has to look for a different port in the prometheus.io/port annotation, or is my understanding of this completely wrong?

Some Kubernetes events are not being forwarded to OMS

We've seen cases where Kubernetes life-cycle events for Pods (e.g. Killed) could be seen in output from kubectl get events, but did not show up in OMS log analytics.

It looks like there may be an issue with the way previously seen events are being tracked. In in_kube_events.rb ,the uuid from the event is used to track which events have already been seen. If the uuid is already in the KubeEventsStateFile, then the event is skipped, otherwise it's routed to the registered outputs.

The issue is that for some events (e.g. Pod events), the uuid does not change when the event occurs again for the same Pod. Instead the count and lastTimestamp property values are updated.

Here's an example of a Pod where there were there were multiple Killing events. In this case the uuid is fb90522d-a65a-11e7-bafb-000d3a36fbf1. The first event has count:970

{
      "metadata": {
        "name": "liveness-exec.14e9557aa4eccb1d",
        "namespace": "default",
        "selfLink": "/api/v1/namespaces/default/events/liveness-exec.14e9557aa4eccb1d",
        "uid": "fb90522d-a65a-11e7-bafb-000d3a36fbf1",
        "resourceVersion": "3815919",
        "creationTimestamp": "2017-10-01T03:45:35Z"
      },
      "involvedObject": {
        "kind": "Pod",
        "namespace": "default",
        "name": "liveness-exec",
        "uid": "9089e7fd-a3be-11e7-9da0-000d3a36fbf1",
        "apiVersion": "v1",
        "resourceVersion": "3302183",
        "fieldPath": "spec.containers{liveness}"
      },
      "reason": "Killing",
      "message": "(events with common reason combined)",
      "source": {
        "component": "kubelet",
        "host": "k8s-agentpool1-39011252-2"
      },
      "firstTimestamp": "2017-10-01T03:45:35Z",
      "lastTimestamp": "2017-10-05T00:37:45Z",
      "count": 970,
      "type": "Normal"
}

A subsequent Killing event for the same Pod (count:971) followed but would have been skipped because it has the same uuid as the first event.

  {
      "metadata": {
        "name": "liveness-exec.14e9557aa4eccb1d",
        "namespace": "default",
        "selfLink": "/api/v1/namespaces/default/events/liveness-exec.14e9557aa4eccb1d",
        "uid": "fb90522d-a65a-11e7-bafb-000d3a36fbf1",
        "resourceVersion": "3816442",
        "creationTimestamp": "2017-10-01T03:45:35Z"
      },
      "involvedObject": {
        "kind": "Pod",
        "namespace": "default",
        "name": "liveness-exec",
        "uid": "9089e7fd-a3be-11e7-9da0-000d3a36fbf1",
        "apiVersion": "v1",
        "resourceVersion": "3302183",
        "fieldPath": "spec.containers{liveness}"
      },
      "reason": "Killing",
      "message": "(events with common reason combined)",
      "source": {
        "component": "kubelet",
        "host": "k8s-agentpool1-39011252-2"
      },
      "firstTimestamp": "2017-10-01T03:45:35Z",
      "lastTimestamp": "2017-10-05T00:43:30Z",
      "count": 971,
      "type": "Normal"
    }

We've tried an experiment by concatenating the count property with the uuid property to construct the eventId for tracking seen events. Another option would have been to use the lastTimestamp property. That seems to resolve the issue of events being skipped. However, we also see periods of time where events are not forwarded, but that seems unrelated to this.

enable-monitoring.sh fails if az output format not set to json

The bash script to enable monitoring for Arc clusters fails if the output format is not json.

An example when output format is set to table:

$ bash enable-monitoring.sh --resource-id $azureArcClusterResourceId --kube-context $kubeContext --workspace-id $logAnalyticsWorkspaceResourceId
...
validating cluster identity
cluster identity type: result -------------- systemassigned
-e only supported cluster identity is systemassigned for Azure ARC K8s cluster type

The az command formatting only works for single line output.

In addition, a script does not satisfy our automation workflows, it would be useful to see RM/Terraform code to configure the workspace then instructions to implement the Helm chart natively.

[Feature request] Setting Mem_Buf_limit of td-agent-bit

In an AKS environment, logs are lost in an environment where a large amount of logs are output from containers.
I forced myself to change Mem_Buf_limit from its default value in the container of omsagent's DaemonSet, and this improved the situation.
It would be very helpful if the Mem_Buf_limit setting of td-agent-bit could be changed using ConfigMap.

# cat /etc/opt/microsoft/docker-cimprov/td-agent-bit.conf
[INPUT]
    Name tail
    Tag oms.container.log.la.*
    Path ${AZMON_LOG_TAIL_PATH}
    DB /var/log/omsagent-fblogs.db
    DB.Sync Off
    Parser cri
    Mem_Buf_Limit 10m <------------------------------- I would like to change this parameter
    Rotate_Wait 20
    Refresh_Interval 30
    Path_Key filepath
    Skip_Long_Lines On
    Ignore_Older 5m
    Exclude_Path ${AZMON_CLUSTER_LOG_TAIL_EXCLUDE_PATH}

OSM Agent high memory usage

Hello,

Is there any way to reduce omsagent memory consumption in the kubernetes cluster? Just for 2 nodes it runs 3 instances of omsagent (1 daemon set - 2 instances, 1 replica set - 1 instance) and each instance uses 300mb of ram. This is the most demanding service in my cluster and it is just a monitoring tool.

Reopening because of last Issue was closed automatically.
#624

Using relabel_config with OMS agent

Hello!

Is there a way to provide a custom Prometheus config (and relabel_config in particular) to the OMS agent? The AWS equivalent of the OMS agent, ADOT Collector, has this feature. We'd rather not self-host Prometheus since the OMS agent is so convenient, but this is bit of a blocker for us.

Thanks!
Michael

How to configure the secured endpoint for metric collection

From this doc: https://github.com/microsoft/Docker-Provider/blob/ci_prod/kubernetes/container-azm-ms-agentconfig.yaml#L70

we can configure the endpoint, then the agent can collect the metrics from my endpoint. but my endpoint is secured. something like this:

curl -u admin:password http://automation-controller-service.ansible-automation-platform:8080/api/v2/metrics

need to use the username and password to access it. I can not find any configuration to do it. do you know how to provide the auth info for my endpoint? thanks

Missing information in syslog warning: Container image name (%s) is improperly formed and could not be parsed in SetRepositoryImageTag

It appears that when an image has a tag, a warning is logged in InventoryQuery::SetImageRepositoryImageTag at syslog(LOG_WARNING, "Container image name (%s) is improperly formed and could not be parsed in SetRepositoryImageTag", properties.c_str());

The syslog message is does not contain the image name: "Container image name () is improperly...". All the docker images are having names, so the name is available and should be displayed in the message.

Run fluentd with a custom configuration

I am trying to run fluentd with a custom configuration instead of the predefined one from https://github.com/microsoft/Docker-Provider/blob/ci_prod/build/linux/installer/conf/kube.conf. In particular, I am interested in changing the run_interval for the kube_events plugin.
I was thinking that the configuration from https://github.com/microsoft/Docker-Provider/blob/ci_prod/charts/azuremonitor-containers/templates/ama-logs-rs-configmap.yaml#L5 is the way to go. However, I couldn’t see any place in which that configuration is used, apart from some if-statements.
I would like to know what is the purpose of that configmap? And there is any way to change the fluentd configuration?
What I did in the end was to change the run_interval directly in the /etc/fluent/kube.conf, but I was thinking there should be a better way to achieve this without any workaround.

Metrics from insights/containers & insights/pods are no longer available

I am unable to see metrics from insights/containers & insights/pods namespaces in the Metrics explorer inside my AKS cluster.
We are using mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod08052021 image version.

In the past I was able to see metrics and I was using oomKilledContainerCount.
Looking over the release notes I didn't see specific changes targeting this area.

Thanks in advance.

Metric Alerts

Following up on #645 , are the Insights.Container/nodes and Insights.Container/containers namespaces deprecated? I too am not seeing any telemetry for them. Specifically, it seems that the following

namespace: Insights.Container/nodes
metrics: cpuUsagePercentage, memoryWorkingSetPercentage

are replaced in favor of

namespace: Microsoft.ContainerService/managedClusters
metrics: node_cpu_usage_percentage, node_memory_rss_percentage

oms agent stopped working

oms agent stopped collecting prometheus metrics with log:
2021-04-20T02:17:49Z E! [inputs.prometheus] Unable to watch resources: Get "https://XXXXXXXXXXXX.hcp.westeurope.azmk8s.io:443/api/v1/pods?watch=true": context canceled
2021-04-20T02:17:49Z E! [telegraf] Error running agent: input plugins recorded 1 errors
End Telegraf Run in Test Mode**********
starting fluent-bit and setting telegraf conf file for replicaset
nodename: aks-main-38921269-vmss000000
replacing nodename in telegraf config
File Doesnt Exist. Creating file...
Fluent Bit v1.6.8

Copyright (C) 2019-2020 The Fluent Bit Authors
Copyright (C) 2015-2018 Treasure Data
Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
https://fluentbit.io

Telegraf 1.18.0 (git: HEAD ac5c7f6a)
2021-04-20T02:17:49Z I! Starting Telegraf 1.18.0
td-agent-bit 1.6.8
stopping rsyslog...

Stopping enhanced syslogd rsyslogd
...done.
getting rsyslog status...
rsyslogd is not running

image tag: mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod03262021

we had to manually restart the pod to see prom metrics being collected again

Start, Stop and Delete methods to the Container_ContainerInventory

It would be great if you added Start, Stop and Delete methods to the Container_ContainerInventory class.

Critical vulnerability (CVE-2017-10906)

Hi, I am part of an InfoSec team and we detected the following critical vulnerability in image: mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod02232021

Please validate. Thanks in advance!

Critical vulnerability (CVE-2017-10906)

Description: Escape sequence injection vulnerability in Fluentd versions 0.12.29 through 0.12.40 may allow an attacker to change the terminal UI or execute arbitrary commands on the device via unspecified vectors.
Installed Resource: fluentd 0.12.40
Fixed Version: 0.12.41
Published by NVD: 2017-12-08
CVSS Score: NVD CVSSv2 10.0
Remediation: Upgrade package fluentd to version 0.12.41 or above.
Full Path: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.4.0/specifications/fluentd-0.12.40.gemspec

Prometheus integration - scraping API server metrics

Is it possible to scrape AKS API server metrics using https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-prometheus-integration?

As far as I know, to get /metrics from API server authentication is required (bearer token) and I cannot see how this can be set in the monitoring agent config file https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-prometheus-integration#prometheus-scraping-settings.

For standard Prometheus deployment this can be configured via bearer_token_file setting.

OMS Agent high memory usage

Hello,

Why is replicaset even required? Just adds one more instance to the node where daemonset already created one.

Reopening because of last Issue was closed without a solution. (With a solution for the problem what happened later than original issue was created)
#694

collect_all_kube_events is not working correctly

Collecting all kube events is not working correctly, only the first 4000 events are recorded taking normal events into account. The loop for fetching events is missing the logic when collect_all_kube_events is enabled:

The code at this line: https://github.com/microsoft/Docker-Provider/blob/ci_prod/source/plugins/ruby/in_kube_events.rb#L115 should look like:

if @collectAllKubeEvents
            continuationToken, eventList = KubernetesApiClient.getResourcesAndContinuationToken("events?limit=#{@EVENTS_CHUNK_SIZE}&continue=#{continuationToken}")
          else
            continuationToken, eventList = KubernetesApiClient.getResourcesAndContinuationToken("events?fieldSelector=type!=Normal&limit=#{@EVENTS_CHUNK_SIZE}&continue=#{continuationToken}")
          end

Helm deployment does not on-board Azure Monitor for containers

Following the instructions from the charts/azuremonitor-containers url the helm deployment does not on-board Azure Monitor for containers.

Expected behaviour: Following the instructions would automatically on-board Azure Monitor for containers.

Environment: AKS
Kubernetes: 1.19.11

Step 1 and 2 complete successfully with a "Log Analytics Workspace" and "ContainerInsights(iob-dev-westeurope-akstest-workspace)" solution created in the same resource group as the AKS cluster.

Step 3 fails with error "No k8s-master VMs or VMSSes found in the specified resource group:iob-dev-westeurope-akstest-rg-aks" but looking at the script i am not sure this applies to AKS.

Helm deployment completes without error.

helm upgrade --install --values=values.yaml azmon-containers microsoft/azuremonitor-containers --namespace kube-system Release "azmon-containers" does not exist. Installing it now. W0721 09:12:19.676421 20988 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole W0721 09:12:19.834125 20988 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding W0721 09:12:22.908504 20988 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole W0721 09:12:23.088834 20988 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding NAME: azmon-containers LAST DEPLOYED: Wed Jul 21 09:12:18 2021 NAMESPACE: kube-system STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: azmon-containers deployment is complete.

Log output from omsagent

kubectl logs omsagent-48t4s -n kube-system
not setting customResourceId
Making curl request to oms endpint with domain: opinsights.azure.com
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl request to oms endpoint succeeded.
****************Start Config Processing********************
Both stdout & stderr log collection are turned off for namespaces: '*_kube-system_*.log'
****************End Config Processing********************
****************Start Config Processing********************
****************Start NPM Config Processing********************
config::npm::Successfully substituted the NPM placeholders into /etc/opt/microsoft/docker-cimprov/telegraf.conf file for DaemonSet
config::Starting to substitute the placeholders in td-agent-bit.conf file for log collection
config::Successfully substituted the placeholders in td-agent-bit.conf file
****************Start Prometheus Config Processing********************
config::No configmap mounted for prometheus custom config, using defaults
****************End Prometheus Config Processing********************
****************Start MDM Metrics Config Processing********************
****************End MDM Metrics Config Processing********************
****************Start Metric Collection Settings Processing********************
****************End Metric Collection Settings Processing********************
Making wget request to cadvisor endpoint with port 10250
Wget request using port 10250 succeeded. Using 10250
Making curl request to cadvisor endpoint /pods with port 10250 to get the configured container runtime on kubelet
configured container runtime on kubelet is : containerd
set caps for ruby process to read container env from proc
aks-system1-34726002-vmss000000
 * Starting periodic command scheduler cron
   ...done.
docker-cimprov 16.0.0.0
DOCKER_CIMPROV_VERSION=16.0.0.0
*** activating oneagent in legacy auth mode ***
setting mdsd workspaceid & key for workspace:68299338-cb11-46a8-a42e-977e476105e4
azure-mdsd 1.10.1-build.master.213
starting mdsd in legacy auth mode in main container...
*** starting fluentd v1 in daemonset
starting fluent-bit and setting telegraf conf file for daemonset
since container run time is containerd update the container log fluentbit Parser to cri from docker
nodename: aks-system1-34726002-vmss000000
replacing nodename in telegraf config
checking for listener on tcp #25226 and waiting for 30 secs if not..
File Doesnt Exist. Creating file...
Fluent Bit v1.6.8
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

waitforlisteneronTCPport found listener on port:25226 in 5 secs
checking for listener on tcp #25228 and waiting for 30 secs if not..
Routing container logs thru v2 route...
waitforlisteneronTCPport found listener on port:25228 in 10 secs
Telegraf 1.18.0 (git: HEAD ac5c7f6a)
2021-07-21T08:12:52Z I! Starting Telegraf 1.18.0
td-agent-bit 1.6.8
stopping rsyslog...
 * Stopping enhanced syslogd rsyslogd
   ...done.
getting rsyslog status...
 * rsyslogd is not running

Can you confirm that the on-boarding of Azure Monitor for containers should have occurred and how to troubleshoot this further?

ContainerInventory table continuously populated with duplicated entries every few seconds, costing a lot

Issue

Hi, we have two clusters running Container Insights and are paying hundreds of pounds each month for the Log Analytics bill. It has become the most expensive part of our Azure bill. Looking into this, the ContainerInventory table is flooded with many messages a second. On the surface, OMS agent appears to be sending the same messages many times. This seems to be happening on both clusters.

Please could you let us know how we can reduce the volume of data that OMS agent sends to Log Analytics for this table?

To reproduce

The OMS version in use is microsoft/oms:ciprod10162018-2.

Example:
The following Log Analytics query shows the ContainerInventory table contains almost 3 times as much data as any other table.

Usage
| where IsBillable = true
| summarize Quantity=sum(Quantity) by SourceSystem, DataType, Solution
| order by Quantity desc

The result shows the top row has a SourceSystem of OMS, a DataType of ContainerInventory and a Solution of ContainerInsights. We're ingesting 8GB per week on our test environment just to that table.

Running the following query on that table helps show the suspected duplicate entries. Note that kube-dns was picked as a common example but the problem occurs for every container.

ContainerInventory
| where TimeGenerated > ago(1d) and ContainerHostname startswith "kube-dns"
| order by TimeGenerated desc

The result shows lots of rows with unique values in the TimeGenerated [UTC] column, but duplicate values in all other columns. A good indicator is to check the values of the CreatedTime [UTC] and StartedTime [UTC] columns at the end - they seem to be exactly the same for many different values of TimeGenerated. This implies that the same Kubernetes events are being reported by the OMS agent to Log Analytics many times over.

Please could you let us know how we can reduce the volume of data that OMS agent sends to Log Analytics for this ContainerInventory table as the cost impact is currently a problem?

Background

We are deploying the OMS agent using the addon in the ARM template:

"addonProfiles": {
  "omsagent": {
    "enabled": true,
    "config": {
      "logAnalyticsWorkspaceResourceID": "[parameters('logAnalyticsWorkspaceResourceId')]"
    }
  }
}

We have five OMS pods running as a result (there are currently four nodes in the test cluster this is taken from):

omsagent-d9cvb                          1/1     Running   1          20h
omsagent-drz85                          1/1     Running   3          7d
omsagent-p6jts                          1/1     Running   4          7d
omsagent-rs-ccf8b9699-9976m             1/1     Running   0          2d
omsagent-swwpq                          1/1     Running   4          7d

Thanks!

[Feature request] Fork container logs based on content

This might not be the right repo for this issue... please do point me at a more appropriate place if not!

Is there a way to filter application logs from an AKS cluster based on the log contents - so that logs with, eg a specific JSON field, can go to a different Azure Monitor instance?

We have a deployment where we need to send application audit logs - which just go into the container logstream with a specific flag in the JSON log body - to a space with tighter access controls and longer log retention than the main bulk of the application logs.

As far as I can tell all container logs just get forwarded through the OMS agent into a single table in a configured workspace - there's no way to customize this on an AKS cluster, apart from the config options in kubernetes/container-azm-ms-agentconfig.yaml - which only allow stripping logs from specific namespaces.

Is there a way to get the agent to fork the container logs based on eg fluentd configuration? Or to deploy some additional containers onto the cluster to maybe intercept the logs and do this filtering?

Critical Vulnerability (CVE-2016-7954)

I work on an internal security team, and one of our tools flagged an older critical vulnerability for:
mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod03262021

Bundler 1.x might allow remote attackers to inject arbitrary Ruby code into an application by leveraging a gem name collision on a secondary source.

Installed Resource: bundler 1.10.6
Fixed Version: 1.11.0rc1
Published by NVD 2016-12-22
CVSS Score NVD CVSSv3: 9.8

Remediation
Upgrade package bundler to version 1.11.0rc1 or above.

Would it be possible for this project to upgrade the bundler to 1.11.0rc1 or above to remediate this issue?

Does OMS Agent log individual container cpu/mem usage?

I have a bunch of Grafana dashboards showing graphs for container_cpu_usage_seconds_total and container_memory_usage_bytes filtered by our applications - I've had a look and I'm exporting the Prometheus data into Log Anayltics, but I don't think the OMS agent exports any container_* stats?

I am trying to create a metric chart on a dashboard which shows container utilisation - is this data retrievable from Container Insights? I can't seem to find anything at the container level, only node level

Thanks

Prometheus logs not showing up in Log Analytics Workspace

My pods has the following annotations for a few weeks:

prometheus.io/path: /metrics
prometheus.io/port: '8900'
prometheus.io/scrape: 'true'

I have deployed a ConfigMap with the following settings, also a few weeks ago:

  prometheus-data-collection-settings: |-
    [prometheus_data_collection_settings.cluster]
        interval = "1m"
        fieldpass = ["mendix_concurrent_user_sessions", "mendix_connection_bus", "mendix_current_request_duration_seconds_bucket", "mendix_current_request_duration_seconds_count", "mendix_current_request_duration_seconds_sum", "mendix_jvm_memory_bytes", "mendix_jvm_memory_pool_bytes", "mendix_license_count", "mendix_named_users", "mendix_runtime_requests_total", "mendix_threadpool_handling_external_requests"]
        monitor_kubernetes_pods = true
        monitor_kubernetes_pods_namespaces = ["dev-apps"]

    [prometheus_data_collection_settings.node]
        interval = "1m"

This Microsoft Learn article tells me where I need to look for querying Prometheus log, which in turn points me to this article.

When I write my query in the Log Analytics Workspace e.g.:

InsightsMetrics
| where Namespace contains "prometheus"
| summarize by Name

I only get the following results

volume_manager_total_volumes
kubelet_runtime_operations_total
process_cpu_seconds_total
process_resident_memory_bytes
kubelet_running_pods

So, where are the other metrics that can be seen in my ConfigMap's fieldpass property? Am I missing something?

retry cadvisor port determination

We should add retries in main.sh here:

#Setting environment variable for CAdvisor metrics to use port 10255/10250 based on curl request
echo "Making wget request to cadvisor endpoint with port 10250"
#Defaults to use port 10255
cAdvisorIsSecure=false
RET_CODE=`wget --server-response https://$NODE_IP:10250/stats/summary --no-check-certificate --header="Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" 2>&1 | awk '/^  HTTP/{print $2}'`
if [ $RET_CODE -eq 200 ]; then
      cAdvisorIsSecure=true
fi

[Feature request] Configure reporting interval

Hello the monitoring of metrics per container is so expensive for us that we would either not use it all, or preferebly tune down the monitoring ammount. We dont really need metrics every minute, if we do this every 2 minutes we would be OK. So what would be nice if there was a way how to configure the interval to log container metrics

Memory usage

Hello,

Helm installation failure

I'm facing an issue of helm during installing the Azure Arc enabled Kubernetes agent in my Kubernetes cluster, according to the document below:
https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-enable-arc-enabled-clusters

At the end of installation script of ps1, it is showing not found message like Error: mcr.microsoft.com/azuremonitor/containerinsights/preview/azuremonitor-containers:2.8.2: not found.
I assuming the ps1 script (and also the bash script) has wrong version number.
And also, after changing the version number to 2.8.1 by editing the script, the installation finished properly.

The log output of installation:

...
Helm version : version.BuildInfo{Version:"v3.5.3", GitCommit:"041ce5a2c17a58be0fcd5f5e16fb3e7e95fea622", GitTreeState:"dirty", GoVersion:"go1.15.8"}
Installing or upgrading if exists, Azure Monitor for containers HELM chart ...
pull the chart from mcr.microsoft.com
pull the chart from mcr.microsoft.com
2.8.2: Pulling from mcr.microsoft.com/azuremonitor/containerinsights/preview/azuremonitor-containers
Error: mcr.microsoft.com/azuremonitor/containerinsights/preview/azuremonitor-containers:2.8.2: not found
export the chart from local cache to current directory
Error: Chart not found: mcr.microsoft.com/azuremonitor/containerinsights/preview/azuremonitor-containers:2.8.2
helmChartRepoPath is : ./azuremonitor-containers
using provided kube-context: minikube
Release "azmon-containers-release-1" does not exist. Installing it now.
Error: path "./azuremonitor-containers" not found
Successfully enabled Azure Monitor for containers for cluster: /subscriptions/************/resourceGroups/************/providers/Microsoft.Kubernetes/connectedClusters/************
Proceed to https://aka.ms/azmon-containers to view your newly onboarded Azure Managed cluster

And, how I download the the ps1 script (same as the document's guide):
Invoke-WebRequest https://aka.ms/enable-monitoring-powershell-script -OutFile enable-monitoring.ps1

PS:
If the Azure Arc Kubernetes has already been GA, I believe the URL doesn't have to preview anymore.
I appreciate if this URL is fixed properly.
Thank you.

Update Telegraf to solve Prometheus: reading text format failed

I raised an issue here: oliver006/redis_exporter#573

Which looks to be caused by: Input.prometheus plugin unable to parse #HELP comment. influxdata/telegraf#8366
influxdata/telegraf#8545

Disabling stdout without ConfigMap

Hello,

I am running container insights in a machine that do not have Kubernetes and I want to disable sending logs from stdout stream.

In the documentation (https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-agent-config), the only way to do that is by using Kubernetes ConfigMap. As I do not have kubernetes installed, I tried to configure this by setting the environment variable AZMON_LOG_EXCLUSION_REGEX_PATTERN=stdout (reference in some files at /build/windows/installer/conf/), but had no success.

I also tried to write a settings file at /etc/config/settings/log-data-collection-settings, but had no success.

Is there any way that I can disable sending stdout logs to azure workspace?

Run fluentd with /etc/config/kube.conf

When fluentd is run it uses the /etc/fluent/kube.conf:

Docker-Provider/kubernetes/linux/main.sh

Line 769 in a697f5a

 fluentd -c /etc/fluent/kube.conf -o /var/opt/microsoft/docker-cimprov/log/fluentd.log --log-rotate-age 5 --log-rotate-size 20971520 & 

Shouldn't it use the /etc/config/kube.conf instead?

[Feature Request] - Ability to detect and monitor Docker swarm services

Provider currently gives:
CLASS=Container_ContainerInventory:CIM_ManagedElement
CLASS=Container_ContainerStatistics:CIM_StatisticalData:CIM_ManagedElement
CLASS=Container_DaemonEvent:CIM_ManagedElement
CLASS=Container_ImageInventory:CIM_ManagedElement
CLASS=Container_ContainerLog:CIM_ManagedElement
CLASS=Container_HostInventory:CIM_ManagedElement
CLASS=Container_Process:CIM_ManagedElement

It would be nice to monitor swarm services

/etc/opt/omi/conf# docker service ls
ID                  NAME                            MODE                REPLICAS            IMAGE                            PORTS
dkq2zy0opjvg        monitoring_telegraf             global              2/2                 telegraf:latest

/etc/opt/omi/conf# docker service ps dkq2zy0opjvg
ID                  NAME                                            IMAGE               NODE                DESIRED STATE       CURRENT STATE         ERROR               PORTS
hzstnm31un4y        monitoring_telegraf.mgnljv2zuh74cfhn3n3z0znck   telegraf:latest     node01               Running             Running 11 days ago                
qms85linbux6        monitoring_telegraf.8k70kl8rzbz0zqu8345wivmpo   telegraf:latest     node02               Running             Running 11 days ago

100 CPU usage

We have noticed that the omi provider for docker tends to cause the docker daemon (dockerd) to spin to 100% cpu.

We think the issue is related to statistics metric being too aggressively queries as this is a known pitfall of docker stats system.

omiagent (OMI-1.4.2-5) keeps crashing

omiagent process keeps crashing every minutes and is filling up the filesystem.
I have traced the problem to the docker-cimprov-1.0.0-32 provider. Please see below for details.

Version:

# /opt/omi/bin/omiserver -v
/opt/omi/bin/omiserver: OMI-1.4.2-5 - Wed Jul 25 10:59:15 PDT 2018

omiserver.log:

2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40012 Priority=INFO (S)Socket: 0x1b1d850, closing connection (mask 2)
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40033 Priority=INFO Selector_RemoveHandler: selector=0x5e2168, handler=0x1b1d850, name=BINARY_SERVER_CONNECTION
2018/10/11 15:22:27 [92034,92034] INFO: null(0): EventId=40011 Priority=INFO (E)done with receiving msg(0x1a52618:4099:EnumerateInstancesReq:e005)
2018/10/11 15:22:27 [92034,92034] INFO: null(0): EventId=40039 Priority=INFO New request received: command=(EnumerateInstancesReq), namespace=(root/cimv2), class=(Container_DaemonEvent)
2018/10/11 15:22:27 [92034,92034] INFO: null(0): EventId=40032 Priority=INFO Selector_AddHandler: selector=0x5dff88, handler=0x1a56c40, name=null
2018/10/11 15:22:27 [92034,92034] INFO: null(0): EventId=40005 Priority=INFO _SendRequestToAgent msg(0x1a576a8:15:BinProtocolNotification:13), from original operationId: 0 to 13
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40032 Priority=INFO Selector_AddHandler: selector=0x5e2168, handler=0x1b1d850, name=BINARY_SERVER_CONNECTION
2018/10/11 15:22:27 [92034,92034] INFO: null(0): EventId=40005 Priority=INFO _SendRequestToAgent msg(0x1a576a8:4099:EnumerateInstancesReq:14), from original operationId: e005 to 14
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40011 Priority=INFO (S)done with receiving msg(0x1b2db28:36:VerifySocketConn:0)
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40011 Priority=INFO (S)done with receiving msg(0x1b2ed68:34:CreateAgentMsg:0)
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40012 Priority=INFO (S)Socket: 0x1b1d850, closing connection (mask 2)
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40033 Priority=INFO Selector_RemoveHandler: selector=0x5e2168, handler=0x1b1d850, name=BINARY_SERVER_CONNECTION
2018/10/11 15:22:27 [92025,92025] WARNING: null(0): EventId=30209 Priority=WARNING child process with PID=[92081] terminated abnormally
2018/10/11 15:22:37 [92034,92034] INFO: null(0): EventId=40011 Priority=INFO (E)done with receiving msg(0x1a54908:4:PostResultMsg:14)
2018/10/11 15:22:37 [92034,92034] INFO: null(0): EventId=40032 Priority=INFO Selector_AddHandler: selector=0x5dff88, handler=0x1a525e0, name=BINARY_SERVER_CONNECTION
2018/10/11 15:22:37 [92034,92034] INFO: null(0): EventId=40028 Priority=INFO (E)Socket: 0x1a54aa0, Connection Closed while reading header

core.92081 :

[New LWP 92081]
[New LWP 92082]
[New LWP 92083]
[New LWP 92085]
[New LWP 92089]
[New LWP 92088]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/omi/bin/omiagent 9 10 --destdir / --providerdir /opt/omi/lib --loglevel IN'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fc0fdb13240 in std::allocator<std::pair<std::string const, unsigned long long> >::~allocator() () from /opt/omi/lib/libcontainer.so
(gdb) bt
#0  0x00007fc0fdb13240 in std::allocator<std::pair<std::string const, unsigned long long> >::~allocator() () from /opt/omi/lib/libcontainer.so
#1  0x00007fc0fdb13fcb in std::_Miter_base<mi::Container_ContainerStatistics_Class*>::iterator_type std::__miter_base<mi::Container_ContainerStatistics_Class*>(mi::Container_ContainerStatistics_Cla
ss*) () from /opt/omi/lib/libcontainer.so
#2  0x00007fc0fdb12986 in __gnu_cxx::__normal_iterator<std::map<std::string, unsigned long long, std::less<std::string>, std::allocator<std::pair<std::string const, unsigned long long> > >*, std::v
ector<std::map<std::string, unsigned long long, std::less<std::string>, std::allocator<std::pair<std::string const, unsigned long long> > >, std::allocator<std::map<std::string, unsigned long long,
 std::less<std::string>, std::allocator<std::pair<std::string const, unsigned long long> > > > > >::operator*() const () from /opt/omi/lib/libcontainer.so
#3  0x00007fc0fdb31db8 in ensure(printbuffer*, unsigned long) () from /opt/omi/lib/libcontainer.so
#4  0x0000000000408707 in ?? ()
#5  0x0000000000404952 in ?? ()
#6  0x0000000000468f9d in ?? ()
#7  0x0000000000465227 in ?? ()
#8  0x00000000004632d3 in ?? ()
#9  0x000000000044c092 in ?? ()
#10 0x000000000046205b in ?? ()
#11 0x00000000004632d3 in ?? ()
#12 0x000000000044f37a in ?? ()
#13 0x000000000044fff8 in ?? ()
#14 0x000000000046b23d in ?? ()
#15 0x0000000000404ffc in ?? ()
#16 0x00007fc10845c3d5 in __libc_start_main (main=0x4052b0, argc=9, ubp_av=0x7ffc84310a08, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc843109f8)
    at ../csu/libc-start.c:274
#17 0x0000000000404699 in ?? ()
#18 0x00007ffc843109f8 in ?? ()
#19 0x000000000000001c in ?? ()
#20 0x0000000000000009 in ?? ()
#21 0x00007ffc84310f36 in ?? ()
#22 0x00007ffc84310f4c in ?? ()
#23 0x00007ffc84310f4e in ?? ()
#24 0x0000000000000000 in ?? ()

How to fix 'Cookie file /var/lib/rabbitmq/.erlang.cookie must be accessible by owner only' error in windows server 2019 with DockerProvider service

I'm installed docker in windows server 2019 with DockerProvider
I'm using this code

Install-Module DockerProvider
Install-Package Docker -ProviderName DockerProvider -RequiredVersion preview
[Environment]::SetEnvironmentVariable("LCOW_SUPPORTED", "1", "Machine")

after that I install Docker-Compose with this code

[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
Invoke-WebRequest "https://github.com/docker/compose/releases/download/1.24.0/docker-compose-Windows-x86_64.exe" -UseBasicParsing -OutFile $Env:ProgramFiles\Docker\docker-compose.exe

after that I use a docker compose file

version: "3.5"

services:


  rabbitmq:
    # restart: always
    image: rabbitmq:3-management
    container_name: rabbitmq
    ports:
      - 5672:5672
      - 15672:15672
    networks:
      - myname
    # network_mode: host
    volumes: 
      - rabbitmq:/var/lib/rabbitmq

 

networks:
  myname:
    name: myname-network

volumes:
  rabbitmq:
    driver: local

everything is Ok up to here
but after i call http://localhost:15672/ url in my browser
rabbitmq crashes and I see this error in docker logs <container-id>

Cookie file /var/lib/rabbitmq/.erlang.cookie must be accessible by owner only

this .yml file is working correctly in docker for windows
but after running the file in windows server, I see this error

Security concern: Omsagent pod running as root user

Hi,

In a project I'm part of there is a security concern regards to that the omsagent pods, deployed into an AKS-Cluster, is running as root user. It maps up /var/log from the kublet (node), accessing the logs, effectively running as root process on the node. We understand that consuming /var/log requires root.

The question is, how much additional hardening has Microsoft done with the omsagent, and can we apply additional hardening that makes it "secure enough"? It would be nice to get a point of view on the matter.

The "attack vector" is through the /var/log filesystem. If it manages to mount up files into this directory somehow. It would require an attacker to break into the omsagent.

Trace-data:

rune@Azure:~$ kubectl get pods -n kube-system | grep -i omsagent
omsagent-7cb7z                               1/1     Running   0          18h
omsagent-rs-7c7b6c8d5b-h2zvh                 1/1     Running   0          18h

rune@Azure:~$ kubectl exec omsagent-rs-7c7b6c8d5b-h2zvh -n kube-system -- id
uid=0(root) gid=0(root) groups=0(root)

rune@Azure:~$ kubectl exec -it -n kube-system omsagent-rs-7c7b6c8d5b-h2zvh -- /bin/sh
# ps -aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          1  0.0  0.0  18504  3140 ?        Ss   Jun15   0:00 /bin/bash /opt/main.sh
root         30  0.0  0.0   6704   116 ?        S    Jun15   0:00 inotifywait /etc/config/settings --daemon --recursive --outfile /opt/inotifyoutput.txt --event create,delete --format %e : %T --timefmt +%s
syslog      223  0.0  0.0 129672  4208 ?        Ssl  Jun15   0:02 /usr/sbin/rsyslogd
omsagent    263  0.6  0.7 394724 55132 ?        Sl   Jun15   6:48 /opt/microsoft/omsagent/ruby/bin/ruby /opt/microsoft/omsagent/bin/omsagent-32ed830c-9fbe-4a37-8b4d-a990f2a873f8 -d /var/opt/microsoft/omsagent/32ed830c-9fbe-4a37-8b4d-a990f2a873f8/run/omsagent.pid --no-supervisor -o /var/opt/micros
root        294  0.0  0.0  28356  2676 ?        Ss   Jun15   0:00 /usr/sbin/cron
root        338  0.0  0.6 150128 46848 ?        Sl   Jun15   0:04 /opt/td-agent-bit/bin/td-agent-bit -c /etc/opt/microsoft/docker-cimprov/td-agent-bit-rs.conf -e /opt/td-agent-bit/bin/out_oms.so
root        348  0.0  0.5 198028 38512 ?        Sl   Jun15   0:21 /opt/telegraf --config /etc/opt/microsoft/docker-cimprov/telegraf-rs.conf
root        369  0.0  0.0   4536   768 ?        S    Jun15   0:00 sleep inf
root      57390  0.0  0.0   4628   772 pts/0    Ss+  06:26   0:00 /bin/sh
root      59470  0.0  0.0   4628   820 pts/1    Ss   07:02   0:00 /bin/sh
root      59477  0.0  0.0  34404  2856 pts/1    R+   07:03   0:00 ps -aux
# exit
rune@Azure:~$

[Feature Request] Ability to set PriorityClass for OMSAgent in AKS

I am not sure if this is the correct repository to open this issue, but:

It would be great if it was possible to somehow configure a priority-class for the omsagent within an AKS-Cluster.

Currently, that does not seem possible - or at least I could find no documentation whatsoever about it.

In our scenario, we want to give the cluster monitoring a somewhat higher priority than most of the other services - so that in case of an error, we won't be flying blind.

The only workaround I can think of is to use a global default class - however that is not really feasible as there might be other, unimportant pods without a PriorityClass within the cluster that might suddenly be ranked way higher than they should be.

Is such a feature possible?

Divergence between ARM templates and Portal for K8s alarms

I am in the process of implementing the alarms based on the ARM templates. Comparing the alarms created via the template, it seems the metrics chosen and thresholds are different than what the portal creates for the corresponding alarms in the "Recommended alerts (Preview)" pane.

Maybe it is not wrong, but I would like to understand why the difference.

Examples:
Alarm on the portal: "(New) Container CPU %"
Description: "Average CPU percent is greater than the configured threshold (default is 95%)"
Metric used on the portal: cpuThresholdExceeded > 0
Metric used on the ARM template: cpuExceededPercentage > 95

Alarm on the portal: "(New) Container working set memory %"
Description: "Average working set memory percent is greater the configured threshold (default is 95%)"
Metric used on the portal: memoryworkingsetthresholdviolated > 0
Metric used on the ARM template: memoryWorkingSetExceededPercentage > 95

Aggregation Type/Aggregation Granularity - Not polutating in Portal page

Aggregation Type/Aggregation Granularity are not populating in portal page & emails are not triggering. But, I see values in export template.

Alerts were created using SPN account and am trying to see through my enterprise subscription. I guess that I am missing few access or restriction in my organization level, Kindly let me know whether any additional RBAC at Cluster level

Issue with azuremonitor-containers helm chart

Hi guys,

I am getting below error when I try to install azuremonitor-containers helm chart in arc enabled k8 v1.16.

Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "AzureClusterIdentityRequest" in version "clusterconfig.azure.com/v1beta1"

It looks like it does not support k8 version 1.16. Could you please help with this ? Thanks.

[Security] Helm-Chart-Update needed for "Omigod" vulnerability

Due to CVE-2021-38645, CVE-2021-38649, CVE-2021-38648, and CVE-2021-38647 ( https://msrc-blog.microsoft.com/2021/09/16/additional-guidance-regarding-omi-vulnerabilities-within-azure-vm-management-extensions/ ), a new version of the ciprod-Docker-Image was released ( mcr.microsoft.com/azuremonitor/containerinsights/ciprod:microsoft-oms-latest , from the same website).

The Helm-Chart Docker-Provider/charts/azuremonitor-containers/ (version 2.8.1) is incompatible with this image. There are (at least) two problems:

The health check script /opt/livenessprobe.sh does not exist in the new image. This causes the deployments created by the Helm chart to be restarted ca. every 3-4 minutes due to the failing health check.
Even when I remove the Kubernetes Healthcheck altogether, no data gets uploaded to Log Analytics, but there is not any error message in the logs - in fact, according to the logs, everything should be fine. When I return to the previous version used by the Helm chart ( same image name, tag "ciprod02232021" ) everything starts working again, but the security vulnerability is back.

The helm chart in the "ci_prod" branch (2.8.3) also points to a vulnerable version of ciprod (ciprod04222021)

Becuase of this, we need an updated Helm chart URGENTLY.

wrong version for "microsoft/oms:win-ciprod10272020"?

https://github.com/microsoft/Docker-Provider/blob/ci_prod/ReleaseNotes.md#version-microsoftomswin-ciprod10272020-version-mcrmicrosoftcomazuremonitorcontainerinsightsciprodwin-ciprod10052020-windows

Version microsoft/oms:win-ciprod10272020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod10052020 (windows)

Should it be "Version microsoft/oms:win-ciprod10272020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod10272020"?
@vishiy

	interval = parsedConfig[:prometheus_data_collection_settings][:cluster][:interval]
	fieldPass = parsedConfig[:prometheus_data_collection_settings][:cluster][:fieldpass]
	fieldDrop = parsedConfig[:prometheus_data_collection_settings][:cluster][:fielddrop]
	urls = parsedConfig[:prometheus_data_collection_settings][:cluster][:urls]
	kubernetesServices = parsedConfig[:prometheus_data_collection_settings][:cluster][:kubernetes_services]

	# Remove below 4 lines after phased rollout
	monitorKubernetesPods = parsedConfig[:prometheus_data_collection_settings][:cluster][:monitor_kubernetes_pods]
	monitorKubernetesPodsNamespaces = parsedConfig[:prometheus_data_collection_settings][:cluster][:monitor_kubernetes_pods_namespaces]
	kubernetesLabelSelectors = parsedConfig[:prometheus_data_collection_settings][:cluster][:kubernetes_label_selector]
	kubernetesFieldSelectors = parsedConfig[:prometheus_data_collection_settings][:cluster][:kubernetes_field_selector]

microsoft / docker-provider Goto Github PK

docker-provider's Issues

Issue

To reproduce

Background

Recommend Projects

Recommend Topics

Recommend Org