microsoft / docker-provider Goto Github PK
View Code? Open in Web Editor NEWAzure Monitor for Containers
License: Other
Azure Monitor for Containers
License: Other
Version: mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod01312022
Platform: AKS
I'm using the proposed configuration template [1], and it seems to get loaded (according to the logs). The omsagents still seem not to respect the arrays of excluded namespaces. If I disable exportation of stdout and stderr, it takes effect, but filtering (while collection is enabled) does not work.
This is a duplicate of #737 in which I did post a comment (as no solution is mentioned in there) but got no answer yet.
Since it is a closed issue I don't think many people will look into it, so let me post a new issue.
I do have an AKS cluster setup with Container Insights enabled.
My log analytics workspace contains alot of logs that I don't use so I want to limit the collected logs.
I did create the ConfigMap based on this template in the kube-system which is deployed to my cluster.
When calling kubectl edit configmap container-azm-ms-agentconfig -n kube-system
I get the following:
I do have a 3 separate namespaces: kube-system
, grafana-namespace
and apps-namespace
.
I only want to capture the last one's logs.
While checking one of the omsagent-...
pods' logs, I get the following:
Both stdout & stderr log collection are turned off for namespaces: '*.csv2,*_kube-system_*.log,*_grafana-namespace_*.log'
****************End Config Processing********************
****************Start Config Processing********************
config::configmap container-azm-ms-agentconfig for agent settings mounted, parsing values
config::Successfully parsed mounted config map
So the configuration itself has no errors in there, and is applied properly.
Now when I check the Log analytics workspace, I still get data in there that is referring to kube-system
and grafana-namespace
.
So, for example this query returns results for the last 5 minutes while the ConfigMap is already deployed for a week or so:
KubePodInventory
| where Namespace == "kube-system"
When reading the Microsoft docs the ConfigMap should reduce the ingestion of logs if you exclude a namespace.
The main question is, what did I do wrong in the configuration or I am wrong in thinking that KubePodInventory
shouldn't contain any data for the excluded namespaces.
We are getting below error when we run ,
curl -sL https://raw.githubusercontent.com/microsoft/Docker-Provider/ci_prod/scripts/onboarding/aksengine/kubernetes/AddMonitoringOnboardingTags.sh | bash -s
No k8s-master VMs or VMSSes found in the specified resource group:
Where it says "Create and use a parameters file as a JSON" on the following page, the linked doc is a review page, not the public docs.
https://github.com/microsoft/Docker-Provider/tree/ci_dev/alerts/recommended_alerts_ARM
It links to:
https://review.docs.microsoft.com/en-us/azure/azure-resource-manager/templates/parameter-files
It should link to:
https://docs.microsoft.com/en-us/azure/azure-resource-manager/templates/parameter-files
Been experiencing this issue for some time, and not just on one client.
Running Sitecore Containers in Azure.
Advised after raising a support request that advise from MS Support was:
"After discussed with our container product team, seems it’s an known issue for windows containerd. And now the windows contained is an opensource and maintained by community, which means any issue regarding contained issue, we have to raise an issue to the community for the tracking. Thanks for your understanding!"
So hence I am raising issue here for help. There are obviously issues with restarts as you can see in the image, though that it seems that it is being investigated separately by MS.
We run more than a thousand short-living jobs on our cluster every day. These jobs stay for some time in the "Completed" state. As a result, we reach our Log Analytics size quota much earlier than expected, because performance metrics are written every minute for every completed job along with other pods (as far as I understand, in_kube_perfinventory.rb is responsible for that).
Can completed pods be excluded from performance traces?
Thanks
From the "About" and also the contents, this repo focuses on AzMon for containers. This makes the current name "Docker-Provider" quite bizarre and confusing.
Hi,
Add, please, to Container_HostInventory additional properties which returns generic API - Containers, ContainersRunning, ContainersPaused, ContainersStopped, Images.
There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.
Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.
I am currently configuring the ConfigMap that is used by the OMS-agent pods. What I want to achieve is sending Prometheus metrics to a log analytics workspace.
For this I am following this Microsoft docs page.
On that page we can see this:
prometheus.io/scrape: "true"
prometheus.io/path: "/mymetrics"
prometheus.io/port: "8000"
prometheus.io/scheme: "http"
And then in the table under Cluster-wide
we have keys like these:
So, my understanding is that a user can specify what port e.g. the OMS-agent has to look at in the annotations of an application pod.
In my case, I have a pod that has an annotation: prometheus.io/port=8900
. And the default that is mentioned in the documentation is 9102.
I tried to specify the following TOML in the ConfigMap:
prometheus-data-collection-settings: |-
[prometheus_data_collection_settings.cluster]
interval = "1m"
fieldpass = ["platform_user_sessions", "platform_connection_bus"]
monitor_kubernetes_pods = true
monitor_kubernetes_pods_namespaces = ["dev-group-apps"]
prometheus.io/port = 8900
Once the ConfigMap is read by the OMS-agent, I get the following error in the logs of the pod:
"config::error::Exception while parsing config map for prometheus config: \nparse error on value \"/\" (error), using defaults, please check config map for errors"
When commenting the prometheus.io/port = 8900
, it is parsed successfully.
I started to look in the source code to find the error, and what it does when it successfully parses the configmap.
There I bumped into these statements:
There is no prometheus
read from the parsedConfig so I am definitely doing something wrong.
How are we able to specify that the OMS-agent has to look for a different port in the prometheus.io/port
annotation, or is my understanding of this completely wrong?
We've seen cases where Kubernetes life-cycle events for Pods (e.g. Killed) could be seen in output from kubectl get events
, but did not show up in OMS log analytics.
It looks like there may be an issue with the way previously seen events are being tracked. In in_kube_events.rb ,the uuid
from the event is used to track which events have already been seen. If the uuid
is already in the KubeEventsStateFile
, then the event is skipped, otherwise it's routed to the registered outputs.
The issue is that for some events (e.g. Pod events), the uuid
does not change when the event occurs again for the same Pod. Instead the count
and lastTimestamp
property values are updated.
Here's an example of a Pod where there were there were multiple Killing
events. In this case the uuid
is fb90522d-a65a-11e7-bafb-000d3a36fbf1
. The first event has count:970
{
"metadata": {
"name": "liveness-exec.14e9557aa4eccb1d",
"namespace": "default",
"selfLink": "/api/v1/namespaces/default/events/liveness-exec.14e9557aa4eccb1d",
"uid": "fb90522d-a65a-11e7-bafb-000d3a36fbf1",
"resourceVersion": "3815919",
"creationTimestamp": "2017-10-01T03:45:35Z"
},
"involvedObject": {
"kind": "Pod",
"namespace": "default",
"name": "liveness-exec",
"uid": "9089e7fd-a3be-11e7-9da0-000d3a36fbf1",
"apiVersion": "v1",
"resourceVersion": "3302183",
"fieldPath": "spec.containers{liveness}"
},
"reason": "Killing",
"message": "(events with common reason combined)",
"source": {
"component": "kubelet",
"host": "k8s-agentpool1-39011252-2"
},
"firstTimestamp": "2017-10-01T03:45:35Z",
"lastTimestamp": "2017-10-05T00:37:45Z",
"count": 970,
"type": "Normal"
}
A subsequent Killing
event for the same Pod (count:971
) followed but would have been skipped because it has the same uuid
as the first event.
{
"metadata": {
"name": "liveness-exec.14e9557aa4eccb1d",
"namespace": "default",
"selfLink": "/api/v1/namespaces/default/events/liveness-exec.14e9557aa4eccb1d",
"uid": "fb90522d-a65a-11e7-bafb-000d3a36fbf1",
"resourceVersion": "3816442",
"creationTimestamp": "2017-10-01T03:45:35Z"
},
"involvedObject": {
"kind": "Pod",
"namespace": "default",
"name": "liveness-exec",
"uid": "9089e7fd-a3be-11e7-9da0-000d3a36fbf1",
"apiVersion": "v1",
"resourceVersion": "3302183",
"fieldPath": "spec.containers{liveness}"
},
"reason": "Killing",
"message": "(events with common reason combined)",
"source": {
"component": "kubelet",
"host": "k8s-agentpool1-39011252-2"
},
"firstTimestamp": "2017-10-01T03:45:35Z",
"lastTimestamp": "2017-10-05T00:43:30Z",
"count": 971,
"type": "Normal"
}
We've tried an experiment by concatenating the count
property with the uuid
property to construct the eventId
for tracking seen events. Another option would have been to use the lastTimestamp
property. That seems to resolve the issue of events being skipped. However, we also see periods of time where events are not forwarded, but that seems unrelated to this.
The bash script to enable monitoring for Arc clusters fails if the output format is not json.
An example when output format is set to table
:
$ bash enable-monitoring.sh --resource-id $azureArcClusterResourceId --kube-context $kubeContext --workspace-id $logAnalyticsWorkspaceResourceId
...
validating cluster identity
cluster identity type: result -------------- systemassigned
-e only supported cluster identity is systemassigned for Azure ARC K8s cluster type
The az command formatting only works for single line output.
In addition, a script does not satisfy our automation workflows, it would be useful to see RM/Terraform code to configure the workspace then instructions to implement the Helm chart natively.
In an AKS environment, logs are lost in an environment where a large amount of logs are output from containers.
I forced myself to change Mem_Buf_limit from its default value in the container of omsagent's DaemonSet, and this improved the situation.
It would be very helpful if the Mem_Buf_limit setting of td-agent-bit could be changed using ConfigMap.
# cat /etc/opt/microsoft/docker-cimprov/td-agent-bit.conf
[INPUT]
Name tail
Tag oms.container.log.la.*
Path ${AZMON_LOG_TAIL_PATH}
DB /var/log/omsagent-fblogs.db
DB.Sync Off
Parser cri
Mem_Buf_Limit 10m <------------------------------- I would like to change this parameter
Rotate_Wait 20
Refresh_Interval 30
Path_Key filepath
Skip_Long_Lines On
Ignore_Older 5m
Exclude_Path ${AZMON_CLUSTER_LOG_TAIL_EXCLUDE_PATH}
Hello,
Is there any way to reduce omsagent memory consumption in the kubernetes cluster? Just for 2 nodes it runs 3 instances of omsagent (1 daemon set - 2 instances, 1 replica set - 1 instance) and each instance uses 300mb of ram. This is the most demanding service in my cluster and it is just a monitoring tool.
Reopening because of last Issue was closed automatically.
#624
Hello!
Is there a way to provide a custom Prometheus config (and relabel_config in particular) to the OMS agent? The AWS equivalent of the OMS agent, ADOT Collector, has this feature. We'd rather not self-host Prometheus since the OMS agent is so convenient, but this is bit of a blocker for us.
Thanks!
Michael
From this doc: https://github.com/microsoft/Docker-Provider/blob/ci_prod/kubernetes/container-azm-ms-agentconfig.yaml#L70
we can configure the endpoint, then the agent can collect the metrics from my endpoint. but my endpoint is secured. something like this:
curl -u admin:password http://automation-controller-service.ansible-automation-platform:8080/api/v2/metrics
need to use the username and password to access it. I can not find any configuration to do it. do you know how to provide the auth info for my endpoint? thanks
It appears that when an image has a tag, a warning is logged in InventoryQuery::SetImageRepositoryImageTag at syslog(LOG_WARNING, "Container image name (%s) is improperly formed and could not be parsed in SetRepositoryImageTag", properties.c_str());
The syslog message is does not contain the image name: "Container image name () is improperly...". All the docker images are having names, so the name is available and should be displayed in the message.
I am trying to run fluentd with a custom configuration instead of the predefined one from https://github.com/microsoft/Docker-Provider/blob/ci_prod/build/linux/installer/conf/kube.conf. In particular, I am interested in changing the run_interval
for the kube_events plugin.
I was thinking that the configuration from https://github.com/microsoft/Docker-Provider/blob/ci_prod/charts/azuremonitor-containers/templates/ama-logs-rs-configmap.yaml#L5 is the way to go. However, I couldn’t see any place in which that configuration is used, apart from some if-statements.
I would like to know what is the purpose of that configmap? And there is any way to change the fluentd configuration?
What I did in the end was to change the run_interval
directly in the /etc/fluent/kube.conf
, but I was thinking there should be a better way to achieve this without any workaround.
I am unable to see metrics from insights/containers
& insights/pods
namespaces in the Metrics explorer inside my AKS cluster.
We are using mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod08052021
image version.
In the past I was able to see metrics and I was using oomKilledContainerCount
.
Looking over the release notes I didn't see specific changes targeting this area.
Thanks in advance.
Following up on #645 , are the Insights.Container/nodes and Insights.Container/containers namespaces deprecated? I too am not seeing any telemetry for them. Specifically, it seems that the following
namespace: Insights.Container/nodes
metrics: cpuUsagePercentage, memoryWorkingSetPercentage
are replaced in favor of
namespace: Microsoft.ContainerService/managedClusters
metrics: node_cpu_usage_percentage, node_memory_rss_percentage
oms agent stopped collecting prometheus metrics with log:
2021-04-20T02:17:49Z E! [inputs.prometheus] Unable to watch resources: Get "https://XXXXXXXXXXXX.hcp.westeurope.azmk8s.io:443/api/v1/pods?watch=true": context canceled
2021-04-20T02:17:49Z E! [telegraf] Error running agent: input plugins recorded 1 errors
End Telegraf Run in Test Mode**********
starting fluent-bit and setting telegraf conf file for replicaset
nodename: aks-main-38921269-vmss000000
replacing nodename in telegraf config
File Doesnt Exist. Creating file...
Fluent Bit v1.6.8
Telegraf 1.18.0 (git: HEAD ac5c7f6a)
2021-04-20T02:17:49Z I! Starting Telegraf 1.18.0
td-agent-bit 1.6.8
stopping rsyslog...
image tag: mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod03262021
we had to manually restart the pod to see prom metrics being collected again
It would be great if you added Start, Stop and Delete methods to the Container_ContainerInventory class.
Hi, I am part of an InfoSec team and we detected the following critical vulnerability in image: mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod02232021
Please validate. Thanks in advance!
Critical vulnerability (CVE-2017-10906)
Description: Escape sequence injection vulnerability in Fluentd versions 0.12.29 through 0.12.40 may allow an attacker to change the terminal UI or execute arbitrary commands on the device via unspecified vectors.
Installed Resource: fluentd 0.12.40
Fixed Version: 0.12.41
Published by NVD: 2017-12-08
CVSS Score: NVD CVSSv2 10.0
Remediation: Upgrade package fluentd to version 0.12.41 or above.
Full Path: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.4.0/specifications/fluentd-0.12.40.gemspec
Is it possible to scrape AKS API server metrics using https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-prometheus-integration?
As far as I know, to get /metrics from API server authentication is required (bearer token) and I cannot see how this can be set in the monitoring agent config file https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-prometheus-integration#prometheus-scraping-settings.
For standard Prometheus deployment this can be configured via bearer_token_file setting.
Hello,
Is there any way to reduce omsagent memory consumption in the kubernetes cluster? Just for 2 nodes it runs 3 instances of omsagent (1 daemon set - 2 instances, 1 replica set - 1 instance) and each instance uses 300mb of ram. This is the most demanding service in my cluster and it is just a monitoring tool.
Why is replicaset even required? Just adds one more instance to the node where daemonset already created one.
Reopening because of last Issue was closed without a solution. (With a solution for the problem what happened later than original issue was created)
#694
Collecting all kube events is not working correctly, only the first 4000 events are recorded taking normal events into account. The loop for fetching events is missing the logic when collect_all_kube_events is enabled:
The code at this line: https://github.com/microsoft/Docker-Provider/blob/ci_prod/source/plugins/ruby/in_kube_events.rb#L115 should look like:
if @collectAllKubeEvents
continuationToken, eventList = KubernetesApiClient.getResourcesAndContinuationToken("events?limit=#{@EVENTS_CHUNK_SIZE}&continue=#{continuationToken}")
else
continuationToken, eventList = KubernetesApiClient.getResourcesAndContinuationToken("events?fieldSelector=type!=Normal&limit=#{@EVENTS_CHUNK_SIZE}&continue=#{continuationToken}")
end
Following the instructions from the charts/azuremonitor-containers url the helm deployment does not on-board Azure Monitor for containers.
Expected behaviour: Following the instructions would automatically on-board Azure Monitor for containers.
Environment: AKS
Kubernetes: 1.19.11
Step 1 and 2 complete successfully with a "Log Analytics Workspace" and "ContainerInsights(iob-dev-westeurope-akstest-workspace)" solution created in the same resource group as the AKS cluster.
Step 3 fails with error "No k8s-master VMs or VMSSes found in the specified resource group:iob-dev-westeurope-akstest-rg-aks" but looking at the script i am not sure this applies to AKS.
Helm deployment completes without error.
helm upgrade --install --values=values.yaml azmon-containers microsoft/azuremonitor-containers --namespace kube-system Release "azmon-containers" does not exist. Installing it now. W0721 09:12:19.676421 20988 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole W0721 09:12:19.834125 20988 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding W0721 09:12:22.908504 20988 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole W0721 09:12:23.088834 20988 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding NAME: azmon-containers LAST DEPLOYED: Wed Jul 21 09:12:18 2021 NAMESPACE: kube-system STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: azmon-containers deployment is complete.
Log output from omsagent
kubectl logs omsagent-48t4s -n kube-system
not setting customResourceId
Making curl request to oms endpint with domain: opinsights.azure.com
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl request to oms endpoint succeeded.
****************Start Config Processing********************
Both stdout & stderr log collection are turned off for namespaces: '*_kube-system_*.log'
****************End Config Processing********************
****************Start Config Processing********************
****************Start NPM Config Processing********************
config::npm::Successfully substituted the NPM placeholders into /etc/opt/microsoft/docker-cimprov/telegraf.conf file for DaemonSet
config::Starting to substitute the placeholders in td-agent-bit.conf file for log collection
config::Successfully substituted the placeholders in td-agent-bit.conf file
****************Start Prometheus Config Processing********************
config::No configmap mounted for prometheus custom config, using defaults
****************End Prometheus Config Processing********************
****************Start MDM Metrics Config Processing********************
****************End MDM Metrics Config Processing********************
****************Start Metric Collection Settings Processing********************
****************End Metric Collection Settings Processing********************
Making wget request to cadvisor endpoint with port 10250
Wget request using port 10250 succeeded. Using 10250
Making curl request to cadvisor endpoint /pods with port 10250 to get the configured container runtime on kubelet
configured container runtime on kubelet is : containerd
set caps for ruby process to read container env from proc
aks-system1-34726002-vmss000000
* Starting periodic command scheduler cron
...done.
docker-cimprov 16.0.0.0
DOCKER_CIMPROV_VERSION=16.0.0.0
*** activating oneagent in legacy auth mode ***
setting mdsd workspaceid & key for workspace:68299338-cb11-46a8-a42e-977e476105e4
azure-mdsd 1.10.1-build.master.213
starting mdsd in legacy auth mode in main container...
*** starting fluentd v1 in daemonset
starting fluent-bit and setting telegraf conf file for daemonset
since container run time is containerd update the container log fluentbit Parser to cri from docker
nodename: aks-system1-34726002-vmss000000
replacing nodename in telegraf config
checking for listener on tcp #25226 and waiting for 30 secs if not..
File Doesnt Exist. Creating file...
Fluent Bit v1.6.8
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
waitforlisteneronTCPport found listener on port:25226 in 5 secs
checking for listener on tcp #25228 and waiting for 30 secs if not..
Routing container logs thru v2 route...
waitforlisteneronTCPport found listener on port:25228 in 10 secs
Telegraf 1.18.0 (git: HEAD ac5c7f6a)
2021-07-21T08:12:52Z I! Starting Telegraf 1.18.0
td-agent-bit 1.6.8
stopping rsyslog...
* Stopping enhanced syslogd rsyslogd
...done.
getting rsyslog status...
* rsyslogd is not running
Can you confirm that the on-boarding of Azure Monitor for containers should have occurred and how to troubleshoot this further?
Hi, we have two clusters running Container Insights and are paying hundreds of pounds each month for the Log Analytics bill. It has become the most expensive part of our Azure bill. Looking into this, the ContainerInventory table is flooded with many messages a second. On the surface, OMS agent appears to be sending the same messages many times. This seems to be happening on both clusters.
Please could you let us know how we can reduce the volume of data that OMS agent sends to Log Analytics for this table?
The OMS version in use is microsoft/oms:ciprod10162018-2
.
Example:
The following Log Analytics query shows the ContainerInventory table contains almost 3 times as much data as any other table.
Usage
| where IsBillable = true
| summarize Quantity=sum(Quantity) by SourceSystem, DataType, Solution
| order by Quantity desc
The result shows the top row has a SourceSystem of OMS
, a DataType of ContainerInventory
and a Solution of ContainerInsights
. We're ingesting 8GB per week on our test environment just to that table.
Running the following query on that table helps show the suspected duplicate entries. Note that kube-dns was picked as a common example but the problem occurs for every container.
ContainerInventory
| where TimeGenerated > ago(1d) and ContainerHostname startswith "kube-dns"
| order by TimeGenerated desc
The result shows lots of rows with unique values in the TimeGenerated [UTC]
column, but duplicate values in all other columns. A good indicator is to check the values of the CreatedTime [UTC]
and StartedTime [UTC]
columns at the end - they seem to be exactly the same for many different values of TimeGenerated. This implies that the same Kubernetes events are being reported by the OMS agent to Log Analytics many times over.
Please could you let us know how we can reduce the volume of data that OMS agent sends to Log Analytics for this ContainerInventory table as the cost impact is currently a problem?
We are deploying the OMS agent using the addon in the ARM template:
"addonProfiles": {
"omsagent": {
"enabled": true,
"config": {
"logAnalyticsWorkspaceResourceID": "[parameters('logAnalyticsWorkspaceResourceId')]"
}
}
}
We have five OMS pods running as a result (there are currently four nodes in the test cluster this is taken from):
omsagent-d9cvb 1/1 Running 1 20h
omsagent-drz85 1/1 Running 3 7d
omsagent-p6jts 1/1 Running 4 7d
omsagent-rs-ccf8b9699-9976m 1/1 Running 0 2d
omsagent-swwpq 1/1 Running 4 7d
Thanks!
This might not be the right repo for this issue... please do point me at a more appropriate place if not!
Is there a way to filter application logs from an AKS cluster based on the log contents - so that logs with, eg a specific JSON field, can go to a different Azure Monitor instance?
We have a deployment where we need to send application audit logs - which just go into the container logstream with a specific flag in the JSON log body - to a space with tighter access controls and longer log retention than the main bulk of the application logs.
As far as I can tell all container logs just get forwarded through the OMS agent into a single table in a configured workspace - there's no way to customize this on an AKS cluster, apart from the config options in kubernetes/container-azm-ms-agentconfig.yaml
- which only allow stripping logs from specific namespaces.
Is there a way to get the agent to fork the container logs based on eg fluentd configuration? Or to deploy some additional containers onto the cluster to maybe intercept the logs and do this filtering?
I work on an internal security team, and one of our tools flagged an older critical vulnerability for:
mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod03262021
Bundler 1.x might allow remote attackers to inject arbitrary Ruby code into an application by leveraging a gem name collision on a secondary source.
Installed Resource: bundler 1.10.6
Fixed Version: 1.11.0rc1
Published by NVD 2016-12-22
CVSS Score NVD CVSSv3: 9.8
Remediation
Upgrade package bundler to version 1.11.0rc1 or above.
Would it be possible for this project to upgrade the bundler to 1.11.0rc1 or above to remediate this issue?
I have a bunch of Grafana dashboards showing graphs for container_cpu_usage_seconds_total and container_memory_usage_bytes filtered by our applications - I've had a look and I'm exporting the Prometheus data into Log Anayltics, but I don't think the OMS agent exports any container_* stats?
I am trying to create a metric chart on a dashboard which shows container utilisation - is this data retrievable from Container Insights? I can't seem to find anything at the container level, only node level
Thanks
My pods has the following annotations for a few weeks:
I have deployed a ConfigMap with the following settings, also a few weeks ago:
prometheus-data-collection-settings: |-
[prometheus_data_collection_settings.cluster]
interval = "1m"
fieldpass = ["mendix_concurrent_user_sessions", "mendix_connection_bus", "mendix_current_request_duration_seconds_bucket", "mendix_current_request_duration_seconds_count", "mendix_current_request_duration_seconds_sum", "mendix_jvm_memory_bytes", "mendix_jvm_memory_pool_bytes", "mendix_license_count", "mendix_named_users", "mendix_runtime_requests_total", "mendix_threadpool_handling_external_requests"]
monitor_kubernetes_pods = true
monitor_kubernetes_pods_namespaces = ["dev-apps"]
[prometheus_data_collection_settings.node]
interval = "1m"
This Microsoft Learn article tells me where I need to look for querying Prometheus log, which in turn points me to this article.
When I write my query in the Log Analytics Workspace e.g.:
InsightsMetrics
| where Namespace contains "prometheus"
| summarize by Name
I only get the following results
So, where are the other metrics that can be seen in my ConfigMap's fieldpass property? Am I missing something?
We should add retries in main.sh here:
#Setting environment variable for CAdvisor metrics to use port 10255/10250 based on curl request
echo "Making wget request to cadvisor endpoint with port 10250"
#Defaults to use port 10255
cAdvisorIsSecure=false
RET_CODE=`wget --server-response https://$NODE_IP:10250/stats/summary --no-check-certificate --header="Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" 2>&1 | awk '/^ HTTP/{print $2}'`
if [ $RET_CODE -eq 200 ]; then
cAdvisorIsSecure=true
fi
Hello the monitoring of metrics per container is so expensive for us that we would either not use it all, or preferebly tune down the monitoring ammount. We dont really need metrics every minute, if we do this every 2 minutes we would be OK. So what would be nice if there was a way how to configure the interval to log container metrics
Hello,
Is there any way to reduce omsagent memory consumption in the kubernetes cluster? Just for 2 nodes it runs 3 instances of omsagent (1 daemon set - 2 instances, 1 replica set - 1 instance) and each instance uses 300mb of ram. This is the most demanding service in my cluster and it is just a monitoring tool.
I'm facing an issue of helm during installing the Azure Arc enabled Kubernetes agent in my Kubernetes cluster, according to the document below:
https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-enable-arc-enabled-clusters
At the end of installation script of ps1, it is showing not found message like Error: mcr.microsoft.com/azuremonitor/containerinsights/preview/azuremonitor-containers:2.8.2: not found
.
I assuming the ps1 script (and also the bash script) has wrong version number.
And also, after changing the version number to 2.8.1 by editing the script, the installation finished properly.
The log output of installation:
...
Helm version : version.BuildInfo{Version:"v3.5.3", GitCommit:"041ce5a2c17a58be0fcd5f5e16fb3e7e95fea622", GitTreeState:"dirty", GoVersion:"go1.15.8"}
Installing or upgrading if exists, Azure Monitor for containers HELM chart ...
pull the chart from mcr.microsoft.com
pull the chart from mcr.microsoft.com
2.8.2: Pulling from mcr.microsoft.com/azuremonitor/containerinsights/preview/azuremonitor-containers
Error: mcr.microsoft.com/azuremonitor/containerinsights/preview/azuremonitor-containers:2.8.2: not found
export the chart from local cache to current directory
Error: Chart not found: mcr.microsoft.com/azuremonitor/containerinsights/preview/azuremonitor-containers:2.8.2
helmChartRepoPath is : ./azuremonitor-containers
using provided kube-context: minikube
Release "azmon-containers-release-1" does not exist. Installing it now.
Error: path "./azuremonitor-containers" not found
Successfully enabled Azure Monitor for containers for cluster: /subscriptions/************/resourceGroups/************/providers/Microsoft.Kubernetes/connectedClusters/************
Proceed to https://aka.ms/azmon-containers to view your newly onboarded Azure Managed cluster
And, how I download the the ps1 script (same as the document's guide):
Invoke-WebRequest https://aka.ms/enable-monitoring-powershell-script -OutFile enable-monitoring.ps1
PS:
If the Azure Arc Kubernetes has already been GA, I believe the URL doesn't have to preview
anymore.
I appreciate if this URL is fixed properly.
Thank you.
I raised an issue here: oliver006/redis_exporter#573
Which looks to be caused by: Input.prometheus plugin unable to parse #HELP comment. influxdata/telegraf#8366
influxdata/telegraf#8545
Hello,
I am running container insights in a machine that do not have Kubernetes and I want to disable sending logs from stdout stream.
In the documentation (https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-agent-config), the only way to do that is by using Kubernetes ConfigMap. As I do not have kubernetes installed, I tried to configure this by setting the environment variable AZMON_LOG_EXCLUSION_REGEX_PATTERN=stdout (reference in some files at /build/windows/installer/conf/), but had no success.
I also tried to write a settings file at /etc/config/settings/log-data-collection-settings, but had no success.
Is there any way that I can disable sending stdout logs to azure workspace?
When fluentd is run it uses the /etc/fluent/kube.conf
:
Docker-Provider/kubernetes/linux/main.sh
Line 769 in a697f5a
/etc/config/kube.conf
instead?Provider currently gives:
CLASS=Container_ContainerInventory:CIM_ManagedElement
CLASS=Container_ContainerStatistics:CIM_StatisticalData:CIM_ManagedElement
CLASS=Container_DaemonEvent:CIM_ManagedElement
CLASS=Container_ImageInventory:CIM_ManagedElement
CLASS=Container_ContainerLog:CIM_ManagedElement
CLASS=Container_HostInventory:CIM_ManagedElement
CLASS=Container_Process:CIM_ManagedElement
It would be nice to monitor swarm services
/etc/opt/omi/conf# docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
dkq2zy0opjvg monitoring_telegraf global 2/2 telegraf:latest
/etc/opt/omi/conf# docker service ps dkq2zy0opjvg
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
hzstnm31un4y monitoring_telegraf.mgnljv2zuh74cfhn3n3z0znck telegraf:latest node01 Running Running 11 days ago
qms85linbux6 monitoring_telegraf.8k70kl8rzbz0zqu8345wivmpo telegraf:latest node02 Running Running 11 days ago
We have noticed that the omi provider for docker tends to cause the docker daemon (dockerd) to spin to 100% cpu.
We think the issue is related to statistics metric being too aggressively queries as this is a known pitfall of docker stats system.
omiagent process keeps crashing every minutes and is filling up the filesystem.
I have traced the problem to the docker-cimprov-1.0.0-32 provider. Please see below for details.
Version:
# /opt/omi/bin/omiserver -v
/opt/omi/bin/omiserver: OMI-1.4.2-5 - Wed Jul 25 10:59:15 PDT 2018
omiserver.log:
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40012 Priority=INFO (S)Socket: 0x1b1d850, closing connection (mask 2)
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40033 Priority=INFO Selector_RemoveHandler: selector=0x5e2168, handler=0x1b1d850, name=BINARY_SERVER_CONNECTION
2018/10/11 15:22:27 [92034,92034] INFO: null(0): EventId=40011 Priority=INFO (E)done with receiving msg(0x1a52618:4099:EnumerateInstancesReq:e005)
2018/10/11 15:22:27 [92034,92034] INFO: null(0): EventId=40039 Priority=INFO New request received: command=(EnumerateInstancesReq), namespace=(root/cimv2), class=(Container_DaemonEvent)
2018/10/11 15:22:27 [92034,92034] INFO: null(0): EventId=40032 Priority=INFO Selector_AddHandler: selector=0x5dff88, handler=0x1a56c40, name=null
2018/10/11 15:22:27 [92034,92034] INFO: null(0): EventId=40005 Priority=INFO _SendRequestToAgent msg(0x1a576a8:15:BinProtocolNotification:13), from original operationId: 0 to 13
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40032 Priority=INFO Selector_AddHandler: selector=0x5e2168, handler=0x1b1d850, name=BINARY_SERVER_CONNECTION
2018/10/11 15:22:27 [92034,92034] INFO: null(0): EventId=40005 Priority=INFO _SendRequestToAgent msg(0x1a576a8:4099:EnumerateInstancesReq:14), from original operationId: e005 to 14
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40011 Priority=INFO (S)done with receiving msg(0x1b2db28:36:VerifySocketConn:0)
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40011 Priority=INFO (S)done with receiving msg(0x1b2ed68:34:CreateAgentMsg:0)
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40012 Priority=INFO (S)Socket: 0x1b1d850, closing connection (mask 2)
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40033 Priority=INFO Selector_RemoveHandler: selector=0x5e2168, handler=0x1b1d850, name=BINARY_SERVER_CONNECTION
2018/10/11 15:22:27 [92025,92025] WARNING: null(0): EventId=30209 Priority=WARNING child process with PID=[92081] terminated abnormally
2018/10/11 15:22:37 [92034,92034] INFO: null(0): EventId=40011 Priority=INFO (E)done with receiving msg(0x1a54908:4:PostResultMsg:14)
2018/10/11 15:22:37 [92034,92034] INFO: null(0): EventId=40032 Priority=INFO Selector_AddHandler: selector=0x5dff88, handler=0x1a525e0, name=BINARY_SERVER_CONNECTION
2018/10/11 15:22:37 [92034,92034] INFO: null(0): EventId=40028 Priority=INFO (E)Socket: 0x1a54aa0, Connection Closed while reading header
core.92081 :
[New LWP 92081]
[New LWP 92082]
[New LWP 92083]
[New LWP 92085]
[New LWP 92089]
[New LWP 92088]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/omi/bin/omiagent 9 10 --destdir / --providerdir /opt/omi/lib --loglevel IN'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fc0fdb13240 in std::allocator<std::pair<std::string const, unsigned long long> >::~allocator() () from /opt/omi/lib/libcontainer.so
(gdb) bt
#0 0x00007fc0fdb13240 in std::allocator<std::pair<std::string const, unsigned long long> >::~allocator() () from /opt/omi/lib/libcontainer.so
#1 0x00007fc0fdb13fcb in std::_Miter_base<mi::Container_ContainerStatistics_Class*>::iterator_type std::__miter_base<mi::Container_ContainerStatistics_Class*>(mi::Container_ContainerStatistics_Cla
ss*) () from /opt/omi/lib/libcontainer.so
#2 0x00007fc0fdb12986 in __gnu_cxx::__normal_iterator<std::map<std::string, unsigned long long, std::less<std::string>, std::allocator<std::pair<std::string const, unsigned long long> > >*, std::v
ector<std::map<std::string, unsigned long long, std::less<std::string>, std::allocator<std::pair<std::string const, unsigned long long> > >, std::allocator<std::map<std::string, unsigned long long,
std::less<std::string>, std::allocator<std::pair<std::string const, unsigned long long> > > > > >::operator*() const () from /opt/omi/lib/libcontainer.so
#3 0x00007fc0fdb31db8 in ensure(printbuffer*, unsigned long) () from /opt/omi/lib/libcontainer.so
#4 0x0000000000408707 in ?? ()
#5 0x0000000000404952 in ?? ()
#6 0x0000000000468f9d in ?? ()
#7 0x0000000000465227 in ?? ()
#8 0x00000000004632d3 in ?? ()
#9 0x000000000044c092 in ?? ()
#10 0x000000000046205b in ?? ()
#11 0x00000000004632d3 in ?? ()
#12 0x000000000044f37a in ?? ()
#13 0x000000000044fff8 in ?? ()
#14 0x000000000046b23d in ?? ()
#15 0x0000000000404ffc in ?? ()
#16 0x00007fc10845c3d5 in __libc_start_main (main=0x4052b0, argc=9, ubp_av=0x7ffc84310a08, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc843109f8)
at ../csu/libc-start.c:274
#17 0x0000000000404699 in ?? ()
#18 0x00007ffc843109f8 in ?? ()
#19 0x000000000000001c in ?? ()
#20 0x0000000000000009 in ?? ()
#21 0x00007ffc84310f36 in ?? ()
#22 0x00007ffc84310f4c in ?? ()
#23 0x00007ffc84310f4e in ?? ()
#24 0x0000000000000000 in ?? ()
I'm installed docker in windows server 2019 with DockerProvider
I'm using this code
Install-Module DockerProvider
Install-Package Docker -ProviderName DockerProvider -RequiredVersion preview
[Environment]::SetEnvironmentVariable("LCOW_SUPPORTED", "1", "Machine")
after that I install Docker-Compose
with this code
[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
Invoke-WebRequest "https://github.com/docker/compose/releases/download/1.24.0/docker-compose-Windows-x86_64.exe" -UseBasicParsing -OutFile $Env:ProgramFiles\Docker\docker-compose.exe
after that I use a docker compose file
version: "3.5"
services:
rabbitmq:
# restart: always
image: rabbitmq:3-management
container_name: rabbitmq
ports:
- 5672:5672
- 15672:15672
networks:
- myname
# network_mode: host
volumes:
- rabbitmq:/var/lib/rabbitmq
networks:
myname:
name: myname-network
volumes:
rabbitmq:
driver: local
everything is Ok up to here
but after i call http://localhost:15672/
url in my browser
rabbitmq crashes and I see this error in docker logs <container-id>
Cookie file /var/lib/rabbitmq/.erlang.cookie must be accessible by owner only
this .yml
file is working correctly in docker for windows
but after running the file in windows server, I see this error
Hi,
In a project I'm part of there is a security concern regards to that the omsagent pods, deployed into an AKS-Cluster, is running as root
user. It maps up /var/log
from the kublet (node), accessing the logs, effectively running as root process on the node. We understand that consuming /var/log
requires root.
The question is, how much additional hardening has Microsoft done with the omsagent, and can we apply additional hardening that makes it "secure enough"? It would be nice to get a point of view on the matter.
The "attack vector" is through the /var/log
filesystem. If it manages to mount up files into this directory somehow. It would require an attacker to break into the omsagent.
Trace-data:
rune@Azure:~$ kubectl get pods -n kube-system | grep -i omsagent
omsagent-7cb7z 1/1 Running 0 18h
omsagent-rs-7c7b6c8d5b-h2zvh 1/1 Running 0 18h
rune@Azure:~$ kubectl exec omsagent-rs-7c7b6c8d5b-h2zvh -n kube-system -- id
uid=0(root) gid=0(root) groups=0(root)
rune@Azure:~$ kubectl exec -it -n kube-system omsagent-rs-7c7b6c8d5b-h2zvh -- /bin/sh
# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 18504 3140 ? Ss Jun15 0:00 /bin/bash /opt/main.sh
root 30 0.0 0.0 6704 116 ? S Jun15 0:00 inotifywait /etc/config/settings --daemon --recursive --outfile /opt/inotifyoutput.txt --event create,delete --format %e : %T --timefmt +%s
syslog 223 0.0 0.0 129672 4208 ? Ssl Jun15 0:02 /usr/sbin/rsyslogd
omsagent 263 0.6 0.7 394724 55132 ? Sl Jun15 6:48 /opt/microsoft/omsagent/ruby/bin/ruby /opt/microsoft/omsagent/bin/omsagent-32ed830c-9fbe-4a37-8b4d-a990f2a873f8 -d /var/opt/microsoft/omsagent/32ed830c-9fbe-4a37-8b4d-a990f2a873f8/run/omsagent.pid --no-supervisor -o /var/opt/micros
root 294 0.0 0.0 28356 2676 ? Ss Jun15 0:00 /usr/sbin/cron
root 338 0.0 0.6 150128 46848 ? Sl Jun15 0:04 /opt/td-agent-bit/bin/td-agent-bit -c /etc/opt/microsoft/docker-cimprov/td-agent-bit-rs.conf -e /opt/td-agent-bit/bin/out_oms.so
root 348 0.0 0.5 198028 38512 ? Sl Jun15 0:21 /opt/telegraf --config /etc/opt/microsoft/docker-cimprov/telegraf-rs.conf
root 369 0.0 0.0 4536 768 ? S Jun15 0:00 sleep inf
root 57390 0.0 0.0 4628 772 pts/0 Ss+ 06:26 0:00 /bin/sh
root 59470 0.0 0.0 4628 820 pts/1 Ss 07:02 0:00 /bin/sh
root 59477 0.0 0.0 34404 2856 pts/1 R+ 07:03 0:00 ps -aux
# exit
rune@Azure:~$
I am not sure if this is the correct repository to open this issue, but:
It would be great if it was possible to somehow configure a priority-class for the omsagent
within an AKS-Cluster.
Currently, that does not seem possible - or at least I could find no documentation whatsoever about it.
In our scenario, we want to give the cluster monitoring a somewhat higher priority than most of the other services - so that in case of an error, we won't be flying blind.
The only workaround I can think of is to use a global default
class - however that is not really feasible as there might be other, unimportant pods without a PriorityClass
within the cluster that might suddenly be ranked way higher than they should be.
Is such a feature possible?
I am in the process of implementing the alarms based on the ARM templates. Comparing the alarms created via the template, it seems the metrics chosen and thresholds are different than what the portal creates for the corresponding alarms in the "Recommended alerts (Preview)" pane.
Maybe it is not wrong, but I would like to understand why the difference.
Examples:
Alarm on the portal: "(New) Container CPU %"
Description: "Average CPU percent is greater than the configured threshold (default is 95%)"
Metric used on the portal: cpuThresholdExceeded > 0
Metric used on the ARM template: cpuExceededPercentage > 95
Alarm on the portal: "(New) Container working set memory %"
Description: "Average working set memory percent is greater the configured threshold (default is 95%)"
Metric used on the portal: memoryworkingsetthresholdviolated > 0
Metric used on the ARM template: memoryWorkingSetExceededPercentage > 95
Aggregation Type/Aggregation Granularity are not populating in portal page & emails are not triggering. But, I see values in export template.
Alerts were created using SPN account and am trying to see through my enterprise subscription. I guess that I am missing few access or restriction in my organization level, Kindly let me know whether any additional RBAC at Cluster level
Hi guys,
I am getting below error when I try to install azuremonitor-containers helm chart in arc enabled k8 v1.16.
Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "AzureClusterIdentityRequest" in version "clusterconfig.azure.com/v1beta1"
It looks like it does not support k8 version 1.16. Could you please help with this ? Thanks.
Due to CVE-2021-38645, CVE-2021-38649, CVE-2021-38648, and CVE-2021-38647 ( https://msrc-blog.microsoft.com/2021/09/16/additional-guidance-regarding-omi-vulnerabilities-within-azure-vm-management-extensions/ ), a new version of the ciprod-Docker-Image was released ( mcr.microsoft.com/azuremonitor/containerinsights/ciprod:microsoft-oms-latest , from the same website).
The Helm-Chart Docker-Provider/charts/azuremonitor-containers/ (version 2.8.1) is incompatible with this image. There are (at least) two problems:
The helm chart in the "ci_prod" branch (2.8.3) also points to a vulnerable version of ciprod (ciprod04222021)
Becuase of this, we need an updated Helm chart URGENTLY.
Version microsoft/oms:win-ciprod10272020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod10052020 (windows)
Should it be "Version microsoft/oms:win-ciprod10272020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod10272020"?
@vishiy
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.