Git Product home page Git Product logo

cluster-dns-operator's Introduction

The DNS Operator

The DNS Operator deploys and manages CoreDNS to provide a name resolution service to pods that enables DNS-based Kubernetes Service discovery in OpenShift.

The operator tries to be useful out of the box by creating a working default deployment based on the cluster's configuration.

  • The default cluster domain is cluster.local.
  • Limited configuration of the CoreDNS Corefile or kubernetes plugin is supported.

How it works

The DNS Operator deploys CoreDNS using a DaemonSet, which means that each node has a local CoreDNS pod replica. This topology provides scalability as the cluster grows or shrinks and resilience in case a node becomes temporarily isolated from other nodes.

The DaemonSet's pod template specifies a "dns" container with CoreDNS and a "dns-node-resolver" container with a process that adds the cluster image registry service's DNS name to the host node's /etc/hosts file (see below).

In order to resolve cluster service DNS names, the operator configures CoreDNS with the kubernetes plugin. This plugin resolves DNS names of the form <service name>.<namespace>.svc.cluster.local to the identified Service's corresponding endpoints.

In order to resolve external DNS names, the operator configures CoreDNS to forward to the upstream name servers configured in the node host's /etc/resolv.conf (typically these name servers come from DHCP or are injected into a custom VM image). The operator allows the user to configure additional upstreams to use for specific zones; see https://github.com/openshift/enhancements/blob/master/enhancements/dns/plugins.md.

The operator also creates a Service with a fixed IP address. This address is derived from the service network CIDR, namely by taking the tenth address in the address space. For example, if the service network CIDR is 172.30.0.0/16, then the DNS service's address is 172.30.0.10.

When a pod is created, the kubelet injects a nameserver entry with the DNS service's IP address into the pod's /etc/resolv.conf file (unless the pod overrides the default behavior with spec.dnsPolicy; see DNS for Services and Pods: Pod's DNS Policy).

Within a pod, the flow of a DNS query varies depending on whether the DNS name to be resolved is for a cluster service DNS name or for an external DNS name. A query for a cluster service DNS name flows from the pod process via the service proxy to a randomly chosen CoreDNS instance, which itself resolves the name. A query for an external DNS name flows from the pod process via the service proxy to a CoreDNS instance, which forwards the request to an upstream name server; this name server may be on a network that is external to the cluster, possibly the Internet.

The foregoing describes the behavior for pods that use container networking. If a pod is configured to use the host network, or if a process runs directly on a node, it uses the name servers configured in the host node's /etc/resolv.conf file. This means queries from host-network pods or processes flow from the process to the name server that is specified in /etc/resolv.conf (which typically is on an external network or the Internet).

In general, DNS names for Services will not resolve from the node host as the node itself is not configured to use CoreDNS as its name server. For example, the container runtime runs directly on the node host, so it cannot resolve cluster service DNS names, with the following exception. As a special case, a process in the DNS DaemonSet's "dns-node-resolver" container adds the registry service's DNS name, image-registry.openshift-image-registry.svc, to the node's /etc/hosts file so that the container runtime and kubelet can resolve the registry service's DNS name.

Troubleshooting DNS issues can may require tools such as strace, tcpdump, dropwatch, and other low-level network diagnostics tools.

How to help

See HACKING.md for development topics.

Reporting issues

Bugs are tracked in the Red Hat Issue Tracker.

cluster-dns-operator's People

Contributors

alebedev87 avatar arjunrn avatar arkadeepsen avatar brandisher avatar bverschueren avatar candita avatar csrwng avatar danehans avatar deads2k avatar derekwaynecarr avatar frobware avatar gcs278 avatar ironcladlou avatar juanvallejo avatar mheler avatar miciah avatar miheer avatar openshift-bot avatar openshift-ci[bot] avatar openshift-merge-bot[bot] avatar openshift-merge-robot avatar ramr avatar ravisantoshgudimetla avatar rfredette avatar s-urbaniak avatar sgreene570 avatar sherine-k avatar smarterclayton avatar suleymanakbas91 avatar wking avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cluster-dns-operator's Issues

Forwarded zone cache

Hi, Forwarded zone cache was activated in this bugzilla https://bugzilla.redhat.com/show_bug.cgi?id=2006803 , I would like to know if it possible to activate the cache's serve_stale feature for the forwarded zone, serve_stale could help to improve application resilience in case of DNS resolver or DNS authoritative outage.

https://coredns.io/plugins/cache/

serve_stale, when serve_stale is set, cache always will serve an expired entry to a client if there is one available. When this happens, cache will attempt to refresh the cache entry after sending the expired cache entry to the client. The responses have a TTL of 0. DURATION is how far back to consider stale responses as fresh. The default duration is 1h.

Thanks.

custom forward options?

Plattform: OpenShift version 4.5.7
Hardware: 12 worker nodes, vmware vsphere, 8 CPU cores per node
Node uname (via oc debug node): Linux ocp-w-01 4.18.0-193.14.3.el8_2.x86_64 #1 SMP Mon Jul 20 15:02:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Hi! We are experiencing 5 seconds timeouts on DNS and believe we are experiencing the NAT trouble described in bugzilla 1661928. The cluster-dns-operator is in use, with ClusterIP service and NAT to the rest of the network. No additional zones is defined, so the resulting CoreDNS configuration is:

.:5353 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        upstream
        fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    forward . /etc/resolv.conf {
        policy sequential
    }
    cache 30
    reload
}

Node /etc/resolv.conf contains two nameservers on local network, 10.150.13.2:53 and 10.150.13.3:53. The dns-pods has io-errors in the log output:

[ERROR] plugin/errors: 2 elasticsearch.openshift-logging.svc.cluster.local.ocp-prd.regsys.brreg.no. A: read udp 10.129.6.2:42578->10.150.13.2:53: i/o timeout

As I understand it, setting force_tcp in the forward-options could resolve the trouble, regardless of where the packet drop is (NAT iptables race condition or dropped packet on network).

You state in the readme:

Limited configuration of the CoreDNS Corefile or kubernetes plugin is supported.

Would you be open to allow more configuration supported? For example:

apiVersion: operator.openshift.io/v1
kind: DNS
metadata:
  name: default
spec:
  forwardOptions: "raw string of options here"

A long request to the dns server

There is a problem with a long request to the dns server. It is shown in the example below. The request is sent from the pod and the problem manifests itself from the pod. I tried to solve the problem by doing forwarding, but it didn't help.

curl -w "dns_resolution: %{time_namelookup}, tcp_established: %{time_connect}, ssl_handshake_done: %{time_appconnect}, TTFB: %{time_starttransfer}, time_redirect: %{time_redirect}, time_total: %{time_total}s\n" -o /dev/null -s https://ya.ru
dns_resolution: 5.073743, tcp_established: 5.087487, ssl_handshake_done: 5.103775, TTFB: 5.166707, time_redirect: 0.000000, time_total: 5.193673s

I attach an example of forwarding.

  servers:  
   - name:
     zones:
       - *.ru
       - *.com
     forwardPlugin:
       upstreams:
         - dns1
         - dns2:5353

cluster version 4.5.0-0.okd-2020-10-15-235428

Appending /etc/hosts is triggering master reboot cycles

Since #56, this operator has been adding content to /etc/hosts. But that file is managed by the machine-config daemon, and when the MCD detects the altered content, it reboots the node. The node comes back up, and tries to restore the expected pods, and presumably the whole cycle repeats again. This makes it harder for external clients to connect to the Kubernetes and OpenShift API servers (openshift/origin#21612), and that shows up in failed aws-e2e CI runs as random, unrelated flakes. I'm not sure what the /etc/hosts additions are for, but can you find another approach that accomplishes the same goal? Or find some way to ask the MCD to append your content, instead of reaching around the MCD and touching the file directly?

CC @abhinavdahiya, @ironcladlou, @pravisankar

Node DNS sync container takes 30-60s to shut down because it doesn't handle SIGTERM properly

Bash processes run as pid 1 need to follow a certain set of patterns in order to avoid being unresponsive when the container runtime invokes them.

Right now, the sidecar sync container is ignoring SIGTERM and so pod deletion takes however long a single wait lasts, instead of happening near instantly. This is frustrating and delays upgrade and can lead to extended downtime on the node for DNS.

Specifically, you need to

  1. Listen for SIGTERM explicitly: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_node_group/files/sync.yaml#L59
  2. Always sleep X & wait instead of just sleep X: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_node_group/files/sync.yaml#L83
  3. If you do run sub processes, be careful to wait on the right object (some examples exist, I don't have them at hand).

Operator can't rely on cluster DNS

If CoreDNS becomes completely unavailable, as in during bootstrapping or after an outage, the operator needs to be able to bring up DNS. This means the operator itself can't rely on cluster DNS to talk to the API server.

Host networking, co-location with the apiserver, and communication with the apiserver over loopback could be the solution (see openshift/cluster-version-operator#25).

Assign a priority class to pods

Priority classes docs:
https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/priority_preemption.html#admin-guide-priority-preemption-priority-class

Example: https://github.com/openshift/cluster-monitoring-operator/search?q=priority&unscoped_q=priority

Notes: The pre-configured system priority classes (system-node-critical and system-cluster-critical) can only be assigned to pods in kube-system or openshift-* namespaces. Most likely, core operators and their pods should be assigned system-cluster-critical. Please do not assign system-node-critical (the highest priority) unless you are really sure about it.

logging in dns.operator

How to enable logging in dns.operator/default ?
updating corefile in configmap/dns-default gets overwritten by the operator.

DNS pods are not started on tainted/dedicated nodes

This is one issue caused by 'Bug 1753059: Don't start DNS on NotReady nodes'

After the change(Bug 1753059), the DNS pods can not be started on the nodes which has taint point (for dedication deployment need).

In my case, we deployed some pods(with tolerations) we developed on the OCP environment, and taint some of the OCP nodes with taint(for example: oc adm taint node xx dedicated=myapp:NoSchedule).
The app works in OCP 4.2.0 version, but after upgrade the OCP environment to 4.2.21, we met problem to have dns pods support.
The problem is introduced by the Bug 1753059, which blocks the DNS pods deployed on the tained nodes.

And we didn't find way to make the DNS daemonset to toleration our tainted key, the added tolerations will be updated/deleted by dns-operator( cluster-dns-operator/pkg/operator/controller/controller_dns_daemonset.go)

Please help to support, either toleration all tainted key as previous(before Bug 1753059), or support customized tolerations.


-- thanks & best wishes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.