sunet / cnaas-nms Goto Github PK

View Code? Open in Web Editor NEW

71.0 71.0 16.0 3.36 MB

Campus Network as-a-Service - Network Management System (Campus network automation software)

License: Other

Python 97.46% Dockerfile 0.50% Shell 1.75% Mako 0.07% Jinja 0.22%

automation network network-automation network-programming

cnaas-nms's People

Contributors

Stargazers

Watchers

Forkers

krihal erraa mirzawaqasahmed uninett katsel 1714801500 pboers1988 workfloworchestrator isabella232 josephine-rutten techxbase yotayota benedictmulongo

cnaas-nms's Issues

IPv6 in linknets

Is your feature request related to a problem? Please describe.
We need the ability to run IPv6 in linknets. Currently it is not possible to add IPv4 addresses in linknets.

Describe the solution you'd like
It needs to be possible to add these settings to linknets:

ipv6_network
device_a_ipv6
device_b_ipv6

Describe alternatives you've considered
We've considered deriving IPv6 address from the IPv4 address. But this is not optimal for several reasons. Most of all it is not possible if we want to go for a IPv6 only underlay.

Referenced before assignment error in nested try-except

Describe the bug
When running a repository refresh for setting, we get this error message.

{
  "status": "error",
  "message": "Error in repository: local variable 'e' referenced before assignment"
}

Using the logs we traced it to settings.py line 255 (in check_settings_syntax). We believe it is caused by the nested try-except using the same variable name e. The outer try-except contains a for loop, and inside the for loop is the second try-except

In python, the variable assigned to the exception (e in this case) is deleted after exiting the try-except statement. So if the inner try-except triggers, it will define its exception to e and delete it after exiting its local scope. Now e does not exist. So in the next iteration of the loop, the error will occur when e is referenced on line 255.

as of writing this line 255 is the below code:

if len(e.errors()) == 2 and num == 1 and error['type'] == 'type_error.none.allowed':

To Reproduce
Im not sure how to give concrete steps, as you might need a similar setting_fields.py override file and settings.py repo (explained more at the bottom of the report)

Steps to reproduce the behavior:

Have a settings repository with errors
Possibly a settings_fields.py override file in plugins that is faulty?
call the api with the below lines:

alias curlJ curl -k -s -H "Authorization: Bearer $JWT_AUTH_TOKEN" -H "Content-Type: application/json"
curlJ https://hostname/api/v1.0/repository/settings -d '{"action": "refresh"}' -X PUT | jq

See error

Expected behavior
Return message from the API should give the real reason why the settings refresh did not work, i.e. what was wrong with the settings/settings_fields.

Screenshots
Not a screenshot, but the relevant lines from the log

2021-11-15 14:50:23,726 DEBG 'uwsgi' stdout output:
[2021-11-15 14:50:23,725] INFO in git: Trying to acquire lock for devices to run refresh repo: 64

2021-11-15 14:50:23,726 DEBG 'uwsgi' stdout output:

2021-11-15 14:50:24,061 DEBG 'uwsgi' stdout output:
[2021-11-15 14:50:24,061] DEBUG in git: Clearing redis-lru cache for settings

2021-11-15 14:50:24,089 DEBG 'uwsgi' stdout output:
[2021-11-15 14:50:24,088] DEBUG in settings: unhashable type: 'dict'

2021-11-15 14:50:24,089 DEBG 'uwsgi' stdout output:
[2021-11-15 14:50:24,089] ERROR in git: Exception while scheduling job for refresh repo: local variable 'e' referenced before assignment
Traceback (most recent call last):
  File "/opt/cnaas/venv/lib/python3.7/site-packages/redis_lru/lru.py", line 46, in inner
    return self[key]
  File "/opt/cnaas/venv/lib/python3.7/site-packages/redis_lru/lru.py", line 71, in __getitem__
    raise KeyError()
KeyError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./cnaas_nms/db/settings.py", line 247, in check_settings_syntax
    ret_dict = f_root(**settings_dict).dict()
  File "pydantic/main.py", line 362, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 4 validation errors for f_root
vrfs
  none is not an allowed value (type=type_error.none.not_allowed)
vxlans -> ztp -> dns_servers_6
  field required (type=value_error.missing)
vxlans -> ansatt -> dns_servers_6
  value is not a valid list (type=type_error.list)
slaac_dns_servers
  value is not a valid list (type=type_error.list)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./cnaas_nms/db/git.py", line 88, in refresh_repo
    result = _refresh_repo_task(repo_type)
  File "./cnaas_nms/db/git.py", line 170, in _refresh_repo_task
    get_settings()
  File "/opt/cnaas/venv/lib/python3.7/site-packages/redis_lru/lru.py", line 48, in inner
    result = func(*args, **kwargs)
  File "./cnaas_nms/db/settings.py", line 613, in get_settings
    verified_settings = check_settings_syntax(settings, settings_origin)
  File "./cnaas_nms/db/settings.py", line 255, in check_settings_syntax
    if len(e.errors()) == 2 and num == 1 and error['type'] == 'type_error.none.allowed':
UnboundLocalError: local variable 'e' referenced before assignment

2021-11-15 14:50:24,092 DEBG 'uwsgi' stdout output:
[2021-11-15 14:50:24,092] INFO in git: Releasing lock for devices from refresh repo job: 64

2021-11-15 14:50:24,156 DEBG 'uwsgi' stdout output:
[2021-11-15 14:50:24,156] INFO in app: User: cnaas, Method: PUT, Status: 500, URL: https://hostname/api/v1.0/repository/settings, JSON: {'action': 'refresh'}

2021-11-15 14:50:24,160 DEBG 'uwsgi' stdout output:
[pid: 17|app: 0|req: 5/5] 172.30.0.1 () {40 vars in 813 bytes} [Mon Nov 15 14:50:23 2021] PUT /api/v1.0/repository/settings => generated 103 bytes in 444 msecs (HTTP/1.1 500) 4 headers in 198 bytes (3 switches on core 999)

Desktop (please complete the following information):

OS: Debian GNU/Linux 10 (buster) (lab host for running docker containers)

Additional context
This happened while overriding settings_fields.py, so its likely that there was something wrong with this file that caused exceptions to occur in the first place. The settings repo also contain a lot of changes that were not tested, so likely there are issues there as well, reflected by the logs containing several errors with the settings. This error occured while trying to test the settings_fields override file with our settings repository and figure out what is wrong, so I cant say for certain if both are broken or not.

I think this local variable error will only occur if there is something wrong with either the settings or the settings_fields. If there are no errors, the exceptions wont happen in the first place and there wont be a chance for the e variable to be referenced.

** Proposed change **
Make it so the inner try-except block uses a different variable name for the exception than the outer block.

QueuePool limit reached

Describe the bug
SQLalchemy QueuePool limit reached

To Reproduce
Steps to reproduce the behavior:

Have 50+ devices and/or slow database?
Sync all devices
Exception raised

Expected behavior
No exception should be raised, just slow down

Outputs

  File "./cnaas_nms/confpush/sync_devices.py", line 339, in push_sync_device
    dev: Device = session.query(Device).filter(Device.hostname == hostname).one()
...
sqlalchemy.exc.TimeoutError: QueuePool limit of size 50 overflow 0 reached, connection timed out, timeout 30 (Background on this error at: http://sqlalche.me/e/13/3o7r)

Environment:

Setup: docker
Version: v1.2.1
Client: webui

Additional context
Quite big job db?

Be able to sync only unsynchronized devices

Is your feature request related to a problem? Please describe.
From the dashboard view you can see the number of unsynchronized devices, but there is no way to sync only these devices.

Describe the solution you'd like
It would be nice if there was a way to synchronize just these devices without having to navigate to "groups" and then find a relevant group or groups to sync. Maybe there could be a "Sync" buttom next to the unsynchronized devices from the dashboard view.

Additional context

Update the python version

Is your feature request related to a problem? Please describe.
We are now at version 3.7. We would like to go to the most recent version possible.

Describe the solution you'd like
Prefer version 3.12. Else 3.11 would also be good.

Describe alternatives you've considered

Refactor scheduler to not use fcntl()

Is your feature request related to a problem? Please describe.
We use the package fcntl which is Linux based. This creates a problem for making the Dockers more lightweight and for people developing on a not-Linux machine. fcntl locks specific files. It's only used in the scheduler.py.

Describe the solution you'd like
We could replace the entire scheduler with the package apscheduler.

Other solutions retaining the scheduler
We will use a different packages. Online I found the following suggestions: portalocker, waitress. We can also check if the file locking necessary or if we would prefer to solve it in a different way.

Upgrade python3.7 -> 3.11

NAPALM -> pyez -> ncclient dependency chain seems to support python ~~3.12~~ 3.11 since recently
napalm-automation/napalm#2020
Juniper/py-junos-eznc#1276

session.py side effects

Is your feature request related to a problem? Please describe.
We have had issues regarding the session.py module. The module automatically runs code for connecting to the database.
So when we want to test just a function (a unit test if you will), and this module imports from session.py then the program will crash
if there is no db to connect to. This also of course happens if the module you test imports another module which imports session.py, so this can be quite frustrating.

Describe the solution you'd like
We want the session.py to not automatically attempt to connect to the database when imported. The simplest solution would be to just put the code in a function, and call the function when its needed. As far as i can see, the only code that references any of the varibles created is the sqla_session function. So creating a function, and calling the function from sqla_session.py is a solution.

Define LACP/port-channel IDs on access ports

Is your feature request related to a problem? Please describe.
I want to configure LACP with two links to servers or other things connected to my access switches.

Describe the solution you'd like
I need a way to define which access ports should be in the same LACP. A simple integer ID should be enough, if it's defined then use the ID for LACP config. If it's undefined don't configure LACP. If it has the special value of -1 then configure a MLAG based on the port number of the switch.

Describe alternatives you've considered
Match something in the interface description to generate LACP config

Additional context
At least two customers wanted this functionality

Get facts for cisco devices fails

os_version for some Cisco devices can exceed 64 chars and won't fit into the database (max 64 chars)
Example os_version: "C2960X Software (C2960X-UNIVERSALK9-M), Version 15.2(2)E7, RELEASE SOFTWARE (fc3)" (81 characters)

Case insensitive search in NMS

Is your feature request related to a problem? Please describe.
When copy pasting part of a switchname to search under "Devices", it is some times in capital letters and then i get no hit on my searh.

Describe the solution you'd like
It would be nice if the search was case insensitive instead.

Additional context

Support branch refspecs in settings and template repo URLs

Is your feature request related to a problem? Please describe.
We sometimes need to make experimental changes to our templates or settings, and test these in single instances of CNaaS-NMS. This would be much easier if we could follow a normal Git workflow and make separate branches for these experiements. However, the current version of CNaaS-NMS will only check out the default branch of any referenced repo, which forces us to take the more laborious route of forking a separate repo for this purpose.

Describe the solution you'd like
We would like a solution where we can specifying a different branch name than the default as part of the GITREPO_TEMPLATES or GITREPO_SETTINGS environment variables.

An example of a URL pattern we would like: GITREPO_SETTINGS=https://git.example.org/cnaas/settings.git#alternate-branch should cause CNaaS-NMS to clone a settings repo from the alternate-branch branch of the repo at https://git.example.org/cnaas/settings.git

docker container more lightweight

Is your feature request related to a problem? Please describe.
The docker containers are now quite bulky. They are linux based. We would like to use something that uses less space.

Describe the solution you'd like
 We would like to use Python3.12slim or something similar.

Integration test debugging improvements

Is your feature request related to a problem? Please describe.
It takes long and is hard to debug issues with integration tests

Describe the solution you'd like

When integrationtests fails we should print the outputs from at least the api container to stdout or stderr so it's possible to find any error messages from there in logs after the docker container stops
If the docker build fails we should abort the entire integrationtest run, currently I think it will continue and use whatever build was last successful

Also if a short summary of errors could be sent into the github PR somehow that would be good?

Describe alternatives you've considered

Additional context

Define tags on VXLANs

Is your feature request related to a problem? Please describe.
I want to define custom behavior for some VXLANs, like special multicast behavior only on specific VXLANs. It would be nice with a well defined way to define "tags" to specify custom behavior/config per VXLAN.

Describe the solution you'd like
Allow a list of alphanumeric strings to be added per VXLAN in global/vxlans.yml

Describe alternatives you've considered
Matching on description/name/vrf

Additional context

Possibility to add infra_ip6 to devices

Is your feature request related to a problem? Please describe.
We use the field "infra_ip" for devices to generate lo0 IP-address. We also need to be able to set a similar field for IPv6.

Describe the solution you'd like
We want a additional field to be available for devices. For instance called "infra_ip6". This field should be of type IPv6 address. The field should of course also be available as a setting for the device in templates as "infra_ip6".

Describe alternatives you've considered
We've considered automatically generating IPv6 address from the IPv4 address, but this does not work for us.

Additional context
The attached image shows a snippet from the documentation and marks the field that we want added as an IPv6 field.

File lock when flask/werkzeug is in debug mode

The file lock used by scheduler/scheduler.py to detect if a scheduler is already running is not working properly when flask/werkzeug is started with debug=True. When in debug mode the scheduler module seems to be loaded once, and then immediately reloaded, without releasing the file lock in between. This results in the scheduler starting with "is_mule = True" since it things a wsgi mule has already acquired the file lock. This is not a problem if running in wsgi mode (in docker etc) but only if you run in standalone API mode. Possible solutions:

Make sure file lock is released when werkzeug does reload thing
Use other method to determine if a scheduler is already started

Workaround: Set debug=False when starting werkzeug

Add a mechanism for easily loading third party template tag libraries

Is your feature request related to a problem? Please describe.

We constantly find ourselves writing new template tags to make our template logic more concise. Some of these functions are accepted upstream, while some of them are very specific to our own setup and do not belong upstream.

Currently, we need to patch CNaaS-NMS locally to add our own template tags, which becomes rather tedious every time we need to do it, or every time we need to upgrade CNaaS-NMS.

Describe the solution you'd like

It would be much better if there was a mechanism or option to configure a list of third party Python modules with template tag functions, to import at runtime and make available when rendering Jinja2 templates.

Describe alternatives you've considered

Keep using the tedious manual patch management.

Don't stack trace when a device is unreachable

Describe the bug
When doing a dry-run to a group of switches and one of them is unreachable you get a long python stack trace in the NMS GUI.

To Reproduce
Steps to reproduce the behavior:

Make one switch unreachable
Perform a dry-run

Expected behavior
Instead of getting 50 lines of stack trace It would be preferable if it was handled with some kind of message saying the device could not be reached or something like that.

Environment:
CNaaS-NMS version: 1.2.1

Docker warning: Pin redis base image version to a release tag

While working on #179, I ran hadolint against all Dockerfiles.
Not much came up, and this warning looks like it's an easy fix against unpleasant surprises.

$ hadolint docker/redis/Dockerfile 
redis/Dockerfile:1 DL3007 warning: Using latest is prone to errors if the image will ever update. Pin the version explicitly to a release tag

Inconsistencies in the API

While testing the automated dev-setup for the frontend I wound up taking a look at the swagger-interface (runs on localhost:443) to the backend API. I have some (pedantic) comments.

In most cases you have two endpoints to work with one type of data. For instance /devices to list all, and /device/<something>. But not for groups, there you only have /groups.

Furthermore, the descriptions of several of the endpoints seems copy-pasted. Compare

/job API for handling jobs
/jobs API for handling jobs
/joblock API for handling jobs

Third, why /joblocks instead of /job_locks when you have /device_<something>? I read it as "job blocks" at first.

Fourth, there is no example response body, which is very useful for somebody trying to figure out the API without having possibly dangerous access. I assume the swagger is auto-generated and that it is a hassle to extend it to do the right thing but it is very much worth it, in my experience.

Fifth, how are you planning to do versioning of the API? I assure you, you'll want to have a way to version the API.

Add postgres database schema definition as part of the Alembic migrations

Is your feature request related to a problem? Please describe.
There is still too much configuration by hand when setting up a new CNAAS configuration on the database side. We could automatize this if a database schema definition is part of the Alembic migrations

Describe the solution you'd like
Add the database schema to the Alembic migrations

git_version: Unhandled exception

Describe the bug

Using the default docker-compose setup, the cnaas-nms API is unable to detect its own version.

To Reproduce

git clone cnaas-nms, checkout develop branch
cd docker
add docker-compose.override.yaml file with custom settings. This override will set BUILDBRANCH=uninett/lab-deploy, among other things.
docker-compose build, docker-compose up -d
curl -k -s -H "Authorization: Bearer $JWT_AUTH_TOKEN" $CNAAS_URL/api/v1.0/system/version | jq

Expected behavior

git_version shows a commit hash, branch and timestamp.

Outputs

{
  "status": "success",
  "data": {
    "version": "1.3.0dev1",
    "git_version": "Unhandled exception"
  }
}

Environment

Setup: Docker
Version:
- docker dir + compose setup: current develop with custom docker-compose.override.yaml
- BUILDBRANCH: a custom branch (not develop)
Client: cURL

Additional context

With some manual debugging, I was able to trace the issue back to a file permission/ownership problem on the .git directory of the container.
I will suggest a patch to fix this.

TEMPLATE_SECRET environment variables are not documented

Describe the bug
Through conversation with @indy-independence, we have learned that any environment variable prefixed with TEMPLATE_SECRET_, exported to the CNaaS-NMS API process, will be made available as variables in Jinja templates.

This fact is not documented in the official documentation, but it should be.

Additional context
#215 further modifies how these variables are made available to templates.

Refactor the jwt_security decorator to use the dynamic configuration of authlib

Is your feature request related to a problem? Please describe.
At this point in time the user must setup their own certificate chain to verify tokens produced by the auth-poc-service. This is not very secure as a lot of deployment probably use the built in certificates.

Describe the solution you'd like
Refactor the jwt_security decorator to use the dynamically retrieved configuration and certificates used in issue #310

Unsafe handling of 3rd party python dependencies

In various docker-files, you install 3rd party python libraries like this:

$ pip install dependency1 dependency2 ..

Sooner or later, these dependencies will need different versions of a sub-dependency, and then pip will fail because of a version conflict.

I recommend that you put these dependencies in requirements-files with version numbers and install them like this:

$ pip install -r requirements-for-subsystem-1.txt

You can generate a requirements file with pip freeze or better: pip-compile from piptools.

By using the same requirements-files everywhere you can make reproducible deploys, and also test on the actual versions that will be deployed. Much recommended!

Logging requires redis

Is your feature request related to a problem? Please describe.
Logging is done via the log.py file, where get_logger is used to get a logger object. This obejct has a handler for writing to redis. If
redis is not running, logging will not work in cnaas-nms, as it will crash when trying to connect to redis. So if you try to run tests that test code trying to log, you are required to have a redis session.

An overall problem is that library code trying to log sets up logging handlers. Handlers should be setup by the program using the library, for instance the main program.

Describe the solution you'd like
Remove setup of handlers in get_logging(), and instead require that handlers and stuff is setup by the main program that is using the code. This way, tests can just decide to not setup the redis logging, and are then able to run tests without needing a redis session.

Dist ZTP init fails with "Neighbor device <> not synchronized"

Describe the bug
Dist ZTP init fails with "Neighbor device <> not synchronized" even though neighbors was synchronized before starting init

To Reproduce
Steps to reproduce the behavior:

Do dist init of a discovered device connected to two core devices

Expected behavior
ZTP init success

Outputs
"Neighbor device <> not synchronized"

Environment:

Setup: docker
Version: v1.3.1
Client: webui

Additional context

`FQDN_REGEX` cannot match host names that have single-letter subdomains

The problem

We have all of our CNaaS equipment on the .c.uninett.no domain. This is not supported by CNaaS-NMS, rendering us unable to use any setting that is validated by Pydantic using host_schema fields.

The source here is this:

cnaas-nms/src/cnaas_nms/db/settings_fields.py

Line 22 in f9f2f94

FQDN_REGEX = r"([a-zA-Z0-9-]{1,63}\.)([a-z-][a-z0-9-]{1,62}\.?)+"

To reproduce

>>> import re
>>> from cnaas_nms.db import settings_fields as sf
>>> pattern = re.compile(sf.FQDN_REGEX)
>>> pattern.match("test.c.uninett.no")
>>> pattern.match("test.cd.uninett.no")
<re.Match object; span=(0, 18), match='test.cd.uninett.no'>

ERROR: Cannot install requirements.txt and napalm because these package versions have conflicting dependencies.

./integrationtests.sh fails with an error while building the API container:

ERROR: Cannot install -r requirements.txt (line 12) and napalm==3.1.0 because these package versions have conflicting dependencies.
The conflict is caused by:
    The user requested napalm==3.1.0
    nornir 2.4.0 depends on napalm<3 and >=2
To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies
ERROR: Service 'cnaas_api' failed to build : The command '/bin/sh -c /opt/cnaas/cnaas-setup.sh $BUILDBRANCH' returned a non-zero code: 1

The error seems related to changes in the pip dependency resolver. See https://pip.pypa.io/en/latest/user_guide/#changes-to-the-pip-dependency-resolver-in-20-3-2020

A temporary fix is adding --use-deprecated=legacy-resolver to the pip call.
A long-term seems to be updating requirements.txt with newer, compatible versions.

Implement an OIDC compatible Authentication flow.

Problem statement
It should be possible to dynamically configure the CNAAS-NMS api and frontend application to be compatible with any OIDC provider. This could be Google, Authy, Facebook, Microsoft and/or SURFconext. Ususally this is done by feeding the application with a .well-known-endpoint that configures the auth(n|z) of your application.

Describe the solution you'd like
Implement the authlib library to configure the api: https://authlib.org/
The dynamic configuration should look something like this

oauth.register(
    "connext",
    server_metadata_url=settings.OIDC_CONF_WELL_KNOWN_URL,
    client_id=settings.OIDC_CLIENT_ID,
    client_secret=settings.OIDC_CLIENT_SECRET,
    client_kwargs={"scope": "openid"},
    response_type="id_token token",
    response_mode="query",
)

The Authlib library provides a mechanism to retrieve tokens and dynamically and download the verification certificates into the application.

dhcpv6_relays for ipv6 dhcpv6 relays in vxlans vlan tree

Is your feature request related to a problem? Please describe.
We CNAAS-NMS has an option for configuring dhcp_relays. because (for example) Juniper has a different config tree for dhcpv6 we have to 'detect' in Jinja if a dhcp relay is ipv4 or ipv6. This makes the jinja2 templates hacky and ugly.

Describe the solution you'd like
Thats why It would be prettier tot just include an option dhcpv6_relays as a list of ipv6 addresses

Describe alternatives you've considered
The alternative is to solve it in jinja2 templates. But that can become hacky and cumbersome....at least for a Juniper

Additional context
None

ipv6_gw in to managament domain

Is your feature request related to a problem? Please describe.
We need the ability to run IPv6 in the management domain. Management domain does not allow to set ipv6_gw as a field and then we are unable to generate a management vxlan with IPv6-adresses.

Describe the solution you'd like
It should be possible to add a ipv6_gw field to the management domain.

Describe alternatives you've considered
Deriving ipv6_gw from the ipv4-address (ipv4_gw) is not at good solution for us.

sunet / cnaas-nms Goto Github PK

cnaas-nms's People

Contributors

Stargazers

Watchers

Forkers

cnaas-nms's Issues

Describe alternatives you've considered

Describe the bug

To Reproduce

Expected behavior

Outputs

Environment

Additional context

The problem

To reproduce

Recommend Projects

Recommend Topics

Recommend Org