Comments (31)
On the system I tested the issue got triggered when grabbing vlanCktEp
as well as fvAEPg
when the child
objects are included.
Is a bit of a trial and error but seems to me a page size of 1000 is working fine for both use cases.
I also doubt anyone will have a fabric that big... I tested against the one of fabrics we use for scale testing that has something like 15K EPGs, 750K Static port bindings and 500 switches
Perhaps set a default of 1000 with an option to override it ?
from aci-exporter.
Alright I found the real issue! the APIC-cookie keeps getting re-added on each page so on page 11 there are 11 APIC-cookie concatenated.
I found this out by making http req and taking a sniffer trace:
Cookie: APIC-cookie=<DATA>; APIC-cookie=<DATA>; APIC-cookie=<DATA>
etc....
No idea why I looked at the code but I don't see why cookiejar would do this and the APIC is sendinf Set-Cookie: APIC-cookie=
only on the loging method.
from aci-exporter.
@thenodon I have 62 switchs and 111 EPG. I confirm that this feature is quite important :). The accuracy of monitoring depends a lot on the response time. I'll keep you inform.
from aci-exporter.
@thenodon Yes, I merged master into issue_37 taking issue_37 as reference.
Ok I'm going to rollback to the master branch while the changes are made
from aci-exporter.
from aci-exporter.
@camrossi do you have any recommendations what would be good settings? Or do you think it should be defined on a query level?
from aci-exporter.
@camrossi I have implemented what I believe is a solution to paging, but... Just setting page
and page-size
and iterate over the result until all pages are consumed is not enough. For this to work you also need the order-by
query parameter to get a consistent response, otherwise the final result include duplicates and not all objects in a random way. Currently the order-by
is set in an implicit way in the doGet
method, but it assume that the class has a dn
attribute. Not sure dn
is an attribute on all classes, but it looks like it works for all the test queries I have. So the query attribute added is order-by=<class_name of the query>.dn
.
Would be create if you can test it on your large cluster. Checkout branch issue_37
. The paging size is hard coded to 1000, but can be made configurable. Not sure if the order-by
have performance implications, but maybe you know.
from aci-exporter.
Hi @thenodon !
I made a few tests: on high scale fabric we need to also pass an API timeout or the APIC might stop processing on its side. I just added &timeout=180
where you set page
and page-size
. 180 is the timeout we use in the APIC UI as well so should be a safe bet
This was mostly needed for the count
queries but dose not hurt to have it set for all the queries.
I am still getting a few queries failing but I think is due to some other issue... but I am having some trouble using
queries=xxx
syntax to execute only some queries.
If I do &queries=compound_queries
doesn't execute any queries same if I do &queries=object_count
do you support filtering like that?
from aci-exporter.
Anyhow the filtering works with class_queries
so I manage to do this test so I could focus on a single request that is failing:
curl -s 'http://localhost:9643/probe?target=scale2-fabric&queries=epg_port_vlan_binding'
This will get ~9 pages of data over 110 seconds but then the APIC will refuse the next request with a HTTP 400.
There are ~20 pages to be retrieved and what I am not really clear on at the moment is why it is happening... I did the exact same thing with curl doing 20 request (without re-authenticating) and it worked just fine.
from aci-exporter.
Hi @camrossi - first the queries=
query attribute should be set to the name of the "query" in the config file. A query is always of one of the types:
- class_queries
- compound_queries
- group_class_queries
So from the example_config.yml
the query interface_info
is a class_queries
. You can not set the query parameter queries=
to one of the type names, only to a named query. Not sure if this is clear in the documentation.
When it comes to the problem that you do not get all the pages, I can also reproduce it. I have to set the page-size vary low. I ran the query interface_info
that returns 98 entries (totalCount=98). I set the page-size to 3, which means that I have to iterate 33 times to get all entries. But what happens is that after 10 successful request the 11th get 400. So I tried to start on a different offset than 0. I get the same result setting offset to 10, page=10, but when trying request page=20 a 400 is return. So only 10 requests are successful, independent of the page-size. I also tried to remove the order-by
, but it did not make any changes.
So it do not matter what offset I start from (page=X) only 10 request are successful and the next gives 400.
400 means a bad request, but the response to do not include any information what is "bad". Do you have access to the any logs on the apic side that could share some more insight to what the problem relates to? You can run this on a smaller cluster just change the line 44 in aci_connection.go to something small, as long that the response require more than 10 iterations, like:
var pageCount uint64 = 3
from aci-exporter.
@camrossi I think I found it. Tested with curl and it worked as expected, like you also said. The only different is that in aci-exporter I set the header Content-Type: application/json
, and when I did run with curl I did not set that header. Removed the header in aci-exporter and it works! Can not understand why this problem just happen on the call number 11 in the iteration. Still great if you access to logs to see what was the issue on apic side.
You can do a pull on the issue_37
branch and test. Have also added to the logs the number of "pages" that was collected based on the page-size.
INFO[0013] api call fabric class=ethpmPhysIf exec_time=286413 fabric=cam length=119622 method=GET pages=3 requestid=2WWFkMYD5BcK4CWV1B5J4FeA3Rk status=0 uri="https://173.36.219.190/api/class/ethpmPhysIf.json"
from aci-exporter.
@camrossi related to the timeout issue. When the aci-exporter create the http client the default timeout is 0, which means no timeout. You can define it in the config.yml like:
httpclient:
insecurehttps: true
# Default is 15 seconds
keepalive: 15
# Zero means no timeout
timeout: 0
If you run from prometheus you must define prometheus scrape timeout against the exporter.
from aci-exporter.
from aci-exporter.
@camrossi Interesting findings and it helped me to understand the real problem. It had nothing to do with the header Content-Type: application/json
was set or not. It was my header logic that was buggy.
I have now, hopefully, fixed this and tested with different page-size and with the original Content-Type header set. The page-size can be set in the config as:
httpclient:
pagesize: 100
....
Default is 1000
Looking forward to your verification on your monster cluster. Branch is still issue_37
from aci-exporter.
@thenodon success it works.... now I am not sure if prometheus will work. The query I am testing consists of pulling all the static port bindings (765327
) and that takes 6minutes. If that is too long would it make sense to return data to prometheus on a per page basis? I did pull 19 pages in those 6min.
I can't even fault the APIC too much as we are pulling static port bindings on a fabric that is pretty much at the highest multi dimensional scale you can go.
Small note: I had to increate the page size to 10K or it was gonna take too long, I think we would want to eventually support setting this per query, the limits is the size (in bytes) so depending what query you do 10K is too much or you could go even higher.
from aci-exporter.
@camrossi good to hear that it worked, but I think we need to do some thinking for this large fabric use case. A scrape time higher than a minute is bad in the context of Prometheus and I do not see Prometheus managing paging in the scrape logic. That would involve complex discovery and the exporter must expose page number or page range in the api. And how many pages is it that must be scraped?
I think the way forward are some of these alternatives:
- Make every page fetch parallel. Currently they are done sequential. The question is, do this burst of api calls against the apic become a resource problem for the apic? This should be fairly simple to implement and I think we should try that.
- Doing async fetching of metrics and caching the result. That could be implemented in the aci-exporter or as an external proxy cache. I have developed a number of exporters that use this pattern where the exporter collect data with its own "scrape" interval and cache it and Prometheus can scrape every minute, but the metrics in Prometheus will be the same during the interval of the exporter scrape/fetch interval. This means lower resolution of the metrics in Prometheus but may be okay. This is often the scenario for streaming metrics where the device/platform push its metrics every X minutes.
I think we should test the first alternative first if you do not have any objection. If my math is not wrong you did 19 pages in 360 sec, that would mean that each page took approx 19 sec to complete. So with a parallel approach the total max time should be x sec (single call to get the totalCount
using just page size of 1) + 19 sec (the remaining pages in parallel) sec range. The question is how it affects the apic?
from aci-exporter.
@camrossi I have implemented the parallel page fetch. Will be interesting to see how that effect the total processing time on your large cluster. On my small cluster I ran interface_info
from the example_config.yml with a pagesize=1:
- sequential - ~24 sec
- parallel - ~3 sec
The fix is in the issue_37 branch
from aci-exporter.
I will give it a try next time I get access to the cluster.
Anyway this is not super urgent :D
from aci-exporter.
@camrossi I forgot to mention but you can disable parallel processing in the config:
httpclient:
# Below are the default values
pagesize: 1000
# If false paging will be done sequentially for the query
parallel_paging: true
It would be nice to try to find the "sweet spot" for the normal ACI customer related how to fetch data. Or even better be able to do some analyse of the cluster on startup and from that set the different configuration parameters of how to process results. I can imaging that that the number of pods, spines and leaf are defining the size, but may be also more "virtual" objects like epg etc. Ideas?
from aci-exporter.
@camrossi any feedback?
from aci-exporter.
Hi @thenodon, @camrossi,
I can test the parallel page fetch with my fabric. Today am up to date with the master branch. I will merge with issue_37 and give you a feedback.
from aci-exporter.
Hi @ahmedaall not sure if it will merge without conflicts, but give it a try. I will hope to be able to continue to focus on this soon since for a large fabric it make so much sense. Looking forward to your feedback and test results. Great if you can share the size of the fabric if you do comparing tests.
from aci-exporter.
@thenodon Did you have the same output during the docker build :
./aci-api.go:69:26: cannot use *newAciConnection(ctx, fabricConfig) (variable of type AciConnection) as *AciConnection value in struct literal
./aci-connection.go:91:2: jar declared and not used
./aci-connection.go:91:12: undefined: cookiejar
./aci-connection.go:117:21: undefined: responseTime
./aci-connection.go:120:2: undefined: connectionCache
./aci-connection.go:121:9: undefined: connectionCache
./aci-connection.go:127:17: c.tokenProcessing undefined (type *AciConnection has no field or method tokenProcessing)
./aci-connection.go:137:4: c.tokenMutex undefined (type *AciConnection has no field or method tokenMutex)
./aci-connection.go:138:10: c.tokenMutex undefined (type *AciConnection has no field or method tokenMutex)
./aci-connection.go:171:4: c.token undefined (type *AciConnection has no field or method token)
./aci-connection.go:171:4: too many errors
ERROR: process "/bin/sh -c go get -u && go mod tidy && go build -o aci-exporter *.go" did not complete successfully: exit code: 1
RUN go get -u && go mod tidy && go build -o aci-exporter *.go:
./aci-connection.go:91:2: jar declared and not used
./aci-connection.go:91:12: undefined: cookiejar
./aci-connection.go:117:21: undefined: responseTime
./aci-connection.go:120:2: undefined: connectionCache
./aci-connection.go:121:9: undefined: connectionCache
./aci-connection.go:127:17: c.tokenProcessing undefined (type *AciConnection has no field or method tokenProcessing)
./aci-connection.go:137:4: c.tokenMutex undefined (type *AciConnection has no field or method tokenMutex)
./aci-connection.go:138:10: c.tokenMutex undefined (type *AciConnection has no field or method tokenMutex)
./aci-connection.go:171:4: c.token undefined (type *AciConnection has no field or method token)
./aci-connection.go:171:4: too many errors
from aci-exporter.
@ahmedaall not using docker when I developing. What branch are you on? Have you tried to merge master into branch issue_37? There are many things changed between master and branch issue_37. I think it needs more work than just a merge. I will see if I get some time this week. I have a clear idea what need to change, but doing it from the master branch.
from aci-exporter.
@ahmedaall and @camrossi I have now used the current master and created branch pagination_issue_37
with the parallel page handling that was done in branch issue_37
. The README.md is updated, search for Large fabric configuration
. Some notes that is important:
- The class used in the query will automatically be added in the final request url as
order-by=<class_name>.dn
so you do not need to added in the configuration of the query. Without the order-by you will not get a consistent response. - When using parallel page the the first request has to be done in "single" way to get the total number of entities that will be returned. That number is than used to calculate the number of parallel requests to do in parallel with the right offset.
- I detected that if the url query include
rsp-subtree-include=count
it will not work in parallel mode since its the count of items returned. So if that "string" is in the query it use single page request. Not sure if there are any more corner cases.
So checkoutpagination_issue_37
and give it a spin. Looking forward to your feedback and especially what, hopefully, latency decrease it gives on a larger fabric.
from aci-exporter.
Hi @thenodon
I managed to succesfully merge pagination_issue_37.
With the default parameters the scrapping time doesn't seems to change.
Maybe I miss something. Here is my config :
# Exporter port
port: 9643
# Configuration file name default without postfix
config: config
# The prefix of the metrics
prefix:
fabrics:
# cisco_sandbox:
# username: admin
# password: ""
# apic:
# - https://sandboxapicdc.cisco.com
MY_FABRIC:
# Apic username
username: ACI_USERNAME
# Apic password.
password: ACI_PASSWORD
# # The available apic controllers
# # The aci-exporter will use the first apic it can successfully login to, starting with the first in the list
apic:
- https://LB-FABRIC
# Http client settings used to access apic
# Below is the default values, where 0 is no timeout
httpclient:
# insecurehttps: true
# keepalive: 15
# timeout: 10
# this is the max and also the default value
pagesize: 1000
# enable parallel paging, default is false
parallel_paging: true
# Http server settings - this is for the web server aci-exporter expose
# Below is the default values, where 0 is no timeout
httpserver:
# read_timeout: 0
# write_timeout: 0
# Define the output format should be in openmetrics format - deprecated from future version after 0.4.0, use below metric_format
#openmetrics: true
metric_format:
# Output in openmetrics format, default false
openmetrics: false
# Transform all label keys to lower case format, default false. E.g. oobMgmtAddr will be oobmgmtaddr
label_key_to_lower_case: true
# Transform all label keys to snake case format, default false. E.g. oobMgmtAddr will be oob_mgmt_addr
label_key_to_snake_case: false
from aci-exporter.
@ahmedaall - you should run the branch pagination_issue_37
, you do not have to merge. Sure you got the latest changes to that branch? What queries to you run?
from aci-exporter.
@thenodon I misspoke. I run pagination_issue_37
with my custom Dockerfile and my custom yaml config.
Here is my main query :
class_queries:
interface_info:
class_name: l1PhysIf
query_parameter: "?rsp-subtree=children&rsp-subtree-include=stats&rsp-subtree-class=ethpmPhysIf,eqptIngrBytes5min,eqptEgrBytes5min,eqptIngrDropPkts5min,eqptEgrDropPkts5min&query-target-filter=and(ne( l1PhysIf.adminSt, \"down\"))"
metrics:
- name: interface_speed_temp
value_name: l1PhysIf.children.[ethpmPhysIf].attributes.operSpeed
type: gauge
help: The current operational speed of the interface, in bits per second.
# value_transform:
# 'unknown': 0
# '100M': 100000000
# '1G': 1000000000
# '10G': 10000000000
# '25G': 25000000000
# '40G': 40000000000
# '100G': 100000000000
# '400G': 400000000000
- name: interface_admin_state
# The field in the json that is used as the metric value, qualified path (gjson) under imdata
value_name: l1PhysIf.attributes.adminSt
# Type
type: gauge
# Help text without prefix of metrics name
help: The current admin state of the interface.
value_transform:
'down': 0 ## ~ disabled interfaces
'up': 1 ## ~ enabled interfaces
- name: interface_oper_state
# The field in the json that is used as the metric value, qualified path (gjson) under imdata
value_name: l1PhysIf.children.[ethpmPhysIf].attributes.operSt
# Type
type: gauge
# Help text without prefix of metrics name
help: The current operational state of the interface. (0=unknown, 1=down, 2=up, 3=link-up)
# A string to float64 transform table of the value
value_transform:
"down": 0 ## ~ disabled interfaces
"up": 1 ## ~ enabled interfaces
# The labels to extract as regex
labels:
# The field in the json used to parse the labels from
- property_name: l1PhysIf.attributes.dn
# The regex where the string enclosed in the P<xyz> is the label name
regex: "^topology/pod-(?P<pod_id>[1-9][0-9]*)/node-(?P<node_id>[1-9][0-9]*)/sys/phys-\\[(?P<interface_name>[^\\]]+)\\]"
# Ajout de l'attribut descr au champs
- property_name: l1PhysIf.attributes.descr
regex: "^(?P<interface_description>.*)"
- property_name: l1PhysIf.children.[ethpmPhysIf].attributes.operSpeed
regex: "^(?P<speed_temp>.*)"
interface_info_more:
class_name: l1PhysIf
query_parameter: "?rsp-subtree=children&rsp-subtree-include=stats&rsp-subtree-class=ethpmPhysIf,eqptIngrBytes5min,eqptEgrBytes5min,eqptIngrDropPkts5min,eqptEgrDropPkts5min&query-target-filter=and(ne( l1PhysIf.adminSt, \"down\"))"
metrics:
- name: interface_rx_unicast
value_name: l1PhysIf.children.[eqptIngrBytes5min].attributes.unicastCum
type: counter
help: The number of unicast bytes received on the interface since it was integrated into the fabric.
unit: bytes
- name: interface_rx_multicast
value_name: l1PhysIf.children.[eqptIngrBytes5min].attributes.multicastCum
type: counter
unit: bytes
help: The number of multicast bytes received on the interface since it was integrated into the fabric.
- name: interface_rx_broadcast
value_name: l1PhysIf.children.[eqptIngrBytes5min].attributes.floodCum
type: counter
unit: bytes
help: The number of broadcast bytes received on the interface since it was integrated into the fabric.
- name: interface_rx_buffer_dropped
value_name: l1PhysIf.children.[eqptIngrDropPkts5min].attributes.bufferCum
type: counter
unit: pkts
help: The number of packets dropped by the interface due to a
buffer overrun while receiving since it was integrated into the
fabric.
- name: interface_rx_error_dropped
value_name: l1PhysIf.children.[eqptIngrDropPkts5min].attributes.errorCum
type: counter
unit: pkts
help: The number of packets dropped by the interface due to a
packet error while receiving since it was integrated into the
fabric.
- name: interface_rx_forwarding_dropped
value_name: l1PhysIf.children.[eqptIngrDropPkts5min].attributes.forwardingCum
type: counter
unit: pkts
help: The number of packets dropped by the interface due to a
forwarding issue while receiving since it was integrated into the
fabric.
- name: interface_rx_loadbal_dropped
value_name: l1PhysIf.children.[eqptIngrDropPkts5min].attributes.lbCum
type: counter
unit: pkts
help: The number of packets dropped by the interface due to a
load balancing issue while receiving since it was integrated into
the fabric.
- name: interface_tx_unicast
value_name: l1PhysIf.children.[eqptEgrBytes5min].attributes.unicastCum
type: counter
help: The number of unicast bytes transmitted on the interface since it was integrated into the fabric.
unit: bytes
- name: interface_tx_multicast
value_name: l1PhysIf.children.[eqptEgrBytes5min].attributes.multicastCum
type: counter
unit: bytes
help: The number of multicast bytes transmitted on the interface since it was integrated into the fabric.
- name: interface_tx_broadcast
value_name: l1PhysIf.children.[eqptEgrBytes5min].attributes.floodCum
type: counter
unit: bytes
help: The number of broadcast bytes transmitted on the interface since it was integrated into the fabric.
- name: interface_tx_queue_dropped
value_name: l1PhysIf.children.[eqptEgrDropPkts5min].attributes.afdWredCum
type: counter
unit: pkts
help: The number of packets dropped by the interface during queue
management while transmitting since it was integrated into the
fabric.
- name: interface_tx_buffer_dropped
value_name: l1PhysIf.children.[eqptEgrDropPkts5min].attributes.bufferCum
type: counter
unit: pkts
help: The number of packets dropped by the interface due to a
buffer overrun while transmitting since it was integrated into the
fabric.
- name: interface_tx_error_dropped
value_name: l1PhysIf.children.[eqptEgrDropPkts5min].attributes.errorCum
type: counter
unit: pkts
help: The number of packets dropped by the interface due to a
packet error while transmitting since it was integrated into the
fabric.
# The labels to extract as regex
labels:
# The field in the json used to parse the labels from
- property_name: l1PhysIf.attributes.dn
# The regex where the string enclosed in the P<xyz> is the label name
regex: "^topology/pod-(?P<pod_id>[1-9][0-9]*)/node-(?P<node_id>[1-9][0-9]*)/sys/phys-\\[(?P<interface_name>[^\\]]+)\\]"
# Ajout de l'attribut descr au champs
- property_name: l1PhysIf.attributes.descr
regex: "^(?P<interface_description>.*)"
- property_name: l1PhysIf.children.[ethpmPhysIf].attributes.operSpeed
regex: "^(?P<speed_temp>.*)"
from aci-exporter.
Hi @thenodon. Do you have any clue about some misconfig that I could have make for parallel pagination ?
For information, the syntax that you gave me in #41, works pretty good to reduce scrapping time.
from aci-exporter.
@ahmedaall sorry but have not had the time yet. I hope that also @camrossi would have a change to test on his mega fabric. I do not see that parallel paging will have a major impact on a normal sized fabric. If the payload will fit in the max size paging will not improve the latency. In the end the biggest impact of the latency is the time spent in the apic responding to the api call. The suggestion I made in #41 will not directly improve the latency, but instead of waiting for the query that take the longest time all separate queries will finish without waiting for the query with the largest latency.
from aci-exporter.
@thenodon I have been currently swallowed in a different black whole but this is still on my to do list...:)
from aci-exporter.
Related Issues (20)
- Scrape timestamp metrics as value HOT 7
- ACI 5.2: Update login/logout API endpoints HOT 2
- Regex in value_transform HOT 10
- The prefix is not reflected in the # HELP and # TYPE lines HOT 2
- A way to convert the systemUpTime to seconds HOT 13
- Add a aci_up metric HOT 1
- It is possible to generate a metric file with empty # HELP and / or # TYPE lines HOT 2
- Labels should alway be lowercase HOT 5
- value transformation do not work HOT 3
- static_binding_info: value_regex_transformation not working HOT 3
- label regex not executed on child class HOT 4
- Convert string to float for operSpeed HOT 9
- Add static label in metrics HOT 6
- auth fails for some passwords containing special characters HOT 3
- reduce scrapping time HOT 8
- Reuse client login cookie data between api calls
- Support configuration directory
- Exclude/include queries by fabric configuration
- Issue with query including an array of children HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aci-exporter.