mfdlabs / grid-bot Goto Github PK
View Code? Open in Web Editor NEWThe underlying code used for the MFDLABS Grid Bot.
Home Page: https://grid-bot.ops.vmminfra.net
License: Apache License 2.0
The underlying code used for the MFDLABS Grid Bot.
Home Page: https://grid-bot.ops.vmminfra.net
License: Apache License 2.0
The current codeowners file has the following problems:
We can fix this by creating the teams and determining on the code-owners registry who owns what.
So currently we have someone manually deploy files to the nodes.
We are thinking of writing dedicated software that automates this (only for machines not hooked onto our network, else we can just use ARCBD)
The app will poll this repositories releases and determine if there's a new version by reading registry key. By default it will not deploy releases marked as "pre-release", unless we put a very specifc string into the release title that overrides this (do not do this, only do this on very specific cases as it will deploy to every available node)
To do this we may also need to create a bridge tcp or udp server on the bot that can accept maintenance commands from an external source, this is so we can enable maintenance automatically before deployment. This will also kill every arbitered instance so we have enough memory to complete the deploy.
When it finds a new release, it will download the release and decompress it into the predefined path that is in it's configuration (check vault), in there it will Set-Location to the predefined path, and run the unpacker script, and then Set-Location into the newly created folder. If the settings say, it will copy the configuration files from the previous deployment and overwrite the new deployment's configuration files.
Finally it will run a script that:
It will then persist this version in the registry.
This auto deployer will have timespan setting that determines the polling interval.
WARNING: If we can somehow get a webhook for this that listens for changes instead of polling, do that instead.
Add some extensions to Task
The fix for this?
QueueUserWorkItem on CrashHandler.Upload()
This issue and milestone aims to integrate the LOVE_ALL_ENVIRONMENTS flag.
Right now it is tedious as hell to set this up on any machine that doesn't already have the requirements.
OpSec: SEC-04-LAE
Integrations of Windows Docker images, AWS images and better tooling scripts for easy deployment.
This also ties into SEC-13-ADP with a possibility of a new script on deployments that bootstraps setups for new devices.
Also tie this into SEC-04-LAP with integration of Linux based environments.
AS WELL as tying into SEC-10-ARBITERS to integrate remote management of arbiter instances. Which may also integrate its own gRPC service that integrates its own local arbiter service.
TODO: draft implementation.
Relates to #34
Fix a little issue with uploading artifacts.
Currently we have all of our dependencies in the project. I want to move this back to Nuget packages.
Allow clients backed by vault provider to be used without vault, as in load plaintext configurations correctly.
Fixes some issues with grid-bot
The following deployment:
https://mfdlabs-infrastructure.s3.amazonaws.com/ops/deployments/MFDLABS/MFDLabs.Grid.Bot/2022/HasErrors/2022.01.28-00.56.39_master_1986de0-net472-Release.mfdlabs-archive
Cannot be ran due to settings never loading, for an unkown reason when trying to use vault
Introduces changes to privacy policy and ToS to better alleviate ambiguities with each.
Fix the issue with Logger when the args is empty (it throws).
Extends #74
In here I, and @mfdlabs/grid-team will think about ideas for SEC-17-EASTER
I was trying to deploy temporarily to EC2 but had binding failure issues.
This repository need more labels that not only reflect opsecs but also reflect a Key ID of work.
SEC-04-REPOOPS
Platforms, despite the only available platform being Windows, should have their own labels.
OPSECS, despite having milestones should have their own labels to show relationships. The milestone represents the primary OPSEC for an issue or PR.
Priority should have labels:
P2 - Key Deliverable (Highest priority)
P1 - Deliverable (Mid priority)
P0 - Stretch Goals (Lowest priority, can be spread over multiple quarters without putting into at-risk)
Status should have labels despite the project:
Not Started
Research
Opportunistic
Deferred
On Track
At Risk
Blocked
Delayed
Partial Release
Complete
Cancelled
Motivated Areas should have their own labels, acts like opsecs but covers a major spectrum of implementation less an idea.
Issue kind should have its own label:
feature - single feature
fix - bug fix
enhancement - feature modification to existing product
ops - overhaul
dev - team branch
Text parsing for log files and RCCService process outputs to be used on in order to support SEC-05-WCF.
Right now we only support the rendering of Avatars and Closeups of users, we would like to extend this functionality by creating new ways of rendering.
Currently when a timeout is caused on the Render Queue instance it's gone forever until you manually kill it, every subsequent request after the timeout is instantly dropped because that Job will still be running, and you cannot execute multiple Jobs at the same time on the same grid server instance (there may be a setting that allows you to, but who knows 🤷).
What this change will introduce is the following:
There will be a property on GridServerArbiter.GridServerInstance that determines if this can perform auto recovery, there will also be a Type array of Exception types that will invoke the auto recovery. If an auto recovery is invoked, it will close and reopen the instance, execute the command again (if the setting that says to do so is enabled). If the amount of times it has recovered has exceeded a specific number, it will permanently terminate the instance.
This will make it so HA integrity can be maintained on instances that need a long life cycle such as the render queue, or the up and coming shared user instances.
There may be multiple PRs related to this, so keep an eye on it.
From the WOTS on the matter:
We keep getting spammed on backlog with the exceptions of there being another job running already. If we can sort of fix it by adding something to kill it when it finds these exceptions that would be amazing!!
Right now, MFDLabs.Analytics.Google.Client is UA only, I will want to create another client for GA4 (Metrics Protocol) and use both of these so we can be prepared for when they ultimately remove UA on July 2023.
MFDLabs.Analytics.Google.UniversalAnalytics.Client
MFDLabs.Analytics.Google.MetricsProtocol.Client
Implements the precise cases for unknown version.
It will consider unknown if the remote errors or times out, we only want it to consider unknown if the remote 404s.
Allow the ability to upload build symbols before deployment on release candidates.
This will allow the -- args like --install for WCF Service Host Apps
Add ratelimits to the Grid Deployer so it doesn't crash for no reason because of this:
https://github.com/mfdlabs/stacktraces/blob/master/grid-autodeployer-graphql-null-response.txt
I want to change the base namespace from MFDLabs to something else because I just do not like the look of this name.
Please refer to mfdlabs/grid-bot-support#3
Within the grid bot's infrastructure, we use 3 Expiring task threads (1 Async), this code is considered legacy, and breaks the development rule of HATE_SINGLETON.
If we can get rid of TaskThreads for AsyncWorkQueues, that would be better.
Report to InfraAPI (either HTTP API or gRPC API) on when a specific node is acting as a grid deployer and when a grid deployer opens a instance etc.
This will try to fix the major issue with the logger being the fact that Ansi characters don't render on non AnsiConsole.
Also I will migrate a few thing to P/Invoke etc.
This integrates OPS-4.
Clears up unused GitHub files.
Cleanups to the config code
Is your feature request related to a problem? Please describe.
We need to migrate to the latest discord.net as a maintainer suggests that the latest version fixes the timeouts
Describe the solution you'd like
A fix to timeouts
Describe alternatives you've considered
Making all callbacks multithreaded
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
This will restrict the IP addresses allowed to access the counter server.
There's an issue with the auto release uploader that causes it to upload corrupted archives.
I may fix this by either uploading form-data or raw bytes.
So we are having a major issue lately where the bot will die and never recover, the reasoning for this has been discovered as a "Hearbeat missed" exception:
Error Type: Discord.WebSocket.GatewayReconnectException
Error Detail: Server missed last heartbeat
Inner Exception:
Exception Stack Trace:
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Discord.ConnectionManager.<>c__DisplayClass29_0.<<StartAsync>b__0>d.MoveNext() in C:\git\MFDLABS\MFDLabs.Grid\Dependencies\Discord.Net.WebSocket\ConnectionManager.cs:line 79
Exception Source: mscorlib
Exception TargetSite: Void Throw()
Exception Data: System.Collections.ListDictionaryInternal
We've raised this with the developers of Discord.Net here: discord-net/Discord.Net#2126
One fix we may implement is the following: Instead of making initial checks in MessageReceived handler Sync, make it post to some WorkQueue for background processing so we don't block the gateway task
Integrates some new features and fixes to the auto deployer.
Is your feature request related to a problem? Please describe.
This will create 2 commands that will be useful for debugging.
Describe the solution you'd like
2 commands to log the deployment ID and debug file name.
Describe alternatives you've considered
Removing the option from support ticket
Additional context
Add any other context or screenshots about the feature request here.
So there’s a new issue with the frontend user that causes all single instance and arbitered grid server requests to timeout.
It has to do with the WebServer that drives the backend for the GridServer; the grid server needs this backend to download settings for it’s own operation, and will crash if it fails (I could do a file based settings thing so I can avoid the web server, but it’s there so just keep it.)
Screenshots:
02/11/2021
03/11/2021
The subdomain api.sitetest4.robloxlabs.com causes it, and it may be due to an open handle not being called (not calling a next() in a middleware)
How do I want to go about fixing this? What I can do is remove EVERY api other than ClientSettings, Avatar, and Version Compatibility, that will also speed up the start time of the web server drastically, but it comes with the downside of unsupporting some features like game persistence etc.
All in all, if I seriously want to fix it, I could use something extremely streamlined like a C++ or C# instead of JavaScript, but I don’t want to go through the pain of rewriting it, anyway.
Edit:
MFDLabs.Internal.RbxAvatar.Site
MFDLabs.Internal.RbxClientSettings.Site
MFDLabs.Internal.RbxVersionCompatibility.Site
May become things :check_mark:
Ripped from:
https://backlog.mfdlabs.local/ui/grid/mfdlabs.grid.bot/issues/7/?t=no&focusSummary=true
This development branch will aim to add a comment to the top of every file owned proprietarily by MFDLABS and the teams within it's own scope.
/* Copyright © MFDLABS Corporation 2001-. All rights reserved. */
This text will also be placed within the C# project file or Assembly Info file Copyright section.
Extends #74
Currently with logs, there's a massive prefix on the log string, and half the data on it is not cached, such as DNS resolution and IP lookup.
I want to make it so it will cache the data that doesn't change (Anything other than the current date, uptime and thread ID)
An example of this abomination is as shows:
That prefix is nearly 200 characters long and contains 13 data sets that are fetched every time a log is called, ultimately making this log inefficient.
I can shorten it to be only be around ~66 characters in length
Please watch for changes on fix/shorter-log-prefixes-and-faster-logging
Add better build scripts
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.