nuget / insights Goto Github PK
View Code? Open in Web Editor NEWGather insights about public NuGet.org package data
License: Apache License 2.0
Gather insights about public NuGet.org package data
License: Apache License 2.0
The PackageManifestRecord
contains raw package tags as a string. Ideally these should be pre-processed into a parsed list for easier analysis. Currently you need to split tags with something like:
JverPackageManifests
| where ResultType == "Available"
| extend Tags = translate(
" ", " ", translate(
",", " ", translate(
";", " ", translate(
"\t", " ", Tags))))
| mv-expand split(Tags, " ")
| where isnotempty(Tags)
Tag splitting algorithm: https://github.com/NuGet/NuGet.Jobs/blob/376ac06e6e07d4ee8d1e28f6b2346e3891487496/src/Catalog/Helpers/Utils.cs#L37-L48
We found that transient network errors (e.g., network timeout or package drop) may happen when releasing the lease of a blob via ReleaseAsync
SDK API, and there is no handling for the transient errors.
More precisely, if the effect has actually taken place in the server but the response of leaseClient.ReleaseAsync()
never returns to the client on time due to transient network errors, the retry logic in Azure Storage SDK will send another release request again. Since the first release operation has released the lease successfully in the remote, there will be a Conflict error when the second request arrives.
The default value for shouldthrow
is false in TryReleaseAsync
, thus in the catch block, it will directly return false without any other handling. However, the operation indeed succeeded so it should return true instead of false. The following lines of code show the detail:
...
try
{
await leaseClient.ReleaseAsync();
return true;
}
catch (RequestFailedException ex) when (ex.Status == (int)HttpStatusCode.Conflict)
{
if (shouldThrow)
{
throw new InvalidOperationException(StorageLeaseResult.AcquiredBySomeoneElse, ex);
}
else
{
return false;
...
As shown above, when the first release request sent by leaseClient.ReleaseAsync()
in TryReleaseAsync
succeeds in the remote but transient errors happen, the second retry request will lead to a Conflict error, which is handled wrongly.
We should distinguish between the errors led by contention or transient error. For transient ones, we can ignore the exception and return true. We believe a similar bug also happens when shouldthrow
is true because InvalidOperationException
should not be thrown when transient errors happen.
The link in the readme's Queries section is broken :'(
For Blob service, GetUserDelegationKeyAsync
is used in GetServiceClientsAsync
to get the delegation key and then use the key to sign the SAS token in GetBlobReadUrlAsync
via BlobSasBuilder
.
But I notice that the token expiration time is 1 hour, and it will be refreshed at the half time (i.e., 30 mins), so I wonder whether it is possible to have a single operation that cannot be finished in 30 mins. I think IKustoQueuedIngestClient.IngestFromStorageAsync
uses the Blob SAS token to access data in the remote. If there is a large file to ingest, which may need more than 30 mins, it is possible to encounter an unexpected HTTP 403 Forbidden error, right?
While developing locally you need to run both the ExplorePackage.Website
website and the ExplorePackages.Worker
Azure Function. It may be worthwhile to use project Tye to launch both of these projects together:
Per @khalidabuhakmeh:
https://twitter.com/buhakmeh/status/1394990191740346370
Blocked by:
We found that transient network errors (e.g., network timeout or package drop) may happen during the execution of Azure SDK APIs, and the retry mechanism of SDK would lead to Conflict/Not Found errors. There are several places we found the errors may happen: AddEntity, DeleteEntity, UpdateEntity, AddEntity, UpdateEntity, UpdateEntity, UpdateEntity.
More precisely, if the effect has actually taken place in the server but the response never returns to the client on time due to transient network errors, the retry logic in Azure Storage SDK will send another request again. Since the first operation has been successfully done in the remote, there will be a Conflict or Not Found error when the second request arrives, DeleteEntity
will lead to 404, AddEntity
will encounter 409.
A better practice could be to wrap these APIs into the try-catch block and handle the RequestFailedException
exceptions as the lease acquire. We are willing to contribute to this, however, we are not sure how to do it in an elegant way. Is it possible to wrap each of them into a try-catch block?
Things to ensure:
Meaning same resulting target framework, of course.
See dotnet/runtime#57531 (comment) for details.
The code in question is here:
This makes it less confusing to check if one of the timer-based data sets has been updated recently.
Currently running dotnet restore
and dotnet build
on macOS ends with the following failure.
โ dotnet build
Microsoft (R) Build Engine version 16.10.0-preview-21181-07+073022eb4 for .NET
Copyright (C) Microsoft Corporation. All rights reserved.
Determining projects to restore...
All projects are up-to-date for restore.
You are using a preview version of .NET. See: https://aka.ms/dotnet-core-preview
You are using a preview version of .NET. See: https://aka.ms/dotnet-core-preview
You are using a preview version of .NET. See: https://aka.ms/dotnet-core-preview
You are using a preview version of .NET. See: https://aka.ms/dotnet-core-preview
You are using a preview version of .NET. See: https://aka.ms/dotnet-core-preview
You are using a preview version of .NET. See: https://aka.ms/dotnet-core-preview
You are using a preview version of .NET. See: https://aka.ms/dotnet-core-preview
You are using a preview version of .NET. See: https://aka.ms/dotnet-core-preview
You are using a preview version of .NET. See: https://aka.ms/dotnet-core-preview
SourceGenerator -> /Users/khalidabuhakmeh/Projects/Dotnet/Insights/artifacts/SourceGenerator/bin/Debug/netstandard2.0/NuGet.Insights.SourceGenerator.dll
Logic -> /Users/khalidabuhakmeh/Projects/Dotnet/Insights/artifacts/Logic/bin/Debug/netcoreapp3.1/NuGet.Insights.Logic.dll
/Users/khalidabuhakmeh/Projects/Dotnet/Insights/src/Worker.Logic/CatalogScan/Drivers/NuGetPackageExplorerToCsv/NuGetPackageExplorerFileRecord.cs(5,15): error CS0234: The type or namespace name 'AssemblyMetadata' does not exist in the namespace 'NuGetPe' (are you missing an assembly reference?) [/Users/khalidabuhakmeh/Projects/Dotnet/Insights/src/Worker.Logic/Worker.Logic.csproj]
/Users/khalidabuhakmeh/Projects/Dotnet/Insights/src/Worker.Logic/CatalogScan/Drivers/NuGetPackageExplorerToCsv/NuGetPackageExplorerFileRecord.cs(42,16): error CS0246: The type or namespace name 'PdbType' could not be found (are you missing a using directive or an assembly reference?) [/Users/khalidabuhakmeh/Projects/Dotnet/Insights/src/Worker.Logic/Worker.Logic.csproj]
Logic.Test -> /Users/khalidabuhakmeh/Projects/Dotnet/Insights/artifacts/Logic.Test/bin/Debug/netcoreapp3.1/NuGet.Insights.Logic.Test.dll
Build FAILED.
/Users/khalidabuhakmeh/Projects/Dotnet/Insights/src/Worker.Logic/CatalogScan/Drivers/NuGetPackageExplorerToCsv/NuGetPackageExplorerFileRecord.cs(5,15): error CS0234: The type or namespace name 'AssemblyMetadata' does not exist in the namespace 'NuGetPe' (are you missing an assembly reference?) [/Users/khalidabuhakmeh/Projects/Dotnet/Insights/src/Worker.Logic/Worker.Logic.csproj]
/Users/khalidabuhakmeh/Projects/Dotnet/Insights/src/Worker.Logic/CatalogScan/Drivers/NuGetPackageExplorerToCsv/NuGetPackageExplorerFileRecord.cs(42,16): error CS0246: The type or namespace name 'PdbType' could not be found (are you missing a using directive or an assembly reference?) [/Users/khalidabuhakmeh/Projects/Dotnet/Insights/src/Worker.Logic/Worker.Logic.csproj]
0 Warning(s)
2 Error(s)
The NuGet Insights project lets us understand the .NET ecosystem. This has been a huge win for the NuGet team.
We should also index VS Code extensions to better understand VS Code's ecosystem.
Add a table to answer questions like:
This table should contain all certificates used to author sign, repository sign, or timestamp packages. This certificate data should be joinable against JverPackageSignatures
.
Consider reusing: https://github.com/NuGet/NuGet.Jobs/blob/main/src/Validation.PackageSigning.ValidateCertificate/OnlineCertificateVerifier.cs
Today we re-ingest ALL of the Kusto data every X hours. This is costly. We could consider Kusto soft delete to improve this.
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/concepts/data-soft-delete
Places that have undefined order (not an exhaustive list):
CatalogLeafItemRecord.Deprecation
CatalogLeafItemRecord.Vulnerabilities
PackageAssembly.CustomAttributes
PackageAssembly.CustomAttributesFailedDecode
Basically any property that has [KustoType("dynamic")]
is suspect.
From time to time the ingestion pipeline gets blocked because a Kusto validation query fails. Example:
A Kusto validation query failed.
Validation label: full outer set comparison of NiCatalogLeafItems_Temp.Identity and NiPackageSignatures_Temp.Identity
Error: The set of values in the Identity columns in the NiCatalogLeafItems_Temp and NiPackageSignatures_Temp tables do not match.
Identity values in NiCatalogLeafItems_Temp but not NiPackageSignatures_Temp:
- Count: 1
- Sample: ["drewsubmissiontest/1.0.0"]
Identity values in NiPackageSignatures_Temp but not NiCatalogLeafItems_Temp:
- Count: 0
- Sample: []
NiCatalogLeafItems_Temp
| distinct Identity
| join kind=fullouter (
NiPackageSignatures_Temp
| distinct Identity
) on Identity
| where isempty(Identity) or isempty(Identity1)
| summarize
LeftOnlyCount = countif(isnotempty(Identity)),
LeftOnlySample = make_set_if(Identity, isnotempty(Identity), 5),
RightOnlyCount = countif(isnotempty(Identity1)),
RightOnlySample = make_set_if(Identity1, isnotempty(Identity1), 5)
I think there's some race condition related that causes this to happen sometimes.
We should have an easy way to abort the current Kusto ingestion and re-run the whole workflow from the beginning.
Currently the "package downloads" report does not have all package identities. This report should be joined with kind=leftouter
.
Ideally, all tables should have the same set of distinct Identity
values. We can consider using the "package versions" report to fill missing records with sensible default data. For example, if a package exists in "package versions" but not in "package downloads", we could insert a record for tha missing package with downloads of 0.
Currently the Package Compatibility table only contains target frameworks that the package explicitly supports. We should add a column for frameworks compatible with the package using nuget.org's FrameworkCompatibilityService.
See https://en.wikipedia.org/wiki/Extended_Validation_Certificate.
I don't think it's important to know if a certificate is domain validation (DV) or organization validation (OV).
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/api/netfx/controlling-tracing
Hook the trace events into App Insights.
This is to ease the SemVer sorting complexity on the data query side. If there is a simple integer that says what position a version version is in the list of ordered versions in that ID, things could be easier.
Version | VersionIndex |
---|---|
2.0.0 | 0 |
2.0.1 | 1 |
10.0.0 | 2 |
This can be used to clean up broken state or reset the data for a rebuild (e.g. a new column is added)
This data is needed by the .NET Template team.
Implementation should be similar to OwnersToCsvUpdate
: use search auxiliary data to determine reserved namespaces.
Some data on NuGet.org can be updated without a corresponding catalog leaf getting added. For example:
Maybe every week or two every package should be checked for updates in these regards.
As we improve the .NET ecosystem we need to understand the adoption of new features and best practices. This will help us make informed engineering investments. This could be done using public Power BI dashboards of NuGet insight's data. For example:
Microsoft employees can play with my dashboard prototype here: https://msit.powerbi.com/groups/me/reports/0c673992-f323-44b5-be81-f8e75afbaee0/ReportSection24621557a494a04b6f43
This is similar to the ASP.NET Core team's public Power BI dashboards: https://aka.ms/aspnet/benchmarks
We can use Power BI's data refresh feature to keep these dashboards up-to-date: https://docs.microsoft.com/en-us/power-bi/connect-data/refresh-data#data-refresh
[xUnit.net 00:00:04.97] Knapcode.ExplorePackages.CatalogCommitTimestampProviderTest.ReturnsExpectedFirstTimestamps [FAIL]
Data collector 'Blame' message: All tests finished running, Sequence file will not be generated.
Failed Knapcode.ExplorePackages.CatalogCommitTimestampProviderTest.ReturnsExpectedFirstTimestamps [2 s]
Error Message:
System.InvalidProgramException : The JIT compiler encountered invalid IL code or an internal limitation.
Stack Trace:
at Azure.Data.Tables.Queryable.ExpressionWriter.ConvertExpressionToString(Expression e)
at Azure.Data.Tables.Queryable.ExpressionWriter.ExpressionToString(Expression e)
at Azure.Data.Tables.Queryable.ExpressionParser.VisitLambda(LambdaExpression lambda)
at Azure.Data.Tables.Queryable.LinqExpressionVisitor.Visit(Expression exp)
at Azure.Data.Tables.Queryable.ExpressionParser.Translate(Expression e)
at Azure.Data.Tables.TableClient.Bind(Expression expression)
at Azure.Data.Tables.TableServiceClient.QueryAsync(Expression`1 filter, Nullable`1 maxPerPage, CancellationToken cancellationToken)
at Knapcode.ExplorePackages.TableExtensions.QueryAsync(TableServiceClient client, String prefix) in C:\Users\XXX\Insights-consistency\src\Logic\Storage\TableExtensions.cs:line 16
at Knapcode.ExplorePackages.BaseLogicIntegrationTest.DisposeAsync() in C:\Users\XXX\Insights-consistency\test\Logic.Test\TestSupport\BaseLogicIntegrationTest.cs:line 289
Standard Output Messages:
[INF] GET https://api.nuget.org/v3/catalog0/index.json
[INF] OK https://api.nuget.org/v3/catalog0/index.json 350ms
[INF] GET https://api.nuget.org/v3/catalog0/page0.json
[INF] OK https://api.nuget.org/v3/catalog0/page0.json 77ms
[INF] Using the configured storage connection string.
[INF] Blob endpoint: http://127.0.0.1:10000/devstoreaccount1
[INF] Queue endpoint: http://127.0.0.1:10001/devstoreaccount1
This is one example of the error System.InvalidProgramException : The JIT compiler encountered invalid IL code or an internal limitation.
when I ran this release version, and I have encountered many same errors when running the lastest version and the old release version. I have not found any helpful solutions when searching this on Google. Could you please give me some advice on how to solve this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.