Git Product home page Git Product logo

mavensecrets's Introduction

MavenSecrets

MavenSecrets is a project designed to look at projects hosted on the Maven Central repository, and perform analysis on them. It consists of the main application, the analyzer, which downloads and extracts data from Maven Cenral, and separate components processing this raw data. This research was performed as part of the Research Project of the TU Delft CSE bachelor programme of 2022/2023.

Analyzer

The application picking packages to analyze and extracting data, see README.

PyScripts

Scripts to process the raw data from running the analyzer, and produce meaningful results, see README.

Java Build Aspects

A standalone application to process build aspect related data, see README.

Space Extractor

Due to dependency conflicts the space extractor has to be run a different way than the rest of the application see README

mavensecrets's People

Contributors

ashkboos avatar cryocz avatar nielstomassen avatar pri-cod avatar st33lphoenix avatar tvelican avatar vel1khan avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

mavensecrets's Issues

Revise selection

We need to take a sample of packages with (at least 1 release after 2019?) and the timestamp property in the POM. The rest we know for sure are not reproducible due to different timestamps of build artefacts on each build. This means re-running everything for these new packages!!!

Fix indexer reader

Fix the indexer reader to only accept packaging type of executables.
Change the query in the database accordingly

Categorise failure types

Build successful, partially reproducible

  • Only pom and resource files reproducible, not jar (Maybe wrong JDK used?)
  • Reproducible files belong to other packages in the same repo
    org.apache.maven.wagon,wagon-http-lightweight,3.5.0

Build successful, not reproducible

  • Non-default build params?
  • Wrong JDK used

Build fail

  • Not built with Maven
  • Correct JDK Toolchain not present
  • Build command should include location of pom file (common in repos with multiple projects)
  • Other packages/dependencies in same repository failing (we need to ONLY build our package) org.apache.atlas,impala-bridge,2.3.0
  • (Maven Enforcer Plugin) Detected Maven Version: 3.6.3 is not in the allowed range [3.8, ) eu.maveniverse.maven.mima,context,1.0.3
  • Some dependencies not in Maven Central
    com.adobe.dx,core,0.0.12
  • Repository empty (project split into multiple smaller repositories)
    com.alibaba.lindorm,lucene-facet,8.9.0
  • Failed to set up Docker container environment
    de.jflex,benchmark,1.9.0 failed because of a missing apt dependency but we can build it normally
  • Builds correctly but other packages in same repo that come after it fail. But the .buildcompare should STILL be there!!?
  • Failed to clone (possibly network downtime)
    org.simplify4u,pgp-keys-map,2021.07.02

Conclusions

  • Ideally we tell Maven to build only our package (+ only required dependencies in the same repo)
  • We might want to keep the build cache so that we can run diffoscope. Otherwise, we have to rebuild each project we want to inspect. Alternatiely, only keep the buildcache for non-/partially-reproducible packages to save space.

Detect whether a project uses Maven, exclude otherwise

For packages that have repository url in the database, search the content of their repository for build.gradle and build.xml. This logic should be implemented in the python code before we do the building and after the repository cloning. We can also store two boolean columns on the data base for this:isGradle and isAnt.

Tag Correctness

Pick a random sample of packages and check the correctness of the tags found.
Make an exhaustive list of correct common tag naming conventions found in the dataset and remove all that don't match.

Builds failing due to Maven Enforcer Plugin

We could write an extractor to get the required maven version to build.

Examples:

  • com.io7m.calino,com.io7m.calino.vanilla,0.0.1
    Detected Maven Version: 3.6.3 is not in the allowed range [3.8.2,4.0.0).

Include runtime dependencies in final jar

Running the artifact in a docker container does not work as it is missing the required class definitions from dependencies.
Configure Maven shade plugin to include dependencies in the final jar or include them in the classpath when running the jar.

[OLD] (Package Builder)

Bare minimum

  • Insert build params and result to db
  • Change db query to resume from unprocessed (and add on conflict ignore)
  • Look for .buildcompare, if exists, build was successful
  • Read .buildcompare, save okFiles and koFiles
  • Parse and convert major java versions to format accepted in .buildspec
  • Check if any file corresponds to the jar of the package

Extra

Calculate Reproducibility score

Devise a strategy to calculate a reproducibility score.
Initial idea: For each archive, calculate the MD5 of its constituents and compare to the reference version. Save the number of reproducible files and their respective types.. Might need to change pkey of builds table to a generated id and create a new table with entries for each filetype for build.

We can assign different weights to filetypes eg. more weight to class files vs resource files

Change builder package fetch query

Currently, if any build for a specific package is already present in the builds db, it is not fetched anymore. Since we are now building each package multiple times, if the experiment is stopped halfway or crashes, any builds left for the package are not resumed.

How to fix:

  1. Fetch ALL packages from tags table matching our requirements
  2. Let all build parameters be determined, but right before building check whether a build with those exact parameters already exists. If yes, skip this build.

Not all packages from RC can find their buildspec

Why? Multimodule projects are stored under an arbitrary artifactid of any of the submodules due to a simplifacation in the RC project.
If our package isn't exactly this arbitrary artifactid, then we don't find its buildspec EVEN though another sub-module in the same project might have the buildspec for the entire project.

This requires an overhaul of how the buildspec is searched for, parsing the READMEs is probably the best way to go about it.

Failure Analysis

Figure out a way to extract the reason for:

  • Build failures
  • Non-reproducibility
  • Partial reproducibility
  1. Try to condense stdout and stderr for each build to make them human-readable and see if there are common failure outputs that we can group together for further analysis. #89
  2. Find only the relevant files for each package if it is part of a larger repository. (perhaps we can find metadata containing all the filenames on Maven Central) (#link-issue)
  3. Use diffoscope on non/partially-reproducible packages to figure out which files are not the same (#link-issue)

Update VCExtractor

I realised that some developers might do strange things and put their repo url in some other fields but not in scm.url. Need to add the following fields in VCExtractor

  • scm.developerConnection
  • scm.connection

Fix data selection

SQL query is broken, currently only eliminates different versions of the same package in the same year, not overall.

Unresolved packages

Some packages are listed in the index as having the packagingType 'module', but they actually do have a .jar executable. Code will fail because it will try to open the .module file instead of the .jar.

Exclude non-Maven projects from builder

Packages where line_ending_lf and/or line_ending_crlf are null in 'packages' table implictly don't have pom.properties. As such, they are 99% likely not built with Maven.

  • Remove existing non-Maven builds
  • Change builder SQL fetch query

Fix Docker compose setup

The database image is started after the executable is ran, resulting in database connections timing out / failing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.