Git Product home page Git Product logo

chatgpt-source-watch's Issues

Update README to align with current scripts/process

I haven't really updated the 'Helper Scripts' / 'Getting Started' sections of the README for quite a while, and so they aren't really fully aligned to how I am actually doing things these days (more based on the older manual'ish methods, or maybe the first iteration of semi-automation)

It would be good to figure out what is completely outdated, what is still relevant but the 'old way' of doing things, and what is the 'new/current way' of doing things; and then update the README to capture that knowledge rather than it being locked up in my head/similar.


At a very high level, off the top of my head, my current process is basically:

  • Load ChatGPT and let my userscript check/notify me if there are any new script files
  • If there are new scripts, use the 'Copy ChatGPT Script data to clipboard' menu option on Tampermonkey
  • Run the following script to get a filtered list of the JSON (with dates) and list of URLs to be downloaded:
    • pbpaste | ./scripts/filter-urls-not-in-changelog.js --json-with-urls
    • # Example
      ⇒ pbpaste | ./scripts/filter-urls-not-in-changelog.js --json-with-urls
      {
        url: 'https://cdn.oaistatic.com/_next/static/chunks/pages/_app-783c9d3d0c38be69.js',
        date: '2024-02-24T02:18:13.376Z'
      }
      {
        url: 'https://cdn.oaistatic.com/_next/static/chunks/webpack-2e4c364289bb4774.js',
        date: '2024-02-24T02:18:13.376Z'
      }
      {
        url: 'https://cdn.oaistatic.com/_next/static/WRJHgIqMF1lNwSuszzsvl/_buildManifest.js',
        date: '2024-02-24T02:18:13.376Z'
      }
      {
        url: 'https://cdn.oaistatic.com/_next/static/WRJHgIqMF1lNwSuszzsvl/_ssgManifest.js',
        date: '2024-02-24T02:18:13.376Z'
      }
      https://cdn.oaistatic.com/_next/static/chunks/pages/_app-783c9d3d0c38be69.js
      https://cdn.oaistatic.com/_next/static/chunks/webpack-2e4c364289bb4774.js
      https://cdn.oaistatic.com/_next/static/WRJHgIqMF1lNwSuszzsvl/_buildManifest.js
      https://cdn.oaistatic.com/_next/static/WRJHgIqMF1lNwSuszzsvl/_ssgManifest.js
  • Copy the output of this command and paste it into SublimeText as a scratch pad/reference
  • Copy the list of URLs, then run the following command
    • pbpaste | ./scripts/add-new-build-v2.sh 2>&1 | subl
    • This will does a bulk of the automation of checking/downloading the URLs, extracting additional URLs from the _buildManifest.js and webpack.js + downloading those, unpacking/formatting the downloaded files, generating a copy/pasteable CHANGELOG entry, etc.
    • Note that as part of running this script, it will ask for the date of the build (from the above JSON) to be input at one point before the CHANGELOG entry is generated
  • Manually copy/paste the generated CHANGELOG entry into the CHANGELOG.md file, generate and add the updated link in the Table of Contents, then modify the entry to add manual analysis notes, etc
  • Commit/push the downloaded files + updated CHANGELOG
  • Potentially write a tweet about the update linking back to the CHANGELOG update, etc
    • If we do, then we should also edit the CHANGELOG again to add a link to that Tweet/thread.
    • Sometimes I will also make a crossposted update on Reddit / LinkedIn / HackerNews / etc; if I do, I tend to also link to those posts in the Tweet thread (and maybe sometimes in the CHANGELOG as well, but I don't think I have bothered with that much lately)

There might be bits in that that aren't perfectly documented; or little snippets of nuance that i've missed, but roughly that is my process currently.


In manually reviewing the diffs to add my 'manual analysis' to the CHANGELOG, there is often a lot of 'diff churn' noise from the minified variable names changing in the webpack build/etc. I've been working on some new scripts that help minimise that; which I haven't currently pushed, but you can see some of my notes about them in this issue:

Currently, I sort of roughly/hackily run them with a command similar to this:

diffmin-wc-raw () { git diff --diff-algorithm=patience $1 | wc -l; };

diffmin-wc () { git diff --diff-algorithm=patience $1 | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | wc -l; };

diffmin () { git diff --diff-algorithm=patience $1 | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | delta; };

# diffmin-wc-raw unpacked/_next/static/chunks/pages/_app.js
# diffmin-wc unpacked/_next/static/chunks/pages/_app.js
diffmin unpacked/_next/static/chunks/pages/_app.js

See Also

Script to identify language/translation files + list them + diff them

Currently it's a manual process to identify which of the chunk files are language/translation files, list them in the CHANGELOG, identify the English translation file, extract/parse the JSON within it, then do a sorted/JSON diff to determine what changed (while also minimising the noise of renamed keys/etc)

It would be good to write a script to automate this process.

This is an entry that had language file changes:

Simple diff analysis based on strings and identifiers

This is an idea for a type of analysis of code diffs. This issue is just for tracking notes and ideas

Example input 1

+            guidance: function (e) {
+              var t = e.isSearchable,
+                n = e.isMulti,
+                r = e.isDisabled,
+                i = e.tabSelectsValue;
+              switch (e.context) {
+                case "menu":
+                  return "Use Up and Down to choose options"
+                    .concat(
+                      r
+                        ? ""
+                        : ", press Enter to select the currently focused option",
+                      ", press Escape to exit the menu",
+                    )
+                    .concat(
+                      i
+                        ? ", press Tab to select the option and exit the menu"
+                        : "",
+                      ".",
+                    );

Here the extraction might be

guidance
e
t
isSearchable
n
isMulti
isDisabled
i
tabSelectsValue
context
"menu"
"Use Up and Down to choose options"
concat
r
""
", press Enter to select the currently focuse doption"
", press Escape to exit the menu"
", press Tab to select the option and exit the menu"
"."

Of course, this gives you far less information than the original, but I think it could be a good trade-off in cases where you want to look at the diff a little bit but don't have time to see everything.

Example input 2

Since common generic ones like e, t, n, "", and "." would show up frequently in any context, they would have already been seen in the past, and therefore filtered out. You'd more focus on the e.g. "Use Up and Down to choose options", with some kind of convenient way to jump back to see it in-context in the code.

For input like the following:

+      var a = n(72843);
+      function s(e, t) {
+        for (var n = 0; n < t.length; n++) {
+          var r = t[n];
+          (r.enumerable = r.enumerable || !1),
+            (r.configurable = !0),
+            "value" in r && (r.writable = !0),
+            Object.defineProperty(e, (0, a.Z)(r.key), r);
+        }
+      }
+      function l(e, t, n) {
+        return (
+          t && s(e.prototype, t),
+          n && s(e, n),
+          Object.defineProperty(e, "prototype", { writable: !1 }),
+          e
+        );
+      }

, none of the names or strings would probably be new, and so you wouldn't see it at all. This is intended, because I can't gleam any conclusions from looking at it, and thus would prefer not to see it

Glenn's comments

https://twitter.com/_devalias/status/1770284997385277554

I think given the size of a lot of the JS files, and the diffs themselves; it would probably end up being a LOT of strings; which might be confusing when removed from the rest of the context of the surrounding code.

For large diffs I think it'd be a lot, but strings and names are a subset of the raw diff, so it should still be less work than a full manual analysis. The idea is to just visually filter through them until you see a name/string that looks interesting on their own, which could lead to something good in-context.

It should be fairly easy to prototype a script using babel parser and babel traverse though.
You would add a rule or couple to the traverse so that it matches on whatever strings are called in the AST; and then output them to console or a file or similar.

Haven't worked with Babel but some relevant docs seem to be

Are there other AST parsers too? Would something like TreeSitter work? I'd generally prefer to avoid node.js if it's not required

Then you would just diff that output file of strings between one build and the next.
If code moves around between builds it might introduce it’s own form of noise (but maybe git diff —color-moved would handle that still anyway)

I haven't seen enough diffs to exactly anticipate how these would look like but there might be different solutions like color-moved that could work depending on how it goes

I also noticed you liked some of my tweets about my more generalised diff minimiser; which would reduce the noise of things a fair bit overall as well.
I still need to polish that and commit/upload it; been super busy lately and haven’t had a chance to yet.

Related:

Feel free to open an issue on the ChatGPT Source watch repo about the string extractor idea + link back to these tweets/copy the relevant info in.
I’d be happy to give some more pointers about it and/or include it in the repo if you wanted to work on it.

Yeah, I want to make a prototype and see if it will kind of work. I'm still not sure on the implementation, though; the most efficient system might be to integrate with a text editor, which makes it harder to be replicable

Explore AST based diff tools

There can be a lot of 'noise' when diffing minimised bundled code, as the bundler will often change the minified variable names it uses at times between builds (even if the rest of the code hasn't changed)

We can attempt to reduce this by using non-default git diff modes such as patience / histogram / minimal:

⇒ git diff --diff-algorithm=default -- unpacked/_next/static/chunks/pages/_app.js | wc -l
  116000

⇒ git diff --diff-algorithm=patience -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35826

⇒ git diff --diff-algorithm=histogram -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35835

⇒ git diff --diff-algorithm=minimal -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35844

Musings

⭐ Suggestion

It would be cool if ast-grep was able to show a diff between 2 files, but do it using the AST rather than just a raw text compare. Ideally we would be able to provide options to this, such as ignoring chunks where the only change is to a variable/function name (eg. for diffing minimised JavaScript webpack builds)

Ideally the output would be text still (not the AST tree), but the actually diffing could be done at the AST level.

💻 Use Cases

This would be really useful for minimising the noise when diffing minimised source builds looking for the 'real changes' between the builds (not just minimised variable names churning, etc)

Looking through current diff output formats shows all of the variable name changes as well, which equates to a lot of noise while looking for the relevant changes.

Some alternative potential workarounds I've considered are either pre-processing the files to standardize their variable/function names; and/or post-processing the diff output to try and detect when the only changes in a chunk are variable/function names, and then suppressing that chunk. Currently I'm just relying on git diff --diff-algorithm=minimal -- thefile.js

Originally posted by @0xdevalias in ast-grep/ast-grep#901

See Also

Automate checking for new builds with GitHub action/similar

Currently the process of checking for new builds is somewhat of a 'manual assisted' process of browsing to the ChatGPT site, letting the chatgpt-web-app-script-update-notifier user script check if any of the script files had changed, then potentially reacting to that notification with more manual steps.

You can see the full manual steps outlined on this issue:

But the core of the start of them are summarised below:

At a very high level, off the top of my head, my current process is basically:

Originally posted by @0xdevalias in #7 (comment)

Because the notifier currently only runs when the ChatGPT app is accessed, it is both easy to miss updates (eg. if updates happen but the ChatGPT app isn't accessed), and easy to get distracted from the task that ChatGPT was originally being opened for by the fact that there is a new update (leading to a tantalising procrastination/avoidance 'treat' when the task at hand brings less dopamine)

The proposed solution would be to use GitHub actions or similar to schedule an 'update check' to happen at a regular interval (eg. once per hour). The following are some notes I made in an intial prompt to ChatGPT for exploring/implementing this:

Can you plan out and create a github action that will:

- run on a schedule (eg. every 1hr)
- check the HTML on a specified webpage and extract some .js script URLs related to a bundled webpack/next.js app
- check (against a cache or similar? not sure of the best way to implement this on github actions) if those URLs have been previously recorded
- if they are new URLs, notify the user and/or kick off further processing (this will probably involve executing one or more scripts that will then download/process the URLs)

That describes the most basic features this should be able to handle (off the top of my head), but the ideal plan is that the solution will be expandable to be able to handle and automate more of the process in future. Some ideas for future features would be:

  • being able to open a Pull Request for each new build, that contains the downloaded files, and the results of various scripts being run on them. This PR would also serve as an interface to prompt the user with any manual actions that are required of them, and some 'bot commands'/workflow for finalising the updates to the CHANGELOG/etc (eg. rebase the PR)
  • etc

See Also

Explore running CodeQL queries against the extracted/unpacked webpack source

From a chat with a friend:

Dunno how well it will work in reality.. but apparently I can run codeql against a random site's webpacked frontend code that I downloaded locally (in this case chatgpt)

codeql database create ~/Desktop/chatgpt-codeql-test-db --language=javascript --source-root ./unpacked

And I could use Chrome Devtools Protocol (CDP) to watch a site for when scripts are parsed, and then to access the source of those parsed scripts (which I could then automagically save locally/similar, and then run codeql on)

codeql database analyze ~/Desktop/chatgpt-codeql-test-db --format=csv --output=./chatgpt-codeql-output.csv --download codeql/javascript-queries

image

image

Huh.. it actually worked and output a bunch of warnings. Could be false positives/irrelevant/etc.. and would need to manually look closer to understand more about them and if they are actually interesting.. but the fact that it worked at all on webpacked code (that had only been run through prettier to format it) is pretty neat

"Improper code sanitization","Escaping code as HTML does not provide protection against code injection.","error","Code construction depends on an [[""improperly sanitized value""|""relative:///_next/static/chunks/pages/_app.js:28576:35:28576:52""]].","/_next/static/chunks/pages/_app.js","28576","21","28576","60"
"Improper code sanitization","Escaping code as HTML does not provide protection against code injection.","error","Code construction depends on an [[""improperly sanitized value""|""relative:///_next/static/chunks/pages/_app.js:28581:35:28581:52""]].","/_next/static/chunks/pages/_app.js","28581","21","28581","60"
"Incomplete URL substring sanitization","Security checks on the substrings of an unparsed URL are often vulnerable to bypassing.","warning","'[[""slack.com""|""relative:///_next/static/chunks/496.js:8801:33:8801:43""]]' can be anywhere in the URL, and arbitrary hosts may come before or after it.","/_next/static/chunks/496.js","8801","11","8801","44"
"Overly permissive regular expression range","Overly permissive regular expression ranges match a wider range of characters than intended. This may allow an attacker to bypass a filter or sanitizer.","warning","Suspicious character range that is equivalent to [&'()*+,\-.\/0-9:;].","/_next/static/chunks/653.js","42385","18","42385","20"
"Overly permissive regular expression range","Overly permissive regular expression ranges match a wider range of characters than intended. This may allow an attacker to bypass a filter or sanitizer.","warning","Suspicious character range that is equivalent to [?@A-Z].","/_next/static/chunks/653.js","42385","22","42385","24"
"Overly permissive regular expression range","Overly permissive regular expression ranges match a wider range of characters than intended. This may allow an attacker to bypass a filter or sanitizer.","warning","Suspicious character range that is equivalent to [A-Z\[\\\]^_`a-z].","/_next/static/chunks/653.js","48571","30","48571","32"
"Overly permissive regular expression range","Overly permissive regular expression ranges match a wider range of characters than intended. This may allow an attacker to bypass a filter or sanitizer.","warning","Suspicious character range that is equivalent to [A-Z\[\\\]^_`a-z].","/_next/static/chunks/653.js","52124","34","52124","36"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This replaces only the first occurrence of ""*"".","/_next/static/chunks/1f110208.js","7333","17","7333","33"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This replaces only the first occurrence of ""\\"".","/_next/static/chunks/1f110208.js","8042","33","8042","51"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This replaces only the first occurrence of ""\\"".","/_next/static/chunks/1f110208.js","8048","33","8048","52"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This does not escape backslash characters in the input.","/_next/static/chunks/653.js","55568","32","55568","40"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This replaces only the first occurrence of /%3A/i.","/_next/static/chunks/main.js","5109","18","5109","46"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This replaces only the first occurrence of ""#"".","/_next/static/chunks/main.js","5130","18","5130","26"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This replaces only the first occurrence of /[\]]/.","/_next/static/chunks/pages/_app.js","24434","20","24434","50"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This replaces only the first occurrence of /[[]/.","/_next/static/chunks/pages/_app.js","24434","20","24434","28"
"Prototype-polluting function","Functions recursively assigning properties on objects may be the cause of accidental modification of a built-in prototype object.","warning","The property chain [[""here""|""relative:///_next/static/chunks/pages/_app.js:38412:19:38412:22""]] is recursively assigned to [[""Y""|""relative:///_next/static/chunks/pages/_app.js:38414:46:38414:46""]] without guarding against prototype pollution.","/_next/static/chunks/pages/_app.js","38414","46","38414","46"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4811","29","4811","38"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4812","31","4812","40"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4819","29","4819","38"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4820","31","4820","40"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4828","31","4828","40"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4829","33","4829","42"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4837","29","4837","38"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4838","31","4838","40"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4850","31","4850","40"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4851","33","4851","42"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","5079","25","5079","34"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","5080","25","5080","34"

Explore creating a 'reverse engineered' records.json / stats.json file from a webpack build

This is an idea I've had in passing a few times, but keep forgetting to document it:

  • https://medium.com/@songawee/long-term-caching-using-webpack-records-9ed9737d96f2
    • there are many factors that go into getting consistent filenames. Using Webpack records helps generate longer lasting filenames (cacheable for a longer period of time) by reusing metadata, including module/chunk information, between successive builds. This means that as each build runs, modules won’t be re-ordered and moved to another chunk as often which leads to less cache busting.

    • The first step is achieved by a Webpack configuration setting: recordsPath: path.resolve(__dirname, ‘./records.json’)
      This configuration setting instructs Webpack to write out a file containing build metadata to a specified location after a build is completed.

    • It keeps track of a variety of metadata including module and chunk ids which are useful to ensure modules do not move between chunks on successive builds when the content has not changed.

    • With the configuration in place, we can now enjoy consistent file hashes across builds!

    • In the following example, we are adding a dependency (superagent) to the vendor-two chunk.

      We can see that all of the chunks change. This is due to the module ids changing. This is not ideal as it forces users to re-download content that has not changed.

      The following example adds the same dependency, but uses Webpack records to keep module ids consistent across the builds. We can see that only the vendor-two chunk and the runtime changes. The runtime is expected to change because it has a map of all the chunk ids. Changing only these two files is ideal.

  • https://webpack.js.org/configuration/other-options/#recordspath
    • recordsPath: Use this option to generate a JSON file containing webpack "records" – pieces of data used to store module identifiers across multiple builds. You can use this file to track how modules change between builds.

  • https://github.com/search?q=path%3A%22webpack.records.json%22&type=code

I'm not 100% sure if this would be useful, or partially useful, but I think I am thinking of it tangentially in relation to things like:

Script to calculate the raw/minimised lines of diff change for each file

Being able to see at a glance what the raw and minimised diff lines are for each chunk file can help in determining how much effort reviewing a build will take. If we can copy/paste this as markdown (or insert it directly into the default CHANGELOG entry), then we can also give some useful stats for a build even if we do no deeper manual analysis.

eg.

- TODO: The following files haven't been deeply reviewed:
  - `unpacked/_next/static/chunks/101.js` (`931` lines)
  - `unpacked/_next/static/chunks/2637.js` (`5290` lines)
  - `unpacked/_next/static/chunks/3032.js` (`150,855` lines)
  - `unpacked/_next/static/chunks/30750f44.js` (diff: `45,368` lines, minimised diff: `17,151` lines)
  - `unpacked/_next/static/chunks/3453.js` (`403` lines)
    - Seem to be a bunch of images, likely related to image generation styling or similar.
  - `unpacked/_next/static/chunks/3472.js` (`320` lines)
    - Statsig, Feature Gates, Experimental Gates, etc
  - `unpacked/_next/static/chunks/3842.js` (`755` lines)
  - `unpacked/_next/static/chunks/3a34cc27.js` (diff: `4373` lines, minimised diff: `1633` lines)
  - `unpacked/_next/static/chunks/4114.js` (diff: `1411` lines, minimised diff: `1373` lines)

Fix script for extracting CSS URLs from `webpack.js` + unpacking `*.css` files

In the past there was only a single *.css URL extracted from webpack.js from the miniCssF field, so it was unpacked as miniCssF.css (as the *.css files hashes change every time they are re-built, and they don't seem to have a static chunk part to their filename when downloaded)

More recently, there have been new *.css files specific to certain chunks (sometimes shared among multiple chunks), and so the scripts for extracting this are broken and produce an entry like this:

https://cdn.oaistatic.com/_next/undefined

We also need to think about how best to name the files. I think main.css would probably work for the 'main' chunk (previously what we called miniCssF). For the *.css related to the other chunks, if they only applied to a single chunk I would probably have named them based on that chunk, but sometimes they are used in multiple chunks. If doing it manually we could probably figure out what they are used for and name them based on that, but not sure the best way to do this automatically. We can't use the hash of the *.css file, as that changes every time the file changes.

Explore stack graphs / scope graphs

Stack Graphs (an evolution of Scope Graphs) sound like they could be really interesting/useful with regards to code navigation, symbol mapping, etc. Perhaps we could use them for module identification, or variable/function identifier naming stabilisation or similar?

  • https://github.blog/changelog/2024-03-14-precise-code-navigation-for-typescript-projects/
    • Precise code navigation is now available for all TypeScript repositories.
      Precise code navigation gives more accurate results by only considering the set of classes, functions, and imported definitions that are visible at a given point in your code.

      Precise code navigation is powered by the stack graphs framework.
      You can read about how we use stack graphs for code navigation and visit the stack graphs definition for TypeScript to learn more.

      • https://github.blog/2021-12-09-introducing-stack-graphs/
        • Introducing stack graphs

        • Precise code navigation is powered by stack graphs, a new open source framework we’ve created that lets you define the name binding rules for a programming language using a declarative, domain-specific language (DSL). With stack graphs, we can generate code navigation data for a repository without requiring any configuration from the repository owner, and without tapping into a build process or other CI job.

        • LOTS of interesting stuff in this post..
        • As part of developing stack graphs, we’ve added a new graph construction language to Tree-sitter, which lets you construct arbitrary graph structures (including but not limited to stack graphs) from parsed CSTs. You use stanzas to define the gadget of graph nodes and edges that should be created for each occurrence of a Tree-sitter query, and how the newly created nodes and edges should connect to graph content that you’ve already created elsewhere.

        • Why aren’t we using the Language Server Protocol (LSP) or Language Server Index Format (LSIF)?

          To dig even deeper and learn more, I encourage you to check out my Strange Loop talk and the stack-graphs crate: our open source Rust implementation of these ideas.

  • https://docs.github.com/en/repositories/working-with-files/using-files/navigating-code-on-github
    • GitHub has developed two code navigation approaches based on the open source tree-sitter and stack-graphs library:

      • Search-based - searches all definitions and references across a repository to find entities with a given name
      • Precise - resolves definitions and references based on the set of classes, functions, and imported definitions at a given point in your code

      To learn more about these approaches, see "Precise and search-based navigation."

      • https://docs.github.com/en/repositories/working-with-files/using-files/navigating-code-on-github#precise-and-search-based-navigation
        • Precise and search-based navigation
          Certain languages supported by GitHub have access to precise code navigation, which uses an algorithm (based on the open source stack-graphs library) that resolves definitions and references based on the set of classes, functions, and imported definitions that are visible at any given point in your code. Other languages use search-based code navigation, which searches all definitions and references across a repository to find entities with a given name. Both strategies are effective at finding results and both make sure to avoid inappropriate results such as comments, but precise code navigation can give more accurate results, especially when a repository contains multiple methods or functions with the same name.

  • https://pl.ewi.tudelft.nl/research/projects/scope-graphs/
    • Scope Graphs | A Theory of Name Resolution

    • Scope graphs provide a new approach to defining the name binding rules of programming languages. A scope graph represents the name binding facts of a program using the basic concepts of declarations and reference associated with scopes that are connected by edges. Name resolution is defined by searching for paths from references to declarations in a scope graph. Scope graph diagrams provide an illuminating visual notation for explaining the bindings in programs.

Potentially Related

  • https://en.wikipedia.org/wiki/Code_property_graph
    • A code property graph of a program is a graph representation of the program obtained by merging its abstract syntax trees (AST), control-flow graphs (CFG) and program dependence graphs (PDG) at statement and predicate nodes. The resulting graph is a property graph, which is the underlying graph model of graph databases such as Neo4j, JanusGraph and OrientDB where data is stored in the nodes and edges as key-value pairs. In effect, code property graphs can be stored in graph databases and queried using graph query languages.

    • Joern CPG. The original code property graph was implemented for C/C++ in 2013 at University of Göttingen as part of the open-source code analysis tool Joern. This original version has been discontinued and superseded by the open-source Joern Project, which provides a formal code property graph specification applicable to multiple programming languages. The project provides code property graph generators for C/C++, Java, Java bytecode, Kotlin, Python, JavaScript, TypeScript, LLVM bitcode, and x86 binaries (via the Ghidra disassembler).

      • https://github.com/joernio/joern
        • Open-source code analysis platform for C/C++/Java/Binary/Javascript/Python/Kotlin based on code property graphs.

        • Joern is a platform for analyzing source code, bytecode, and binary executables. It generates code property graphs (CPGs), a graph representation of code for cross-language code analysis. Code property graphs are stored in a custom graph database. This allows code to be mined using search queries formulated in a Scala-based domain-specific query language. Joern is developed with the goal of providing a useful tool for vulnerability discovery and research in static program analysis.

        • https://joern.io/
        • https://cpg.joern.io/
          • Code Property Graph Specification 1.1

          • This is the specification of the Code Property Graph, a language-agnostic intermediate graph representation of code designed for code querying.

            The code property graph is a directed, edge-labeled, attributed multigraph. This specification provides the graph schema, that is, the types of nodes and edges and their properties, as well as constraints that specify which source and destination nodes are permitted for each edge type.

            The graph schema is structured into multiple layers, each of which provide node, property, and edge type definitions. A layer may depend on multiple other layers and make use of the types it provides.

  • https://docs.openrewrite.org/concepts-explanations/lossless-semantic-trees
    • A Lossless Semantic Tree (LST) is a tree representation of code. Unlike the traditional Abstract Syntax Tree (AST), OpenRewrite's LST offers a unique set of characteristics that make it possible to perform accurate transformations and searches across a repository

See Also

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.