0xdevalias / chatgpt-source-watch Goto Github PK

View Code? Open in Web Editor NEW

265.0 12.0 15.0 116.12 MB

Analyzing the evolution of ChatGPT's codebase through time with curated archives and scripts

Home Page: https://github.com/0xdevalias/chatgpt-source-watch/blob/main/CHANGELOG.md

License: Other

JavaScript 99.49% Shell 0.10% CSS 0.40%

archive chatgpt webpack

chatgpt-source-watch's Issues

Update README to align with current scripts/process

I haven't really updated the 'Helper Scripts' / 'Getting Started' sections of the README for quite a while, and so they aren't really fully aligned to how I am actually doing things these days (more based on the older manual'ish methods, or maybe the first iteration of semi-automation)

It would be good to figure out what is completely outdated, what is still relevant but the 'old way' of doing things, and what is the 'new/current way' of doing things; and then update the README to capture that knowledge rather than it being locked up in my head/similar.

At a very high level, off the top of my head, my current process is basically:

Load ChatGPT and let my userscript check/notify me if there are any new script files
- https://github.com/0xdevalias/userscripts/tree/main/userscripts/chatgpt-web-app-script-update-notifier
If there are new scripts, use the 'Copy ChatGPT Script data to clipboard' menu option on Tampermonkey

Run the following script to get a filtered list of the JSON (with dates) and list of URLs to be downloaded:

pbpaste | ./scripts/filter-urls-not-in-changelog.js --json-with-urls

# Example
⇒ pbpaste | ./scripts/filter-urls-not-in-changelog.js --json-with-urls
{
  url: 'https://cdn.oaistatic.com/_next/static/chunks/pages/_app-783c9d3d0c38be69.js',
  date: '2024-02-24T02:18:13.376Z'
}
{
  url: 'https://cdn.oaistatic.com/_next/static/chunks/webpack-2e4c364289bb4774.js',
  date: '2024-02-24T02:18:13.376Z'
}
{
  url: 'https://cdn.oaistatic.com/_next/static/WRJHgIqMF1lNwSuszzsvl/_buildManifest.js',
  date: '2024-02-24T02:18:13.376Z'
}
{
  url: 'https://cdn.oaistatic.com/_next/static/WRJHgIqMF1lNwSuszzsvl/_ssgManifest.js',
  date: '2024-02-24T02:18:13.376Z'
}
https://cdn.oaistatic.com/_next/static/chunks/pages/_app-783c9d3d0c38be69.js
https://cdn.oaistatic.com/_next/static/chunks/webpack-2e4c364289bb4774.js
https://cdn.oaistatic.com/_next/static/WRJHgIqMF1lNwSuszzsvl/_buildManifest.js
https://cdn.oaistatic.com/_next/static/WRJHgIqMF1lNwSuszzsvl/_ssgManifest.js

Copy the output of this command and paste it into SublimeText as a scratch pad/reference
Copy the list of URLs, then run the following command
- ```
pbpaste | ./scripts/add-new-build-v2.sh 2>&1 | subl
```
- This will does a bulk of the automation of checking/downloading the URLs, extracting additional URLs from the _buildManifest.js and webpack.js + downloading those, unpacking/formatting the downloaded files, generating a copy/pasteable CHANGELOG entry, etc.
- Note that as part of running this script, it will ask for the date of the build (from the above JSON) to be input at one point before the CHANGELOG entry is generated
Manually copy/paste the generated CHANGELOG entry into the CHANGELOG.md file, generate and add the updated link in the Table of Contents, then modify the entry to add manual analysis notes, etc
- https://derlin.github.io/bitdowntoc/
Commit/push the downloaded files + updated CHANGELOG
Potentially write a tweet about the update linking back to the CHANGELOG update, etc
- If we do, then we should also edit the CHANGELOG again to add a link to that Tweet/thread.
- Sometimes I will also make a crossposted update on Reddit / LinkedIn / HackerNews / etc; if I do, I tend to also link to those posts in the Tweet thread (and maybe sometimes in the CHANGELOG as well, but I don't think I have bothered with that much lately)

There might be bits in that that aren't perfectly documented; or little snippets of nuance that i've missed, but roughly that is my process currently.

In manually reviewing the diffs to add my 'manual analysis' to the CHANGELOG, there is often a lot of 'diff churn' noise from the minified variable names changing in the webpack build/etc. I've been working on some new scripts that help minimise that; which I haven't currently pushed, but you can see some of my notes about them in this issue:

Currently, I sort of roughly/hackily run them with a command similar to this:

diffmin-wc-raw () { git diff --diff-algorithm=patience $1 | wc -l; };

diffmin-wc () { git diff --diff-algorithm=patience $1 | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | wc -l; };

diffmin () { git diff --diff-algorithm=patience $1 | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | delta; };

# diffmin-wc-raw unpacked/_next/static/chunks/pages/_app.js
# diffmin-wc unpacked/_next/static/chunks/pages/_app.js
diffmin unpacked/_next/static/chunks/pages/_app.js

Example input 1

+            guidance: function (e) {
+              var t = e.isSearchable,
+                n = e.isMulti,
+                r = e.isDisabled,
+                i = e.tabSelectsValue;
+              switch (e.context) {
+                case "menu":
+                  return "Use Up and Down to choose options"
+                    .concat(
+                      r
+                        ? ""
+                        : ", press Enter to select the currently focused option",
+                      ", press Escape to exit the menu",
+                    )
+                    .concat(
+                      i
+                        ? ", press Tab to select the option and exit the menu"
+                        : "",
+                      ".",
+                    );

Here the extraction might be

guidance
e
t
isSearchable
n
isMulti
isDisabled
i
tabSelectsValue
context
"menu"
"Use Up and Down to choose options"
concat
r
""
", press Enter to select the currently focuse doption"
", press Escape to exit the menu"
", press Tab to select the option and exit the menu"
"."

Of course, this gives you far less information than the original, but I think it could be a good trade-off in cases where you want to look at the diff a little bit but don't have time to see everything.

Example input 2

Since common generic ones like e, t, n, "", and "." would show up frequently in any context, they would have already been seen in the past, and therefore filtered out. You'd more focus on the e.g. "Use Up and Down to choose options", with some kind of convenient way to jump back to see it in-context in the code.

For input like the following:

+      var a = n(72843);
+      function s(e, t) {
+        for (var n = 0; n < t.length; n++) {
+          var r = t[n];
+          (r.enumerable = r.enumerable || !1),
+            (r.configurable = !0),
+            "value" in r && (r.writable = !0),
+            Object.defineProperty(e, (0, a.Z)(r.key), r);
+        }
+      }
+      function l(e, t, n) {
+        return (
+          t && s(e.prototype, t),
+          n && s(e, n),
+          Object.defineProperty(e, "prototype", { writable: !1 }),
+          e
+        );
+      }

, none of the names or strings would probably be new, and so you wouldn't see it at all. This is intended, because I can't gleam any conclusions from looking at it, and thus would prefer not to see it

Glenn's comments

https://twitter.com/_devalias/status/1770284997385277554

I think given the size of a lot of the JS files, and the diffs themselves; it would probably end up being a LOT of strings; which might be confusing when removed from the rest of the context of the surrounding code.

For large diffs I think it'd be a lot, but strings and names are a subset of the raw diff, so it should still be less work than a full manual analysis. The idea is to just visually filter through them until you see a name/string that looks interesting on their own, which could lead to something good in-context.

It should be fairly easy to prototype a script using babel parser and babel traverse though.
You would add a rule or couple to the traverse so that it matches on whatever strings are called in the AST; and then output them to console or a file or similar.

Haven't worked with Babel but some relevant docs seem to be

Are there other AST parsers too? Would something like TreeSitter work? I'd generally prefer to avoid node.js if it's not required

Then you would just diff that output file of strings between one build and the next.
If code moves around between builds it might introduce it’s own form of noise (but maybe git diff —color-moved would handle that still anyway)

I haven't seen enough diffs to exactly anticipate how these would look like but there might be different solutions like color-moved that could work depending on how it goes

I also noticed you liked some of my tweets about my more generalised diff minimiser; which would reduce the noise of things a fair bit overall as well.
I still need to polish that and commit/upload it; been super busy lately and haven’t had a chance to yet.

Feel free to open an issue on the ChatGPT Source watch repo about the string extractor idea + link back to these tweets/copy the relevant info in.
I’d be happy to give some more pointers about it and/or include it in the repo if you wanted to work on it.

Yeah, I want to make a prototype and see if it will kind of work. I'm still not sure on the implementation, though; the most efficient system might be to integrate with a text editor, which makes it harder to be replicable

Explore AST based diff tools

There can be a lot of 'noise' when diffing minimised bundled code, as the bundler will often change the minified variable names it uses at times between builds (even if the rest of the code hasn't changed)

We can attempt to reduce this by using non-default git diff modes such as patience / histogram / minimal:

https://git-scm.com/docs/diff-options/2.6.7#Documentation/diff-options.txt---patience
https://git-scm.com/docs/diff-options/2.6.7#Documentation/diff-options.txt---diff-algorithmpatienceminimalhistogrammyers
https://stackoverflow.com/questions/4045017/what-is-git-diff-patience-for
- https://web.archive.org/web/20200128181055/http://git.661346.n2.nabble.com/Bram-Cohen-speaks-up-about-patience-diff-td2277041.html
- https://bryanpendleton.blogspot.com/2010/05/patience-diff.html
- https://alfedenzo.livejournal.com/170301.html
  - Patience Diff, a brief summary
  - Patience Diff also relies on the longest common subsequence problem, but takes a different approach. First, it only considers lines that are (a) common to both files, and (b) appear only once in each file. This means that most lines containing a single brace or a new line are ignored, but distinctive lines like a function declaration are retained. Computing the longest common subsequence of the unique elements of both documents leads to a skeleton of common points that almost definitely correspond to each other. The algorithm then sweeps up all contiguous blocks of common lines found in this way, and recurses on those parts that were left out, in the hopes that in this smaller context, some of the lines that were ignored earlier for being non-unique are found to be unique. Once this process is finished, we are left with a common subsequence that more closely corresponds to what humans would identify.
- https://fabiensanglard.net/git_code_review/diff.php
  - Git Source Code Review: Diff Algorithms

⇒ git diff --diff-algorithm=default -- unpacked/_next/static/chunks/pages/_app.js | wc -l
  116000

⇒ git diff --diff-algorithm=patience -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35826

⇒ git diff --diff-algorithm=histogram -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35835

⇒ git diff --diff-algorithm=minimal -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35844

Musings

⭐ Suggestion

It would be cool if ast-grep was able to show a diff between 2 files, but do it using the AST rather than just a raw text compare. Ideally we would be able to provide options to this, such as ignoring chunks where the only change is to a variable/function name (eg. for diffing minimised JavaScript webpack builds)

Ideally the output would be text still (not the AST tree), but the actually diffing could be done at the AST level.

💻 Use Cases

This would be really useful for minimising the noise when diffing minimised source builds looking for the 'real changes' between the builds (not just minimised variable names churning, etc)

Looking through current diff output formats shows all of the variable name changes as well, which equates to a lot of noise while looking for the relevant changes.

Some alternative potential workarounds I've considered are either pre-processing the files to standardize their variable/function names; and/or post-processing the diff output to try and detect when the only changes in a chunk are variable/function names, and then suppressing that chunk. Currently I'm just relying on git diff --diff-algorithm=minimal -- thefile.js

Originally posted by @0xdevalias in ast-grep/ast-grep#901

Huh.. it actually worked and output a bunch of warnings. Could be false positives/irrelevant/etc.. and would need to manually look closer to understand more about them and if they are actually interesting.. but the fact that it worked at all on webpacked code (that had only been run through prettier to format it) is pretty neat

chatgpt-codeql-output.csv

"Improper code sanitization","Escaping code as HTML does not provide protection against code injection.","error","Code construction depends on an [[""improperly sanitized value""|""relative:///_next/static/chunks/pages/_app.js:28576:35:28576:52""]].","/_next/static/chunks/pages/_app.js","28576","21","28576","60"
"Improper code sanitization","Escaping code as HTML does not provide protection against code injection.","error","Code construction depends on an [[""improperly sanitized value""|""relative:///_next/static/chunks/pages/_app.js:28581:35:28581:52""]].","/_next/static/chunks/pages/_app.js","28581","21","28581","60"
"Incomplete URL substring sanitization","Security checks on the substrings of an unparsed URL are often vulnerable to bypassing.","warning","'[[""slack.com""|""relative:///_next/static/chunks/496.js:8801:33:8801:43""]]' can be anywhere in the URL, and arbitrary hosts may come before or after it.","/_next/static/chunks/496.js","8801","11","8801","44"
"Overly permissive regular expression range","Overly permissive regular expression ranges match a wider range of characters than intended. This may allow an attacker to bypass a filter or sanitizer.","warning","Suspicious character range that is equivalent to [&'()*+,\-.\/0-9:;].","/_next/static/chunks/653.js","42385","18","42385","20"
"Overly permissive regular expression range","Overly permissive regular expression ranges match a wider range of characters than intended. This may allow an attacker to bypass a filter or sanitizer.","warning","Suspicious character range that is equivalent to [?@A-Z].","/_next/static/chunks/653.js","42385","22","42385","24"
"Overly permissive regular expression range","Overly permissive regular expression ranges match a wider range of characters than intended. This may allow an attacker to bypass a filter or sanitizer.","warning","Suspicious character range that is equivalent to [A-Z\[\\\]^_`a-z].","/_next/static/chunks/653.js","48571","30","48571","32"
"Overly permissive regular expression range","Overly permissive regular expression ranges match a wider range of characters than intended. This may allow an attacker to bypass a filter or sanitizer.","warning","Suspicious character range that is equivalent to [A-Z\[\\\]^_`a-z].","/_next/static/chunks/653.js","52124","34","52124","36"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This replaces only the first occurrence of ""*"".","/_next/static/chunks/1f110208.js","7333","17","7333","33"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This replaces only the first occurrence of ""\\"".","/_next/static/chunks/1f110208.js","8042","33","8042","51"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This replaces only the first occurrence of ""\\"".","/_next/static/chunks/1f110208.js","8048","33","8048","52"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This does not escape backslash characters in the input.","/_next/static/chunks/653.js","55568","32","55568","40"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This replaces only the first occurrence of /%3A/i.","/_next/static/chunks/main.js","5109","18","5109","46"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This replaces only the first occurrence of ""#"".","/_next/static/chunks/main.js","5130","18","5130","26"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This replaces only the first occurrence of /[\]]/.","/_next/static/chunks/pages/_app.js","24434","20","24434","50"
"Incomplete string escaping or encoding","A string transformer that does not replace or escape all occurrences of a meta-character may be ineffective.","warning","This replaces only the first occurrence of /[[]/.","/_next/static/chunks/pages/_app.js","24434","20","24434","28"
"Prototype-polluting function","Functions recursively assigning properties on objects may be the cause of accidental modification of a built-in prototype object.","warning","The property chain [[""here""|""relative:///_next/static/chunks/pages/_app.js:38412:19:38412:22""]] is recursively assigned to [[""Y""|""relative:///_next/static/chunks/pages/_app.js:38414:46:38414:46""]] without guarding against prototype pollution.","/_next/static/chunks/pages/_app.js","38414","46","38414","46"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4811","29","4811","38"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4812","31","4812","40"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4819","29","4819","38"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4820","31","4820","40"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4828","31","4828","40"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4829","33","4829","42"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4837","29","4837","38"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4838","31","4838","40"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4850","31","4850","40"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","4851","33","4851","42"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","5079","25","5079","34"
"Insecure randomness","Using a cryptographically weak pseudo-random number generator to generate a security-sensitive value may allow an attacker to predict what value will be generated.","warning","This uses a cryptographically insecure random number generated at [[""Math.random()""|""relative:///_next/static/chunks/polyfills.js:182:9:182:21""]] in a security context.","/_next/static/chunks/polyfills.js","5080","25","5080","34"

Explore creating a 'reverse engineered' records.json / stats.json file from a webpack build

This is an idea I've had in passing a few times, but keep forgetting to document it:

https://medium.com/@songawee/long-term-caching-using-webpack-records-9ed9737d96f2
- there are many factors that go into getting consistent filenames. Using Webpack records helps generate longer lasting filenames (cacheable for a longer period of time) by reusing metadata, including module/chunk information, between successive builds. This means that as each build runs, modules won’t be re-ordered and moved to another chunk as often which leads to less cache busting.
- The first step is achieved by a Webpack configuration setting: recordsPath: path.resolve(__dirname, ‘./records.json’)
  This configuration setting instructs Webpack to write out a file containing build metadata to a specified location after a build is completed.
- It keeps track of a variety of metadata including module and chunk ids which are useful to ensure modules do not move between chunks on successive builds when the content has not changed.
- With the configuration in place, we can now enjoy consistent file hashes across builds!
- In the following example, we are adding a dependency (superagent) to the vendor-two chunk.
  
  We can see that all of the chunks change. This is due to the module ids changing. This is not ideal as it forces users to re-download content that has not changed.
  
  The following example adds the same dependency, but uses Webpack records to keep module ids consistent across the builds. We can see that only the vendor-two chunk and the runtime changes. The runtime is expected to change because it has a map of all the chunk ids. Changing only these two files is ideal.
https://webpack.js.org/configuration/other-options/#recordspath
- recordsPath: Use this option to generate a JSON file containing webpack "records" – pieces of data used to store module identifiers across multiple builds. You can use this file to track how modules change between builds.
https://github.com/search?q=path%3A%22webpack.records.json%22&type=code
- https://github.com/GooTechnologies/goojs/blob/master/webpack.records.json

I'm not 100% sure if this would be useful, or partially useful, but I think I am thinking of it tangentially in relation to things like:

Script to calculate the raw/minimised lines of diff change for each file

Being able to see at a glance what the raw and minimised diff lines are for each chunk file can help in determining how much effort reviewing a build will take. If we can copy/paste this as markdown (or insert it directly into the default CHANGELOG entry), then we can also give some useful stats for a build even if we do no deeper manual analysis.

eg.

- TODO: The following files haven't been deeply reviewed:
  - `unpacked/_next/static/chunks/101.js` (`931` lines)
  - `unpacked/_next/static/chunks/2637.js` (`5290` lines)
  - `unpacked/_next/static/chunks/3032.js` (`150,855` lines)
  - `unpacked/_next/static/chunks/30750f44.js` (diff: `45,368` lines, minimised diff: `17,151` lines)
  - `unpacked/_next/static/chunks/3453.js` (`403` lines)
    - Seem to be a bunch of images, likely related to image generation styling or similar.
  - `unpacked/_next/static/chunks/3472.js` (`320` lines)
    - Statsig, Feature Gates, Experimental Gates, etc
  - `unpacked/_next/static/chunks/3842.js` (`755` lines)
  - `unpacked/_next/static/chunks/3a34cc27.js` (diff: `4373` lines, minimised diff: `1633` lines)
  - `unpacked/_next/static/chunks/4114.js` (diff: `1411` lines, minimised diff: `1373` lines)

Fix script for extracting CSS URLs from `webpack.js` + unpacking `*.css` files

In the past there was only a single *.css URL extracted from webpack.js from the miniCssF field, so it was unpacked as miniCssF.css (as the *.css files hashes change every time they are re-built, and they don't seem to have a static chunk part to their filename when downloaded)

More recently, there have been new *.css files specific to certain chunks (sometimes shared among multiple chunks), and so the scripts for extracting this are broken and produce an entry like this:

https://cdn.oaistatic.com/_next/undefined

We also need to think about how best to name the files. I think main.css would probably work for the 'main' chunk (previously what we called miniCssF). For the *.css related to the other chunks, if they only applied to a single chunk I would probably have named them based on that chunk, but sometimes they are used in multiple chunks. If doing it manually we could probably figure out what they are used for and name them based on that, but not sure the best way to do this automatically. We can't use the hash of the *.css file, as that changes every time the file changes.

Explore stack graphs / scope graphs

Stack Graphs (an evolution of Scope Graphs) sound like they could be really interesting/useful with regards to code navigation, symbol mapping, etc. Perhaps we could use them for module identification, or variable/function identifier naming stabilisation or similar?

https://github.blog/changelog/2024-03-14-precise-code-navigation-for-typescript-projects/
- Precise code navigation is now available for all TypeScript repositories.
  Precise code navigation gives more accurate results by only considering the set of classes, functions, and imported definitions that are visible at a given point in your code.
  
  Precise code navigation is powered by the stack graphs framework.
  You can read about how we use stack graphs for code navigation and visit the stack graphs definition for TypeScript to learn more.
  - https://github.blog/2021-12-09-introducing-stack-graphs/
    - Introducing stack graphs
    - Precise code navigation is powered by stack graphs, a new open source framework we’ve created that lets you define the name binding rules for a programming language using a declarative, domain-specific language (DSL). With stack graphs, we can generate code navigation data for a repository without requiring any configuration from the repository owner, and without tapping into a build process or other CI job.
    - LOTS of interesting stuff in this post..
    - As part of developing stack graphs, we’ve added a new graph construction language to Tree-sitter, which lets you construct arbitrary graph structures (including but not limited to stack graphs) from parsed CSTs. You use stanzas to define the gadget of graph nodes and edges that should be created for each occurrence of a Tree-sitter query, and how the newly created nodes and edges should connect to graph content that you’ve already created elsewhere.
      - https://github.com/tree-sitter/tree-sitter-graph
        
        tree-sitter-graph
        The tree-sitter-graph library defines a DSL for constructing arbitrary graph structures from source code that has been parsed using tree-sitter.
        
        https://marketplace.visualstudio.com/items?itemName=tree-sitter.tree-sitter-graph
        
        tree-sitter-graph support for VS Code
        This language extension for VS Code provides syntax support for tree-sitter-graph files.
    - Why aren’t we using the Language Server Protocol (LSP) or Language Server Index Format (LSIF)?
      
      To dig even deeper and learn more, I encourage you to check out my Strange Loop talk and the stack-graphs crate: our open source Rust implementation of these ideas.
      - https://github.com/github/stack-graphs
        
        Stack graphs
        The crates in this repository provide a Rust implementation of stack graphs, which allow you to define the name resolution rules for an arbitrary programming language in a way that is efficient, incremental, and does not need to tap into existing build or program analysis tools.
        
        https://docs.rs/stack-graphs/latest/stack_graphs/
        
        https://github.com/github/stack-graphs/tree/main/languages
        
        This directory contains stack graphs definitions for specific languages.
        
        https://github.com/github/stack-graphs/tree/main/languages/tree-sitter-stack-graphs-javascript
        
        tree-sitter-stack-graphs definition for JavaScript
        This project defines tree-sitter-stack-graphs rules for JavaScript using the tree-sitter-javascript grammar.
        
        The command-line program for tree-sitter-stack-graphs-javascript lets you do stack graph based analysis and lookup from the command line.
        
        cargo install --features cli tree-sitter-stack-graphs-javascript
        
        tree-sitter-stack-graphs-javascript index SOURCE_DIR
        
        tree-sitter-stack-graphs-javascript status SOURCE_DIR
        
        tree-sitter-stack-graphs-javascript query definition SOURCE_PATH:LINE:COLUMN
        
        https://github.com/github/stack-graphs/tree/main/languages/tree-sitter-stack-graphs-typescript
        
        tree-sitter-stack-graphs definition for TypeScript
        This project defines tree-sitter-stack-graphs rules for TypeScript using the tree-sitter-typescript grammar.
        
        The command-line program for tree-sitter-stack-graphs-typescript lets you do stack graph based analysis and lookup from the command line.
      - https://dcreager.net/talks/2021-strange-loop/
        
        Redirects to https://dcreager.net/talks/stack-graphs/
        
        Incremental, zero-config Code Navigation using stack graphs.
        
        In this talk I’ll describe stack graphs, which use a graphical notation to define the name binding rules for a programming language. They work equally well for dynamic languages like Python and JavaScript, and for static languages like Go and Java. Our solution is fast — processing most commits within seconds of us receiving your push. It does not require setting up a CI job, or tapping into a project-specific build process. And it is open-source, building on the tree-sitter project’s existing ecosystem of language tools.
        
        https://www.youtube.com/watch?v=l2R1PTGcwrE
        
        "Incremental, zero-config Code Nav using stack graphs" by Douglas Creager
        
        https://media.dcreager.net/dcreager-strange-loop-2021-slides.pdf
        
        https://media.dcreager.net/dcreager-2022-ucsc-lsd-slides.pdf
https://docs.github.com/en/repositories/working-with-files/using-files/navigating-code-on-github
- GitHub has developed two code navigation approaches based on the open source tree-sitter and stack-graphs library:
  - Search-based - searches all definitions and references across a repository to find entities with a given name
  - Precise - resolves definitions and references based on the set of classes, functions, and imported definitions at a given point in your code
  To learn more about these approaches, see "Precise and search-based navigation."
  - https://docs.github.com/en/repositories/working-with-files/using-files/navigating-code-on-github#precise-and-search-based-navigation
    - Precise and search-based navigation
      Certain languages supported by GitHub have access to precise code navigation, which uses an algorithm (based on the open source stack-graphs library) that resolves definitions and references based on the set of classes, functions, and imported definitions that are visible at any given point in your code. Other languages use search-based code navigation, which searches all definitions and references across a repository to find entities with a given name. Both strategies are effective at finding results and both make sure to avoid inappropriate results such as comments, but precise code navigation can give more accurate results, especially when a repository contains multiple methods or functions with the same name.
https://pl.ewi.tudelft.nl/research/projects/scope-graphs/
- Scope Graphs | A Theory of Name Resolution
- Scope graphs provide a new approach to defining the name binding rules of programming languages. A scope graph represents the name binding facts of a program using the basic concepts of declarations and reference associated with scopes that are connected by edges. Name resolution is defined by searching for paths from references to declarations in a scope graph. Scope graph diagrams provide an illuminating visual notation for explaining the bindings in programs.

Potentially Related

https://en.wikipedia.org/wiki/Code_property_graph
- A code property graph of a program is a graph representation of the program obtained by merging its abstract syntax trees (AST), control-flow graphs (CFG) and program dependence graphs (PDG) at statement and predicate nodes. The resulting graph is a property graph, which is the underlying graph model of graph databases such as Neo4j, JanusGraph and OrientDB where data is stored in the nodes and edges as key-value pairs. In effect, code property graphs can be stored in graph databases and queried using graph query languages.
- Joern CPG. The original code property graph was implemented for C/C++ in 2013 at University of Göttingen as part of the open-source code analysis tool Joern. This original version has been discontinued and superseded by the open-source Joern Project, which provides a formal code property graph specification applicable to multiple programming languages. The project provides code property graph generators for C/C++, Java, Java bytecode, Kotlin, Python, JavaScript, TypeScript, LLVM bitcode, and x86 binaries (via the Ghidra disassembler).
  - https://github.com/joernio/joern
    - Open-source code analysis platform for C/C++/Java/Binary/Javascript/Python/Kotlin based on code property graphs.
    - Joern is a platform for analyzing source code, bytecode, and binary executables. It generates code property graphs (CPGs), a graph representation of code for cross-language code analysis. Code property graphs are stored in a custom graph database. This allows code to be mined using search queries formulated in a Scala-based domain-specific query language. Joern is developed with the goal of providing a useful tool for vulnerability discovery and research in static program analysis.
    - https://joern.io/
    - https://cpg.joern.io/
      - Code Property Graph Specification 1.1
      - This is the specification of the Code Property Graph, a language-agnostic intermediate graph representation of code designed for code querying.
        
        The code property graph is a directed, edge-labeled, attributed multigraph. This specification provides the graph schema, that is, the types of nodes and edges and their properties, as well as constraints that specify which source and destination nodes are permitted for each edge type.
        
        The graph schema is structured into multiple layers, each of which provide node, property, and edge type definitions. A layer may depend on multiple other layers and make use of the types it provides.
https://docs.openrewrite.org/concepts-explanations/lossless-semantic-trees
- A Lossless Semantic Tree (LST) is a tree representation of code. Unlike the traditional Abstract Syntax Tree (AST), OpenRewrite's LST offers a unique set of characteristics that make it possible to perform accurate transformations and searches across a repository

0xdevalias / chatgpt-source-watch Goto Github PK

chatgpt-source-watch's Issues

See Also

Example input 1

Example input 2

Glenn's comments

Musings

See Also

See Also

Potentially Related

See Also

Recommend Projects

Recommend Topics

Recommend Org