Comments (41)
Interesting about the TypeScript issue. Looks like it's caused by the files in
src/qt/locale
, which all have.ts
extensions (guess it's the extension of Qt translation files, too).b) perhaps this is related to C++ header detection?
Most likely. What it comes down to is that
tokei
, our dependency for language detection, naively using filenames instead of the actual content for categorization.github-linguist
performs heuristics on file contents, helping make it much more accurate when there are conflicts with extensions. You can try creating an issue over there, but looking at the issue history it might be closed as a duplicate of XAMPPRocky/tokei#67.
I reported this issue over a year ago: XAMPPRocky/tokei#719.
from onefetch.
Sorry, I incorrectly assumed that tokei allowed language overrides, but I guess that's not implemented yet.
Well, the only workaround that I know of is renaming all Verilog files to *.vg
😅
from onefetch.
This issue still exists, though it is likely seen by the devs as low priority, so I'll probably have to bump this again later to please the stale
bot.
from onefetch.
Interesting about the TypeScript issue. Looks like it's caused by the files in src/qt/locale
, which all have .ts
extensions (guess it's the extension of Qt translation files, too).
b) perhaps this is related to C++ header detection?
Most likely. What it comes down to is that tokei
, our dependency for language detection, naively using filenames instead of the actual content for categorization. github-linguist
performs heuristics on file contents, helping make it much more accurate when there are conflicts with extensions. You can try creating an issue over there, but looking at the issue history it might be closed as a duplicate of XAMPPRocky/tokei#67.
from onefetch.
@o2sh We might want to create a known-issues.md
file, documenting workarounds (alternate extensions to use, tokei config file snippets, etc.).
from onefetch.
Coming back to a really old issue to document a potential solution:
It might be worth creating a new crate that acts as a wrapper for tokei. This wrapper would provide its own function for getting languages, adding the following:
- heuristics using regexes
- naive bayesian classification using code samples
Basically, it should do what github-linguist does.
Also, it should probably re-export the rest of tokei's public interface to make usage easier.
Such a crate should probably be in a separate repository, as I anticipate releases occurring on a very different schedule from onefetch.
Additionally, this crate would probably need a lot of community support to provide the heuristics and code samples.
I might attempt to do this sometime, but I can't promise that it will be soon. If someone else wants to take this on, I'll be happy to help and discuss this further.
from onefetch.
Hey everyone following this 👋
There's been a bit of discussion here, but to keep you all up to date: I went ahead and started a project called gengo that should be more linguist-like, to hopefully improve our language detection eventually. Unlike tokei, there can be file extension collisions, and gengo will try to pick the right language using heuristics. For example, for this comment, it would need to register ts
as an XML file extension, and include a heuristic to be confident that the .ts
file is actually XML.
But right now, gengo doesn't support nearly enough languages. While I can just grab the data from linguist (and maybe I eventually will), right now I'm hoping that language support grows more organically, with discussion for each added language. So if you'd like to contribute, please do! I'll definitely need help with languages that I'm unfamiliar with, especially when it comes to adding heuristics, for example for C and C++ .h
header files.
Edit: See spenserblack/gengo#34
from onefetch.
Maybe onefetch could by default show whatever it finds and then people that want less output can play with filters.
The issue is that data languages can very easily become the most common language, even if they're not the primary language used. JavaScript projects will often have many more lines of JSON due to package-lock.json
(or YAML if you use yarn
or pnpm
), Rust projects can have more lines of TOML than Rust due to Cargo.lock
, etc. So we hide data and prose by default because they often don't represent the "real" primary language even when they're the most common.
I'd be happy to discuss this further, but if you want to continue the discussion please open a new issue or a discussion, since this issue is "why is language X detected as Y?" instead of "why isn't language X displayed?" 🙂
from onefetch.
But why ?
https://github.com/Aaronepower/tokei is written in Rust and does a great job detecting languages.
from onefetch.
Is that what Onefetch currently uses? It detects C++ as C in the case of Godot, and it didn't detect anything for the repo of a Godot project (while GitHub detects GDScript).
from onefetch.
it only detects the languages that are currently supported by onefetch (WIP):
C
Clojure
C++
C#
Go
Haskell
Java
Lisp
Lua
Python
R
Ruby
Rust
Scala
Shell
TypeScript
JavaScript
Php
Also tokei ignores all commented lines which is why the language distribution sometimes differs from GH.
Supported languages by tokei --> https://github.com/Aaronepower/tokei#supported-languages
from onefetch.
Upstream issues: XAMPPRocky/tokei#305 and XAMPPRocky/tokei#67
We can leave this closed though if you want.
from onefetch.
Ok, with the new title it makes more sense to keep this open.
We'll wait for tokei to fix it then.
Thx @aaronfranke
from onefetch.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
from onefetch.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
from onefetch.
This issue still exists, though it is likely seen by the devs as low priority, so I'll probably have to bump this again later to please the stale
bot.
from onefetch.
Hi, I added the c header and cpp header to language.rs file to my fork
https://github.com/mapau/onefetch/tree/feature/add-c-cpp-header
https://github.com/Aaronepower/tokei already detects the c header and cpp header only the mapping in onefetch is missing.
Here is a PR #365
from onefetch.
I'm not very fund of this idea of having separate entries for header files (CHeader and C++Header). I personally prefer the GitHub Linguist approach of extending C and C++ detection scope to include their respective header files:
C++:
type: programming
tm_scope: source.c++
ace_mode: c_cpp
codemirror_mode: clike
codemirror_mime_type: text/x-c++src
color: "#f34b7d"
aliases:
- cpp
extensions:
- ".cpp"
- ".c++"
- ".cc"
- ".cp"
- ".cxx"
- ".h"
- ".h++"
- ".hh"
- ".hpp"
- ".hxx"
- ".inc"
- ".inl"
- ".ino"
- ".ipp"
- ".re"
- ".tcc"
- ".tpp"
C:
type: programming
color: "#555555"
extensions:
- ".c"
- ".cats"
- ".h"
- ".idc"
interpreters:
- tcc
tm_scope: source.c
ace_mode: c_cpp
codemirror_mode: clike
codemirror_mime_type: text/x-csrc
language_id: 41
I doubt the people over at tokei would be ready to make that shift...So, either we stick to tokei's detection rules and merge @mapau's PR, or we override the logic in Onefetch or...
from onefetch.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
from onefetch.
This issue still exists, though it is seen by the devs as low priority, so I'll probably have to bump this again later to please the stale
bot. The problem still deserves some kind of solution eventually.
from onefetch.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
from onefetch.
This issue still exists, though it is seen by the devs as low priority, so I'll probably have to bump this again later to please the stale
bot. The problem still deserves some kind of solution eventually.
from onefetch.
FWIW I tried this on the Bitcoin Core repo and it thinks it's 50.5%...
a) theres no typescript in the project
b) perhaps this is related to C++ header detection?
onefetch 2.11.0
github-linguist
does return the correct information on this repo:
[20:05] ubuntu:bitcoin (master) | ₿ github-linguist
67.10% 9517488 C++
18.93% 2684890 Python
8.65% 1227100 C
1.66% 235453 M4
1.47% 209083 Shell
0.98% 138998 Makefile
0.40% 56897 Sage
0.21% 29132 CMake
0.20% 28178 Assembly
0.17% 24552 Scheme
0.15% 21833 HTML
0.04% 5497 Objective-C++
0.01% 1721 Dockerfile
0.01% 1256 Cap'n Proto
0.00% 541 Java
0.00% 438 QMake
from onefetch.
Right. I did take a quick look at those issues but agree that it seems likely that it will be closed as a duplicate.
Can't really see how this can be improved without either using the github tool (which works).
from onefetch.
I don't think we should be using github-linguist
directly, since it's written in Ruby, and any interfacing with it would likely be hack-y. However, it could be possible to make a Rust port of linguist that uses the same data files (similar to what linguist-js does). But one cool thing that tokei does is categorize lines so that commented lines can be treated separately from actual code, which I think we would lose by migrating to a lingusit-like crate. Removing the tokei dependency would also be a large change on our end at this point.
Another option is to fork tokei to fix this.
Unfortunately, for your use case, it looks like you would have to use a tokei config file for now to ignore "TypeScript," since that is tokei's preferred solution.
from onefetch.
I think we should start collecting all "incorrect language detected" issues together, so I'm going to rename this to something more general and pin it.
from onefetch.
Incorrect detection of Verilog using tokei.
tokei uses file extension *.vg
for verilog while the more popular choice is *.v
tokei *.v
is defined as COQ, all this is described in tokei issue 520
When considering a new approach, please consider verilog file identification as a useful test case.
from onefetch.
Thanks for the report, @atluft.
I guess this also means that tokei can have a problem with V, it it is ever implemented.
from onefetch.
I'd be happy to do so, but do we actually have any workaround for this? 🤔
As far as I know tokei still doesn't provide an option to allows users to override the extensions - as suggested here
from onefetch.
that acts as a wrapper for tokei.
You mean exposing the same set of APIs? or is it gonna reuse some of tokei's code?
If I'm understanding correctly, the new project will be similar to github-linguist
but implemented in Rust. Does that mean sacrificing some performance for better accuracy?
Regardless, it's definitely an intriguing challenge. If executed well, it could gain a lot of traction, especially given the current state/limitations of existing solutions 😢.
I'd be happy to help 👍
from onefetch.
You mean exposing the same set of APIs? or is it gonna reuse some of tokei's code?
Mostly exposing the same API. Basically a bunch of pub use
. For the re-implementation of get_statistics
, some code reuse might be necessary.
This could also be a fork of tokei. I was just thinking about "wrapping" tokei since AFAIK get_statistics
is the only part we're not satisfied with. But maybe forking makes more sense so we wouldn't have to re-implement as much internal code 🤷
Also, since we're mentioning github-linguist, I should note that linguist actually analyzes the HEAD
(or other rev) of a repository, not the filesystem, which is pretty cool. But I'm not sure if we'd want that since we display pending change stats.
So I guess the first question is: do we want to improve tokei for our purposes, or port linguist to Rust?
from onefetch.
I'm seeing this as well on a repository:
$ github-linguist
46.47% 205323 Shell
38.84% 171601 Lua
8.45% 37319 Perl
2.97% 13138 Ruby
2.91% 12836 Python
0.19% 861 Scheme
0.17% 750 Vim Script
Which is correct, while onefetch
things I have a lot of Go code:
$ onefetch
--==============-- Dan Sully ~ git version 2.40.0
.-==-.===oooo=oooooo=ooooo===--===- ------------------------------
.== =o=oGGGGGGo=oo=oGGGGGGGG=o= oo- Project: dotfiles
-o= oo=G .=GGGGGo=o== .=GGGGG=ooo o=- HEAD: f76ffe4 (master)
.-=oo=o==oGGGGG=oo=oooGGGGGo=oooo. Pending: 12+- 153+
-ooooo=oooooo=. .==ooo==oooooo- Created: 3 years ago
-ooooooooooo====_====ooooooooooo= Languages:
-oooooooooooo==#.#==ooooooooooooo ● Go (21.6 %) ● HTML (18.9 %)
-ooooooooooooo=#.#=oooooooooooooo ● Rust (14.8 %) ● Python (9.9 %)
.oooooooooooooooooooooooooooooooo. ● C (7.2 %) ● Vim script (7.1 %)
oooooooooooooooooooooooooooooooo. ● Other (20.5 %)
..oooooooooooooooooooooooooooooooo.. Authors: 99% Dan Sully 1397
-=o-=ooooooooooooooooooooooooooooooo-oo. 0% Dan Sully 5
.=- oooooooooooooooooooooooooooooooo-.- 0% Dan Sully 3
.oooooooooooooooooooooooooooooooo- Last change: 17 hours ago
-oooooooooooooooooooooooooooooooo- Contributors: 7
-oooooooooooooooooooooooooooooooo- URL: [email protected]:dsully/dotfiles.git
-oooooooooooooooooooooooooooooooo- Commits: 1409
.oooooooooooooooooooooooooooooooo Lines of code: 5023922
=oooooooooooooooooooooooooooooo- Size: 1.13 MiB (241 files)
.=oooooooooooooooooooooooooooo-
-=oooooooooooooooooooooooo=.
=oo====oooooooooooooooo==-oo=-
.-==- .--=======--- .==-
from onefetch.
Hi @dsully
You may want to try the --include-hidden
flag if you wish to account for hidden files. Here is what you would get:
from onefetch.
Pending: 12+- 153+
To add to what @o2sh said, github-linguist
considers committed files, but onefetch
considers all files, which could explain why the golang code is detected. There's likely some Go code in a subfolder of your local dotfiles repo. In ~/go/
, for example.
from onefetch.
Got it. I had wrongly assumed onefetch
only looked at the repo / committed files. Should there be an option for that?
from onefetch.
Tokei doesn't support that AFAIK. We could potentially make use of git ls-files
, but that wouldn't be accurate since we'd be analyzing tracked files, but filesystem contents instead of committed contents. But this is getting off-topic of the initial issue, the wrong language being detected, IMO. Analyzing tracked contents instead of filesystem contents is an exciting idea, and I'd be happy to discuss it further, but I think we should continue in a separate issue 🙂
from onefetch.
Agreed about moving to a new issue - would you like me to create one?
from onefetch.
Sure, go ahead 👍
from onefetch.
Is YAML currently supported? Our repository contains like 99% of YAML files and still onefetch says it's 100% Shell.
from onefetch.
Is YAML currently supported?
Yes. By default onefetch doesn't report on data languages (JSON, YAML). Check the -T
option to choose which types of languages are reported.
🤔 We've gotten this question several times. Is there any way that we can make this easier to discover?
from onefetch.
@spenserblack Oh dear. I even saw -T option but I didn't even think that default values would exclude YAML.
Maybe onefetch could by default show whatever it finds and then people that want less output can play with filters.
Anyway thanks for your answer. Works like a charm now.
from onefetch.
Related Issues (20)
- mercurial support HOT 8
- Code in subdirectory HOT 3
- Display code metrics HOT 1
- Chore: add branch protections/rulesets to `main` branch HOT 6
- Add option to exclude files from churn HOT 4
- Errs when invoked on partial clones HOT 9
- .cls file extension not recognized HOT 6
- Make Debian packages HOT 6
- Use workspace inheritance
- build for `aarch64-pc-windows-gnullvm` is broken HOT 3
- Specify rev besides HEAD? HOT 1
- Arduino language support request HOT 15
- Switch from `actions-rs/*` actions HOT 1
- Astro support HOT 4
- [doc] Universal install anchor is broken HOT 2
- True color support not recognized on Windows Terminal HOT 6
- Make it easier to discover that data languages (JSON, YAML) and prose (Markdown) are not reported by default HOT 2
- Translate `README.md` to German 🇩🇪
- Parse .gitattributes HOT 1
- GPL-3.0-or-later licenses wrongly reported as GPL-3.0-only HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from onefetch.