<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Files containing ascii8 are not indexed (feature/request) about codesearch HOT 4 OPEN

google commented on May 5, 2024

Files containing ascii8 are not indexed (feature/request)

from codesearch.

Comments (4)

GoogleCodeExporter commented on May 5, 2024

But, i don't see a log message about skipping the file on the console. cindex 
run looks normal.

Original comment by [email protected] on 21 Nov 2012 at 7:36

Added labels: ****
Removed labels: ****

from codesearch.

GoogleCodeExporter commented on May 5, 2024

I encountered all these issues you mentioned and was annoyed enough by them to 
implement the following changes for myself at 
https://github.com/junkblocker/codesearch

1) Do not stop at first bad UTF-8 character encountered. Instead allow a 
percentage of non-UTF-8 characters to be in the document. These are ignored but 
the rest of the document gets indexed. The option, which I call, 
-maxinvalidutf8ratio, defaults to 0.1. This combined with considering a 
document containing a 0x00 byte as binary has been working great for me.

2) Allow custom trigrams size. The current hardcoded limit is at 20000 trigrams 
but I sadly have to work on code with one important source file beyond that. 
(-maxtrigrams).

3) Add message and reasoning for every document skipped from indexing.

I would love to get those changes merged or at least considered for alternate 
implementation here in this official sources but am not sure about the 
aliveness of project here.

Original comment by [email protected] on 21 Nov 2012 at 3:09

Added labels: ****
Removed labels: ****

from codesearch.

GoogleCodeExporter commented on May 5, 2024

The project is not super alive. Mostly the code just works and we
leave it alone. I think the UTF-8 heuristic works pretty well as does
the trigram size heuristic. It's possible to tune these forever, of
course. How many trigrams does your important file have?

I thought that the indexer already did print about files it skipped if
you run it in verbose mode, but maybe I am misremembering.

Original comment by [email protected] on 6 Dec 2012 at 4:30

Added labels: ****
Removed labels: ****

from codesearch.

GoogleCodeExporter commented on May 5, 2024

All source files being UTF-8 is a pretty big assumption. A lot of files may be 
latin-1 etc.  which is the most common problem I encountered. Having random 
european author's name with a diacritic in the source or some cyrillic, for 
example, loses a whole file from index making codesearch something that can't 
be depended on at all. When I am changing code based on what codesearch finds 
in my codebase, I don't wanna miss some files for this reason. codesearch 
should not be less reliable that a regular grep.

The file I mentioned is around 30K trigrams. It was simple to just add a custom 
limit flag.

The indexer misses the warning in a couple of places mainly because of the 
assumptions it makes about the input data. The one example I recall off the top 
of my head is about quietly ignoring symlinked paths (which I submitted another 
patch to optionally not ignore for).

Original comment by [email protected] on 6 Dec 2012 at 5:47

Added labels: ****
Removed labels: ****

from codesearch.

Files containing ascii8 are not indexed (feature/request) about codesearch HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent