abanteai / mentat Goto Github PK

View Code? Open in Web Editor NEW

2.5K 2.5K 228.0 3.43 MB

Mentat - The AI Coding Assistant

Home Page: https://mentat.ai

License: Apache License 2.0

Python 87.90% CSS 0.84% Jinja 1.31% JavaScript 1.23% Shell 0.42% TypeScript 8.30%

mentat's People

Contributors

Stargazers

Watchers

Forkers

howle candide0negn ferasawwad uncommonsensor experiencesnetwork wyspywoods cemberk hotelzululima hackley zomegz redbankdev rogerchucker mz0in aptxzwei mrcodechef chaunceydust decentralised-ai ccompeau restevesd stalinkay yodon dnrtrdata chadwhitacre wodole drgonzalomora beanl annias joexu22 smolten unclevee enkaybit littlecm jan-karsten-kuhnke markandrewdemarest utkarshx vital121 stevegyutyan kingler sandiiarov imedia765 ikonikre neskuchny aristoddle jromerooo2 liushanliang2 andriypanin1 tonywhite11 druncumbvelwimb nightvision419 dmn-tsk kaumah ailabteam toni4i4a dikshamanu eltociear gouserabbani44 sumodgeorge mivanovitch morph-labs andrewcraigdev pterameta aicodehunt fredthedead hadryan swayducky dan4k-tosh quanmeta lambdafunc nicholasbvolpe hhy5277 iyangming ericsongyl dayadaya222 evopimp 0x7b33a2379686f47 tomchapin saratsarat7 redlegenddev thirdeyedog allconditionguy lovenangelo civil-dude khanh-codes ishaan-jaff sullyj3 xhwskhizein kustomzone friedrichdun cktang88 alitrack sepatel dev-amon clus1 jbrumwell xadhrit knowitall0807 ktyyer willkhoza lesaun huangweiboy2

mentat's Issues

petals.dev support

Hey, I think there should be support for the petals.dev network, which uses a bittorrent-style method of serving LLMs like LLaMA 2 and Stable Beluga 2 on a network of machines.

Add optional predefined set of user prompts

Add a config option to have a list of prompt aliases to use while being in session.

For example, we might then add single-word aliases for commonly used custom prompts such as:

"debug": "Try to identify issues with this code"
"refactor": "Refactor this code, DRY it up"
"cleanup": "Identity which parts of the code are not used anywhere"
etc

These could then be used while being inside of the session by writing a single word instead of the whole sentence.

Refactor: keep `git_root` private to `CodeFileManager`

Right now, git_root gets passed around a lot, including being stored in each CodeChange object. This is necessary because CodeFileManager.file_lines maps full paths to files.

To clean this up, we should refactor to have CodeFileManager.file_lines keys just be the relative file paths within the project (from the git root). CodeFileManager is the only thing that needs to know the actual path to the git project on the system, since it reads / writes to files.

Add autocomplete for file, class, and function names

Using ctags (#66 ) to get relevant names, autocomplete them as the user types. Right now we have autocomplete for prompt history, but that's just for the entire request the user types, not on the word level. I'm not sure what switching that will look like.

This will probably have some conflicts with #59 , since that also changes user input. Might be good to merge that first.

Could you tell me what your recording tool is?

It seem so smooth
thanks xD

add an option to run Mentat on a the diff between two commits or branches

Currently, Mentat is aware of uncommitted changes to files in context (the git diff for these files).

It'd be nice if you could make it aware of changes that have been committed. Like have it look at the changes from the last 3 commit, or the difference between two branches (a PR review assistant?). This could make it more helpful at continuing changes you are working on, or helping you review a PR.

This could also be a way to automatically give context - instead of manually choosing files, just look at all the files that have changes.

Add Command Line Option for Model Selection

Problem Statement:

The current implementation of Mentat does not allow users to choose which language model to use, including the more advanced GPT-4 and its 32K version.

Proposed Solution:

Implement a command-line option, for example, --model, that lets users specify the language model they want to use.
Include validation to ensure the specified model is supported.

Acceptance Criteria:

A user can specify the model via the command line.
The system validates the model name against the supported models and returns an error if it's not supported.

Priority: High

🛠️ This feature is key to user customization and leveraging the full potential of the latest language models.

Windows file paths

Tried it in Windows but it mixed up the file path slashes.

Experiment with message ordering

Currently the conversation we send to the model is structured like this:

initial system prompt, describing the edit format that needs to be followed
user message
assistant response
user message
system "code message", containing all the context of the codebase (included files, diffs, codemaps, etc)

(in this example, the model already responded previously once, the structure is the same for shorter or longer conversations, always with a "code message" from the system as the final message)

This order is likely not optimal, because models seem to pay the most attention to the end of the prompt, which here we are filling with code context. My best guess for a better ordering would be:

system prompt describing edit format
system "code message"
user message
assistant message
user message

This would put the most important thing last (the user message). It may be better to reverse the order of (1) and (2). We should experiment to see if this improves performance at all. However running these experiments effectively will require longer context benchmarks than we currently have.

Add difficult benchmark that forces use of signatures in code maps

Once code maps are merged (#66), create a new benchmark that requires the model to use the function signature in the code map correctly to pass. Perhaps this could be in the form of it using methods on a class defined in a file not included in context.

It should be difficult enough that pass rate is ~40%-70% - gives us something to aim to realistically improve relatively quickly.

Cancel openai completion to save tokens

When you interrupt the streaming output model, the stream itself doesn't get closed, so I think it still generates (and charges you for) the full response. This this thread shows how to do it, should be simple to integrate.

Generate-Context Algorithm

The way we currently generate context is:

Add files that the user has selected
Add diff annotations to those files for the diff or pr-diff they select
Calculate how many tokens we've used. If it's over the model's max, throw an error.
Else, If no_code_map is false:
a) Try to include filename/functions/signatures
b) If it's too big, try to include filename/functions
c) If it's too big, try to include just filenames

At 4), we want to use the remaining context in the most valuable way - not just fill it in with code_map. To do this we will:

a) Make a list of all the features we could potentially include. Would include (mentat/app.py, 'code'), (mentat/app.py, 'diff'), (mentat/app.py, 'cmap_signatures') etc. for different features of a file, as well as smaller chunks, e.g. (mentat/app.py:run 'code'), (mentat/app.py:loop 'code'). Chunks within files should cover the entire file without overlap.
b) Assign a relevance score to each feature based on (i) its embedding, relative to the current prompt (ii) if/how it relates to user-specified paths/diff, (iii) which functions it calls and is called by, etc.
c) Divide the score by some length factor - maybe the literal number of tokens, maybe a parameter like cmap_signature_weight. Just want to prioritize higher-density information.
d) Sort all the features by score, and add one-by-one until context is full. If there's overlap conflicts, e.g. <file>:<func> is already included and you add <file>, keep the higher-level item ().

Happy for questions or suggestions on the approach! My plan moving forward is:

Move the get_code_message to CodeFile, update diff and codemaps to work on individual files. Eventually CodeFile will become CodeFeature and can be anything in a).
Setup a refresh workflow and caching of code message
Build a basic version of the algo using just diff and codemaps
Add embeddings (with some type of persistent storage) and use to prioritize items in b)
Add Tree-sitter to parse files into smaller chunks.

Rate limit error

Am I the only one running into this?

File "/minecraft/CODE/mentat/mentat/mentat/app.py", line 36, in run
loop(paths, cost_tracker)
File "/minecraft/CODE/mentat/mentat/mentat/app.py", line 59, in loop
explanation, code_changes = conv.get_model_response(code_file_manager, config)
File "/minecraft/CODE/mentat/mentat/mentat/conversation.py", line 35, in get_model_response
state = run_async_stream_and_parse_llm_response(
File "/minecraft/CODE/mentat/mentat/mentat/parsing.py", line 129, in run_async_stream_and_parse_llm_response
asyncio.run(
File "/opt/conda/envs/Py10/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/envs/Py10/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/minecraft/CODE/mentat/mentat/mentat/parsing.py", line 152, in stream_and_parse_llm_response
response = await call_llm_api(messages, model)
File "/minecraft/CODE/mentat/mentat/mentat/llm_api.py", line 44, in call_llm_api
response = await openai.ChatCompletion.acreate(
File "/opt/conda/envs/Py10/lib/python3.10/site-packages/openai/api_resources/chat_completion.py", line 45, in acreate
return await super().acreate(*args, **kwargs)
File "/opt/conda/envs/Py10/lib/python3.10/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 217, in acreate
response, _, api_key = await requestor.arequest(
File "/opt/conda/envs/Py10/lib/python3.10/site-packages/openai/api_requestor.py", line 382, in arequest
resp, got_stream = await self._interpret_async_response(result, stream)
File "/opt/conda/envs/Py10/lib/python3.10/site-packages/openai/api_requestor.py", line 726, in _interpret_async_response
self._interpret_response_line(
File "/opt/conda/envs/Py10/lib/python3.10/site-packages/openai/api_requestor.py", line 763, in _interpret_response_line
raise self.handle_error_response(
openai.error.RateLimitError: Rate limit reached for 10KTPM-200RPM in organization org-Ab4yA9 on tokens per min. Limit: 10000 / min. Please try again in 6ms. Contact us through our help center at help.openai.com if you continue to have issues.

Feature Request: Add option to have LLM refactor suggestion instead of just y/n

Would be cool if when presented with multiple changes in interactive mode, you could give feedback for updates to a specific function during that flow.

i.e.

in this example if I wanted mentat to make the function more concise or to use a different method for calculating primes, I could just ask it to update it's suggestion.

your OpenAI API key doesn't have access to gpt-4-0314

I get the error your OpenAI API key doesn't have access to gpt-4-0314 when I try to run mentat. I thought GPT-4 API was generally available now? Do I need to change the code to use a specific version of GPT-4 API?

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I got an error when running mentat .

app
├── * .gitignore
├── .gitlab-ci.yml
├── Dockerfile
├── README.md
├── main.py
├── helper
│   ├── __init__.py
│   ├── variables.py
├── model
│   ├── .gitignore
│   ├── clf_model.pkl
│   ├── wiki.txt.gz
│   └── word2vec,pkl
├── requirements.txt
├── sample
│   └── .gitignore
└── tests
    ├── mock_training.json
    ├── model
    │   ├── .gitignore
    │   └── wiki.txt.gz
    └── tes_training.py

Total session cost: $0.00
Traceback (most recent call last):
  File "/home/user/app/venv/bin/mentat", line 8, in <module>
    sys.exit(run_cli())
  File "/home/user/app/venv/lib/python3.10/site-packages/mentat/app.py", line 24, in run_cli
    run(paths)
  File "/home/user/app/venv/lib/python3.10/site-packages/mentat/app.py", line 35, in run
    loop(paths, cost_tracker)
  File "/home/user/app/venv/lib/python3.10/site-packages/mentat/app.py", line 48, in loop
    tokens = count_tokens(code_file_manager.get_code_message())
  File "/home/user/app/venv/lib/python3.10/site-packages/mentat/code_file_manager.py", line 260, in get_code_message
    self._read_all_file_lines()
  File "/home/user/app/venv/lib/python3.10/site-packages/mentat/code_file_manager.py", line 257, in _read_all_file_lines
    self.file_lines[abs_path] = self._read_file(abs_path)
  File "/home/user/app/venv/lib/python3.10/site-packages/mentat/code_file_manager.py", line 251, in _read_file
    lines = f.read().split("\n")
  File "/usr/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

SSL certificate error

this is the error im seeing please help

Last login: Wed Jul 26 13:13:02 on ttys002
rileylovett@Rileys-MacBook-Air ~ % /Users/rileylovett/.venv_new/bin/mentat /Users/rileylovett/Discord2
Files included in context:
Discord2
├── .env
├── README.md
├── app.py
├── bot.py
├── boty.py
├── pp.py
└── requirements.txt

File token count: 6671
Type 'q' or use Ctrl-C to quit at any time.

What can I do for you?

explain

Total token count: 7518

Total session cost: $0.00
Traceback (most recent call last):
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/aiohttp/connector.py", line 980, in _wrap_create_connection
return await self._loop.create_connection(*args, **kwargs) # type: ignore[return-value] # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 1112, in create_connection
transport, protocol = await self._create_connection_transport(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 1145, in _create_connection_transport
await waiter
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/sslproto.py", line 575, in _on_handshake_complete
raise handshake_exc
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/sslproto.py", line 557, in _do_handshake
self._sslobj.do_handshake()
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 979, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1002)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/openai/api_requestor.py", line 592, in arequest_raw
result = await session.request(**request_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/aiohttp/client.py", line 536, in _request
conn = await self._connector.connect(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/aiohttp/connector.py", line 540, in connect
proto = await self._create_connection(req, traces, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/aiohttp/connector.py", line 901, in _create_connection
_, proto = await self._create_direct_connection(req, traces, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/aiohttp/connector.py", line 1206, in _create_direct_connection
raise last_exc
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/aiohttp/connector.py", line 1175, in _create_direct_connection
transp, proto = await self._wrap_create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/aiohttp/connector.py", line 982, in _wrap_create_connection
raise ClientConnectorCertificateError(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorCertificateError: Cannot connect to host api.openai.com:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1002)')]

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/rileylovett/.venv_new/bin/mentat", line 8, in
sys.exit(run_cli())
^^^^^^^^^
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/mentat/app.py", line 24, in run_cli
run(paths)
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/mentat/app.py", line 35, in run
loop(paths, cost_tracker)
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/mentat/app.py", line 57, in loop
explanation, code_changes = conv.get_model_response(code_file_manager, config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/mentat/conversation.py", line 52, in get_model_response
state = run_async_stream_and_parse_llm_response(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/mentat/parsing.py", line 129, in run_async_stream_and_parse_llm_response
asyncio.run(
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/mentat/parsing.py", line 152, in stream_and_parse_llm_response
response = await call_llm_api(messages, model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/mentat/llm_api.py", line 46, in call_llm_api
response = await openai.ChatCompletion.acreate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/openai/api_resources/chat_completion.py", line 45, in acreate
return await super().acreate(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 217, in acreate
response, _, api_key = await requestor.arequest(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/openai/api_requestor.py", line 304, in arequest
result = await self.arequest_raw(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rileylovett/.venv_new/lib/python3.11/site-packages/openai/api_requestor.py", line 609, in arequest_raw
raise error.APIConnectionError("Error communicating with OpenAI") from e
openai.error.APIConnectionError: Error communicating with OpenAI
rileylovett@Rileys-MacBook-Air ~ %

Mentat Debugger

I'm imagining a workflow where you run like mentat --debug python myfile.py arg1 from the terminal, and it:

Tries to run the command in a subprocess (python myfile.py arg1)
If an exception is raised:
- Initialize a mentat conversation with (1) the complete error message (2) all the functions on the current call stack (can we get individual functions with cmap?)
- Send it to the llm automatically and initiate a mentat session - stream it back, propose changes, accept/reject, etc
If/when you accept a proposed change, restart the subprocess and try again, check for exception...
When there are no more exceptions, end with a success message.

Anyone else think this would be useful? Maybe it could be implemented as a new interface.

Add git subdirectory tests

Add tests for running mentat from a subdirectory of the git repository (we might have them already but I don't think we do; would be good to check first!).

Add /add and /remove commands to add/remove files from context

Add a command to add / remove files from the context inside of the current session to avoid having to restart it after each edit.

Fix weird behavior with `ctrl+c`

I seem to "randomly" have my ctrl+c keybind disabled during response streaming from OpenAI. KeyboardInterrrupt is getting swallowed somewhere, however I'm still trying to consistently reproduce this behavior.

Run benchmarks in parallel

These take forever to run sequentially, and since we have temperature set to a non-zero value we should run each benchmark several times and average the pass/fail totals.

Conversation class proposal

@biobootloader @waydegg @PCSwingle

I see a lot of indecisiveness around the conversation class with a lot of things that seem out of scope for it.

May I propose a structure with better separation of concerns with a bit more forethought (possibly a future dependency)?

Note, I'm thinking ahead here to fully logged model agnostic multi agent conversations (similar to ChatDev), with model training in mind where manual & automated training data augmentation can occur with traceability to obey TOS of any model used. -- I know, that's a bit much, but I think that will ultimately be needed for any such AI pair programming system to scale.

(Mind you, I want to really emphasis fine tuning here, because even if you favor OpenAI models, likely GPT4 will be fine tunable in time as GPT3.5 now is.)

I've been prototyping this in AbstractAI, and am currently re-working it locally for readability, simplicity and easier serialization in sql lite, json, and transport & syncing via flask... But when I saw the number of changes to the conversation class which I'm currently using to log my mentat conversations on my instance, I wondered if a discussion wasn't warranted.

@dataclass
class MessageSource:
	pass
	
@dataclass
class Message:
	content: str
	
	creation_time: datetime = field(default_factory=datetime.now)
	
	prev_message: "Message" = None
	conversation: "Conversation" = None
	
	source: MessageSource = None
	
	_children: List["Message"] = internal_field(default_factory=list)

@dataclass
class MessageSequence:
	conversation: "Conversation"
	messages: List[Message] = field(default_factory=list)
	
	def add_message(self, message:Message):
		message.prev_message = None if len(self.messages) == 0 else self.messages[-1]
		self.messages.append(message)
	
@dataclass
class Conversation:
	name: str
	description: str
	
	creation_time: datetime = field(default_factory=datetime.now)
	
	message_sequence:MessageSequence = None
	
	_all_messages: List[Message] = internal_field(default_factory=list)
	_root_messages: List[Message] = internal_field(default_factory=list)
	
	def __post_init__(self, message_sequence:MessageSequence = None):
		if message_sequence is None:
			self.message_sequence = MessageSequence(self)
			
	def add_message(self, message:Message):
		self._all_messages.append(message)
		self.message_sequence.add_message(message)
		if message.prev_message is None:
			self._root_messages.append(message)
		else:
			message.prev_message._children.append(message)

@dataclass
class EditSource(MessageSource):
	original: Message
	new: Message
	new_message_source: MessageSource
	
@dataclass
class ModelSource(MessageSource):
	model_name: str
	model_parameters: dict
	message_sequence: MessageSequence

@dataclass
class UserSource(MessageSource):
	user_name: str = None

Real World Performance Benchmark

In order to get a realistic sense of Mentat's performance it'd be best to compare the code it generates to human generated code. We could pick out some high quality open source github repos with good standards around writing commit messages, issues and pull requests. Checkout the code before a commit and prompt mentat with produce a diff that accomplishes $COMMIT_MESSAGE towards solving $ISSUE. We can then evaluate it in two ways:

Does it modify the same files as the original commit?
We can feed both diffs to gpt and ask which is higher quality.

I'm interested in evaluating 1 even though it's not that informative on its own because it will give us a benchmark to evaluate as we implement ideas aimed at expanding the model's effective context, Issue 3.

We can discuss other ideas for benchmarks here and break good ideas out into separate issues.

Option to use local llama model instead of OpenAI API

Will there be such option?

File "/home/gussand/github/mentat/mentat/app.py", line 117

Hi:
Great demos, but when I installed from github and tried to run my local copy I always get this error. I created it's own virtual environment and all. Any ideas?
Thank you so much! Very excited to try it.

[1] % mentat .
Traceback (most recent call last):
File "/home/gussand/.venv/bin/mentat", line 5, in
from mentat.app import run_cli
File "/home/gussand/github/mentat/mentat/app.py", line 117
match user_response.lower():
^
SyntaxError: invalid syntax

Feature request: Automated file selection and indexing for streamlined workflow

The inclusion of an automated file selection feature could greatly enhance workflow efficiency, eliminating the need for manual selection of relevant files. Here's a proposed implementation:

Automated Descriptions with Local Indexing: We could introduce a locally stored index.json file that automatically generates descriptions for each folder and file, outlining their respective roles.

Index Updating Command: We could create a command, for instance, mentat index (/optional/path), designed to keep the index.json file current.

Query-Based File Suggestion: With a command like mentat code $query, the system could leverage the index.json file to determine and suggest the most relevant files based on the user's query. The user could then either confirm the selection or provide feedback.

Seamless Integration with Existing Workflow: Once the appropriate files are selected, the user can proceed with their usual editing tasks.

Add the ability to commit to git (and write meaningful commit message)

Add SessionContext

followup from this discussion: #123 (comment)

Implement Auto-Test and Auto-Debug Features

Problem Statement:

Mentat currently lacks the ability to automatically test and debug code.

Proposed Solution:

Integrate a testing framework that can automatically run unit tests on the code.
Develop an auto-debugging feature to identify and flag potential issues.

Acceptance Criteria:

Auto-testing of code is possible and reports are generated.
Auto-debugging flags issues and possibly suggests fixes.

Priority: Medium

🔍 This will drastically speed up the development and debugging process.

Refactor CodeChange code

Refactoring CodeChange into a concrete and abstract format; the concrete format would manage parsing the model's output and would be able to convert into an abstract change. The abstract change would use an internal format and would handle actually changing files. This would allow us to easily create new formats and test them out, without having to change anything outside of the parsing (and maybe streaming?) logic. After talking with @biobootloader here is the current plan:

Overview

Concrete Change: Contains logic for parsing different model formats; can be converted to an abstract change. Can also be converted from an abstract change so that we can automatically generate examples for different models (and possibly tests) from a single example set.

Additions/Deletions: Make up an Abstract Change. These will be the simplest possible changes, either adding a block of code at a certain line or deleting a certain block of code. Ideally any conflicts will be resolved by the concrete change parser, but even if they aren't the Additions and Deletions will be processed in order and so conflicts shouldn't be an issue.

Abstract Change: Will be stored per file; each file will contain a list of Additions and Deletions; additionally, each Addition and Deletion will be marked with a specific Concrete Change; this will allow the user to use interactive mode to accept or reject (and eventually maybe ask the model to change?) specific changes that it outputs.

Plan

Create Abstract Changes, Additions/Deletions, and modify CodeFileManager to use them when applying changes
Create tests for Abstract Changes
Create Concrete Change basic interface
Look into seeing if possible to have unified streaming logic for different Concrete Change formats; if so, create it
Migrate current format parsing and streaming over to Concrete Change
Test current format Concrete Change
Create two-way conversion between Abstract Change and Concrete Change
Add testing for conversion
Add un-parsing for current format Concrete Change -- this will be used for tests and prompt examples
Create Abstract Change examples and tests, and use automatic conversion to Concrete Change all the way to pre-parsed text for both prompt examples and tests

Possible alternative?

Another design choice we could have made is to have Abstract Changes represent an entire file, and simply be the text of the file after applying all changes rather than having Additions and Deletions; this would resolve some problems with conflicts and renaming, but wouldn't let us be able to reference specific Concrete Changes from the Abstract Changes, meaning we would mostly likely end up simply using hunks (groups of changed lines) in interactive mode. For now, @biobootloader and I have decided to go with the previous plan to allow interactive mode to be more powerful.

Cost estimation to run `mentat`

Hi, nice work!

I just wanted to ask you if you have any cost estimations for the usage of the tool. It would be really nice to see some numbers on what can we expect.

The model `gpt-4-0314` does not exist

I'm getting this error and it seems I'm not alone.

The model gpt-4-0314 does not exist or you do not have access to it. Learn more: https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4.

Set up the CodeFileManager to allow for partial files

This is a necessary step for working with larger codebases that don't fit in context. We'll have some sort of automatic process to gather key locations to include in context, instead of just user added files. This is also necessary to run Mentat at all on single files that exceed context limits.

It's not clear how users could easily add sections of files themselves, but that's not the primary goal here anyway. The requirements are:

Refactor the CodeFileManager to keep track of "file segments" instead of whole files that are in context. Perhaps for each file it would also store a list of tuples, encoding the start and end line numbers of included sections?
Add tests that use this, including making edits on files
Ensure edits are only made inside included file segments - the model may try to make edits outside of these sections
Figure out the best way to provide file segments to the LLM in CodeFileManager.get_code_message
Keep in mind that the code message also includes the current git diff for each file, which may display parts of the file that aren't in the included file segments. That's probably fine or even good (we want the LLM to be aware of current diff).

There are probably several details and decisions to be made here that I haven't thought of yet!

How do I change the OpenAI model?

Hello! I love this project and it is honestly really useful. But the cost is a lot when using GPT-4. Is there a way to manually change the model from GPT-4 to GPT-3.5 turbo? I thought I had seen an issue about this before but I do not see it anymore (I did not see it in the closed issues as well). There might be a way to do this already, and if there is I have not found how to do so yet and am sorry.

Regarding the project, it is yet another example of what great things can be created when great developers and AI work together. Thanks for creating it!

If the codebase is too large, retrieve a subset of the codebase

If we incorporate all the codebase in the prompt sent to the LLM, we might overflow the max length allowed by the LLM. So if the codebase is large, maybe we want to add a retrieval step where we identify the most relevant pieces of code given the user instructions and only add these relevant ones to the LLM input

Generic Interface PR follow-ups

This issue is a place to collect follow-ups from: #119

Just edit and add to this list:

Typing for color argument in StreamMessage. Since we just copied the cprint api with .send, we can/should have types for colors here that the static type checker verifies are correct
...

Install fails while following set up video: ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'.

Traceback (most recent call last):
  File "/Users/.../.pyenv/versions/3.11.0/bin/mentat", line 8, in <module>
    sys.exit(run_cli())
             ^^^^^^^^^
  File "/Users/.../.pyenv/versions/3.11.0/lib/python3.11/site-packages/mentat/app.py", line 24, in run_cli
    run(paths)
  File "/Users/.../.pyenv/versions/3.11.0/lib/python3.11/site-packages/mentat/app.py", line 35, in run
    loop(paths, cost_tracker)
  File "/Users/.../.pyenv/versions/3.11.0/lib/python3.11/site-packages/mentat/app.py", line 48, in loop
    tokens = count_tokens(code_file_manager.get_code_message())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/.../.pyenv/versions/3.11.0/lib/python3.11/site-packages/mentat/llm_api.py", line 57, in count_tokens
    return len(tiktoken.encoding_for_model("gpt-4").encode(message))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/.../.pyenv/versions/3.11.0/lib/python3.11/site-packages/tiktoken/core.py", line 117, in encode
    raise_disallowed_special_token(match.group())
  File "/Users/.../.pyenv/versions/3.11.0/lib/python3.11/site-packages/tiktoken/core.py", line 351, in raise_disallowed_special_token
    raise ValueError(
ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endoftext|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.

Automatic failure analysis during benchmarking

We are experimenting with edit specification formats for the models. We benchmark performance using different formats, but that just tells us how successful they are. It'd be very helpful to understand why some edits are bad, i.e. did they lead to duplicated lines? accidentally deleted lines?

By sending a unified diff of the result after applying each edit (or multiple edits) to GPT-4, we can ask it what went wrong, and have it list issues for us. We could define issues for it that it could use as labels, such as duplicate-line, mis-deleted-line, incorrect-insert-location etc. As well as letting it do freeform response for errors that don't match.

This would be something we could run during / after benchmarking.

Use embeddings of files/functions to select context for prompt

We'd like mentat to select files related to a prompt based on how similar their embeddings are. I've setup a test in a Jupyter Notebook (I can share directly). The process is:

Get embeddings from openai using text-embedding-ada-002. We can send up to 8192 tokens per API call, either a string or a list of strings (batch).
I use ast and astor to parse Python functions from the mentat code. There doesn't seem to be a one-stop solution to parsing different languages, but we have some leads.
I use cosine similarity on the embeddings of prompt vs each file/function, them sort them and return the top N.

I wrote 3 test prompts, with a list of files and functions I expected to see. Then I ran the algo and returned the top 10, and counted how many of the expected were included.

Results

Prompt: Give me a summary of how asyncio is used in the code.
Files: 3/5
Functions: 1/4

Prompt: Add a new pytest fixture function in conftest to call git commands and return the results. Then, replace calls to subprocess/git in the tests with that new fixture.
Files: 3/4
Functions: 3/10

Prompt: Allow the user to use gpt-3.5 instead of gpt-4, but give them a warning. Add relevant, cost, metadata, etc for the new model, as well as tests and stats.
Files: 2/3
Functions: 2/7

Separate Core Logic from Terminal Interface

To prepare for future editor extensions and other interfaces, we need to separate the core logic from the terminal interface (both input and output). The goal is to allow for multiple interfaces: terminal, VSCode extension, NeoVim extension, etc

Add support to clone a repo that user provide and run mentat in collab something like this https://github.com/lllyasviel/Fooocus#colab

https://github.com/lllyasviel/Fooocus#colab

Add a test of code map fallback behavior

Once code maps are merged (#66), write a test (not a benchmark, it should run on GH actions) that checks that calling CodeMap.get_message with different token_limits correctly transitions between a full map (with signatures), a map without signatures, and just a file map.

Reason for test: this is an important feature that seems like it be easy for us to not immediately notice if we broke it on accident

ability to ignore files/folders?

Does it have an ability to ignore files or folders? For example I've got a large project with node modules or other files that I would like to exclude from the context? Or I would like to work only on particular files?

ValueError: 'rename-file' is not a valid CodeChangeAction

getting this error when getting mentat to rename a JS file

running the repo locally on Ubuntu 22.04.2, haven't tried on any other OS

i'm not a python guy so apologies if it's an obvious env thing lol,
here's the how I recreated it along w/ the stack trace:

What can I do for you?
>>> rename the JS file to app.js, and update package.json with the new fil
ename                                                                     

Total token count: 1458

streaming...  use control-c to interrupt the model at any point

I will rename the JS file to app.js and update the package.json with the new filename.

Steps:
1. Rename the download-data.js file to app.js.
2. Update the package.json file with the new filename.

2023-07-26 23:48:42,828 - ERROR - an error occurred during closing of asynchronous generator <async_generator object APIRequestor.arequest.<locals>.wrap_resp at 0x7f275e2a23c0>
asyncgen: <async_generator object APIRequestor.arequest.<locals>.wrap_resp at 0x7f275e2a23c0>
Traceback (most recent call last):
  File "/home/zhoug/.local/lib/python3.10/site-packages/openai/api_requestor.py", line 324, in wrap_resp
    yield r
GeneratorExit

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zhoug/.local/lib/python3.10/site-packages/openai/api_requestor.py", line 326, in wrap_resp
    await ctx.__aexit__(None, None, None)
  File "/usr/lib/python3.10/contextlib.py", line 206, in __aexit__
    await anext(self.gen)
RuntimeError: anext(): asynchronous generator is already running

Total session cost: $0.00
Traceback (most recent call last):
  File "/home/zhoug/.local/bin/mentat", line 8, in <module>
    sys.exit(run_cli())
  File "/home/zhoug/.local/lib/python3.10/site-packages/mentat/app.py", line 24, in run_cli
    run(paths)
  File "/home/zhoug/.local/lib/python3.10/site-packages/mentat/app.py", line 35, in run
    loop(paths, cost_tracker)
  File "/home/zhoug/.local/lib/python3.10/site-packages/mentat/app.py", line 57, in loop
    explanation, code_changes = conv.get_model_response(code_file_manager, config)
  File "/home/zhoug/.local/lib/python3.10/site-packages/mentat/conversation.py", line 52, in get_model_response
    state = run_async_stream_and_parse_llm_response(
  File "/home/zhoug/.local/lib/python3.10/site-packages/mentat/parsing.py", line 129, in run_async_stream_and_parse_llm_response
    asyncio.run(
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()
  File "/home/zhoug/.local/lib/python3.10/site-packages/mentat/parsing.py", line 158, in stream_and_parse_llm_response
    await _process_response(state, response, printer, code_file_manager)
  File "/home/zhoug/.local/lib/python3.10/site-packages/mentat/parsing.py", line 178, in _process_response
    _process_content_line(state, content_line, printer, code_file_manager)
  File "/home/zhoug/.local/lib/python3.10/site-packages/mentat/parsing.py", line 201, in _process_content_line
    state.new_line(code_file_manager)
  File "/home/zhoug/.local/lib/python3.10/site-packages/mentat/parsing.py", line 93, in new_line
    self.create_code_change(code_file_manager)
  File "/home/zhoug/.local/lib/python3.10/site-packages/mentat/parsing.py", line 71, in create_code_change
    CodeChange(json_data, self.code_lines, self.git_root, code_file_manager)
  File "/home/zhoug/.local/lib/python3.10/site-packages/mentat/code_change.py", line 50, in __init__
    self.action = CodeChangeAction(self.json_data["action"])
  File "/usr/lib/python3.10/enum.py", line 385, in __call__
    return cls.__new__(cls, value)
  File "/usr/lib/python3.10/enum.py", line 710, in __new__
    raise ve_exc
ValueError: 'rename-file' is not a valid CodeChangeAction

Allow for using Azure and other OpenAI-compatible endpoints

let users configure to use the Azure API to call GPT-4 instead
possibly also use https://openrouter.ai/docs#api-keys

Segmentation fault - but generates some code

This plan should provide a solid foundation for the implementation of the desired functionality.

stream_handler error: 'content'
{}

--------smol dev done!---------
Segmentation fault

Benchmarking: track output tokens required

Different edit formats can improve the success rate of Mentat, but not all edit formats are as concise. For example we could ask GPT-4 to rewrite entire files, which would be easier than specifying edits that involve insertions / deletions. But this would be slow (and more expensive!) to use.

To keep track of this we should record how many tokens the models generate while using each edit format during benchmarks. Comparison is tricky as different tasks will require different numbers of tokens. Perhaps we could choose a set of exercises that are all solved on the first attempt when using each edit format.

Fix `test_start_project_from_scratch` benchmark

The test_start_project_from_scratch is raising a FileNotFoundError on the main branch.

It's looking for calculator.py which shouldn't be needed in this test anyways 😕

===================================== test session starts ======================================
platform darwin -- Python 3.11.4, pytest-7.4.0, pluggy-1.2.0
rootdir: /Users/waydegg/ghq/github.com/biobootloader/mentat
plugins: reportlog-0.4.0, mock-3.11.1, repeat-0.9.1
collected 34 items

tests/benchmark_test.py ....F                                                            [100%]

=========================================== FAILURES ===========================================
_______________________________ test_start_project_from_scratch ________________________________

mock_collect_user_input = <MagicMock id='4711084368'>

    def test_start_project_from_scratch(mock_collect_user_input):
        # Clear the testbed so we can test that it works with empty directories
        for item in os.listdir("."):
            if os.path.isfile(item):
                os.remove(item)
            elif os.path.isdir(item):
                if item != ".git":
                    shutil.rmtree(item)

        mock_collect_user_input.side_effect = [
            "make a file that does fizzbuzz, named fizzbuzz.py, going up to 10",
            "y",
            KeyboardInterrupt,
        ]
>       run(["."])

/Users/waydegg/ghq/github.com/biobootloader/mentat/tests/benchmark_test.py:126:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Users/waydegg/ghq/github.com/biobootloader/mentat/mentat/app.py:70: in run
    loop(paths, exclude_paths, cost_tracker)
/Users/waydegg/ghq/github.com/biobootloader/mentat/mentat/app.py:92: in loop
    code_file_manager = CodeFileManager(
/Users/waydegg/ghq/github.com/biobootloader/mentat/mentat/code_file_manager.py:110: in __init__
    self._set_file_paths(paths, exclude_paths)
/Users/waydegg/ghq/github.com/biobootloader/mentat/mentat/code_file_manager.py:143: in _set_file_paths
    file_paths_direct, file_paths_from_dirs = _abs_file_paths_from_list(
/Users/waydegg/ghq/github.com/biobootloader/mentat/mentat/code_file_manager.py:90: in _abs_file_paths_from_list
    file_paths_from_dirs.update(
/Users/waydegg/ghq/github.com/biobootloader/mentat/mentat/code_file_manager.py:92: in <lambda>
    lambda f: (not check_for_text) or _is_file_text_encoded(f),
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

file_path = '/private/var/folders/f0/l3msd5g16kn5vkkrdy0hbsnw0000gn/T/tmphjm0i8zm/testbed/multifile_calculator/calculator.py'

    def _is_file_text_encoded(file_path):
        try:
            # The ultimate filetype test
>           with open(file_path) as f:
E           FileNotFoundError: [Errno 2] No such file or directory: '/private/var/folders/f0/l3msd5g16kn5vkkrdy0hbsnw0000gn/T/tmphjm0i8zm/testbed/multifile_calculator/calculator.py'

/Users/waydegg/ghq/github.com/biobootloader/mentat/mentat/code_file_manager.py:65: FileNotFoundError
------------------------------------- Captured stdout call -------------------------------------

Total session cost: $0.00
=================================== short test summary info ====================================
FAILED tests/benchmark_test.py::test_start_project_from_scratch - FileNotFoundError: [Errno 2] No such file or directory: '/private/var/folders/f0/l3msd5g16k...
=========================== 1 failed, 4 passed in 126.61s (0:02:06) ============================

Add mypy to github action lint checks

I'm assuming we're using mypy since we have mypy-extensions as a dependency in the requirements.txt, however we don't have any checks setup in GH actions. Would be pretty trivial to add a command that checks for this.

@biobootloader any specific reason why choosing mypy over Pyright? I'm not sure what the best choice would be tbh.

abanteai / mentat Goto Github PK

mentat's People

Contributors

Stargazers

Watchers

Forkers

mentat's Issues

Problem Statement:

Proposed Solution:

Acceptance Criteria:

Priority: High

Problem Statement:

Proposed Solution:

Acceptance Criteria:

Priority: Medium

Overview

Plan

Possible alternative?

Results

Recommend Projects

Recommend Topics

Recommend Org