Git Product home page Git Product logo

c-loftus / sight-free-talon Goto Github PK

View Code? Open in Web Editor NEW
15.0 2.0 1.0 118.45 MB

Integrate Talon voice dictation commands with TTS, screen readers, braille, and more!

Home Page: http://colton.place/sight-free-talon/

License: GNU General Public License v3.0

Python 86.60% Talon 6.84% Shell 0.78% Go 3.30% AppleScript 1.22% Smarty 0.43% CSS 0.48% PowerShell 0.36%
talonvoice accessibility nvda eyestrain screenreader voice-dictation hci human-computer-interaction blind dictation

sight-free-talon's Introduction

Hi, I'm Colton!

I love working on projects related to accessibility, Linux, and UX design.

I graduated in 2023 from Princeton, focusing on ML and HCI. I currently work in healthcare consulting and am most experienced in Python, JS, and Rust.

Please feel free to reach out to me on my website

sight-free-talon's People

Contributors

c-loftus avatar firechickenproductivity avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

sight-free-talon's Issues

Support OpenAI TTS Model

it would be useful to have a way to query the new openai text to speech model. I don't think this would be economical to use in large quantities but it may be helpful for certain specialized cases

Support VoiceOver

I do not have a Mac so I have no way of testing or developing this at the moment. Would appreciate any help or the ability to test run code

Create command to launch Microsoft Immersive Reader

Microsoft has an immersive reader mode within Microsoft edge. this immersive reader mode is free and has a more natural text to speech voice than most and thus would be a good choice instead of the robot-like tts when the user needs to listen to a long article intended to be read with natural speech.

there are other programs that use this immersive reader mode like Microsoft word and so it would be useful to try and see if there is a common shortcut or voice command that could be used to query all of these modes

The Python program edge-tts that I experimented with seems to be using the same API endpoint in the background. However, it is much easier to just take advantage of the immersive reader functionality where it is an option.

Improve Installation for Blind / Vision Impaired Users

Should be made easier for those with no or low vision. Many steps to install talon, knausj, nvda, remove character / word dictation. All these settings add significant user friction. Not entirely clear what the best way to do this though since knausj is a git repo and thus should really be cloned. Thus the user needs git and likely another step. Difficult to say what is necessary and what can be automated in the background when installing

Give Talon access to screen reader's accessibility tree

I do not know to what degree this is possible. Let me preface my idea with that :)
I think, for me at least, the largest limitation Talon has is its reliance on Cursorless, mouse grids and essentially any variation of the concept of tagging UI elements with numbers, two-letter abbreviations etc.
Due to the sequential nature of screen readers, only one element at a time can be in focus, which means the rest of the screen is unperceivable while that happens. This means that without focusing the element in question first to find its identifier, this sorta kinda works with the Cursorless browser extension for example, screen reader users aren't going to know what to tell Talon in order to interact with that element unless they focus it through some other means, at which point they may as well just continue using the modality they were using, if able.
Mind you, this info is based on Talon as it was several months ago, i don't know if this has improved.
Dragon is, generally, able to receive a command like "Click OK" and find and interact with the right button. I believe there is a Talon OCR plugin that does something similar but in my tests, it wasn't always very reliable.
My thought was that NVDA has a rendition of the accessibility tree at any given point; that's what's used for object navigation, for eample. It also is able to use OCR using the windows 10/11 aPIs that far outstrip open-source offerings like Teseract at present. A mode might be developed that uses this information to provide Talon with this info and work with it in order to allow the user to access controls in view similar to Dragon or possibly even better?

Support Emacspeak

It seems that emacspeak is a screen reader option used by some blind or visually impaired programmers. That being said, it is a bit convoluted to set up and may run differently in WSL2.

It is probably a very small portion of users who would use both Talon and emacspeak, but it might be interesting from a design perspective to see if there is potential for integrating the two.

I think I would need to discuss with someone who actually uses the software in order to get a better perspective

Way to stop echo this

Echo this does not stop until I quit Talon. It would be good to have a command to stop it.

Echo Dictation Subtitles via TTS

For users without eyesight who need to use Talon without looking at the computer, it would be useful to have a way to provide auditory feedback to ensure accurate command execution.

This auditory feedback should be quick in such a way that it does not interfere with dictation or cause an annoyance.

Handle addon path for NVDA portable Version

Use os.path.join(globalVars.appArgs.configPath, "talon_server_spec.json") instead of os.path.expanduser. This will also
handle portable versions of NVDA, or NVDA running with an alternate configuration path.

Waiting to finish this until I have the addon stabilized and working

Integrate with Screen Readers

I imagine that Talon has many ways to potentially interact with screen readers. For instance, instead of needing to hear every potential option on the screen, the user could say a command to jump directly to the option they want. This could be particularly powerful, when integrated with a large language model and talons located library to do on screen ocr.

I am not particularly familiar with screen readers, and if they directly support some sort of process, communication, and what the best way of sending information to them would be.

There would likely be times, in which text to speech from the screen reader would not need to be echoed, especially if we are echoing words from Talon, such as any log errors.

Sightless Help Menu

if a user wants to find out all potential Talon script commands that are able to be called, they have to either say one of the help commands which will trigger a graphical pop up, or look within the Talon file itself. neither of these solutions are particularly accessible

it would be helpful to have some sort of action that could echo out all the specific commands that could be said within a given file. it could potentially be cool to have some sort of tag that could be declared at the top of the file which would automatically allow access to a help command. this help command would take all the commands declared within that file and then echo them out to the user. that being said we would have to parse the Talon files somehow to get the spoken names of the commands so this might be rather difficult and we have to know exactly where the path the file is as well. however at the same time the syntax for Talon files is just everything before the: so it might be not too difficult to have a basic heuristic

Restore setting instead of toggling on/off during pre and post phrase

This can be done by adding:

  • New NVDA server command to get the state of the command before disabling the settings to speak / interrupt
  • Using that state to determine the clientside ipc request for the post:phrase callback

We will probably need a better serialization JSON format and preferably this can be shared between both server and client in the repo

  • Schema should include both the status code of the response and any content that gets returned i.e. the state of the variable that was requested

Standardized Command Grammar

As of right now, some commands use the word "speak" while others use the word "echo." Sometimes, it makes more sense to use the word "echo," such as when Talon echoes back what the user said to confirm it was dictated correctly. However, it is probably preferred to have the same word used in all contexts. Additionally, it would be preferred to have some sort of consistent grammar, either verb subject or subject verb. I have to think about this more and potentially get some user feedback.

echo errors in the log via TTS

Terminals and other programs that operate upon a text log are usually not very accessible. it would be useful to try and find a way to watch the log file see if there is any errors and if so echo something to the user via text to speech.

This functionality could also be useful for users who do have full sight but still don't want to switch back and forth for the log on one monitor

Change repository name

Similar to cursorless or rango, it would probably make sense to have a more specialized name for this repository instead of a generic name.
Need to think of something that is clever, short, preferably easily, dictated with Talon and has decent seo. Should use inclusive language . Should imply the ability to use talon along with screen readers and low vision tools, but also not necessitate users to have a certain ability status. (Ie sightless talon is not a great name since it implies the program is only for non-sighted users)

Support JAWS

I have the baseline Python code but no JAWS license. Could use help with testing this

Prevent race condition when using screen reader

If you are using a screen reader, there is currently a setting to allow the tts function to instead use the screen reader, instead of using the text to speech function with Talon

However, if you do that, you have to press down a key in order to read from the clipboard and if you press this key down, it could mangle text that is input via dictation

There might need to be some sort of lock, or other concurrency primitive in order to prevent this race condition

COM Errors

I'm not particularly familiar with COM functionality on Windows, but it seems like if COM gets in a messed up state, then the functions that rely on COM interfaces won't work correctly. This could include both the screen reader and the simple Windows TTS included in this repository.

This is fixed only by a reboot. it is not particularly clear what can cause this

Create Talon script to echo context to the user

This would be useful for both sighted and visually impaired users.
Talt has a series of useful contextual state variables. For instance, whether certain tags or modes are enabled or even more fundamental things like the title of the running application and the window name. It would be useful to be able to echo this out via text to speech. That could be useful for people of all backgrounds who want to debug their talents scripts but don't necessarily want to copy and paste things or open the debug window

With a vision impairment, it would be useful to quickly get a sense of what mode they are in or other important context

Support text to speech of various speeds and voices

By default the speaker object within the python windows com library speaks out at rate that is relatively slow. It would be useful to see if there is a way to make it faster.

The default speaker within this library is particularly robotic. This is OK and some circumstances and I have been told that sometimes for accessibility reasons this allows for faster listening. I have personally not found that to be the case myself. Regardless, it is useful to have other options as well.

I have experimented with 11labs which provides a free API key, but there is pretty heavy rate limits.
I have experimented with the Python library, edge-tts, but it is awkward to call from a python subprocess. you can use it asynchronously with a streaming API, but Talon does not allow asynchronous python code so this would need to be called from another interpreter, which would likely be a hassle. Perhaps there is a better way.

Setting to change tts volume

Hopefully should integrate with both system tts and screen reader tts volume although the latter might be difficult.

For context, sometimes things like videos or video calls have a sound level that is much different and thus we need to normalize everything

Integrate with Cursorless

Lots of potential for using cursorless to intelligently navigate around vscode when coding. Blocked until cursorless keyboard mode is done since the api doesn't expose everything at the moment.
Investigate using a pedal as well

We can essentially use the tree structure as a way to intelligently interact with the editor better than a typical screen reader.

NVDA quits if Talon is launched via the terminal and then is force quit

Not sure why this is happening and is likely something on NVDA's side with the controller client and DLL injection. This happens regardless of my extension being installed. However, at the same time, Talon has undefined behavior if force quit from the terminal and the user is not even intended to launch this way, so perhaps it is not a significant issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.