Skip to content

Advanced AI assistant for NVDA powered by Google Gemini. Features Smart Translation, Dictation, Vision, and CAPTCHA solving.

License

Notifications You must be signed in to change notification settings

Ed-Fe/VisionAssistantPro

 
 

Repository files navigation

Vision Assistant Pro Documentation

Vision Assistant Pro is an advanced, multi-modal AI assistant for NVDA. It leverages Google's Gemini models to provide intelligent screen reading, translation, voice dictation, and document analysis capabilities.

This add-on was released to the community in honor of the International Day of Persons with Disabilities.

1. Setup & Configuration

Go to NVDA Menu > Preferences > Settings > Vision Assistant Pro.

  • API Key: Required. You can enter multiple keys (separated by commas or new lines). The assistant will automatically rotate between them if a quota limit is reached.
  • AI Model: Choose between Flash (Fastest/Free), Lite, or Pro (High Intelligence) models.
  • Proxy URL: Optional. Use this if Google is blocked in your region. It must be a web address that acts as a bridge to the Gemini API.
  • OCR Engine: Choose between Chrome (Fast) for quick results or Gemini (Formatted) for superior layout preservation and table recognition.
  • TTS Voice: Select the preferred voice style for generating audio files from document pages.
  • Smart Swap: Automatically swaps languages if the source text matches the target language.
  • Direct Output: Skips the chat window and announces the AI response directly via speech.
  • Clipboard Integration: Automatically copies the AI response to the clipboard.

2. Command Layer & Shortcuts

To prevent keyboard conflicts, this add-on uses a Command Layer.

  1. Press NVDA + Shift + V (Master Key) to activate the layer (you will hear a beep).
  2. Release keys, then press one of the following single keys:
Key Function Description
T Smart Translator Translates text under navigator cursor or selection.
Shift + T Clipboard Translator Translates content currently in the clipboard.
R Text Refiner Summarize, Fix Grammar, Explain, or run Custom Prompts.
V Object Vision Describes the current navigator object.
O Full Screen Vision Analyzes the entire screen layout and content.
Shift + V Online Video Analysis Analyze YouTube, Instagram, or Twitter (X) videos via URL.
D Document Reader Advanced reader for PDF and images with page range selection.
F File OCR Direct text recognition from selected image, PDF, or TIFF files.
A Audio Transcription Transcribe MP3, WAV, or OGG files into text.
C CAPTCHA Solver Captures and solves CAPTCHAs on the screen or navigator object.
S Smart Dictation Converts speech to text. Press to start recording, again to stop/type.
L Status Reporting Announces current progress (e.g., "Scanning...", "Idle").
U Update Check Manually check GitHub for the latest version of the add-on.
H Commands Help Displays a list of all available shortcuts within the command layer.

2.1 Document Reader Shortcuts (Inside Viewer)

Once a document is opened via the D command:

  • Ctrl + PageDown: Move to the next page (announces page number).
  • Ctrl + PageUp: Move to the previous page (announces page number).
  • Alt + A: Open a chat dialog to ask questions about the document.
  • Alt + R: Force a re-scan of the current page or all pages using the Gemini engine.
  • Alt + G: Generate and save a high-quality audio file (WAV) from the content.
  • Alt + S / Ctrl + S: Save the extracted text as a TXT or HTML file.

3. Custom Prompts & Variables

Open Settings > Prompts > Manage Prompts... to configure system and custom prompts.

  • Default Prompts tab: edit built-in prompts. You can reset a single prompt or reset all defaults.
  • Custom Prompts tab: add, edit, remove, and reorder custom prompts.
  • Variables Guide button: opens a help window with all supported variables and input types.

Available Variables

Variable Description Input Type
[selection] Currently selected text Text
[clipboard] Clipboard content Text
[screen_obj] Screenshot of the navigator object Image
[screen_full] Full screen screenshot Image
[file_ocr] Select image/PDF file for text extraction Image, PDF, TIFF
[file_read] Select document for reading TXT, Code, PDF
[file_audio] Select audio file for analysis MP3, WAV, OGG

Example Custom Prompts

  • Quick OCR: My OCR:[file_ocr]
  • Translate Image: Translate Img:Extract text from this image and translate to English. [file_ocr]
  • Analyze Audio: Summarize Audio:Listen to this recording and summarize the main points. [file_audio]
  • Code Debugger: Debug:Find bugs in this code and explain them: [selection]

Note: An active internet connection is required for all AI features. Multi-page documents and TIFFs are processed automatically.

Changes for 4.5

  • Advanced Prompt Manager: Introduced a dedicated management dialog in settings to customize default system prompts and manage user-defined prompts with full support for adding, editing, reordering, and previewing.
  • Comprehensive Proxy Support: Resolved network connectivity issues by ensuring that user-configured proxy settings are strictly applied to all API requests, including translation, OCR, and speech generation.
  • Automated Data Migration: Integrated a smart migration system to automatically upgrade legacy prompt configurations to a robust v2 JSON format upon the first run without data loss.
  • Updated Compatibility (2025.1): Set the minimum required NVDA version to 2025.1 due to library dependencies in advanced features like the Document Reader to ensure stable performance.
  • Optimized Settings Interface: Streamlined the settings interface by reorganizing prompt management into a separate dialog, providing a cleaner and more accessible user experience.
  • Prompt Variables Guide: Added a built-in guide within the prompt dialogs to help users easily identify and use dynamic variables such as [selection], [clipboard], and [screen_obj].

Changes for 4.0.3

  • Enhanced Network Resilience: Added an automatic retry mechanism to better handle unstable internet connections and temporary server errors, ensuring more reliable AI responses.
  • Visual Translation Dialog: Introduced a dedicated window for translation results. Users can now easily navigate and read long translations line-by-line, similar to OCR results.
  • Aggregated Formatted View: The "View Formatted" feature in the Document Reader now displays all processed pages in a single, organized window with clear page headers.
  • Optimized OCR Workflow: Automatically skips the page range selection for single-page documents, making the recognition process faster and more seamless.
  • Improved API Stability: Switched to a more robust header-based authentication method, resolving potential "All API Keys failed" errors caused by key rotation conflicts.
  • Bug Fixes: Resolved several potential crashes, including an issue during add-on termination and a focus error in the chat dialog.

Changes for 4.0.1

  • Advanced Document Reader: A powerful new viewer for PDF and images with page range selection, background processing, and seamless Ctrl+PageUp/Down navigation.
  • New Tools Submenu: Added a dedicated "Vision Assistant" submenu under NVDA's Tools menu for quicker access to core features, settings, and documentation.
  • Flexible Customization: You can now choose your preferred OCR engine and TTS voice directly from the settings panel.
  • Multiple API Key Support: Added support for multiple Gemini API keys. You can enter one key per line or separate them with commas in the settings.
  • Alternative OCR Engine: Introduced a new OCR engine to ensure reliable text recognition even when hitting Gemini API quota limits.
  • Smart API Key Rotation: Automatically switches to and remembers the fastest working API key to bypass quota limits.
  • Document to MP3/WAV: Integrated capability to generate and save high-quality audio files in both MP3 (128kbps) and WAV formats directly within the reader.
  • Instagram Stories Support: Added the ability to describe and analyze Instagram Stories using their URLs.
  • TikTok Support: Introduced support for TikTok videos, allowing for full visual description and audio transcription of clips.
  • Redesigned Update Dialog: Features a new accessible interface with a scrollable text box to clearly read version changes before installing.
  • Unified Status & UX: Standardized file dialogs across the add-on and enhanced the 'L' command to report real-time progress.

Changes for 3.6.0

  • Help System: Added a help command (H) within the Command Layer to provide an easy-to-access list of all shortcuts and their functions.
  • Online Video Analysis: Expanded support to include Twitter (X) videos. Also improved URL detection and stability for a more reliable experience.
  • Project Contribution: Added an optional donation dialog for users who wish to support the project’s future updates and continuous growth.

Changes for 3.5.0

* **Command Layer:** Introduced a Command Layer system (default: NVDA+Shift+V) to group shortcuts under a single master key. For example, instead of pressing NVDA+Control+Shift+T for translation, you now press NVDA+Shift+V followed by T. * **Online Video Analysis:** Added a new feature to analyze YouTube and Instagram videos directly by providing a URL.

Changes for 3.1.0

  • Direct Output Mode: Added an option to skip the chat dialog and hear AI responses directly via speech for a faster and more seamless experience.
  • Clipboard Integration: Added a new setting to automatically copy AI responses to the clipboard.

Changes for 3.0

  • New Languages: Added Persian and Vietnamese translations.
  • Expanded AI Models: Reorganized the model selection list with clear prefixes ([Free], [Pro], [Auto]) to help users distinguish between free and rate-limited (paid) models. Added support for Gemini 3.0 Pro and Gemini 2.0 Flash Lite.
  • Dictation Stability: Significantly improved Smart Dictation stability. Added a safety check to ignore audio clips shorter than 1 second, preventing AI hallucinations and empty errors.
  • File Handling: Fixed an issue where uploading files with non-English names would fail.
  • Prompt Optimization: Improved Translation logic and structured Vision results.

Changes for 2.9

  • Added French and Turkish translations.
  • Formatted View: Added a "View Formatted" button in chat dialogs to view the conversation with proper styling (Headings, Bold, Code) in a standard browseable window.
  • Markdown Setting: Added a new option "Clean Markdown in Chat" in Settings. Unchecking this allows users to see raw Markdown syntax (e.g., **, #) in the chat window.
  • Dialog Management: Fixed an issue where the "Refine Text" or chat windows would open multiple times or fail to focus correctly.
  • UX Improvements: Standardized file dialog titles to "Open" and removed redundant speech announcements (e.g., "Opening menu...") for a smoother experience.

Changes for 2.8

  • Added Italian translation.
  • Status Reporting: Added a new command (NVDA+Control+Shift+I) to announce the current status of the add-on (e.g., "Uploading...", "Analyzing...").
  • HTML Export: The "Save Content" button in result dialogs now saves output as a formatted HTML file, preserving styles like headings and bold text.
  • Settings UI: Improved the Settings panel layout with accessible grouping.
  • New Models: Added support for gemini-flash-latest and gemini-flash-lite-latest.
  • Languages: Added Nepali to supported languages.
  • Refine Menu Logic: Fixed a critical bug where "Refine Text" commands would fail if the NVDA interface language was not English.
  • Dictation: Improved silence detection to prevent incorrect text output when no speech is input.
  • Update Settings: "Check for updates on startup" is now disabled by default to comply with Add-on Store policies.
  • Code Cleanup.

Changes for 2.7

  • Migrated project structure to the official NV Access Add-on Template for better standards compliance.
  • Implemented automatic retry logic for HTTP 429 (Rate Limit) errors to ensure reliability during high traffic.
  • Optimized translation prompts for higher accuracy and better "Smart Swap" logic handling.
  • Updated Russian translation.

Changes for 2.6

  • Added Russian translation support (Thanks to nvda-ru).
  • Updated error messages to provide more descriptive feedback regarding connectivity.
  • Changed default target language to English.

Changes for 2.5

  • Added Native File OCR Command (NVDA+Control+Shift+F).
  • Added "Save Chat" button to result dialogs.
  • Implemented full localization support (i18n).
  • Migrated audio feedback to NVDA's native tones module.
  • Switched to Gemini File API for better handling of PDF and audio files.
  • Fixed crash when translating text containing curly braces.

Changes for 2.1.1

  • Fixed an issue where the [file_ocr] variable was not functioning correctly within Custom Prompts.

Changes for 2.1

  • Standardized all shortcuts to use NVDA+Control+Shift to eliminate conflicts with NVDA's Laptop layout and system hotkeys.

Changes for 2.0

  • Implemented built-in Auto-Update system.
  • Added Smart Translation Cache for instant retrieval of previously translated text.
  • Added Conversation Memory to contextually refine results in chat dialogs.
  • Added Dedicated Clipboard Translation command (NVDA+Control+Shift+Y).
  • Optimized AI prompts to strictly enforce target language output.
  • Fixed crash caused by special characters in input text.

Changes for 1.5

  • Added support for over 20 new languages.
  • Implemented Interactive Refine Dialog for follow-up questions.
  • Added Native Smart Dictation feature.
  • Added "Vision Assistant" category to NVDA's Input Gestures dialog.
  • Fixed COMError crashes in specific applications like Firefox and Word.
  • Added automatic retry mechanism for server errors.

Changes for 1.0

  • Initial release.

About

Advanced AI assistant for NVDA powered by Google Gemini. Features Smart Translation, Dictation, Vision, and CAPTCHA solving.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 62.7%
  • C++ 25.6%
  • C 11.7%