Vision Assistant Pro is an advanced, multi-modal AI assistant for NVDA. It leverages Google's Gemini models to provide intelligent screen reading, translation, voice dictation, and document analysis capabilities.
This add-on was released to the community in honor of the International Day of Persons with Disabilities.
Go to NVDA Menu > Preferences > Settings > Vision Assistant Pro.
- API Key: Required. You can enter multiple keys (separated by commas or new lines). The assistant will automatically rotate between them if a quota limit is reached.
- AI Model: Choose between Flash (Fastest/Free), Lite, or Pro (High Intelligence) models.
- Proxy URL: Optional. Use this if Google is blocked in your region. It must be a web address that acts as a bridge to the Gemini API.
- OCR Engine: Choose between Chrome (Fast) for quick results or Gemini (Formatted) for superior layout preservation and table recognition.
- TTS Voice: Select the preferred voice style for generating audio files from document pages.
- Smart Swap: Automatically swaps languages if the source text matches the target language.
- Direct Output: Skips the chat window and announces the AI response directly via speech.
- Clipboard Integration: Automatically copies the AI response to the clipboard.
To prevent keyboard conflicts, this add-on uses a Command Layer.
- Press NVDA + Shift + V (Master Key) to activate the layer (you will hear a beep).
- Release keys, then press one of the following single keys:
| Key | Function | Description |
|---|---|---|
| T | Smart Translator | Translates text under navigator cursor or selection. |
| Shift + T | Clipboard Translator | Translates content currently in the clipboard. |
| R | Text Refiner | Summarize, Fix Grammar, Explain, or run Custom Prompts. |
| V | Object Vision | Describes the current navigator object. |
| O | Full Screen Vision | Analyzes the entire screen layout and content. |
| Shift + V | Online Video Analysis | Analyze YouTube, Instagram, or Twitter (X) videos via URL. |
| D | Document Reader | Advanced reader for PDF and images with page range selection. |
| F | File OCR | Direct text recognition from selected image, PDF, or TIFF files. |
| A | Audio Transcription | Transcribe MP3, WAV, or OGG files into text. |
| C | CAPTCHA Solver | Captures and solves CAPTCHAs on the screen or navigator object. |
| S | Smart Dictation | Converts speech to text. Press to start recording, again to stop/type. |
| L | Status Reporting | Announces current progress (e.g., "Scanning...", "Idle"). |
| U | Update Check | Manually check GitHub for the latest version of the add-on. |
| H | Commands Help | Displays a list of all available shortcuts within the command layer. |
Once a document is opened via the D command:
- Ctrl + PageDown: Move to the next page (announces page number).
- Ctrl + PageUp: Move to the previous page (announces page number).
- Alt + A: Open a chat dialog to ask questions about the document.
- Alt + R: Force a re-scan of the current page or all pages using the Gemini engine.
- Alt + G: Generate and save a high-quality audio file (WAV) from the content.
- Alt + S / Ctrl + S: Save the extracted text as a TXT or HTML file.
Open Settings > Prompts > Manage Prompts... to configure system and custom prompts.
- Default Prompts tab: edit built-in prompts. You can reset a single prompt or reset all defaults.
- Custom Prompts tab: add, edit, remove, and reorder custom prompts.
- Variables Guide button: opens a help window with all supported variables and input types.
| Variable | Description | Input Type |
|---|---|---|
[selection] |
Currently selected text | Text |
[clipboard] |
Clipboard content | Text |
[screen_obj] |
Screenshot of the navigator object | Image |
[screen_full] |
Full screen screenshot | Image |
[file_ocr] |
Select image/PDF file for text extraction | Image, PDF, TIFF |
[file_read] |
Select document for reading | TXT, Code, PDF |
[file_audio] |
Select audio file for analysis | MP3, WAV, OGG |
- Quick OCR:
My OCR:[file_ocr] - Translate Image:
Translate Img:Extract text from this image and translate to English. [file_ocr] - Analyze Audio:
Summarize Audio:Listen to this recording and summarize the main points. [file_audio] - Code Debugger:
Debug:Find bugs in this code and explain them: [selection]
Note: An active internet connection is required for all AI features. Multi-page documents and TIFFs are processed automatically.
- Advanced Prompt Manager: Introduced a dedicated management dialog in settings to customize default system prompts and manage user-defined prompts with full support for adding, editing, reordering, and previewing.
- Comprehensive Proxy Support: Resolved network connectivity issues by ensuring that user-configured proxy settings are strictly applied to all API requests, including translation, OCR, and speech generation.
- Automated Data Migration: Integrated a smart migration system to automatically upgrade legacy prompt configurations to a robust v2 JSON format upon the first run without data loss.
- Updated Compatibility (2025.1): Set the minimum required NVDA version to 2025.1 due to library dependencies in advanced features like the Document Reader to ensure stable performance.
- Optimized Settings Interface: Streamlined the settings interface by reorganizing prompt management into a separate dialog, providing a cleaner and more accessible user experience.
- Prompt Variables Guide: Added a built-in guide within the prompt dialogs to help users easily identify and use dynamic variables such as [selection], [clipboard], and [screen_obj].
- Enhanced Network Resilience: Added an automatic retry mechanism to better handle unstable internet connections and temporary server errors, ensuring more reliable AI responses.
- Visual Translation Dialog: Introduced a dedicated window for translation results. Users can now easily navigate and read long translations line-by-line, similar to OCR results.
- Aggregated Formatted View: The "View Formatted" feature in the Document Reader now displays all processed pages in a single, organized window with clear page headers.
- Optimized OCR Workflow: Automatically skips the page range selection for single-page documents, making the recognition process faster and more seamless.
- Improved API Stability: Switched to a more robust header-based authentication method, resolving potential "All API Keys failed" errors caused by key rotation conflicts.
- Bug Fixes: Resolved several potential crashes, including an issue during add-on termination and a focus error in the chat dialog.
- Advanced Document Reader: A powerful new viewer for PDF and images with page range selection, background processing, and seamless
Ctrl+PageUp/Downnavigation. - New Tools Submenu: Added a dedicated "Vision Assistant" submenu under NVDA's Tools menu for quicker access to core features, settings, and documentation.
- Flexible Customization: You can now choose your preferred OCR engine and TTS voice directly from the settings panel.
- Multiple API Key Support: Added support for multiple Gemini API keys. You can enter one key per line or separate them with commas in the settings.
- Alternative OCR Engine: Introduced a new OCR engine to ensure reliable text recognition even when hitting Gemini API quota limits.
- Smart API Key Rotation: Automatically switches to and remembers the fastest working API key to bypass quota limits.
- Document to MP3/WAV: Integrated capability to generate and save high-quality audio files in both MP3 (128kbps) and WAV formats directly within the reader.
- Instagram Stories Support: Added the ability to describe and analyze Instagram Stories using their URLs.
- TikTok Support: Introduced support for TikTok videos, allowing for full visual description and audio transcription of clips.
- Redesigned Update Dialog: Features a new accessible interface with a scrollable text box to clearly read version changes before installing.
- Unified Status & UX: Standardized file dialogs across the add-on and enhanced the 'L' command to report real-time progress.
- Help System: Added a help command (
H) within the Command Layer to provide an easy-to-access list of all shortcuts and their functions. - Online Video Analysis: Expanded support to include Twitter (X) videos. Also improved URL detection and stability for a more reliable experience.
- Project Contribution: Added an optional donation dialog for users who wish to support the project’s future updates and continuous growth.
* **Command Layer:** Introduced a Command Layer system (default: NVDA+Shift+V) to group shortcuts under a single master key. For example, instead of pressing NVDA+Control+Shift+T for translation, you now press NVDA+Shift+V followed by T.
* **Online Video Analysis:** Added a new feature to analyze YouTube and Instagram videos directly by providing a URL.
- Direct Output Mode: Added an option to skip the chat dialog and hear AI responses directly via speech for a faster and more seamless experience.
- Clipboard Integration: Added a new setting to automatically copy AI responses to the clipboard.
- New Languages: Added Persian and Vietnamese translations.
- Expanded AI Models: Reorganized the model selection list with clear prefixes (
[Free],[Pro],[Auto]) to help users distinguish between free and rate-limited (paid) models. Added support for Gemini 3.0 Pro and Gemini 2.0 Flash Lite. - Dictation Stability: Significantly improved Smart Dictation stability. Added a safety check to ignore audio clips shorter than 1 second, preventing AI hallucinations and empty errors.
- File Handling: Fixed an issue where uploading files with non-English names would fail.
- Prompt Optimization: Improved Translation logic and structured Vision results.
- Added French and Turkish translations.
- Formatted View: Added a "View Formatted" button in chat dialogs to view the conversation with proper styling (Headings, Bold, Code) in a standard browseable window.
- Markdown Setting: Added a new option "Clean Markdown in Chat" in Settings. Unchecking this allows users to see raw Markdown syntax (e.g.,
**,#) in the chat window. - Dialog Management: Fixed an issue where the "Refine Text" or chat windows would open multiple times or fail to focus correctly.
- UX Improvements: Standardized file dialog titles to "Open" and removed redundant speech announcements (e.g., "Opening menu...") for a smoother experience.
- Added Italian translation.
- Status Reporting: Added a new command (NVDA+Control+Shift+I) to announce the current status of the add-on (e.g., "Uploading...", "Analyzing...").
- HTML Export: The "Save Content" button in result dialogs now saves output as a formatted HTML file, preserving styles like headings and bold text.
- Settings UI: Improved the Settings panel layout with accessible grouping.
- New Models: Added support for gemini-flash-latest and gemini-flash-lite-latest.
- Languages: Added Nepali to supported languages.
- Refine Menu Logic: Fixed a critical bug where "Refine Text" commands would fail if the NVDA interface language was not English.
- Dictation: Improved silence detection to prevent incorrect text output when no speech is input.
- Update Settings: "Check for updates on startup" is now disabled by default to comply with Add-on Store policies.
- Code Cleanup.
- Migrated project structure to the official NV Access Add-on Template for better standards compliance.
- Implemented automatic retry logic for HTTP 429 (Rate Limit) errors to ensure reliability during high traffic.
- Optimized translation prompts for higher accuracy and better "Smart Swap" logic handling.
- Updated Russian translation.
- Added Russian translation support (Thanks to nvda-ru).
- Updated error messages to provide more descriptive feedback regarding connectivity.
- Changed default target language to English.
- Added Native File OCR Command (NVDA+Control+Shift+F).
- Added "Save Chat" button to result dialogs.
- Implemented full localization support (i18n).
- Migrated audio feedback to NVDA's native tones module.
- Switched to Gemini File API for better handling of PDF and audio files.
- Fixed crash when translating text containing curly braces.
- Fixed an issue where the [file_ocr] variable was not functioning correctly within Custom Prompts.
- Standardized all shortcuts to use NVDA+Control+Shift to eliminate conflicts with NVDA's Laptop layout and system hotkeys.
- Implemented built-in Auto-Update system.
- Added Smart Translation Cache for instant retrieval of previously translated text.
- Added Conversation Memory to contextually refine results in chat dialogs.
- Added Dedicated Clipboard Translation command (NVDA+Control+Shift+Y).
- Optimized AI prompts to strictly enforce target language output.
- Fixed crash caused by special characters in input text.
- Added support for over 20 new languages.
- Implemented Interactive Refine Dialog for follow-up questions.
- Added Native Smart Dictation feature.
- Added "Vision Assistant" category to NVDA's Input Gestures dialog.
- Fixed COMError crashes in specific applications like Firefox and Word.
- Added automatic retry mechanism for server errors.
- Initial release.