Installation
Overview
Smart DOM Reader is designed to extract relevant information from web pages while minimizing token usage for LLM consumption. It provides two extraction approaches:- Full Extraction (
SmartDOMReader) - Complete single-pass extraction - Progressive Extraction (
ProgressiveExtractor) - Step-by-step, token-efficient approach
Key Features
- Token-Efficient: Extracts only interactive and semantic elements
- Two Extraction Strategies: Full or progressive extraction based on your needs
- Selector Generation: Generates reliable CSS/XPath selectors for extracted elements
- Interactive Elements: Identifies buttons, links, inputs, and other interactive components
- Form Detection: Extracts form structure with field information
- Semantic Content: Preserves heading hierarchy and meaningful text
- Shadow DOM Support: Can extract from shadow DOM trees
- Browser & Node.js: Works in both browser environments and Node.js (with jsdom)
Full Extraction Approach
UseSmartDOMReader when you need all information upfront and have sufficient token budget.
Basic Usage
Extraction Modes
Interactive Mode (Default)
Extracts only interactive elements (buttons, links, inputs, forms):Full Mode
Includes interactive elements plus semantic content (headings, images, tables):Constructor Options
Runtime Options Override
You can override constructor options at extraction time:Node.js with jsdom
Progressive Extraction Approach
UseProgressiveExtractor for token-efficient, step-by-step extraction.
Step 1: Extract Structure
Get a high-level overview of the page structure (minimal tokens):Step 2: Extract Specific Region
Extract detailed information from a specific region:Step 3: Extract Content
Extract readable text content from a region:Return Types
SmartDOMResult
ExtractedElement
MCP Server Integration
Use Smart DOM Reader with Model Context Protocol:Use Cases
Web Scraping for LLMs
Browser Automation
Progressive LLM Interaction
Form Analysis
Advanced Features
Selector Generation
The package provides robust selector generation:Content Detection
Automatically detect main content areas:Custom Element Filtering
Filter extracted elements with custom logic:Best Practices
Token Efficiency
- Use
mode: 'interactive'if you don’t need semantic content - Set
mainContentOnly: trueto skip headers/footers/nav - Use
ProgressiveExtractorfor multi-step interactions with LLMs - Truncate long attributes with
attributeTruncateLength - Extract from specific containers instead of the whole document:
Selector Reliability
- Generated selectors prioritize stability:
- IDs (
#my-id) - Data attributes (
[data-testid="..."]) - ARIA labels
- Unique classes
- Structural paths
- IDs (
- Check
selector.candidatesfor alternative selectors - Re-query before interaction to ensure element still exists
Error Handling
Bundle String Export
For injection into pages via extension or userscript:Browser Compatibility
- Chrome/Edge: Full support
- Firefox: Full support
- Safari: Full support
- Node.js with jsdom: Full support
Related Packages
@mcp-b/extension-tools- Includes DOM tools for Chrome extensions@modelcontextprotocol/sdk- Official MCP SDK
