Skip to main content
Token-efficient DOM extraction library for AI-powered browser automation. Extracts interactive elements, forms, and semantic content from web pages in a compact, AI-readable format.

Installation

npm install @mcp-b/smart-dom-reader

Overview

Smart DOM Reader is designed to extract relevant information from web pages while minimizing token usage for LLM consumption. It provides two extraction approaches:
  1. Full Extraction (SmartDOMReader) - Complete single-pass extraction
  2. Progressive Extraction (ProgressiveExtractor) - Step-by-step, token-efficient approach

Key Features

  • Token-Efficient: Extracts only interactive and semantic elements
  • Two Extraction Strategies: Full or progressive extraction based on your needs
  • Selector Generation: Generates reliable CSS/XPath selectors for extracted elements
  • Interactive Elements: Identifies buttons, links, inputs, and other interactive components
  • Form Detection: Extracts form structure with field information
  • Semantic Content: Preserves heading hierarchy and meaningful text
  • Shadow DOM Support: Can extract from shadow DOM trees
  • Browser & Node.js: Works in both browser environments and Node.js (with jsdom)

Full Extraction Approach

Use SmartDOMReader when you need all information upfront and have sufficient token budget.

Basic Usage

import { SmartDOMReader } from '@mcp-b/smart-dom-reader';

// Create reader instance
const reader = new SmartDOMReader({
  mode: 'interactive', // or 'full'
  maxDepth: 5,
  includeHidden: false
});

// Extract from current page
const result = reader.extract(document);

console.log(result);
// Output:
// {
//   mode: 'interactive',
//   timestamp: 1234567890,
//   page: { url: '...', title: '...', hasErrors: false, ... },
//   landmarks: { navigation: [...], main: [...], forms: [...], ... },
//   interactive: {
//     buttons: [{ tag: 'button', text: 'Submit', selector: {...}, ... }],
//     links: [{ tag: 'a', text: 'Home', selector: {...}, href: '/', ... }],
//     inputs: [...],
//     forms: [...],
//     clickable: [...]
//   }
// }

Extraction Modes

Interactive Mode (Default)

Extracts only interactive elements (buttons, links, inputs, forms):
const reader = new SmartDOMReader({ mode: 'interactive' });
const result = reader.extract(document);

// result.interactive contains all UI elements
console.log(result.interactive.buttons); // All buttons
console.log(result.interactive.links);   // All links
console.log(result.interactive.inputs);  // All form inputs
console.log(result.interactive.forms);   // All forms

Full Mode

Includes interactive elements plus semantic content (headings, images, tables):
const reader = new SmartDOMReader({ mode: 'full' });
const result = reader.extract(document);

// All interactive elements
console.log(result.interactive);

// Plus semantic elements
console.log(result.semantic.headings); // h1-h6 elements
console.log(result.semantic.images);   // img elements
console.log(result.semantic.tables);   // table elements
console.log(result.semantic.lists);    // ul/ol elements

// Plus metadata
console.log(result.metadata.totalElements);
console.log(result.metadata.mainContent);

Constructor Options

interface ExtractionOptions {
  mode?: 'interactive' | 'full';
  maxDepth?: number;                  // Max traversal depth (default: 5)
  includeHidden?: boolean;            // Include hidden elements (default: false)
  includeShadowDOM?: boolean;         // Include shadow DOM (default: true)
  includeIframes?: boolean;           // Include iframe content (default: false)
  viewportOnly?: boolean;             // Only visible viewport (default: false)
  mainContentOnly?: boolean;          // Only main content area (default: false)
  customSelectors?: string[];         // Additional selectors to extract
  attributeTruncateLength?: number;   // Max attribute length (default: 100)
  dataAttributeTruncateLength?: number; // Max data-* length (default: 50)
  textTruncateLength?: number;        // Max text length (default: unlimited)
}

Runtime Options Override

You can override constructor options at extraction time:
const reader = new SmartDOMReader({ mode: 'interactive' });

// Normal extraction
const interactive = reader.extract(document);

// Override to full mode for one extraction
const full = reader.extract(document, { mode: 'full' });

Node.js with jsdom

import { JSDOM } from 'jsdom';
import { SmartDOMReader } from '@mcp-b/smart-dom-reader';

const dom = new JSDOM(`
  <html>
    <body>
      <h1>My Page</h1>
      <button id="cta">Click Me</button>
      <form>
        <input name="email" type="email" />
        <button type="submit">Submit</button>
      </form>
    </body>
  </html>
`);

const reader = new SmartDOMReader();
const result = reader.extract(dom.window.document);
console.log(result.interactive.buttons); // Extracted buttons

Progressive Extraction Approach

Use ProgressiveExtractor for token-efficient, step-by-step extraction.

Step 1: Extract Structure

Get a high-level overview of the page structure (minimal tokens):
import { ProgressiveExtractor } from '@mcp-b/smart-dom-reader';

const structure = ProgressiveExtractor.extractStructure(document);

console.log(structure);
// {
//   regions: {
//     header: { selector: 'header', label: 'Page Header', interactiveCount: 5 },
//     navigation: [{ selector: 'nav', label: 'Main Navigation', interactiveCount: 10 }],
//     main: { selector: 'main', label: 'Main Content', interactiveCount: 25 },
//     sections: [...],
//     sidebars: [...],
//     footer: { selector: 'footer', label: 'Footer', interactiveCount: 8 }
//   }
// }

Step 2: Extract Specific Region

Extract detailed information from a specific region:
// Extract just the main content area
const mainContent = ProgressiveExtractor.extractRegion('main', document);

console.log(mainContent);
// SmartDOMResult focused on the main content area only
You can also use selectors from the structure extraction:
const structure = ProgressiveExtractor.extractStructure(document);
const headerSelector = structure.regions.header?.selector;

if (headerSelector) {
  const headerData = ProgressiveExtractor.extractRegion(headerSelector, document);
}

Step 3: Extract Content

Extract readable text content from a region:
const content = ProgressiveExtractor.extractContent('article', document, {
  includeMarkdown: true,
  preserveStructure: true
});

console.log(content);
// {
//   selector: 'article',
//   text: '...',  // Plain text content
//   markdown: '...', // Markdown-formatted content (if includeMarkdown: true)
//   structure: { headings: [...], paragraphs: [...] } // If preserveStructure: true
// }

Return Types

SmartDOMResult

interface SmartDOMResult {
  mode: 'interactive' | 'full';
  timestamp: number;
  page: {
    url: string;
    title: string;
    hasErrors: boolean;
    isLoading: boolean;
    hasModals: boolean;
    hasFocus?: string;
  };
  landmarks: {
    navigation: string[];  // Selector strings
    main: string[];
    forms: string[];
    headers: string[];
    footers: string[];
    articles: string[];
    sections: string[];
  };
  interactive: {
    buttons: ExtractedElement[];
    links: ExtractedElement[];
    inputs: ExtractedElement[];
    forms: FormInfo[];
    clickable: ExtractedElement[];
  };
  semantic?: {  // Only in 'full' mode
    headings: ExtractedElement[];
    images: ExtractedElement[];
    tables: ExtractedElement[];
    lists: ExtractedElement[];
    articles: ExtractedElement[];
  };
  metadata?: {  // Only in 'full' mode
    totalElements: number;
    extractedElements: number;
    mainContent?: string;
    language?: string;
  };
}

ExtractedElement

interface ExtractedElement {
  tag: string;
  text: string;
  selector: {
    css: string;
    xpath: string;
    textBased?: string;
    dataTestId?: string;
    ariaLabel?: string;
    candidates?: SelectorCandidate[];  // Ranked by stability
  };
  attributes: Record<string, string>;
  context: {
    nearestForm?: string;
    nearestSection?: string;
    nearestMain?: string;
    nearestNav?: string;
    parentChain: string[];
  };
  interaction: {
    click?: boolean;
    change?: boolean;
    submit?: boolean;
    nav?: boolean;
    disabled?: boolean;
    hidden?: boolean;
    role?: string;
    form?: string;
  };
  children?: ExtractedElement[];
}

MCP Server Integration

Use Smart DOM Reader with Model Context Protocol:
import { SmartDOMReader } from '@mcp-b/smart-dom-reader';
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { z } from 'zod';

const server = new McpServer({
  name: 'dom-reader-server',
  version: '1.0.0'
});

const reader = new SmartDOMReader();

server.tool(
  'extract_page_elements',
  'Extract interactive elements from the current page',
  {
    mode: z.enum(['interactive', 'full']).optional(),
    mainContentOnly: z.boolean().optional()
  },
  async (args) => {
    const result = reader.extract(document, {
      mode: args.mode || 'interactive',
      mainContentOnly: args.mainContentOnly || false
    });

    return {
      content: [{
        type: 'text',
        text: JSON.stringify(result, null, 2)
      }]
    };
  }
);

Use Cases

Web Scraping for LLMs

// Extract page content for LLM processing with minimal tokens
const reader = new SmartDOMReader({ mode: 'interactive' });
const pageData = reader.extract(document);

// Send to LLM with minimal tokens
const prompt = `
  Page: ${pageData.page.title}

  Interactive elements:
  ${pageData.interactive.buttons.map(el => `- Button: "${el.text}"`).join('\n')}
  ${pageData.interactive.links.map(el => `- Link: "${el.text}" → ${el.attributes.href}`).join('\n')}

  Please click the "Submit" button.
`;

Browser Automation

// Find and interact with elements
const reader = new SmartDOMReader();
const result = reader.extract(document);

const submitButton = result.interactive.buttons.find(el =>
  el.text.toLowerCase().includes('submit')
);

if (submitButton) {
  const element = document.querySelector(submitButton.selector.css);
  element?.click();
}

Progressive LLM Interaction

// Step 1: Show overview (minimal tokens)
const structure = ProgressiveExtractor.extractStructure(document);
console.log('Page regions:', Object.keys(structure.regions));

// Step 2: LLM decides which region to explore
const targetRegion = 'main'; // From LLM decision

// Step 3: Extract detailed info from that region only
const regionData = ProgressiveExtractor.extractRegion(targetRegion, document);
console.log('Interactive elements in main:', regionData.interactive);

Form Analysis

// Analyze forms on a page
const reader = new SmartDOMReader({ mode: 'interactive' });
const result = reader.extract(document);

result.interactive.forms.forEach(form => {
  console.log(`Form at ${form.selector}:`);
  console.log(`  Action: ${form.action || 'none'}`);
  console.log(`  Method: ${form.method || 'GET'}`);
  console.log('  Fields:');

  form.inputs.forEach(field => {
    const required = field.attributes.required ? ' (required)' : '';
    console.log(`    - ${field.attributes.name}: ${field.attributes.type}${required}`);
  });
});

Advanced Features

Selector Generation

The package provides robust selector generation:
import { SelectorGenerator } from '@mcp-b/smart-dom-reader';

const button = document.querySelector('#my-button');
const selectors = SelectorGenerator.generateSelectors(button);

console.log(selectors);
// {
//   css: '#my-button',
//   xpath: '//*[@id="my-button"]',
//   dataTestId: '[data-testid="my-button"]',
//   candidates: [
//     { type: 'id', value: '#my-button', score: 100 },
//     { type: 'data-testid', value: '[data-testid="my-button"]', score: 95 },
//     ...
//   ]
// }

Content Detection

Automatically detect main content areas:
import { ContentDetection } from '@mcp-b/smart-dom-reader';

const mainContent = ContentDetection.findMainContent(document);
const landmarks = ContentDetection.detectLandmarks(document);

console.log('Main content element:', mainContent);
console.log('Page landmarks:', landmarks);

Custom Element Filtering

Filter extracted elements with custom logic:
const reader = new SmartDOMReader({
  mode: 'interactive',
  filter: {
    // Only buttons with specific text
    textContains: ['submit', 'save', 'continue'],

    // Only elements with data-testid
    hasAttributes: ['data-testid'],

    // Exclude elements within nav
    excludeSelectors: ['nav *']
  }
});

const result = reader.extract(document);
// Only filtered elements included

Best Practices

Token Efficiency

  • Use mode: 'interactive' if you don’t need semantic content
  • Set mainContentOnly: true to skip headers/footers/nav
  • Use ProgressiveExtractor for multi-step interactions with LLMs
  • Truncate long attributes with attributeTruncateLength
  • Extract from specific containers instead of the whole document:
    const main = document.querySelector('main');
    const result = reader.extract(main);
    

Selector Reliability

  • Generated selectors prioritize stability:
    1. IDs (#my-id)
    2. Data attributes ([data-testid="..."])
    3. ARIA labels
    4. Unique classes
    5. Structural paths
  • Check selector.candidates for alternative selectors
  • Re-query before interaction to ensure element still exists

Error Handling

try {
  const reader = new SmartDOMReader();
  const result = reader.extract(document);

  const button = result.interactive.buttons[0];
  const element = document.querySelector(button.selector.css);

  if (!element) {
    console.error('Element no longer exists in DOM');
  } else {
    element.click();
  }
} catch (error) {
  console.error('Failed to extract DOM content:', error);
}

Bundle String Export

For injection into pages via extension or userscript:
import { SMART_DOM_READER_BUNDLE } from '@mcp-b/smart-dom-reader/bundle-string';

// Inject into a page
await page.evaluate(SMART_DOM_READER_BUNDLE);

// Now SmartDOMReader is available in the page context
const content = await page.evaluate(() => {
  const reader = new SmartDOMReader();
  return reader.extract(document);
});

Browser Compatibility

  • Chrome/Edge: Full support
  • Firefox: Full support
  • Safari: Full support
  • Node.js with jsdom: Full support

Resources

License

MIT - see LICENSE for details