Browse Source

feat(skills): Add markitdown skill with benchmarked web fetching hierarchy

- Add markitdown skill for local document conversion (PDF, Word, Excel, PPT, images)
- Add benchmark script comparing markitdown vs Jina vs Firecrawl
- Update web fetching hierarchy with performance data:
  - Jina: 0.5s avg, 10/10 success (recommended default)
  - Firecrawl: 4.5s avg, 10/10 success (anti-bot specialist)
  - markitdown: 2.5s avg, 9/10 success (best for local files)
- Update cli-tools.md, tools/README.md with benchmark results
- Clarify markitdown's niche: local files, not web scraping

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
0xDarkMatter 2 months ago
parent
commit
d969b5411e
6 changed files with 497 additions and 29 deletions
  1. 1 1
      AGENTS.md
  2. 1 0
      README.md
  3. 33 13
      rules/cli-tools.md
  4. 92 0
      skills/markitdown/SKILL.md
  5. 325 0
      tests/markitdown-vs-jina/benchmark.py
  6. 45 15
      tools/README.md

+ 1 - 1
AGENTS.md

@@ -56,7 +56,7 @@ On "INIT:" message at session start:
 
 
 ## Quick Reference
 ## Quick Reference
 
 
-**CLI Tools:** Use `rg` over grep, `fd` over find, `eza` over ls, `bat` over cat
+**CLI Tools:** Use `rg` over grep, `fd` over find, `eza` over ls, `bat` over cat, `markitdown` for documents
 
 
 **Web Fetching:** WebFetch → Jina (`r.jina.ai/`) → `firecrawl` → firecrawl-expert agent
 **Web Fetching:** WebFetch → Jina (`r.jina.ai/`) → `firecrawl` → firecrawl-expert agent
 
 

+ 1 - 0
README.md

@@ -139,6 +139,7 @@ Then symlink or copy to your Claude directories:
 | [find-replace](skills/find-replace/) | Modern find-and-replace with sd |
 | [find-replace](skills/find-replace/) | Modern find-and-replace with sd |
 | [code-stats](skills/code-stats/) | Analyze codebase with tokei and difft |
 | [code-stats](skills/code-stats/) | Analyze codebase with tokei and difft |
 | [data-processing](skills/data-processing/) | Process JSON with jq, YAML/TOML with yq |
 | [data-processing](skills/data-processing/) | Process JSON with jq, YAML/TOML with yq |
+| [markitdown](skills/markitdown/) | Convert PDF, Word, Excel, PowerPoint, images to markdown |
 | [structural-search](skills/structural-search/) | Search code by AST structure with ast-grep |
 | [structural-search](skills/structural-search/) | Search code by AST structure with ast-grep |
 
 
 #### Workflow Skills
 #### Workflow Skills

+ 33 - 13
rules/cli-tools.md

@@ -51,6 +51,23 @@ jq '.dependencies | keys' package.json
 yq '.services | keys' docker-compose.yml
 yq '.services | keys' docker-compose.yml
 ```
 ```
 
 
+## Document Conversion
+
+| Instead of | Use | Why |
+|------------|-----|-----|
+| PyMuPDF/pdfplumber | `markitdown` | One tool for PDF, Word, Excel, PowerPoint |
+| python-docx | `markitdown` | Consistent markdown output |
+| Manual OCR | `markitdown` | Built-in image text extraction |
+
+```bash
+# Convert documents to markdown (use markitdown)
+markitdown document.pdf           # PDF to markdown
+markitdown report.docx            # Word to markdown
+markitdown data.xlsx              # Excel to markdown tables
+markitdown slides.pptx            # PowerPoint to markdown
+markitdown screenshot.png         # OCR image text
+```
+
 ## Git Operations
 ## Git Operations
 
 
 | Instead of | Use | Why |
 | Instead of | Use | Why |
@@ -161,33 +178,36 @@ just test                     # Run test task
 
 
 ## Web Fetching (URL Retrieval)
 ## Web Fetching (URL Retrieval)
 
 
-When fetching web content, use this hierarchy in order:
+When fetching web content, use this hierarchy based on benchmarked performance:
 
 
-| Priority | Tool | When to Use |
-|----------|------|-------------|
-| 1 | `WebFetch` | First attempt - fast, built-in |
-| 2 | `r.jina.ai/URL` | JS-rendered pages, PDFs, cleaner extraction |
-| 3 | `firecrawl <url>` | Anti-bot bypass, blocked sites (403, Cloudflare) |
-| 4 | `firecrawl-expert` agent | Complex scraping, structured extraction |
+| Priority | Tool | Speed | Use Case |
+|----------|------|-------|----------|
+| 1 | `WebFetch` | Instant | First attempt - built-in |
+| 2 | `r.jina.ai/URL` | **0.5s avg** | Default fallback - 5-10x faster than alternatives |
+| 3 | `firecrawl <url>` | 4-5s avg | Anti-bot bypass, Cloudflare, heavy JS |
+| 4 | `markitdown <url>` | 2-3s avg | Simple static pages (or local files) |
 
 
 ```bash
 ```bash
-# Jina Reader - prefix any URL (free, 10M tokens)
+# Jina Reader - fastest option (free, 10M tokens)
 curl https://r.jina.ai/https://example.com
 curl https://r.jina.ai/https://example.com
 
 
 # Jina Search - search + fetch in one call
 # Jina Search - search + fetch in one call
 curl https://s.jina.ai/your%20search%20query
 curl https://s.jina.ai/your%20search%20query
 
 
-# Firecrawl CLI - when WebFetch gets blocked
+# Firecrawl CLI - anti-bot bypass
 firecrawl https://blocked-site.com
 firecrawl https://blocked-site.com
-firecrawl https://example.com -o output.md
 firecrawl https://example.com --json
 firecrawl https://example.com --json
+
+# markitdown - simple pages or local files
+markitdown https://example.com
+markitdown document.pdf
 ```
 ```
 
 
 **Decision Tree:**
 **Decision Tree:**
 1. Try `WebFetch` first (instant, free)
 1. Try `WebFetch` first (instant, free)
-2. If 403/blocked/JS-heavy → Try Jina: `r.jina.ai/URL`
-3. If still blocked → Try `firecrawl <url>`
-4. For complex scraping → Use `firecrawl-expert` agent
+2. If blocked → Try Jina: `r.jina.ai/URL` (fastest, 10/10 success rate)
+3. If anti-bot/Cloudflare → Try `firecrawl <url>` (designed for bypass)
+4. For local files (PDF, Word, Excel) → Use `markitdown`
 
 
 ## Reference
 ## Reference
 
 

+ 92 - 0
skills/markitdown/SKILL.md

@@ -0,0 +1,92 @@
+---
+name: markitdown
+description: "Convert local documents to Markdown using Microsoft's markitdown CLI. Best for: PDF, Word, Excel, PowerPoint, images (OCR), audio. Can fetch URLs but Jina is faster for web. Triggers on: convert to markdown, read PDF, parse document, extract text from, docx, xlsx, pptx, OCR image, local file."
+compatibility: "Requires markitdown. Install: pip install markitdown"
+allowed-tools: "Bash"
+---
+
+# markitdown - Document to Markdown
+
+Convert local documents to clean Markdown. One tool for PDF, Word, Excel, PowerPoint, images, and more.
+
+## When to Use markitdown
+
+| Use Case | Recommendation |
+|----------|----------------|
+| **Local files (PDF, Word, Excel)** | ✅ **Use markitdown** - unique capability |
+| **Web pages** | ❌ Use Jina (`r.jina.ai/`) - 5x faster |
+| **Blocked/anti-bot sites** | ❌ Use Firecrawl |
+| **OCR on images** | ✅ **Use markitdown** |
+| **Audio transcription** | ✅ **Use markitdown** |
+
+## Basic Usage
+
+```bash
+# Local files (primary use case)
+markitdown document.pdf
+markitdown report.docx
+markitdown data.xlsx
+markitdown slides.pptx
+markitdown screenshot.png    # OCR
+
+# URLs (works, but Jina is faster)
+markitdown https://example.com
+
+# Save output
+markitdown document.pdf > document.md
+```
+
+## Supported Formats
+
+| Format | Extensions | Notes |
+|--------|------------|-------|
+| PDF | `.pdf` | Text extraction, tables |
+| Word | `.docx` | Formatting preserved |
+| Excel | `.xlsx` | Tables to markdown |
+| PowerPoint | `.pptx` | Slides as sections |
+| Images | `.jpg`, `.png` | OCR text extraction |
+| HTML | `.html` | Clean conversion |
+| Audio | `.mp3`, `.wav` | Speech-to-text |
+| Text | `.txt`, `.csv`, `.json`, `.xml` | Pass-through/structure |
+| URLs | `https://...` | Works but slower than Jina |
+
+## Benchmarked Performance (URLs)
+
+| Tool | Avg Speed | Success Rate |
+|------|-----------|--------------|
+| Jina | **0.5s** | 10/10 |
+| markitdown | 2.5s | 9/10 |
+| Firecrawl | 4.5s | 10/10 |
+
+**Verdict**: For URLs, use Jina. For local files, markitdown is the only option.
+
+## Examples
+
+```bash
+# PDF to markdown (primary use case)
+markitdown report.pdf > report.md
+
+# Excel spreadsheet
+markitdown financials.xlsx
+
+# Image with text (OCR)
+markitdown screenshot.png
+
+# PowerPoint deck
+markitdown presentation.pptx > slides.md
+
+# Audio transcription
+markitdown meeting.mp3 > transcript.md
+```
+
+## Comparison with Alternatives
+
+| Task | markitdown | Alternative |
+|------|------------|-------------|
+| PDF text | `markitdown file.pdf` | PyMuPDF, pdfplumber |
+| Word docs | `markitdown file.docx` | python-docx |
+| Excel | `markitdown file.xlsx` | pandas, openpyxl |
+| OCR | `markitdown image.png` | Tesseract |
+| Web pages | Use Jina instead | `r.jina.ai/URL` (5x faster) |
+
+**markitdown's advantage**: One CLI for all local document formats. No code needed.

+ 325 - 0
tests/markitdown-vs-jina/benchmark.py

@@ -0,0 +1,325 @@
+#!/usr/bin/env python3
+"""
+Benchmark: markitdown vs Jina Reader vs Firecrawl
+Compare speed, accuracy, formatting, and parallel execution
+"""
+
+import subprocess
+import time
+import os
+import sys
+import concurrent.futures
+from pathlib import Path
+from urllib.parse import quote
+
+# Force UTF-8 encoding on Windows
+if sys.platform == "win32":
+    import codecs
+    sys.stdout = codecs.getwriter("utf-8")(sys.stdout.buffer, errors="replace")
+    sys.stderr = codecs.getwriter("utf-8")(sys.stderr.buffer, errors="replace")
+
+# Test corpus - 10 URLs of varying complexity
+URLS = [
+    # News articles - use stable landing pages
+    ("guardian-tech", "https://www.theguardian.com/technology"),
+    ("bbc-news", "https://www.bbc.com/news"),
+
+    # Documentation
+    ("python-docs", "https://docs.python.org/3/library/asyncio.html"),
+    ("mdn-fetch", "https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API"),
+    ("rust-book", "https://doc.rust-lang.org/book/ch04-01-what-is-ownership.html"),
+
+    # Feature-rich / Complex
+    ("github-repo", "https://github.com/microsoft/markitdown"),
+    ("hackernews", "https://news.ycombinator.com/"),
+    ("wikipedia", "https://en.wikipedia.org/wiki/Markdown"),
+
+    # Simple / Minimal
+    ("example-com", "https://example.com"),
+    ("httpbin", "https://httpbin.org/html"),
+]
+
+OUTPUT_DIR = Path(__file__).parent / "output"
+
+def fetch_with_markitdown(url: str, name: str) -> dict:
+    """Fetch URL with markitdown, return timing and output"""
+    output_file = OUTPUT_DIR / f"{name}_markitdown.md"
+    start = time.perf_counter()
+    try:
+        result = subprocess.run(
+            ["markitdown", url],
+            capture_output=True,
+            text=True,
+            timeout=60,
+            encoding="utf-8",
+            errors="replace"
+        )
+        elapsed = time.perf_counter() - start
+        output = result.stdout or ""
+        error = result.stderr or ""
+        success = result.returncode == 0 and len(output) > 50
+    except subprocess.TimeoutExpired:
+        elapsed = 60.0
+        output = ""
+        error = "TIMEOUT"
+        success = False
+    except Exception as e:
+        elapsed = time.perf_counter() - start
+        output = ""
+        error = str(e)
+        success = False
+
+    if success and output:
+        output_file.write_text(output, encoding="utf-8")
+
+    return {
+        "tool": "markitdown",
+        "name": name,
+        "url": url,
+        "time": elapsed,
+        "success": success,
+        "output_len": len(output),
+        "error": error if not success else None,
+        "output_file": str(output_file) if success else None
+    }
+
+def fetch_with_jina(url: str, name: str) -> dict:
+    """Fetch URL with Jina Reader, return timing and output"""
+    output_file = OUTPUT_DIR / f"{name}_jina.md"
+    jina_url = f"https://r.jina.ai/{url}"
+    start = time.perf_counter()
+    try:
+        result = subprocess.run(
+            ["curl", "-s", "-L", "--max-time", "60", jina_url],
+            capture_output=True,
+            text=True,
+            timeout=65,
+            encoding="utf-8",
+            errors="replace"
+        )
+        elapsed = time.perf_counter() - start
+        output = result.stdout or ""
+        error = result.stderr
+        success = result.returncode == 0 and len(output) > 100
+    except subprocess.TimeoutExpired:
+        elapsed = 60.0
+        output = ""
+        error = "TIMEOUT"
+        success = False
+    except Exception as e:
+        elapsed = time.perf_counter() - start
+        output = ""
+        error = str(e)
+        success = False
+
+    if success and output:
+        output_file.write_text(output, encoding="utf-8")
+
+    return {
+        "tool": "jina",
+        "name": name,
+        "url": url,
+        "time": elapsed,
+        "success": success,
+        "output_len": len(output),
+        "error": error if not success else None,
+        "output_file": str(output_file) if success else None
+    }
+
+def fetch_with_firecrawl(url: str, name: str) -> dict:
+    """Fetch URL with Firecrawl, return timing and output"""
+    output_file = OUTPUT_DIR / f"{name}_firecrawl.md"
+    start = time.perf_counter()
+    try:
+        # On Windows, firecrawl is a .cmd script - need shell=True
+        result = subprocess.run(
+            f"firecrawl {url}",
+            capture_output=True,
+            text=True,
+            timeout=90,  # Firecrawl can be slower due to JS rendering
+            encoding="utf-8",
+            errors="replace",
+            shell=True
+        )
+        elapsed = time.perf_counter() - start
+        output = result.stdout or ""
+        error = result.stderr
+        success = result.returncode == 0 and len(output) > 100
+    except subprocess.TimeoutExpired:
+        elapsed = 90.0
+        output = ""
+        error = "TIMEOUT"
+        success = False
+    except Exception as e:
+        elapsed = time.perf_counter() - start
+        output = ""
+        error = str(e)
+        success = False
+
+    if success and output:
+        output_file.write_text(output, encoding="utf-8")
+
+    return {
+        "tool": "firecrawl",
+        "name": name,
+        "url": url,
+        "time": elapsed,
+        "success": success,
+        "output_len": len(output),
+        "error": error if not success else None,
+        "output_file": str(output_file) if success else None
+    }
+
+def run_sequential():
+    """Run all tests sequentially"""
+    print("\n" + "="*60)
+    print("SEQUENTIAL EXECUTION")
+    print("="*60)
+
+    results = {"markitdown": [], "jina": [], "firecrawl": []}
+
+    for name, url in URLS:
+        print(f"\nTesting: {name}")
+        print(f"  URL: {url}")
+
+        # markitdown
+        r1 = fetch_with_markitdown(url, name)
+        status1 = "OK" if r1["success"] else "FAIL"
+        print(f"  markitdown: {r1['time']:.2f}s, {r1['output_len']:,} chars - {status1}")
+        results["markitdown"].append(r1)
+
+        # jina
+        r2 = fetch_with_jina(url, name)
+        status2 = "OK" if r2["success"] else "FAIL"
+        print(f"  jina:       {r2['time']:.2f}s, {r2['output_len']:,} chars - {status2}")
+        results["jina"].append(r2)
+
+        # firecrawl
+        r3 = fetch_with_firecrawl(url, name)
+        status3 = "OK" if r3["success"] else "FAIL"
+        print(f"  firecrawl:  {r3['time']:.2f}s, {r3['output_len']:,} chars - {status3}")
+        results["firecrawl"].append(r3)
+
+    return results
+
+def run_parallel(tool: str, max_workers: int = 5):
+    """Run all tests in parallel for a single tool"""
+    print(f"\n{'='*60}")
+    print(f"PARALLEL EXECUTION: {tool} (max_workers={max_workers})")
+    print("="*60)
+
+    fetch_fns = {
+        "markitdown": fetch_with_markitdown,
+        "jina": fetch_with_jina,
+        "firecrawl": fetch_with_firecrawl
+    }
+    fetch_fn = fetch_fns[tool]
+
+    start = time.perf_counter()
+    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
+        futures = {
+            executor.submit(fetch_fn, url, name): name
+            for name, url in URLS
+        }
+        results = []
+        for future in concurrent.futures.as_completed(futures):
+            name = futures[future]
+            result = future.result()
+            status = "OK" if result["success"] else f"FAIL"
+            print(f"  {name}: {result['time']:.2f}s - {status}")
+            results.append(result)
+
+    total_time = time.perf_counter() - start
+    print(f"\nTotal parallel time: {total_time:.2f}s")
+
+    return results, total_time
+
+def print_summary(seq_results: dict, par_results: dict):
+    """Print comparison summary"""
+    print("\n" + "="*60)
+    print("SUMMARY")
+    print("="*60)
+
+    # Sequential times
+    md_times = [r["time"] for r in seq_results["markitdown"] if r["success"]]
+    jina_times = [r["time"] for r in seq_results["jina"] if r["success"]]
+    fc_times = [r["time"] for r in seq_results["firecrawl"] if r["success"]]
+
+    md_success = sum(1 for r in seq_results["markitdown"] if r["success"])
+    jina_success = sum(1 for r in seq_results["jina"] if r["success"])
+    fc_success = sum(1 for r in seq_results["firecrawl"] if r["success"])
+
+    md_chars = sum(r["output_len"] for r in seq_results["markitdown"] if r["success"])
+    jina_chars = sum(r["output_len"] for r in seq_results["jina"] if r["success"])
+    fc_chars = sum(r["output_len"] for r in seq_results["firecrawl"] if r["success"])
+
+    def safe_avg(times):
+        return sum(times)/len(times) if times else 0
+
+    print("\n## Speed (Sequential)")
+    print(f"| Metric | markitdown | Jina | Firecrawl |")
+    print(f"|--------|------------|------|-----------|")
+    print(f"| Avg time | {safe_avg(md_times):.2f}s | {safe_avg(jina_times):.2f}s | {safe_avg(fc_times):.2f}s |")
+    print(f"| Total time | {sum(md_times):.2f}s | {sum(jina_times):.2f}s | {sum(fc_times):.2f}s |")
+    print(f"| Success rate | {md_success}/{len(URLS)} | {jina_success}/{len(URLS)} | {fc_success}/{len(URLS)} |")
+
+    print("\n## Speed (Parallel, 5 workers)")
+    print(f"| Metric | markitdown | Jina | Firecrawl |")
+    print(f"|--------|------------|------|-----------|")
+    print(f"| Total time | {par_results['markitdown'][1]:.2f}s | {par_results['jina'][1]:.2f}s | {par_results['firecrawl'][1]:.2f}s |")
+
+    print("\n## Output Size")
+    print(f"| Metric | markitdown | Jina | Firecrawl |")
+    print(f"|--------|------------|------|-----------|")
+    print(f"| Total chars | {md_chars:,} | {jina_chars:,} | {fc_chars:,} |")
+    print(f"| Avg chars | {md_chars//max(md_success,1):,} | {jina_chars//max(jina_success,1):,} | {fc_chars//max(fc_success,1):,} |")
+
+    print("\n## Per-URL Comparison")
+    print(f"| URL | markitdown | Jina | Firecrawl | Winner |")
+    print(f"|-----|------------|------|-----------|--------|")
+    for i, (name, url) in enumerate(URLS):
+        md = seq_results["markitdown"][i]
+        jn = seq_results["jina"][i]
+        fc = seq_results["firecrawl"][i]
+
+        md_str = f"{md['time']:.1f}s" if md["success"] else "FAIL"
+        jn_str = f"{jn['time']:.1f}s" if jn["success"] else "FAIL"
+        fc_str = f"{fc['time']:.1f}s" if fc["success"] else "FAIL"
+
+        # Determine winner by speed among successful tools
+        successful = []
+        if md["success"]: successful.append(("markitdown", md["time"]))
+        if jn["success"]: successful.append(("Jina", jn["time"]))
+        if fc["success"]: successful.append(("Firecrawl", fc["time"]))
+
+        if successful:
+            winner = min(successful, key=lambda x: x[1])[0]
+        else:
+            winner = "None"
+
+        print(f"| {name} | {md_str} | {jn_str} | {fc_str} | {winner} |")
+
+    print(f"\nOutput files saved to: {OUTPUT_DIR}")
+
+def main():
+    # Create output directory
+    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+
+    print("Benchmark: markitdown vs Jina Reader vs Firecrawl")
+    print(f"Testing {len(URLS)} URLs")
+
+    # Run sequential tests
+    seq_results = run_sequential()
+
+    # Run parallel tests
+    par_results = {
+        "markitdown": run_parallel("markitdown", max_workers=5),
+        "jina": run_parallel("jina", max_workers=5),
+        "firecrawl": run_parallel("firecrawl", max_workers=5),
+    }
+
+    # Print summary
+    print_summary(seq_results, par_results)
+
+if __name__ == "__main__":
+    main()

+ 45 - 15
tools/README.md

@@ -44,6 +44,27 @@ Token-efficient CLI tools that replace verbose legacy commands. These tools are
 | JSON manual | `jq` | Structured queries and transforms |
 | JSON manual | `jq` | Structured queries and transforms |
 | YAML manual | `yq` | Same as jq for YAML/TOML |
 | YAML manual | `yq` | Same as jq for YAML/TOML |
 
 
+### Document Conversion
+
+| Legacy | Modern | Improvement |
+|--------|--------|-------------|
+| PyMuPDF/pdfplumber | `markitdown` | One CLI for all document types |
+| python-docx | `markitdown` | Consistent markdown output |
+| Tesseract (OCR) | `markitdown` | Built-in image text extraction |
+
+**markitdown** (Microsoft) - Convert documents to markdown:
+```bash
+pip install markitdown
+
+# Usage
+markitdown document.pdf       # PDF
+markitdown report.docx        # Word
+markitdown data.xlsx          # Excel (tables)
+markitdown slides.pptx        # PowerPoint
+markitdown image.png          # OCR
+```
+Supports: PDF, DOCX, XLSX, PPTX, images (OCR), HTML, audio (speech-to-text), CSV, JSON, XML
+
 ### Git Operations
 ### Git Operations
 
 
 | Legacy | Modern | Improvement |
 | Legacy | Modern | Improvement |
@@ -153,15 +174,16 @@ export PERPLEXITY_API_KEY="your-key-here"
 
 
 ### Web Fetching (URL Retrieval Hierarchy)
 ### Web Fetching (URL Retrieval Hierarchy)
 
 
-When Claude's built-in `WebFetch` gets blocked (403, Cloudflare, etc.), use these alternatives in order:
+Benchmarked performance (10 URLs, varying complexity):
 
 
-| Tool | When to Use | Setup |
-|------|-------------|-------|
-| **WebFetch** | First attempt - fast, built-in | None required |
-| **Jina Reader** | JS-rendered pages, PDFs, cleaner extraction | Prefix URL with `r.jina.ai/` |
-| **Firecrawl** | Anti-bot bypass, complex scraping, structured extraction | Use `firecrawl-expert` agent |
+| Tool | Avg Speed | Success | Best For |
+|------|-----------|---------|----------|
+| **WebFetch** | Instant | Varies | First attempt - built-in |
+| **Jina Reader** | **0.5s** | 10/10 | Default fallback - 5-10x faster |
+| **Firecrawl** | 4-5s | 10/10 | Anti-bot bypass, Cloudflare |
+| **markitdown** | 2-3s | 9/10 | Local files + simple pages |
 
 
-**Jina Reader** (free tier: 10M tokens):
+**Jina Reader** (free tier: 10M tokens) - **Recommended default**:
 ```bash
 ```bash
 # Simple - just prefix any URL
 # Simple - just prefix any URL
 curl https://r.jina.ai/https://example.com
 curl https://r.jina.ai/https://example.com
@@ -170,9 +192,9 @@ curl https://r.jina.ai/https://example.com
 curl https://s.jina.ai/your%20search%20query
 curl https://s.jina.ai/your%20search%20query
 ```
 ```
 
 
-**Firecrawl** (requires API key):
+**Firecrawl** (requires API key) - **Anti-bot specialist**:
 ```bash
 ```bash
-# Simple URL scrape (globally available)
+# When Jina fails due to anti-bot
 firecrawl https://blocked-site.com
 firecrawl https://blocked-site.com
 
 
 # Save to file
 # Save to file
@@ -180,19 +202,27 @@ firecrawl https://example.com -o output.md
 
 
 # With JSON metadata
 # With JSON metadata
 firecrawl https://example.com --json
 firecrawl https://example.com --json
-
-# For complex scraping, use the firecrawl-expert agent
 ```
 ```
 - Handles Cloudflare, Datadome, and other anti-bot systems
 - Handles Cloudflare, Datadome, and other anti-bot systems
 - Supports interactive scraping (click, scroll, fill forms)
 - Supports interactive scraping (click, scroll, fill forms)
 - AI-powered structured data extraction
 - AI-powered structured data extraction
-- CLI: `E:\Projects\Coding\Firecrawl\scripts\fc.py`
+
+**markitdown** - **Local files + URLs**:
+```bash
+# URLs (slower than Jina, but works offline)
+markitdown https://example.com
+
+# Local files (unique capability)
+markitdown document.pdf
+markitdown report.docx
+markitdown data.xlsx
+```
 
 
 **Decision Tree:**
 **Decision Tree:**
 1. Try `WebFetch` first (instant, free)
 1. Try `WebFetch` first (instant, free)
-2. If blocked/JS-heavy → Try `r.jina.ai/URL` prefix
-3. If still blocked → Try `firecrawl <url>` CLI
-4. For complex scraping/extraction → Use `firecrawl-expert` agent
+2. If blocked → Try Jina `r.jina.ai/URL` (fastest, best success rate)
+3. If anti-bot/Cloudflare → Try `firecrawl <url>` (designed for bypass)
+4. For local files (PDF, Word, Excel) → Use `markitdown`
 
 
 ## Token Efficiency Benchmarks
 ## Token Efficiency Benchmarks