4 months ago · d969b5411e
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -56,7 +56,7 @@ On "INIT:" message at session start:
 
																 ## Quick Reference
															
 
																-**CLI Tools:** Use `rg` over grep, `fd` over find, `eza` over ls, `bat` over cat
															
 
																+**CLI Tools:** Use `rg` over grep, `fd` over find, `eza` over ls, `bat` over cat, `markitdown` for documents
															
 
																 **Web Fetching:** WebFetch → Jina (`r.jina.ai/`) → `firecrawl` → firecrawl-expert agent
															
--- a/README.md
+++ b/README.md
@@ -139,6 +139,7 @@ Then symlink or copy to your Claude directories:
 
																 | [find-replace](skills/find-replace/) | Modern find-and-replace with sd |
															
 
																 | [code-stats](skills/code-stats/) | Analyze codebase with tokei and difft |
															
 
																 | [data-processing](skills/data-processing/) | Process JSON with jq, YAML/TOML with yq |
															
 
																+| [markitdown](skills/markitdown/) | Convert PDF, Word, Excel, PowerPoint, images to markdown |
															
 
																 | [structural-search](skills/structural-search/) | Search code by AST structure with ast-grep |
															
 
																 #### Workflow Skills
															
--- a/rules/cli-tools.md
+++ b/rules/cli-tools.md
@@ -51,6 +51,23 @@ jq '.dependencies | keys' package.json
 
																 yq '.services | keys' docker-compose.yml
															
 
																 ```
															
 
																+## Document Conversion
															
 
																+
															
 
																+| Instead of | Use | Why |
															
 
																+|------------|-----|-----|
															
 
																+| PyMuPDF/pdfplumber | `markitdown` | One tool for PDF, Word, Excel, PowerPoint |
															
 
																+| python-docx | `markitdown` | Consistent markdown output |
															
 
																+| Manual OCR | `markitdown` | Built-in image text extraction |
															
 
																+
															
 
																+```bash
															
 
																+# Convert documents to markdown (use markitdown)
															
 
																+markitdown document.pdf           # PDF to markdown
															
 
																+markitdown report.docx            # Word to markdown
															
 
																+markitdown data.xlsx              # Excel to markdown tables
															
 
																+markitdown slides.pptx            # PowerPoint to markdown
															
 
																+markitdown screenshot.png         # OCR image text
															
 
																+```
															
 
																+
															
 
																 ## Git Operations
															
 
																 | Instead of | Use | Why |
															
@@ -161,33 +178,36 @@ just test                     # Run test task
 
																 ## Web Fetching (URL Retrieval)
															
 
																-When fetching web content, use this hierarchy in order:
															
 
																+When fetching web content, use this hierarchy based on benchmarked performance:
															
 
																-| Priority | Tool | When to Use |
															
 
																-|----------|------|-------------|
															
 
																-| 1 | `WebFetch` | First attempt - fast, built-in |
															
 
																-| 2 | `r.jina.ai/URL` | JS-rendered pages, PDFs, cleaner extraction |
															
 
																-| 3 | `firecrawl <url>` | Anti-bot bypass, blocked sites (403, Cloudflare) |
															
 
																-| 4 | `firecrawl-expert` agent | Complex scraping, structured extraction |
															
 
																+| Priority | Tool | Speed | Use Case |
															
 
																+|----------|------|-------|----------|
															
 
																+| 1 | `WebFetch` | Instant | First attempt - built-in |
															
 
																+| 2 | `r.jina.ai/URL` | **0.5s avg** | Default fallback - 5-10x faster than alternatives |
															
 
																+| 3 | `firecrawl <url>` | 4-5s avg | Anti-bot bypass, Cloudflare, heavy JS |
															
 
																+| 4 | `markitdown <url>` | 2-3s avg | Simple static pages (or local files) |
															
 
																 ```bash
															
 
																-# Jina Reader - prefix any URL (free, 10M tokens)
															
 
																+# Jina Reader - fastest option (free, 10M tokens)
															
 
																 curl https://r.jina.ai/https://example.com
															
 
																 # Jina Search - search + fetch in one call
															
 
																 curl https://s.jina.ai/your%20search%20query
															
 
																-# Firecrawl CLI - when WebFetch gets blocked
															
 
																+# Firecrawl CLI - anti-bot bypass
															
 
																 firecrawl https://blocked-site.com
															
 
																-firecrawl https://example.com -o output.md
															
 
																 firecrawl https://example.com --json
															
 
																+
															
 
																+# markitdown - simple pages or local files
															
 
																+markitdown https://example.com
															
 
																+markitdown document.pdf
															
 
																 ```
															
 
																 **Decision Tree:**
															
 
																 1. Try `WebFetch` first (instant, free)
															
 
																-2. If 403/blocked/JS-heavy → Try Jina: `r.jina.ai/URL`
															
 
																-3. If still blocked → Try `firecrawl <url>`
															
 
																-4. For complex scraping → Use `firecrawl-expert` agent
															
 
																+2. If blocked → Try Jina: `r.jina.ai/URL` (fastest, 10/10 success rate)
															
 
																+3. If anti-bot/Cloudflare → Try `firecrawl <url>` (designed for bypass)
															
 
																+4. For local files (PDF, Word, Excel) → Use `markitdown`
															
 
																 ## Reference
															
--- a/skills/markitdown/SKILL.md
+++ b/skills/markitdown/SKILL.md
@@ -0,0 +1,92 @@
 
																+---
															
 
																+name: markitdown
															
 
																+description: "Convert local documents to Markdown using Microsoft's markitdown CLI. Best for: PDF, Word, Excel, PowerPoint, images (OCR), audio. Can fetch URLs but Jina is faster for web. Triggers on: convert to markdown, read PDF, parse document, extract text from, docx, xlsx, pptx, OCR image, local file."
															
 
																+compatibility: "Requires markitdown. Install: pip install markitdown"
															
 
																+allowed-tools: "Bash"
															
 
																+---
															
 
																+
															
 
																+# markitdown - Document to Markdown
															
 
																+
															
 
																+Convert local documents to clean Markdown. One tool for PDF, Word, Excel, PowerPoint, images, and more.
															
 
																+
															
 
																+## When to Use markitdown
															
 
																+
															
 
																+| Use Case | Recommendation |
															
 
																+|----------|----------------|
															
 
																+| **Local files (PDF, Word, Excel)** | ✅ **Use markitdown** - unique capability |
															
 
																+| **Web pages** | ❌ Use Jina (`r.jina.ai/`) - 5x faster |
															
 
																+| **Blocked/anti-bot sites** | ❌ Use Firecrawl |
															
 
																+| **OCR on images** | ✅ **Use markitdown** |
															
 
																+| **Audio transcription** | ✅ **Use markitdown** |
															
 
																+
															
 
																+## Basic Usage
															
 
																+
															
 
																+```bash
															
 
																+# Local files (primary use case)
															
 
																+markitdown document.pdf
															
 
																+markitdown report.docx
															
 
																+markitdown data.xlsx
															
 
																+markitdown slides.pptx
															
 
																+markitdown screenshot.png    # OCR
															
 
																+
															
 
																+# URLs (works, but Jina is faster)
															
 
																+markitdown https://example.com
															
 
																+
															
 
																+# Save output
															
 
																+markitdown document.pdf > document.md
															
 
																+```
															
 
																+
															
 
																+## Supported Formats
															
 
																+
															
 
																+| Format | Extensions | Notes |
															
 
																+|--------|------------|-------|
															
 
																+| PDF | `.pdf` | Text extraction, tables |
															
 
																+| Word | `.docx` | Formatting preserved |
															
 
																+| Excel | `.xlsx` | Tables to markdown |
															
 
																+| PowerPoint | `.pptx` | Slides as sections |
															
 
																+| Images | `.jpg`, `.png` | OCR text extraction |
															
 
																+| HTML | `.html` | Clean conversion |
															
 
																+| Audio | `.mp3`, `.wav` | Speech-to-text |
															
 
																+| Text | `.txt`, `.csv`, `.json`, `.xml` | Pass-through/structure |
															
 
																+| URLs | `https://...` | Works but slower than Jina |
															
 
																+
															
 
																+## Benchmarked Performance (URLs)
															
 
																+
															
 
																+| Tool | Avg Speed | Success Rate |
															
 
																+|------|-----------|--------------|
															
 
																+| Jina | **0.5s** | 10/10 |
															
 
																+| markitdown | 2.5s | 9/10 |
															
 
																+| Firecrawl | 4.5s | 10/10 |
															
 
																+
															
 
																+**Verdict**: For URLs, use Jina. For local files, markitdown is the only option.
															
 
																+
															
 
																+## Examples
															
 
																+
															
 
																+```bash
															
 
																+# PDF to markdown (primary use case)
															
 
																+markitdown report.pdf > report.md
															
 
																+
															
 
																+# Excel spreadsheet
															
 
																+markitdown financials.xlsx
															
 
																+
															
 
																+# Image with text (OCR)
															
 
																+markitdown screenshot.png
															
 
																+
															
 
																+# PowerPoint deck
															
 
																+markitdown presentation.pptx > slides.md
															
 
																+
															
 
																+# Audio transcription
															
 
																+markitdown meeting.mp3 > transcript.md
															
 
																+```
															
 
																+
															
 
																+## Comparison with Alternatives
															
 
																+
															
 
																+| Task | markitdown | Alternative |
															
 
																+|------|------------|-------------|
															
 
																+| PDF text | `markitdown file.pdf` | PyMuPDF, pdfplumber |
															
 
																+| Word docs | `markitdown file.docx` | python-docx |
															
 
																+| Excel | `markitdown file.xlsx` | pandas, openpyxl |
															
 
																+| OCR | `markitdown image.png` | Tesseract |
															
 
																+| Web pages | Use Jina instead | `r.jina.ai/URL` (5x faster) |
															
 
																+
															
 
																+**markitdown's advantage**: One CLI for all local document formats. No code needed.
															
--- a/tests/markitdown-vs-jina/benchmark.py
+++ b/tests/markitdown-vs-jina/benchmark.py
@@ -0,0 +1,325 @@
 
																+#!/usr/bin/env python3
															
 
																+"""
															
 
																+Benchmark: markitdown vs Jina Reader vs Firecrawl
															
 
																+Compare speed, accuracy, formatting, and parallel execution
															
 
																+"""
															
 
																+
															
 
																+import subprocess
															
 
																+import time
															
 
																+import os
															
 
																+import sys
															
 
																+import concurrent.futures
															
 
																+from pathlib import Path
															
 
																+from urllib.parse import quote
															
 
																+
															
 
																+# Force UTF-8 encoding on Windows
															
 
																+if sys.platform == "win32":
															
 
																+    import codecs
															
 
																+    sys.stdout = codecs.getwriter("utf-8")(sys.stdout.buffer, errors="replace")
															
 
																+    sys.stderr = codecs.getwriter("utf-8")(sys.stderr.buffer, errors="replace")
															
 
																+
															
 
																+# Test corpus - 10 URLs of varying complexity
															
 
																+URLS = [
															
 
																+    # News articles - use stable landing pages
															
 
																+    ("guardian-tech", "https://www.theguardian.com/technology"),
															
 
																+    ("bbc-news", "https://www.bbc.com/news"),
															
 
																+
															
 
																+    # Documentation
															
 
																+    ("python-docs", "https://docs.python.org/3/library/asyncio.html"),
															
 
																+    ("mdn-fetch", "https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API"),
															
 
																+    ("rust-book", "https://doc.rust-lang.org/book/ch04-01-what-is-ownership.html"),
															
 
																+
															
 
																+    # Feature-rich / Complex
															
 
																+    ("github-repo", "https://github.com/microsoft/markitdown"),
															
 
																+    ("hackernews", "https://news.ycombinator.com/"),
															
 
																+    ("wikipedia", "https://en.wikipedia.org/wiki/Markdown"),
															
 
																+
															
 
																+    # Simple / Minimal
															
 
																+    ("example-com", "https://example.com"),
															
 
																+    ("httpbin", "https://httpbin.org/html"),
															
 
																+]
															
 
																+
															
 
																+OUTPUT_DIR = Path(__file__).parent / "output"
															
 
																+
															
 
																+def fetch_with_markitdown(url: str, name: str) -> dict:
															
 
																+    """Fetch URL with markitdown, return timing and output"""
															
 
																+    output_file = OUTPUT_DIR / f"{name}_markitdown.md"
															
 
																+    start = time.perf_counter()
															
 
																+    try:
															
 
																+        result = subprocess.run(
															
 
																+            ["markitdown", url],
															
 
																+            capture_output=True,
															
 
																+            text=True,
															
 
																+            timeout=60,
															
 
																+            encoding="utf-8",
															
 
																+            errors="replace"
															
 
																+        )
															
 
																+        elapsed = time.perf_counter() - start
															
 
																+        output = result.stdout or ""
															
 
																+        error = result.stderr or ""
															
 
																+        success = result.returncode == 0 and len(output) > 50
															
 
																+    except subprocess.TimeoutExpired:
															
 
																+        elapsed = 60.0
															
 
																+        output = ""
															
 
																+        error = "TIMEOUT"
															
 
																+        success = False
															
 
																+    except Exception as e:
															
 
																+        elapsed = time.perf_counter() - start
															
 
																+        output = ""
															
 
																+        error = str(e)
															
 
																+        success = False
															
 
																+
															
 
																+    if success and output:
															
 
																+        output_file.write_text(output, encoding="utf-8")
															
 
																+
															
 
																+    return {
															
 
																+        "tool": "markitdown",
															
 
																+        "name": name,
															
 
																+        "url": url,
															
 
																+        "time": elapsed,
															
 
																+        "success": success,
															
 
																+        "output_len": len(output),
															
 
																+        "error": error if not success else None,
															
 
																+        "output_file": str(output_file) if success else None
															
 
																+    }
															
 
																+
															
 
																+def fetch_with_jina(url: str, name: str) -> dict:
															
 
																+    """Fetch URL with Jina Reader, return timing and output"""
															
 
																+    output_file = OUTPUT_DIR / f"{name}_jina.md"
															
 
																+    jina_url = f"https://r.jina.ai/{url}"
															
 
																+    start = time.perf_counter()
															
 
																+    try:
															
 
																+        result = subprocess.run(
															
 
																+            ["curl", "-s", "-L", "--max-time", "60", jina_url],
															
 
																+            capture_output=True,
															
 
																+            text=True,
															
 
																+            timeout=65,
															
 
																+            encoding="utf-8",
															
 
																+            errors="replace"
															
 
																+        )
															
 
																+        elapsed = time.perf_counter() - start
															
 
																+        output = result.stdout or ""
															
 
																+        error = result.stderr
															
 
																+        success = result.returncode == 0 and len(output) > 100
															
 
																+    except subprocess.TimeoutExpired:
															
 
																+        elapsed = 60.0
															
 
																+        output = ""
															
 
																+        error = "TIMEOUT"
															
 
																+        success = False
															
 
																+    except Exception as e:
															
 
																+        elapsed = time.perf_counter() - start
															
 
																+        output = ""
															
 
																+        error = str(e)
															
 
																+        success = False
															
 
																+
															
 
																+    if success and output:
															
 
																+        output_file.write_text(output, encoding="utf-8")
															
 
																+
															
 
																+    return {
															
 
																+        "tool": "jina",
															
 
																+        "name": name,
															
 
																+        "url": url,
															
 
																+        "time": elapsed,
															
 
																+        "success": success,
															
 
																+        "output_len": len(output),
															
 
																+        "error": error if not success else None,
															
 
																+        "output_file": str(output_file) if success else None
															
 
																+    }
															
 
																+
															
 
																+def fetch_with_firecrawl(url: str, name: str) -> dict:
															
 
																+    """Fetch URL with Firecrawl, return timing and output"""
															
 
																+    output_file = OUTPUT_DIR / f"{name}_firecrawl.md"
															
 
																+    start = time.perf_counter()
															
 
																+    try:
															
 
																+        # On Windows, firecrawl is a .cmd script - need shell=True
															
 
																+        result = subprocess.run(
															
 
																+            f"firecrawl {url}",
															
 
																+            capture_output=True,
															
 
																+            text=True,
															
 
																+            timeout=90,  # Firecrawl can be slower due to JS rendering
															
 
																+            encoding="utf-8",
															
 
																+            errors="replace",
															
 
																+            shell=True
															
 
																+        )
															
 
																+        elapsed = time.perf_counter() - start
															
 
																+        output = result.stdout or ""
															
 
																+        error = result.stderr
															
 
																+        success = result.returncode == 0 and len(output) > 100
															
 
																+    except subprocess.TimeoutExpired:
															
 
																+        elapsed = 90.0
															
 
																+        output = ""
															
 
																+        error = "TIMEOUT"
															
 
																+        success = False
															
 
																+    except Exception as e:
															
 
																+        elapsed = time.perf_counter() - start
															
 
																+        output = ""
															
 
																+        error = str(e)
															
 
																+        success = False
															
 
																+
															
 
																+    if success and output:
															
 
																+        output_file.write_text(output, encoding="utf-8")
															
 
																+
															
 
																+    return {
															
 
																+        "tool": "firecrawl",
															
 
																+        "name": name,
															
 
																+        "url": url,
															
 
																+        "time": elapsed,
															
 
																+        "success": success,
															
 
																+        "output_len": len(output),
															
 
																+        "error": error if not success else None,
															
 
																+        "output_file": str(output_file) if success else None
															
 
																+    }
															
 
																+
															
 
																+def run_sequential():
															
 
																+    """Run all tests sequentially"""
															
 
																+    print("\n" + "="*60)
															
 
																+    print("SEQUENTIAL EXECUTION")
															
 
																+    print("="*60)
															
 
																+
															
 
																+    results = {"markitdown": [], "jina": [], "firecrawl": []}
															
 
																+
															
 
																+    for name, url in URLS:
															
 
																+        print(f"\nTesting: {name}")
															
 
																+        print(f"  URL: {url}")
															
 
																+
															
 
																+        # markitdown
															
 
																+        r1 = fetch_with_markitdown(url, name)
															
 
																+        status1 = "OK" if r1["success"] else "FAIL"
															
 
																+        print(f"  markitdown: {r1['time']:.2f}s, {r1['output_len']:,} chars - {status1}")
															
 
																+        results["markitdown"].append(r1)
															
 
																+
															
 
																+        # jina
															
 
																+        r2 = fetch_with_jina(url, name)
															
 
																+        status2 = "OK" if r2["success"] else "FAIL"
															
 
																+        print(f"  jina:       {r2['time']:.2f}s, {r2['output_len']:,} chars - {status2}")
															
 
																+        results["jina"].append(r2)
															
 
																+
															
 
																+        # firecrawl
															
 
																+        r3 = fetch_with_firecrawl(url, name)
															
 
																+        status3 = "OK" if r3["success"] else "FAIL"
															
 
																+        print(f"  firecrawl:  {r3['time']:.2f}s, {r3['output_len']:,} chars - {status3}")
															
 
																+        results["firecrawl"].append(r3)
															
 
																+
															
 
																+    return results
															
 
																+
															
 
																+def run_parallel(tool: str, max_workers: int = 5):
															
 
																+    """Run all tests in parallel for a single tool"""
															
 
																+    print(f"\n{'='*60}")
															
 
																+    print(f"PARALLEL EXECUTION: {tool} (max_workers={max_workers})")
															
 
																+    print("="*60)
															
 
																+
															
 
																+    fetch_fns = {
															
 
																+        "markitdown": fetch_with_markitdown,
															
 
																+        "jina": fetch_with_jina,
															
 
																+        "firecrawl": fetch_with_firecrawl
															
 
																+    }
															
 
																+    fetch_fn = fetch_fns[tool]
															
 
																+
															
 
																+    start = time.perf_counter()
															
 
																+    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
															
 
																+        futures = {
															
 
																+            executor.submit(fetch_fn, url, name): name
															
 
																+            for name, url in URLS
															
 
																+        }
															
 
																+        results = []
															
 
																+        for future in concurrent.futures.as_completed(futures):
															
 
																+            name = futures[future]
															
 
																+            result = future.result()
															
 
																+            status = "OK" if result["success"] else f"FAIL"
															
 
																+            print(f"  {name}: {result['time']:.2f}s - {status}")
															
 
																+            results.append(result)
															
 
																+
															
 
																+    total_time = time.perf_counter() - start
															
 
																+    print(f"\nTotal parallel time: {total_time:.2f}s")
															
 
																+
															
 
																+    return results, total_time
															
 
																+
															
 
																+def print_summary(seq_results: dict, par_results: dict):
															
 
																+    """Print comparison summary"""
															
 
																+    print("\n" + "="*60)
															
 
																+    print("SUMMARY")
															
 
																+    print("="*60)
															
 
																+
															
 
																+    # Sequential times
															
 
																+    md_times = [r["time"] for r in seq_results["markitdown"] if r["success"]]
															
 
																+    jina_times = [r["time"] for r in seq_results["jina"] if r["success"]]
															
 
																+    fc_times = [r["time"] for r in seq_results["firecrawl"] if r["success"]]
															
 
																+
															
 
																+    md_success = sum(1 for r in seq_results["markitdown"] if r["success"])
															
 
																+    jina_success = sum(1 for r in seq_results["jina"] if r["success"])
															
 
																+    fc_success = sum(1 for r in seq_results["firecrawl"] if r["success"])
															
 
																+
															
 
																+    md_chars = sum(r["output_len"] for r in seq_results["markitdown"] if r["success"])
															
 
																+    jina_chars = sum(r["output_len"] for r in seq_results["jina"] if r["success"])
															
 
																+    fc_chars = sum(r["output_len"] for r in seq_results["firecrawl"] if r["success"])
															
 
																+
															
 
																+    def safe_avg(times):
															
 
																+        return sum(times)/len(times) if times else 0
															
 
																+
															
 
																+    print("\n## Speed (Sequential)")
															
 
																+    print(f"| Metric | markitdown | Jina | Firecrawl |")
															
 
																+    print(f"|--------|------------|------|-----------|")
															
 
																+    print(f"| Avg time | {safe_avg(md_times):.2f}s | {safe_avg(jina_times):.2f}s | {safe_avg(fc_times):.2f}s |")
															
 
																+    print(f"| Total time | {sum(md_times):.2f}s | {sum(jina_times):.2f}s | {sum(fc_times):.2f}s |")
															
 
																+    print(f"| Success rate | {md_success}/{len(URLS)} | {jina_success}/{len(URLS)} | {fc_success}/{len(URLS)} |")
															
 
																+
															
 
																+    print("\n## Speed (Parallel, 5 workers)")
															
 
																+    print(f"| Metric | markitdown | Jina | Firecrawl |")
															
 
																+    print(f"|--------|------------|------|-----------|")
															
 
																+    print(f"| Total time | {par_results['markitdown'][1]:.2f}s | {par_results['jina'][1]:.2f}s | {par_results['firecrawl'][1]:.2f}s |")
															
 
																+
															
 
																+    print("\n## Output Size")
															
 
																+    print(f"| Metric | markitdown | Jina | Firecrawl |")
															
 
																+    print(f"|--------|------------|------|-----------|")
															
 
																+    print(f"| Total chars | {md_chars:,} | {jina_chars:,} | {fc_chars:,} |")
															
 
																+    print(f"| Avg chars | {md_chars//max(md_success,1):,} | {jina_chars//max(jina_success,1):,} | {fc_chars//max(fc_success,1):,} |")
															
 
																+
															
 
																+    print("\n## Per-URL Comparison")
															
 
																+    print(f"| URL | markitdown | Jina | Firecrawl | Winner |")
															
 
																+    print(f"|-----|------------|------|-----------|--------|")
															
 
																+    for i, (name, url) in enumerate(URLS):
															
 
																+        md = seq_results["markitdown"][i]
															
 
																+        jn = seq_results["jina"][i]
															
 
																+        fc = seq_results["firecrawl"][i]
															
 
																+
															
 
																+        md_str = f"{md['time']:.1f}s" if md["success"] else "FAIL"
															
 
																+        jn_str = f"{jn['time']:.1f}s" if jn["success"] else "FAIL"
															
 
																+        fc_str = f"{fc['time']:.1f}s" if fc["success"] else "FAIL"
															
 
																+
															
 
																+        # Determine winner by speed among successful tools
															
 
																+        successful = []
															
 
																+        if md["success"]: successful.append(("markitdown", md["time"]))
															
 
																+        if jn["success"]: successful.append(("Jina", jn["time"]))
															
 
																+        if fc["success"]: successful.append(("Firecrawl", fc["time"]))
															
 
																+
															
 
																+        if successful:
															
 
																+            winner = min(successful, key=lambda x: x[1])[0]
															
 
																+        else:
															
 
																+            winner = "None"
															
 
																+
															
 
																+        print(f"| {name} | {md_str} | {jn_str} | {fc_str} | {winner} |")
															
 
																+
															
 
																+    print(f"\nOutput files saved to: {OUTPUT_DIR}")
															
 
																+
															
 
																+def main():
															
 
																+    # Create output directory
															
 
																+    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
															
 
																+
															
 
																+    print("Benchmark: markitdown vs Jina Reader vs Firecrawl")
															
 
																+    print(f"Testing {len(URLS)} URLs")
															
 
																+
															
 
																+    # Run sequential tests
															
 
																+    seq_results = run_sequential()
															
 
																+
															
 
																+    # Run parallel tests
															
 
																+    par_results = {
															
 
																+        "markitdown": run_parallel("markitdown", max_workers=5),
															
 
																+        "jina": run_parallel("jina", max_workers=5),
															
 
																+        "firecrawl": run_parallel("firecrawl", max_workers=5),
															
 
																+    }
															
 
																+
															
 
																+    # Print summary
															
 
																+    print_summary(seq_results, par_results)
															
 
																+
															
 
																+if __name__ == "__main__":
															
 
																+    main()
															
--- a/tools/README.md
+++ b/tools/README.md
@@ -44,6 +44,27 @@ Token-efficient CLI tools that replace verbose legacy commands. These tools are
 
																 | JSON manual | `jq` | Structured queries and transforms |
															
 
																 | YAML manual | `yq` | Same as jq for YAML/TOML |
															
 
																+### Document Conversion
															
 
																+
															
 
																+| Legacy | Modern | Improvement |
															
 
																+|--------|--------|-------------|
															
 
																+| PyMuPDF/pdfplumber | `markitdown` | One CLI for all document types |
															
 
																+| python-docx | `markitdown` | Consistent markdown output |
															
 
																+| Tesseract (OCR) | `markitdown` | Built-in image text extraction |
															
 
																+
															
 
																+**markitdown** (Microsoft) - Convert documents to markdown:
															
 
																+```bash
															
 
																+pip install markitdown
															
 
																+
															
 
																+# Usage
															
 
																+markitdown document.pdf       # PDF
															
 
																+markitdown report.docx        # Word
															
 
																+markitdown data.xlsx          # Excel (tables)
															
 
																+markitdown slides.pptx        # PowerPoint
															
 
																+markitdown image.png          # OCR
															
 
																+```
															
 
																+Supports: PDF, DOCX, XLSX, PPTX, images (OCR), HTML, audio (speech-to-text), CSV, JSON, XML
															
 
																+
															
 
																 ### Git Operations
															
 
																 | Legacy | Modern | Improvement |
															
@@ -153,15 +174,16 @@ export PERPLEXITY_API_KEY="your-key-here"
 
																 ### Web Fetching (URL Retrieval Hierarchy)
															
 
																-When Claude's built-in `WebFetch` gets blocked (403, Cloudflare, etc.), use these alternatives in order:
															
 
																+Benchmarked performance (10 URLs, varying complexity):
															
 
																-| Tool | When to Use | Setup |
															
 
																-|------|-------------|-------|
															
 
																-| **WebFetch** | First attempt - fast, built-in | None required |
															
 
																-| **Jina Reader** | JS-rendered pages, PDFs, cleaner extraction | Prefix URL with `r.jina.ai/` |
															
 
																-| **Firecrawl** | Anti-bot bypass, complex scraping, structured extraction | Use `firecrawl-expert` agent |
															
 
																+| Tool | Avg Speed | Success | Best For |
															
 
																+|------|-----------|---------|----------|
															
 
																+| **WebFetch** | Instant | Varies | First attempt - built-in |
															
 
																+| **Jina Reader** | **0.5s** | 10/10 | Default fallback - 5-10x faster |
															
 
																+| **Firecrawl** | 4-5s | 10/10 | Anti-bot bypass, Cloudflare |
															
 
																+| **markitdown** | 2-3s | 9/10 | Local files + simple pages |
															
 
																-**Jina Reader** (free tier: 10M tokens):
															
 
																+**Jina Reader** (free tier: 10M tokens) - **Recommended default**:
															
 
																 ```bash
															
 
																 # Simple - just prefix any URL
															
 
																 curl https://r.jina.ai/https://example.com
															
@@ -170,9 +192,9 @@ curl https://r.jina.ai/https://example.com
 
																 curl https://s.jina.ai/your%20search%20query
															
 
																 ```
															
 
																-**Firecrawl** (requires API key):
															
 
																+**Firecrawl** (requires API key) - **Anti-bot specialist**:
															
 
																 ```bash
															
 
																-# Simple URL scrape (globally available)
															
 
																+# When Jina fails due to anti-bot
															
 
																 firecrawl https://blocked-site.com
															
 
																 # Save to file
															
@@ -180,19 +202,27 @@ firecrawl https://example.com -o output.md
 
																 # With JSON metadata
															
 
																 firecrawl https://example.com --json
															
 
																-
															
 
																-# For complex scraping, use the firecrawl-expert agent
															
 
																 ```
															
 
																 - Handles Cloudflare, Datadome, and other anti-bot systems
															
 
																 - Supports interactive scraping (click, scroll, fill forms)
															
 
																 - AI-powered structured data extraction
															
 
																-- CLI: `E:\Projects\Coding\Firecrawl\scripts\fc.py`
															
 
																+
															
 
																+**markitdown** - **Local files + URLs**:
															
 
																+```bash
															
 
																+# URLs (slower than Jina, but works offline)
															
 
																+markitdown https://example.com
															
 
																+
															
 
																+# Local files (unique capability)
															
 
																+markitdown document.pdf
															
 
																+markitdown report.docx
															
 
																+markitdown data.xlsx
															
 
																+```
															
 
																 **Decision Tree:**
															
 
																 1. Try `WebFetch` first (instant, free)
															
 
																-2. If blocked/JS-heavy → Try `r.jina.ai/URL` prefix
															
 
																-3. If still blocked → Try `firecrawl <url>` CLI
															
 
																-4. For complex scraping/extraction → Use `firecrawl-expert` agent
															
 
																+2. If blocked → Try Jina `r.jina.ai/URL` (fastest, best success rate)
															
 
																+3. If anti-bot/Cloudflare → Try `firecrawl <url>` (designed for bypass)
															
 
																+4. For local files (PDF, Word, Excel) → Use `markitdown`
															
 
																 ## Token Efficiency Benchmarks