4 months ago · d969b5411e
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -56,7 +56,7 @@ On "INIT:" message at session start:
 
				 
			
 
				 ## Quick Reference
			
 
				 
			
 
				-**CLI Tools:** Use `rg` over grep, `fd` over find, `eza` over ls, `bat` over cat
			
 
				+**CLI Tools:** Use `rg` over grep, `fd` over find, `eza` over ls, `bat` over cat, `markitdown` for documents
			
 
				 
			
 
				 **Web Fetching:** WebFetch → Jina (`r.jina.ai/`) → `firecrawl` → firecrawl-expert agent
			
 
				 
			
--- a/README.md
+++ b/README.md
@@ -139,6 +139,7 @@ Then symlink or copy to your Claude directories:
 
				 | [find-replace](skills/find-replace/) | Modern find-and-replace with sd |
			
 
				 | [code-stats](skills/code-stats/) | Analyze codebase with tokei and difft |
			
 
				 | [data-processing](skills/data-processing/) | Process JSON with jq, YAML/TOML with yq |
			
 
				+| [markitdown](skills/markitdown/) | Convert PDF, Word, Excel, PowerPoint, images to markdown |
			
 
				 | [structural-search](skills/structural-search/) | Search code by AST structure with ast-grep |
			
 
				 
			
 
				 #### Workflow Skills
			
--- a/rules/cli-tools.md
+++ b/rules/cli-tools.md
@@ -51,6 +51,23 @@ jq '.dependencies | keys' package.json
 
				 yq '.services | keys' docker-compose.yml
			
 
				 ```
			
 
				 
			
 
				+## Document Conversion
			
 
				+
			
 
				+| Instead of | Use | Why |
			
 
				+|------------|-----|-----|
			
 
				+| PyMuPDF/pdfplumber | `markitdown` | One tool for PDF, Word, Excel, PowerPoint |
			
 
				+| python-docx | `markitdown` | Consistent markdown output |
			
 
				+| Manual OCR | `markitdown` | Built-in image text extraction |
			
 
				+
			
 
				+```bash
			
 
				+# Convert documents to markdown (use markitdown)
			
 
				+markitdown document.pdf           # PDF to markdown
			
 
				+markitdown report.docx            # Word to markdown
			
 
				+markitdown data.xlsx              # Excel to markdown tables
			
 
				+markitdown slides.pptx            # PowerPoint to markdown
			
 
				+markitdown screenshot.png         # OCR image text
			
 
				+```
			
 
				+
			
 
				 ## Git Operations
			
 
				 
			
 
				 | Instead of | Use | Why |
			
@@ -161,33 +178,36 @@ just test                     # Run test task
 
				 
			
 
				 ## Web Fetching (URL Retrieval)
			
 
				 
			
 
				-When fetching web content, use this hierarchy in order:
			
 
				+When fetching web content, use this hierarchy based on benchmarked performance:
			
 
				 
			
 
				-| Priority | Tool | When to Use |
			
 
				-|----------|------|-------------|
			
 
				-| 1 | `WebFetch` | First attempt - fast, built-in |
			
 
				-| 2 | `r.jina.ai/URL` | JS-rendered pages, PDFs, cleaner extraction |
			
 
				-| 3 | `firecrawl <url>` | Anti-bot bypass, blocked sites (403, Cloudflare) |
			
 
				-| 4 | `firecrawl-expert` agent | Complex scraping, structured extraction |
			
 
				+| Priority | Tool | Speed | Use Case |
			
 
				+|----------|------|-------|----------|
			
 
				+| 1 | `WebFetch` | Instant | First attempt - built-in |
			
 
				+| 2 | `r.jina.ai/URL` | **0.5s avg** | Default fallback - 5-10x faster than alternatives |
			
 
				+| 3 | `firecrawl <url>` | 4-5s avg | Anti-bot bypass, Cloudflare, heavy JS |
			
 
				+| 4 | `markitdown <url>` | 2-3s avg | Simple static pages (or local files) |
			
 
				 
			
 
				 ```bash
			
 
				-# Jina Reader - prefix any URL (free, 10M tokens)
			
 
				+# Jina Reader - fastest option (free, 10M tokens)
			
 
				 curl https://r.jina.ai/https://example.com
			
 
				 
			
 
				 # Jina Search - search + fetch in one call
			
 
				 curl https://s.jina.ai/your%20search%20query
			
 
				 
			
 
				-# Firecrawl CLI - when WebFetch gets blocked
			
 
				+# Firecrawl CLI - anti-bot bypass
			
 
				 firecrawl https://blocked-site.com
			
 
				-firecrawl https://example.com -o output.md
			
 
				 firecrawl https://example.com --json
			
 
				+
			
 
				+# markitdown - simple pages or local files
			
 
				+markitdown https://example.com
			
 
				+markitdown document.pdf
			
 
				 ```
			
 
				 
			
 
				 **Decision Tree:**
			
 
				 1. Try `WebFetch` first (instant, free)
			
 
				-2. If 403/blocked/JS-heavy → Try Jina: `r.jina.ai/URL`
			
 
				-3. If still blocked → Try `firecrawl <url>`
			
 
				-4. For complex scraping → Use `firecrawl-expert` agent
			
 
				+2. If blocked → Try Jina: `r.jina.ai/URL` (fastest, 10/10 success rate)
			
 
				+3. If anti-bot/Cloudflare → Try `firecrawl <url>` (designed for bypass)
			
 
				+4. For local files (PDF, Word, Excel) → Use `markitdown`
			
 
				 
			
 
				 ## Reference
			
 
				 
			
--- a/skills/markitdown/SKILL.md
+++ b/skills/markitdown/SKILL.md
@@ -0,0 +1,92 @@
 
				+---
			
 
				+name: markitdown
			
 
				+description: "Convert local documents to Markdown using Microsoft's markitdown CLI. Best for: PDF, Word, Excel, PowerPoint, images (OCR), audio. Can fetch URLs but Jina is faster for web. Triggers on: convert to markdown, read PDF, parse document, extract text from, docx, xlsx, pptx, OCR image, local file."
			
 
				+compatibility: "Requires markitdown. Install: pip install markitdown"
			
 
				+allowed-tools: "Bash"
			
 
				+---
			
 
				+
			
 
				+# markitdown - Document to Markdown
			
 
				+
			
 
				+Convert local documents to clean Markdown. One tool for PDF, Word, Excel, PowerPoint, images, and more.
			
 
				+
			
 
				+## When to Use markitdown
			
 
				+
			
 
				+| Use Case | Recommendation |
			
 
				+|----------|----------------|
			
 
				+| **Local files (PDF, Word, Excel)** | ✅ **Use markitdown** - unique capability |
			
 
				+| **Web pages** | ❌ Use Jina (`r.jina.ai/`) - 5x faster |
			
 
				+| **Blocked/anti-bot sites** | ❌ Use Firecrawl |
			
 
				+| **OCR on images** | ✅ **Use markitdown** |
			
 
				+| **Audio transcription** | ✅ **Use markitdown** |
			
 
				+
			
 
				+## Basic Usage
			
 
				+
			
 
				+```bash
			
 
				+# Local files (primary use case)
			
 
				+markitdown document.pdf
			
 
				+markitdown report.docx
			
 
				+markitdown data.xlsx
			
 
				+markitdown slides.pptx
			
 
				+markitdown screenshot.png    # OCR
			
 
				+
			
 
				+# URLs (works, but Jina is faster)
			
 
				+markitdown https://example.com
			
 
				+
			
 
				+# Save output
			
 
				+markitdown document.pdf > document.md
			
 
				+```
			
 
				+
			
 
				+## Supported Formats
			
 
				+
			
 
				+| Format | Extensions | Notes |
			
 
				+|--------|------------|-------|
			
 
				+| PDF | `.pdf` | Text extraction, tables |
			
 
				+| Word | `.docx` | Formatting preserved |
			
 
				+| Excel | `.xlsx` | Tables to markdown |
			
 
				+| PowerPoint | `.pptx` | Slides as sections |
			
 
				+| Images | `.jpg`, `.png` | OCR text extraction |
			
 
				+| HTML | `.html` | Clean conversion |
			
 
				+| Audio | `.mp3`, `.wav` | Speech-to-text |
			
 
				+| Text | `.txt`, `.csv`, `.json`, `.xml` | Pass-through/structure |
			
 
				+| URLs | `https://...` | Works but slower than Jina |
			
 
				+
			
 
				+## Benchmarked Performance (URLs)
			
 
				+
			
 
				+| Tool | Avg Speed | Success Rate |
			
 
				+|------|-----------|--------------|
			
 
				+| Jina | **0.5s** | 10/10 |
			
 
				+| markitdown | 2.5s | 9/10 |
			
 
				+| Firecrawl | 4.5s | 10/10 |
			
 
				+
			
 
				+**Verdict**: For URLs, use Jina. For local files, markitdown is the only option.
			
 
				+
			
 
				+## Examples
			
 
				+
			
 
				+```bash
			
 
				+# PDF to markdown (primary use case)
			
 
				+markitdown report.pdf > report.md
			
 
				+
			
 
				+# Excel spreadsheet
			
 
				+markitdown financials.xlsx
			
 
				+
			
 
				+# Image with text (OCR)
			
 
				+markitdown screenshot.png
			
 
				+
			
 
				+# PowerPoint deck
			
 
				+markitdown presentation.pptx > slides.md
			
 
				+
			
 
				+# Audio transcription
			
 
				+markitdown meeting.mp3 > transcript.md
			
 
				+```
			
 
				+
			
 
				+## Comparison with Alternatives
			
 
				+
			
 
				+| Task | markitdown | Alternative |
			
 
				+|------|------------|-------------|
			
 
				+| PDF text | `markitdown file.pdf` | PyMuPDF, pdfplumber |
			
 
				+| Word docs | `markitdown file.docx` | python-docx |
			
 
				+| Excel | `markitdown file.xlsx` | pandas, openpyxl |
			
 
				+| OCR | `markitdown image.png` | Tesseract |
			
 
				+| Web pages | Use Jina instead | `r.jina.ai/URL` (5x faster) |
			
 
				+
			
 
				+**markitdown's advantage**: One CLI for all local document formats. No code needed.
			
--- a/tests/markitdown-vs-jina/benchmark.py
+++ b/tests/markitdown-vs-jina/benchmark.py
@@ -0,0 +1,325 @@
 
				+#!/usr/bin/env python3
			
 
				+"""
			
 
				+Benchmark: markitdown vs Jina Reader vs Firecrawl
			
 
				+Compare speed, accuracy, formatting, and parallel execution
			
 
				+"""
			
 
				+
			
 
				+import subprocess
			
 
				+import time
			
 
				+import os
			
 
				+import sys
			
 
				+import concurrent.futures
			
 
				+from pathlib import Path
			
 
				+from urllib.parse import quote
			
 
				+
			
 
				+# Force UTF-8 encoding on Windows
			
 
				+if sys.platform == "win32":
			
 
				+    import codecs
			
 
				+    sys.stdout = codecs.getwriter("utf-8")(sys.stdout.buffer, errors="replace")
			
 
				+    sys.stderr = codecs.getwriter("utf-8")(sys.stderr.buffer, errors="replace")
			
 
				+
			
 
				+# Test corpus - 10 URLs of varying complexity
			
 
				+URLS = [
			
 
				+    # News articles - use stable landing pages
			
 
				+    ("guardian-tech", "https://www.theguardian.com/technology"),
			
 
				+    ("bbc-news", "https://www.bbc.com/news"),
			
 
				+
			
 
				+    # Documentation
			
 
				+    ("python-docs", "https://docs.python.org/3/library/asyncio.html"),
			
 
				+    ("mdn-fetch", "https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API"),
			
 
				+    ("rust-book", "https://doc.rust-lang.org/book/ch04-01-what-is-ownership.html"),
			
 
				+
			
 
				+    # Feature-rich / Complex
			
 
				+    ("github-repo", "https://github.com/microsoft/markitdown"),
			
 
				+    ("hackernews", "https://news.ycombinator.com/"),
			
 
				+    ("wikipedia", "https://en.wikipedia.org/wiki/Markdown"),
			
 
				+
			
 
				+    # Simple / Minimal
			
 
				+    ("example-com", "https://example.com"),
			
 
				+    ("httpbin", "https://httpbin.org/html"),
			
 
				+]
			
 
				+
			
 
				+OUTPUT_DIR = Path(__file__).parent / "output"
			
 
				+
			
 
				+def fetch_with_markitdown(url: str, name: str) -> dict:
			
 
				+    """Fetch URL with markitdown, return timing and output"""
			
 
				+    output_file = OUTPUT_DIR / f"{name}_markitdown.md"
			
 
				+    start = time.perf_counter()
			
 
				+    try:
			
 
				+        result = subprocess.run(
			
 
				+            ["markitdown", url],
			
 
				+            capture_output=True,
			
 
				+            text=True,
			
 
				+            timeout=60,
			
 
				+            encoding="utf-8",
			
 
				+            errors="replace"
			
 
				+        )
			
 
				+        elapsed = time.perf_counter() - start
			
 
				+        output = result.stdout or ""
			
 
				+        error = result.stderr or ""
			
 
				+        success = result.returncode == 0 and len(output) > 50
			
 
				+    except subprocess.TimeoutExpired:
			
 
				+        elapsed = 60.0
			
 
				+        output = ""
			
 
				+        error = "TIMEOUT"
			
 
				+        success = False
			
 
				+    except Exception as e:
			
 
				+        elapsed = time.perf_counter() - start
			
 
				+        output = ""
			
 
				+        error = str(e)
			
 
				+        success = False
			
 
				+
			
 
				+    if success and output:
			
 
				+        output_file.write_text(output, encoding="utf-8")
			
 
				+
			
 
				+    return {
			
 
				+        "tool": "markitdown",
			
 
				+        "name": name,
			
 
				+        "url": url,
			
 
				+        "time": elapsed,
			
 
				+        "success": success,
			
 
				+        "output_len": len(output),
			
 
				+        "error": error if not success else None,
			
 
				+        "output_file": str(output_file) if success else None
			
 
				+    }
			
 
				+
			
 
				+def fetch_with_jina(url: str, name: str) -> dict:
			
 
				+    """Fetch URL with Jina Reader, return timing and output"""
			
 
				+    output_file = OUTPUT_DIR / f"{name}_jina.md"
			
 
				+    jina_url = f"https://r.jina.ai/{url}"
			
 
				+    start = time.perf_counter()
			
 
				+    try:
			
 
				+        result = subprocess.run(
			
 
				+            ["curl", "-s", "-L", "--max-time", "60", jina_url],
			
 
				+            capture_output=True,
			
 
				+            text=True,
			
 
				+            timeout=65,
			
 
				+            encoding="utf-8",
			
 
				+            errors="replace"
			
 
				+        )
			
 
				+        elapsed = time.perf_counter() - start
			
 
				+        output = result.stdout or ""
			
 
				+        error = result.stderr
			
 
				+        success = result.returncode == 0 and len(output) > 100
			
 
				+    except subprocess.TimeoutExpired:
			
 
				+        elapsed = 60.0
			
 
				+        output = ""
			
 
				+        error = "TIMEOUT"
			
 
				+        success = False
			
 
				+    except Exception as e:
			
 
				+        elapsed = time.perf_counter() - start
			
 
				+        output = ""
			
 
				+        error = str(e)
			
 
				+        success = False
			
 
				+
			
 
				+    if success and output:
			
 
				+        output_file.write_text(output, encoding="utf-8")
			
 
				+
			
 
				+    return {
			
 
				+        "tool": "jina",
			
 
				+        "name": name,
			
 
				+        "url": url,
			
 
				+        "time": elapsed,
			
 
				+        "success": success,
			
 
				+        "output_len": len(output),
			
 
				+        "error": error if not success else None,
			
 
				+        "output_file": str(output_file) if success else None
			
 
				+    }
			
 
				+
			
 
				+def fetch_with_firecrawl(url: str, name: str) -> dict:
			
 
				+    """Fetch URL with Firecrawl, return timing and output"""
			
 
				+    output_file = OUTPUT_DIR / f"{name}_firecrawl.md"
			
 
				+    start = time.perf_counter()
			
 
				+    try:
			
 
				+        # On Windows, firecrawl is a .cmd script - need shell=True
			
 
				+        result = subprocess.run(
			
 
				+            f"firecrawl {url}",
			
 
				+            capture_output=True,
			
 
				+            text=True,
			
 
				+            timeout=90,  # Firecrawl can be slower due to JS rendering
			
 
				+            encoding="utf-8",
			
 
				+            errors="replace",
			
 
				+            shell=True
			
 
				+        )
			
 
				+        elapsed = time.perf_counter() - start
			
 
				+        output = result.stdout or ""
			
 
				+        error = result.stderr
			
 
				+        success = result.returncode == 0 and len(output) > 100
			
 
				+    except subprocess.TimeoutExpired:
			
 
				+        elapsed = 90.0
			
 
				+        output = ""
			
 
				+        error = "TIMEOUT"
			
 
				+        success = False
			
 
				+    except Exception as e:
			
 
				+        elapsed = time.perf_counter() - start
			
 
				+        output = ""
			
 
				+        error = str(e)
			
 
				+        success = False
			
 
				+
			
 
				+    if success and output:
			
 
				+        output_file.write_text(output, encoding="utf-8")
			
 
				+
			
 
				+    return {
			
 
				+        "tool": "firecrawl",
			
 
				+        "name": name,
			
 
				+        "url": url,
			
 
				+        "time": elapsed,
			
 
				+        "success": success,
			
 
				+        "output_len": len(output),
			
 
				+        "error": error if not success else None,
			
 
				+        "output_file": str(output_file) if success else None
			
 
				+    }
			
 
				+
			
 
				+def run_sequential():
			
 
				+    """Run all tests sequentially"""
			
 
				+    print("\n" + "="*60)
			
 
				+    print("SEQUENTIAL EXECUTION")
			
 
				+    print("="*60)
			
 
				+
			
 
				+    results = {"markitdown": [], "jina": [], "firecrawl": []}
			
 
				+
			
 
				+    for name, url in URLS:
			
 
				+        print(f"\nTesting: {name}")
			
 
				+        print(f"  URL: {url}")
			
 
				+
			
 
				+        # markitdown
			
 
				+        r1 = fetch_with_markitdown(url, name)
			
 
				+        status1 = "OK" if r1["success"] else "FAIL"
			
 
				+        print(f"  markitdown: {r1['time']:.2f}s, {r1['output_len']:,} chars - {status1}")
			
 
				+        results["markitdown"].append(r1)
			
 
				+
			
 
				+        # jina
			
 
				+        r2 = fetch_with_jina(url, name)
			
 
				+        status2 = "OK" if r2["success"] else "FAIL"
			
 
				+        print(f"  jina:       {r2['time']:.2f}s, {r2['output_len']:,} chars - {status2}")
			
 
				+        results["jina"].append(r2)
			
 
				+
			
 
				+        # firecrawl
			
 
				+        r3 = fetch_with_firecrawl(url, name)
			
 
				+        status3 = "OK" if r3["success"] else "FAIL"
			
 
				+        print(f"  firecrawl:  {r3['time']:.2f}s, {r3['output_len']:,} chars - {status3}")
			
 
				+        results["firecrawl"].append(r3)
			
 
				+
			
 
				+    return results
			
 
				+
			
 
				+def run_parallel(tool: str, max_workers: int = 5):
			
 
				+    """Run all tests in parallel for a single tool"""
			
 
				+    print(f"\n{'='*60}")
			
 
				+    print(f"PARALLEL EXECUTION: {tool} (max_workers={max_workers})")
			
 
				+    print("="*60)
			
 
				+
			
 
				+    fetch_fns = {
			
 
				+        "markitdown": fetch_with_markitdown,
			
 
				+        "jina": fetch_with_jina,
			
 
				+        "firecrawl": fetch_with_firecrawl
			
 
				+    }
			
 
				+    fetch_fn = fetch_fns[tool]
			
 
				+
			
 
				+    start = time.perf_counter()
			
 
				+    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
			
 
				+        futures = {
			
 
				+            executor.submit(fetch_fn, url, name): name
			
 
				+            for name, url in URLS
			
 
				+        }
			
 
				+        results = []
			
 
				+        for future in concurrent.futures.as_completed(futures):
			
 
				+            name = futures[future]
			
 
				+            result = future.result()
			
 
				+            status = "OK" if result["success"] else f"FAIL"
			
 
				+            print(f"  {name}: {result['time']:.2f}s - {status}")
			
 
				+            results.append(result)
			
 
				+
			
 
				+    total_time = time.perf_counter() - start
			
 
				+    print(f"\nTotal parallel time: {total_time:.2f}s")
			
 
				+
			
 
				+    return results, total_time
			
 
				+
			
 
				+def print_summary(seq_results: dict, par_results: dict):
			
 
				+    """Print comparison summary"""
			
 
				+    print("\n" + "="*60)
			
 
				+    print("SUMMARY")
			
 
				+    print("="*60)
			
 
				+
			
 
				+    # Sequential times
			
 
				+    md_times = [r["time"] for r in seq_results["markitdown"] if r["success"]]
			
 
				+    jina_times = [r["time"] for r in seq_results["jina"] if r["success"]]
			
 
				+    fc_times = [r["time"] for r in seq_results["firecrawl"] if r["success"]]
			
 
				+
			
 
				+    md_success = sum(1 for r in seq_results["markitdown"] if r["success"])
			
 
				+    jina_success = sum(1 for r in seq_results["jina"] if r["success"])
			
 
				+    fc_success = sum(1 for r in seq_results["firecrawl"] if r["success"])
			
 
				+
			
 
				+    md_chars = sum(r["output_len"] for r in seq_results["markitdown"] if r["success"])
			
 
				+    jina_chars = sum(r["output_len"] for r in seq_results["jina"] if r["success"])
			
 
				+    fc_chars = sum(r["output_len"] for r in seq_results["firecrawl"] if r["success"])
			
 
				+
			
 
				+    def safe_avg(times):
			
 
				+        return sum(times)/len(times) if times else 0
			
 
				+
			
 
				+    print("\n## Speed (Sequential)")
			
 
				+    print(f"| Metric | markitdown | Jina | Firecrawl |")
			
 
				+    print(f"|--------|------------|------|-----------|")
			
 
				+    print(f"| Avg time | {safe_avg(md_times):.2f}s | {safe_avg(jina_times):.2f}s | {safe_avg(fc_times):.2f}s |")
			
 
				+    print(f"| Total time | {sum(md_times):.2f}s | {sum(jina_times):.2f}s | {sum(fc_times):.2f}s |")
			
 
				+    print(f"| Success rate | {md_success}/{len(URLS)} | {jina_success}/{len(URLS)} | {fc_success}/{len(URLS)} |")
			
 
				+
			
 
				+    print("\n## Speed (Parallel, 5 workers)")
			
 
				+    print(f"| Metric | markitdown | Jina | Firecrawl |")
			
 
				+    print(f"|--------|------------|------|-----------|")
			
 
				+    print(f"| Total time | {par_results['markitdown'][1]:.2f}s | {par_results['jina'][1]:.2f}s | {par_results['firecrawl'][1]:.2f}s |")
			
 
				+
			
 
				+    print("\n## Output Size")
			
 
				+    print(f"| Metric | markitdown | Jina | Firecrawl |")
			
 
				+    print(f"|--------|------------|------|-----------|")
			
 
				+    print(f"| Total chars | {md_chars:,} | {jina_chars:,} | {fc_chars:,} |")
			
 
				+    print(f"| Avg chars | {md_chars//max(md_success,1):,} | {jina_chars//max(jina_success,1):,} | {fc_chars//max(fc_success,1):,} |")
			
 
				+
			
 
				+    print("\n## Per-URL Comparison")
			
 
				+    print(f"| URL | markitdown | Jina | Firecrawl | Winner |")
			
 
				+    print(f"|-----|------------|------|-----------|--------|")
			
 
				+    for i, (name, url) in enumerate(URLS):
			
 
				+        md = seq_results["markitdown"][i]
			
 
				+        jn = seq_results["jina"][i]
			
 
				+        fc = seq_results["firecrawl"][i]
			
 
				+
			
 
				+        md_str = f"{md['time']:.1f}s" if md["success"] else "FAIL"
			
 
				+        jn_str = f"{jn['time']:.1f}s" if jn["success"] else "FAIL"
			
 
				+        fc_str = f"{fc['time']:.1f}s" if fc["success"] else "FAIL"
			
 
				+
			
 
				+        # Determine winner by speed among successful tools
			
 
				+        successful = []
			
 
				+        if md["success"]: successful.append(("markitdown", md["time"]))
			
 
				+        if jn["success"]: successful.append(("Jina", jn["time"]))
			
 
				+        if fc["success"]: successful.append(("Firecrawl", fc["time"]))
			
 
				+
			
 
				+        if successful:
			
 
				+            winner = min(successful, key=lambda x: x[1])[0]
			
 
				+        else:
			
 
				+            winner = "None"
			
 
				+
			
 
				+        print(f"| {name} | {md_str} | {jn_str} | {fc_str} | {winner} |")
			
 
				+
			
 
				+    print(f"\nOutput files saved to: {OUTPUT_DIR}")
			
 
				+
			
 
				+def main():
			
 
				+    # Create output directory
			
 
				+    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
			
 
				+
			
 
				+    print("Benchmark: markitdown vs Jina Reader vs Firecrawl")
			
 
				+    print(f"Testing {len(URLS)} URLs")
			
 
				+
			
 
				+    # Run sequential tests
			
 
				+    seq_results = run_sequential()
			
 
				+
			
 
				+    # Run parallel tests
			
 
				+    par_results = {
			
 
				+        "markitdown": run_parallel("markitdown", max_workers=5),
			
 
				+        "jina": run_parallel("jina", max_workers=5),
			
 
				+        "firecrawl": run_parallel("firecrawl", max_workers=5),
			
 
				+    }
			
 
				+
			
 
				+    # Print summary
			
 
				+    print_summary(seq_results, par_results)
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    main()
			
--- a/tools/README.md
+++ b/tools/README.md
@@ -44,6 +44,27 @@ Token-efficient CLI tools that replace verbose legacy commands. These tools are
 
				 | JSON manual | `jq` | Structured queries and transforms |
			
 
				 | YAML manual | `yq` | Same as jq for YAML/TOML |
			
 
				 
			
 
				+### Document Conversion
			
 
				+
			
 
				+| Legacy | Modern | Improvement |
			
 
				+|--------|--------|-------------|
			
 
				+| PyMuPDF/pdfplumber | `markitdown` | One CLI for all document types |
			
 
				+| python-docx | `markitdown` | Consistent markdown output |
			
 
				+| Tesseract (OCR) | `markitdown` | Built-in image text extraction |
			
 
				+
			
 
				+**markitdown** (Microsoft) - Convert documents to markdown:
			
 
				+```bash
			
 
				+pip install markitdown
			
 
				+
			
 
				+# Usage
			
 
				+markitdown document.pdf       # PDF
			
 
				+markitdown report.docx        # Word
			
 
				+markitdown data.xlsx          # Excel (tables)
			
 
				+markitdown slides.pptx        # PowerPoint
			
 
				+markitdown image.png          # OCR
			
 
				+```
			
 
				+Supports: PDF, DOCX, XLSX, PPTX, images (OCR), HTML, audio (speech-to-text), CSV, JSON, XML
			
 
				+
			
 
				 ### Git Operations
			
 
				 
			
 
				 | Legacy | Modern | Improvement |
			
@@ -153,15 +174,16 @@ export PERPLEXITY_API_KEY="your-key-here"
 
				 
			
 
				 ### Web Fetching (URL Retrieval Hierarchy)
			
 
				 
			
 
				-When Claude's built-in `WebFetch` gets blocked (403, Cloudflare, etc.), use these alternatives in order:
			
 
				+Benchmarked performance (10 URLs, varying complexity):
			
 
				 
			
 
				-| Tool | When to Use | Setup |
			
 
				-|------|-------------|-------|
			
 
				-| **WebFetch** | First attempt - fast, built-in | None required |
			
 
				-| **Jina Reader** | JS-rendered pages, PDFs, cleaner extraction | Prefix URL with `r.jina.ai/` |
			
 
				-| **Firecrawl** | Anti-bot bypass, complex scraping, structured extraction | Use `firecrawl-expert` agent |
			
 
				+| Tool | Avg Speed | Success | Best For |
			
 
				+|------|-----------|---------|----------|
			
 
				+| **WebFetch** | Instant | Varies | First attempt - built-in |
			
 
				+| **Jina Reader** | **0.5s** | 10/10 | Default fallback - 5-10x faster |
			
 
				+| **Firecrawl** | 4-5s | 10/10 | Anti-bot bypass, Cloudflare |
			
 
				+| **markitdown** | 2-3s | 9/10 | Local files + simple pages |
			
 
				 
			
 
				-**Jina Reader** (free tier: 10M tokens):
			
 
				+**Jina Reader** (free tier: 10M tokens) - **Recommended default**:
			
 
				 ```bash
			
 
				 # Simple - just prefix any URL
			
 
				 curl https://r.jina.ai/https://example.com
			
@@ -170,9 +192,9 @@ curl https://r.jina.ai/https://example.com
 
				 curl https://s.jina.ai/your%20search%20query
			
 
				 ```
			
 
				 
			
 
				-**Firecrawl** (requires API key):
			
 
				+**Firecrawl** (requires API key) - **Anti-bot specialist**:
			
 
				 ```bash
			
 
				-# Simple URL scrape (globally available)
			
 
				+# When Jina fails due to anti-bot
			
 
				 firecrawl https://blocked-site.com
			
 
				 
			
 
				 # Save to file
			
@@ -180,19 +202,27 @@ firecrawl https://example.com -o output.md
 
				 
			
 
				 # With JSON metadata
			
 
				 firecrawl https://example.com --json
			
 
				-
			
 
				-# For complex scraping, use the firecrawl-expert agent
			
 
				 ```
			
 
				 - Handles Cloudflare, Datadome, and other anti-bot systems
			
 
				 - Supports interactive scraping (click, scroll, fill forms)
			
 
				 - AI-powered structured data extraction
			
 
				-- CLI: `E:\Projects\Coding\Firecrawl\scripts\fc.py`
			
 
				+
			
 
				+**markitdown** - **Local files + URLs**:
			
 
				+```bash
			
 
				+# URLs (slower than Jina, but works offline)
			
 
				+markitdown https://example.com
			
 
				+
			
 
				+# Local files (unique capability)
			
 
				+markitdown document.pdf
			
 
				+markitdown report.docx
			
 
				+markitdown data.xlsx
			
 
				+```
			
 
				 
			
 
				 **Decision Tree:**
			
 
				 1. Try `WebFetch` first (instant, free)
			
 
				-2. If blocked/JS-heavy → Try `r.jina.ai/URL` prefix
			
 
				-3. If still blocked → Try `firecrawl <url>` CLI
			
 
				-4. For complex scraping/extraction → Use `firecrawl-expert` agent
			
 
				+2. If blocked → Try Jina `r.jina.ai/URL` (fastest, best success rate)
			
 
				+3. If anti-bot/Cloudflare → Try `firecrawl <url>` (designed for bypass)
			
 
				+4. For local files (PDF, Word, Excel) → Use `markitdown`
			
 
				 
			
 
				 ## Token Efficiency Benchmarks